Python数据分析:文本数据处理
1. 核心知识点概述
在Pandas中,文本数据处理是数据清洗的重要环节。主要涉及以下核心方法:
str访问器: 通过.str访问字符串方法,进行向量化字符串操作。StringDtype: Pandas 3.0+ 推荐的字符串类型,支持缺失值且性能更好。- 字符串方法
- 正则表达式
关键方法说明
split()replace()contains()extract()lower()strip()
2. 示例代码
2.1 准备数据
In [1]:
'name': [' Alice Smith ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson'], 'email': ['alice@example.com', 'bob@test.org', 'charlie@company.com', 'david@gmail.com', 'eve@school.edu'], 'phone': ['123-456-7890', '(987) 654-3210', '555.123.4567', '800-555-0199', '999-888-7777'], 'address': ['123 Main St, NY', '456 Oak Ave, CA', '789 Pine Rd, TX', '321 Elm St, FL', '654 Maple Dr, WA'], 'description': ['Senior Developer', 'Data Scientist', 'Product Manager', 'UX Designer', 'DevOps Engineer']
name email phone address \0 Alice Smith alice@example.com 123-456-7890 123 Main St, NY 1 Bob Johnson bob@test.org (987) 654-3210 456 Oak Ave, CA 2 Charlie Brown charlie@company.com 555.123.4567 789 Pine Rd, TX 3 David Lee david@gmail.com 800-555-0199 321 Elm St, FL 4 Eve Wilson eve@school.edu 999-888-7777 654 Maple Dr, WA
2.2 StringDtype 字符串类型
Pandas 3.0+ 推荐使用StringDtype,支持缺失值且性能更好。
In [2]:
'text': ['Hello', None, 'World', pd.NA, 'Python']print(f"\n数据类型: {df_str['text'].dtype}")df_str['text'] = df_str['text'].astype('string')print(f"\n转换后数据类型: {df_str['text'].dtype}")print(df_str['text'].str.len())print(df_str['text'].str.upper())
Name: text, dtype: string
2.3 字符串分割 (split)
按指定分隔符分割字符串,常用于提取信息。
In [3]:
df['first_name'] = df['name'].str.strip().str.split().str[0]df['last_name'] = df['name'].str.strip().str.split().str[-1]print(df[['name', 'first_name', 'last_name']])email_split = df['email'].str.split('@', expand=True)email_split.columns = ['username', 'domain']address_split = df['address'].str.split(', ', expand=True)address_split.columns = ['street', 'state']
name first_name last_name0 Alice Smith Alice Smith1 Bob Johnson Bob Johnson2 Charlie Brown Charlie Brown
2.4 字符串替换 (replace)
替换子串或使用正则表达式替换。
In [4]:
df['phone_clean'] = df['phone'].str.replace('-', '')print(df[['phone', 'phone_clean']])df['phone_digits'] = df['phone'].str.replace(r'\D', '', regex=True)print(df[['phone', 'phone_digits']])df['desc_modified'] = df['description'].str.replace('Senior', 'Sr.')print(df[['description', 'desc_modified']])
0 123-456-7890 12345678901 (987) 654-3210 (987) 65432102 555.123.4567 555.123.45673 800-555-0199 80055501994 999-888-7777 99988877770 123-456-7890 12345678901 (987) 654-3210 98765432102 555.123.4567 55512345673 800-555-0199 80055501994 999-888-7777 9998887777 description desc_modified0 Senior Developer Sr. Developer1 Data Scientist Data Scientist2 Product Manager Product Manager3 UX Designer UX Designer4 DevOps Engineer DevOps Engineer
2.5 字符串连接 (cat)
将多个字符串连接在一起。
In [5]:
df['full_info'] = df['first_name'].str.cat(df['description'], sep=' - ')print(df[['first_name', 'description', 'full_info']])df['contact'] = df['first_name'].str.cat([df['email'], df['phone_digits']], sep=' | ')print(df[['first_name', 'contact']])df['greeting'] = 'Hello, ' + df['first_name'] + '! Your email is ' + df['email']print(df[['first_name', 'greeting']])
first_name description full_info0 Alice Senior Developer Alice - Senior Developer1 Bob Data Scientist Bob - Data Scientist2 Charlie Product Manager Charlie - Product Manager3 David UX Designer David - UX Designer4 Eve DevOps Engineer Eve - DevOps Engineer0 Alice Alice | alice@example.com | 12345678901 Bob Bob | bob@test.org | 98765432102 Charlie Charlie | charlie@company.com | 55512345673 David David | david@gmail.com | 80055501994 Eve Eve | eve@school.edu | 99988877770 Alice Hello, Alice! Your email is alice@example.com1 Bob Hello, Bob! Your email is bob@test.org2 Charlie Hello, Charlie! Your email is charlie@company.com3 David Hello, David! Your email is david@gmail.com4 Eve Hello, Eve! Your email is eve@school.edu
2.6 字符串提取 (extract)
使用正则表达式提取捕获组内容。
In [6]:
email_extracted = df['email'].str.extract(r'(.+)@(.+)')email_extracted.columns = ['user', 'domain']phone_extracted = df['phone'].str.extract(r'\(?(\d{3})\)?[-.\s]?(\d{3})[-.]?(\d{4})')phone_extracted.columns = ['area_code', 'prefix', 'line']title_extract = df['description'].str.extract(r'(Developer|Scientist|Manager|Designer|Engineer)')
2.7 字符串匹配 (contains/match)
判断字符串是否包含某模式。
In [7]:
has_gmail = df['email'].str.contains('gmail')is_com_domain = df['email'].str.contains(r'\.com$')starts_with_d = df['first_name'].str.match(r'^D')tech_roles = df[df['description'].str.contains('Developer|Engineer|Scientist', regex=True)]print(tech_roles[['first_name', 'description']])
Name: first_name, dtype: bool
2.8 大小写转换与空白处理
处理字符串的大小写和空白字符。
In [8]:
print(df['first_name'].str.lower())print(df['first_name'].str.upper())print(df['first_name'].str.capitalize())print(df['description'].str.title())print(df['name'].tolist())print(df['name'].str.strip().tolist())print(df['name'].str.lstrip().tolist())print(df['name'].str.rstrip().tolist())
Name: first_name, dtype: objectName: first_name, dtype: objectName: first_name, dtype: objectName: description, dtype: object[' Alice Smith ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']['Alice Smith ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson'][' Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']
2.9 字符串长度与计数
获取字符串长度和子串出现次数。
In [9]:
df['name_len'] = df['name'].str.len()print(df[['first_name', 'name_len']])df['e_count'] = df['email'].str.count('e')print(df[['email', 'e_count']])df['at_position'] = df['email'].str.find('@')print(df[['email', 'at_position']])df['is_senior'] = df['description'].str.startswith('Senior')df['is_com'] = df['email'].str.endswith('.com')print(df[['description', 'is_senior', 'email', 'is_com']])
description is_senior email is_com0 Senior Developer True alice@example.com True1 Data Scientist False bob@test.org False2 Product Manager False charlie@company.com True3 UX Designer False david@gmail.com True4 DevOps Engineer False eve@school.edu False
2.10 填充与对齐
对字符串进行填充和对齐操作。
In [10]:
padded_left = df['first_name'].str.pad(width=10, side='left', fillchar='-')print(padded_left.tolist())padded_right = df['first_name'].str.pad(width=10, side='right', fillchar='*')print(padded_right.tolist())padded_center = df['first_name'].str.center(width=12, fillchar='=')print(padded_center.tolist())numbers = pd.Series(['1', '23', '456', '7890'])zfilled = numbers.str.zfill(6)print(list(zip(numbers, zfilled)))
['-----Alice', '-------Bob', '---Charlie', '-----David', '-------Eve']['Alice*****', 'Bob*******', 'Charlie***', 'David*****', 'Eve*******']['===Alice====', '====Bob=====', '==Charlie===', '===David====', '====Eve====='][('1', '000001'), ('23', '000023'), ('456', '000456'), ('7890', '007890')]
3. 常见应用场景总结
- 数据清洗:使用
strip()去除空格,lower()/upper()统一大小写。 - 信息提取:使用
split()或extract()从复合字段提取信息。 - 数据验证
- 数据标准化:使用
replace()统一格式,如电话号码、邮箱等。 - 特征工程:使用
get_dummies()将文本特征转换为数值特征。 - 数据筛选