当前位置：首页>python>12_Python数据分析:文本数据处理

12_Python数据分析:文本数据处理

2026-06-28 00:56:54

Python数据分析：文本数据处理

1. 核心知识点概述

在Pandas中，文本数据处理是数据清洗的重要环节。主要涉及以下核心方法：

str访问器
: 通过.str访问字符串方法，进行向量化字符串操作。
StringDtype
: Pandas 3.0+ 推荐的字符串类型，支持缺失值且性能更好。
字符串方法
: 分割、替换、连接、提取、匹配等操作。
正则表达式
: 使用正则进行复杂的模式匹配和提取。

关键方法说明

split()
: 按分隔符分割字符串。
replace()
: 替换子串或正则匹配内容。
contains()
: 判断是否包含某模式。
extract()
: 使用正则提取捕获组。
lower()
/upper(): 大小写转换。
strip()
: 去除首尾空白字符。

2. 示例代码

2.1 准备数据

In [1]:

import pandas as pd
import numpy as np
# 创建示例数据
data = {
    'name': ['  Alice Smith  ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson'],
    'email': ['alice@example.com', 'bob@test.org', 'charlie@company.com', 'david@gmail.com', 'eve@school.edu'],
    'phone': ['123-456-7890', '(987) 654-3210', '555.123.4567', '800-555-0199', '999-888-7777'],
    'address': ['123 Main St, NY', '456 Oak Ave, CA', '789 Pine Rd, TX', '321 Elm St, FL', '654 Maple Dr, WA'],
    'description': ['Senior Developer', 'Data Scientist', 'Product Manager', 'UX Designer', 'DevOps Engineer']
}
df = pd.DataFrame(data)
print("原始数据：")
print(df)

原始数据：
              name                email           phone           address  \
0    Alice Smith      alice@example.com    123-456-7890   123 Main St, NY   
1      Bob Johnson         bob@test.org  (987) 654-3210   456 Oak Ave, CA   
2    Charlie Brown  charlie@company.com    555.123.4567   789 Pine Rd, TX   
3        David Lee      david@gmail.com    800-555-0199    321 Elm St, FL   
4       Eve Wilson       eve@school.edu    999-888-7777  654 Maple Dr, WA   
        description  
0  Senior Developer  
1    Data Scientist  
2   Product Manager  
3       UX Designer  
4   DevOps Engineer

2.2 StringDtype 字符串类型

Pandas 3.0+ 推荐使用StringDtype，支持缺失值且性能更好。

In [2]:

# 创建含缺失值的字符串数据
df_str = pd.DataFrame({
    'text': ['Hello', None, 'World', pd.NA, 'Python']
})
print("原始数据：")
print(df_str)
print(f"\n数据类型: {df_str['text'].dtype}")
# 转换为StringDtype
df_str['text'] = df_str['text'].astype('string')
print(f"\n转换后数据类型: {df_str['text'].dtype}")
print("\n转换后的数据：")
print(df_str)
# StringDtype支持的操作
print("\n字符串长度（含缺失值）：")
print(df_str['text'].str.len())
print("\n转大写：")
print(df_str['text'].str.upper())

原始数据：
     text
0   Hello
1    None
2   World
3    <NA>
4  Python
数据类型: object
转换后数据类型: string
转换后的数据：
     text
0   Hello
1    <NA>
2   World
3    <NA>
4  Python
字符串长度（含缺失值）：
0       5
1    <NA>
2       5
3    <NA>
4       6
Name: text, dtype: Int64
转大写：
0     HELLO
1      <NA>
2     WORLD
3      <NA>
4    PYTHON
Name: text, dtype: string

2.3 字符串分割 (split)

按指定分隔符分割字符串，常用于提取信息。

In [3]:

# 分割姓名
df['first_name'] = df['name'].str.strip().str.split().str[0]
df['last_name'] = df['name'].str.strip().str.split().str[-1]
print("\n分割后的姓名：")
print(df[['name', 'first_name', 'last_name']])
# 分割邮箱，提取用户名和域名
email_split = df['email'].str.split('@', expand=True)
email_split.columns = ['username', 'domain']
print("\n邮箱分割结果：")
print(email_split)
# 分割地址
address_split = df['address'].str.split(', ', expand=True)
address_split.columns = ['street', 'state']
print("\n地址分割结果：")
print(address_split)

分割后的姓名：
              name first_name last_name
0    Alice Smith        Alice     Smith
1      Bob Johnson        Bob   Johnson
2    Charlie Brown    Charlie     Brown
3        David Lee      David       Lee
4       Eve Wilson        Eve    Wilson
邮箱分割结果：
  username       domain
0    alice  example.com
1      bob     test.org
2  charlie  company.com
3    david    gmail.com
4      eve   school.edu
地址分割结果：
         street state
0   123 Main St    NY
1   456 Oak Ave    CA
2   789 Pine Rd    TX
3    321 Elm St    FL
4  654 Maple Dr    WA

2.4 字符串替换 (replace)

替换子串或使用正则表达式替换。

In [4]:

# 简单替换
df['phone_clean'] = df['phone'].str.replace('-', '')
print("\n去除横线的电话号码：")
print(df[['phone', 'phone_clean']])
# 正则替换 - 去除所有非数字字符
df['phone_digits'] = df['phone'].str.replace(r'\D', '', regex=True)
print("\n只保留数字的电话号码：")
print(df[['phone', 'phone_digits']])
# 替换特定内容
df['desc_modified'] = df['description'].str.replace('Senior', 'Sr.')
print("\n职位描述替换：")
print(df[['description', 'desc_modified']])

去除横线的电话号码：
            phone    phone_clean
0    123-456-7890     1234567890
1  (987) 654-3210  (987) 6543210
2    555.123.4567   555.123.4567
3    800-555-0199     8005550199
4    999-888-7777     9998887777
只保留数字的电话号码：
            phone phone_digits
0    123-456-7890   1234567890
1  (987) 654-3210   9876543210
2    555.123.4567   5551234567
3    800-555-0199   8005550199
4    999-888-7777   9998887777
职位描述替换：
        description    desc_modified
0  Senior Developer    Sr. Developer
1    Data Scientist   Data Scientist
2   Product Manager  Product Manager
3       UX Designer      UX Designer
4   DevOps Engineer  DevOps Engineer

2.5 字符串连接 (cat)

将多个字符串连接在一起。

In [5]:

# 连接两列
df['full_info'] = df['first_name'].str.cat(df['description'], sep=' - ')
print("\n连接后的信息：")
print(df[['first_name', 'description', 'full_info']])
# 连接多列
df['contact'] = df['first_name'].str.cat([df['email'], df['phone_digits']], sep=' | ')
print("\n多列连接结果：")
print(df[['first_name', 'contact']])
# 使用 + 号连接（更灵活）
df['greeting'] = 'Hello, ' + df['first_name'] + '! Your email is ' + df['email']
print("\n自定义格式连接：")
print(df[['first_name', 'greeting']])

连接后的信息：
  first_name       description                  full_info
0      Alice  Senior Developer   Alice - Senior Developer
1        Bob    Data Scientist       Bob - Data Scientist
2    Charlie   Product Manager  Charlie - Product Manager
3      David       UX Designer        David - UX Designer
4        Eve   DevOps Engineer      Eve - DevOps Engineer
多列连接结果：
  first_name                                     contact
0      Alice      Alice | alice@example.com | 1234567890
1        Bob             Bob | bob@test.org | 9876543210
2    Charlie  Charlie | charlie@company.com | 5551234567
3      David        David | david@gmail.com | 8005550199
4        Eve           Eve | eve@school.edu | 9998887777
自定义格式连接：
  first_name                                           greeting
0      Alice      Hello, Alice! Your email is alice@example.com
1        Bob             Hello, Bob! Your email is bob@test.org
2    Charlie  Hello, Charlie! Your email is charlie@company.com
3      David        Hello, David! Your email is david@gmail.com
4        Eve           Hello, Eve! Your email is eve@school.edu

2.6 字符串提取 (extract)

使用正则表达式提取捕获组内容。

In [6]:

# 从邮箱提取用户名和域名
email_extracted = df['email'].str.extract(r'(.+)@(.+)')
email_extracted.columns = ['user', 'domain']
print("\n从邮箱提取的信息：")
print(email_extracted)
# 从电话号码提取区号
phone_extracted = df['phone'].str.extract(r'\(?(\d{3})\)?[-.\s]?(\d{3})[-.]?(\d{4})')
phone_extracted.columns = ['area_code', 'prefix', 'line']
print("\n从电话提取的信息：")
print(phone_extracted)
# 提取职位中的关键词
title_extract = df['description'].str.extract(r'(Developer|Scientist|Manager|Designer|Engineer)')
print("\n提取的职位关键词：")
print(title_extract)

从邮箱提取的信息：
      user       domain
0    alice  example.com
1      bob     test.org
2  charlie  company.com
3    david    gmail.com
4      eve   school.edu
从电话提取的信息：
  area_code prefix  line
0       123    456  7890
1       987    654  3210
2       555    123  4567
3       800    555  0199
4       999    888  7777
提取的职位关键词：
           0
0  Developer
1  Scientist
2    Manager
3   Designer
4   Engineer

2.7 字符串匹配 (contains/match)

判断字符串是否包含某模式。

In [7]:

# 判断是否包含某字符串
has_gmail = df['email'].str.contains('gmail')
print("\n是否包含gmail：")
print(has_gmail)
# 使用正则匹配
is_com_domain = df['email'].str.contains(r'\.com$')
print("\n是否是.com域名：")
print(is_com_domain)
# 匹配开头
starts_with_d = df['first_name'].str.match(r'^D')
print("\n是否以D开头：")
print(starts_with_d)
# 查找包含特定关键词的行
tech_roles = df[df['description'].str.contains('Developer|Engineer|Scientist', regex=True)]
print("\n技术岗位：")
print(tech_roles[['first_name', 'description']])

是否包含gmail：
0    False
1    False
2    False
3     True
4    False
Name: email, dtype: bool
是否是.com域名：
0     True
1    False
2     True
3     True
4    False
Name: email, dtype: bool
是否以D开头：
0    False
1    False
2    False
3     True
4    False
Name: first_name, dtype: bool
技术岗位：
  first_name       description
0      Alice  Senior Developer
1        Bob    Data Scientist
4        Eve   DevOps Engineer

2.8 大小写转换与空白处理

处理字符串的大小写和空白字符。

In [8]:

# 大小写转换
print("\n转小写：")
print(df['first_name'].str.lower())
print("\n转大写：")
print(df['first_name'].str.upper())
print("\n首字母大写：")
print(df['first_name'].str.capitalize())
print("\n单词首字母大写：")
print(df['description'].str.title())
# 空白处理
print("\n原始name列（带空格）：")
print(df['name'].tolist())
print("\n去除首尾空格：")
print(df['name'].str.strip().tolist())
print("\n去除左侧空格：")
print(df['name'].str.lstrip().tolist())
print("\n去除右侧空格：")
print(df['name'].str.rstrip().tolist())

转小写：
0      alice
1        bob
2    charlie
3      david
4        eve
Name: first_name, dtype: object
转大写：
0      ALICE
1        BOB
2    CHARLIE
3      DAVID
4        EVE
Name: first_name, dtype: object
首字母大写：
0      Alice
1        Bob
2    Charlie
3      David
4        Eve
Name: first_name, dtype: object
单词首字母大写：
0    Senior Developer
1      Data Scientist
2     Product Manager
3         Ux Designer
4     Devops Engineer
Name: description, dtype: object
原始name列（带空格）：
['  Alice Smith  ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']
去除首尾空格：
['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']
去除左侧空格：
['Alice Smith  ', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']
去除右侧空格：
['  Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee', 'Eve Wilson']

2.9 字符串长度与计数

获取字符串长度和子串出现次数。

In [9]:

# 字符串长度
df['name_len'] = df['name'].str.len()
print("\n姓名长度：")
print(df[['first_name', 'name_len']])
# 计算子串出现次数
df['e_count'] = df['email'].str.count('e')
print("\n邮箱中'e'的出现次数：")
print(df[['email', 'e_count']])
# 查找子串位置
df['at_position'] = df['email'].str.find('@')
print("\n@符号的位置：")
print(df[['email', 'at_position']])
# 检查是否以某字符串开头/结尾
df['is_senior'] = df['description'].str.startswith('Senior')
df['is_com'] = df['email'].str.endswith('.com')
print("\n开头/结尾检查：")
print(df[['description', 'is_senior', 'email', 'is_com']])

姓名长度：
  first_name  name_len
0      Alice        15
1        Bob        11
2    Charlie        13
3      David         9
4        Eve        10
邮箱中'e'的出现次数：
                 email  e_count
0    alice@example.com        3
1         bob@test.org        1
2  charlie@company.com        1
3      david@gmail.com        0
4       eve@school.edu        3
@符号的位置：
                 email  at_position
0    alice@example.com            5
1         bob@test.org            3
2  charlie@company.com            7
3      david@gmail.com            5
4       eve@school.edu            3
开头/结尾检查：
        description  is_senior                email  is_com
0  Senior Developer       True    alice@example.com    True
1    Data Scientist      False         bob@test.org   False
2   Product Manager      False  charlie@company.com    True
3       UX Designer      False      david@gmail.com    True
4   DevOps Engineer      False       eve@school.edu   False

2.10 填充与对齐

对字符串进行填充和对齐操作。

In [10]:

# 左侧填充（右对齐）
padded_left = df['first_name'].str.pad(width=10, side='left', fillchar='-')
print("\n左侧填充（右对齐）：")
print(padded_left.tolist())
# 右侧填充（左对齐）
padded_right = df['first_name'].str.pad(width=10, side='right', fillchar='*')
print("\n右侧填充（左对齐）：")
print(padded_right.tolist())
# 两侧填充（居中）
padded_center = df['first_name'].str.center(width=12, fillchar='=')
print("\n两侧填充（居中）：")
print(padded_center.tolist())
# 使用zfill填充数字
numbers = pd.Series(['1', '23', '456', '7890'])
zfilled = numbers.str.zfill(6)
print("\nzfill填充：")
print(list(zip(numbers, zfilled)))

左侧填充（右对齐）：
['-----Alice', '-------Bob', '---Charlie', '-----David', '-------Eve']
右侧填充（左对齐）：
['Alice*****', 'Bob*******', 'Charlie***', 'David*****', 'Eve*******']
两侧填充（居中）：
['===Alice====', '====Bob=====', '==Charlie===', '===David====', '====Eve=====']
zfill填充：
[('1', '000001'), ('23', '000023'), ('456', '000456'), ('7890', '007890')]

3. 常见应用场景总结

数据清洗
：使用strip()去除空格，lower()/upper()统一大小写。
信息提取
：使用split()或extract()从复合字段提取信息。
数据验证
：使用contains()和正则验证数据格式。
数据标准化
：使用replace()统一格式，如电话号码、邮箱等。
特征工程
：使用get_dummies()将文本特征转换为数值特征。
数据筛选
：结合布尔索引使用contains()过滤数据。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

12_Python数据分析:文本数据处理

Python数据分析：文本数据处理

1. 核心知识点概述

关键方法说明

2. 示例代码

2.1 准备数据

2.2 StringDtype 字符串类型

2.3 字符串分割 (split)

2.4 字符串替换 (replace)

2.5 字符串连接 (cat)

2.6 字符串提取 (extract)

2.7 字符串匹配 (contains/match)

2.8 大小写转换与空白处理

2.9 字符串长度与计数

2.10 填充与对齐

3. 常见应用场景总结

最新文章

热门文章

随机文章

12_Python数据分析:文本数据处理

Python数据分析：文本数据处理

1. 核心知识点概述

关键方法说明

2. 示例代码

2.1 准备数据

2.2 StringDtype 字符串类型

2.3 字符串分割 (split)

2.4 字符串替换 (replace)

2.5 字符串连接 (cat)

2.6 字符串提取 (extract)

2.7 字符串匹配 (contains/match)

2.8 大小写转换与空白处理

2.9 字符串长度与计数

2.10 填充与对齐

3. 常见应用场景总结

Linux硬链接与软链接的核⼼区别(底层原理)

第221讲:VBA和Python双方案,从离散日志到连续洞察:小休与离席时长的自动化统计

最新文章

热门文章

随机文章