当前位置：首页>python>一天一个Python知识点——Day 163:文本预处理实战

一天一个Python知识点——Day 163:文本预处理实战

2026-07-03 19:11:36

一、开篇：为什么80%的时间都在洗数据？

“在真实世界的NLP项目中，80%的时间花在数据清洗上，剩下的20%才是建模和调参。”

这是每一个NLP从业者的血泪教训。

昨天你用spaCy完成了流畅的文本分析，但那是在干净的英文例句上。如果你面对的是这样的数据：

"RT @user: This is SOOOO exciting!!! Can't wait for the #event 😍😍😍 http://t.co/xyz"

或者这样的数据：

"这家店的东西质量很好！！！服务态度也超赞的～～但是价格有点小贵……下次还会来💯"

你会发现，模型根本无法直接处理。

今天的目标：掌握一套完整的文本预处理流水线，让任何原始文本都能变成模型可以理解的结构化数据。

二、文本预处理的核心目标

目标	说明
去噪	移除与任务无关的噪声（HTML标签、URL、特殊符号）
统一化	将不同形式的相同语义归一化（大小写、简繁、数字）
分词	将文本切成模型能处理的最小单元
降维	移除无信息量的词（停用词），减少特征空间
还原	将单词变体还原为原形，减少词汇表大小

三、预处理步骤详解（附代码）

3.1 数据清洗——铲除噪音

任务：移除HTML标签、URL、@提及、特殊符号、多余空白。

import redef clean_text(text):    """基础清洗函数"""    # 移除HTML标签    text = re.sub(r'<[^>]+>', '', text)    # 移除URL    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+])+', '', text)    # 移除@提及    text = re.sub(r'@\w+', '', text)    # 移除特殊符号（保留字母、数字、中文、基本标点）    text = re.sub(r'[^\w\s\u4e00-\u9fff。，！？、]', '', text)    # 合并多余空白    text = re.sub(r'\s+', ' ', text).strip()    return text# 示例dirty_text = "<p>RT @user: This is SOOOO exciting!!! Can't wait for the #event 😍😍😍 http://t.co/xyz</p>"clean = clean_text(dirty_text)print("清洗后：", clean)# 输出：RT This is SOOOO exciting Cant wait for the event

3.2 统一化——把“苹果”和“Apple”归一

任务：统一大小写、处理表情符号、数字占位符。

def normalize_text(text):    """文本统一化"""    # 转为小写（英文）    text = text.lower()    # 将数字替换为占位符    text = re.sub(r'\d+', '<NUM>', text)    # 将连续重复字母缩减（如 soooo → so）    text = re.sub(r'(.)\1{2,}', r'\1', text)    return text# 示例text = "This is SOOOO AMAZING!!! 12345"norm = normalize_text(text)print("统一后：", norm)# 输出：this is so amazing!!! <NUM>

3.3 分词——切出最小语义单元

任务：将文本切分为单词、标点等。中文分词需要专用工具。

import jiebaimport spacyfrom nltk.tokenize import word_tokenize# 英文分词（spaCy）nlp_en = spacy.load("en_core_web_sm")def tokenize_english(text):    doc = nlp_en(text)    return [token.text for token in doc]# 中文分词（jieba）def tokenize_chinese(text):    return list(jieba.cut(text))# 示例en_text = "I can't believe it!"ch_text = "我不相信这是真的！"print("英文分词：", tokenize_english(en_text))print("中文分词：", tokenize_chinese(ch_text))

3.4 去除停用词——删除无意义的词

任务：移除“的”、“是”、“a”、“the”等高频但对分析无用的词。

from nltk.corpus import stopwords# 下载停用词（若未下载）# nltk.download('stopwords')stop_words_en = set(stopwords.words('english'))stop_words_zh = set(["的", "了", "在", "是", "我", "你", "他", "她", "它"])  # 示例def remove_stopwords(tokens, lang='en'):    if lang == 'en':        return [t for t in tokens if t.lower() not in stop_words_en]    elif lang == 'zh':        return [t for t in tokens if t not in stop_words_zh]    return tokens# 示例tokens = ['I', 'love', 'natural', 'language', 'processing']filtered = remove_stopwords(tokens, 'en')print("去除停用词后：", filtered)# 输出：['love', 'natural', 'language', 'processing']

3.5 词干提取 vs 词形还原——让词归一

任务：将单词的不同形式归并。

词干提取：粗暴砍掉后缀（running → run，但可能不是单词）
词形还原：根据词典还原（running → run，肯定是单词）

from nltk.stem import PorterStemmer, WordNetLemmatizerstemmer = PorterStemmer()lemmatizer = WordNetLemmatizer()words = ['running', 'runs', 'better', 'geese']print("词干提取：", [stemmer.stem(w) for w in words])print("词形还原：", [lemmatizer.lemmatize(w, pos='v') for w in words])# 输出：# 词干提取： ['run', 'run', 'better', 'gees']# 词形还原： ['run', 'run', 'better', 'geese']

四、构建一个完整的预处理流水线

将以上步骤封装成一个可复用的流水线。

import reimport jiebaimport spacyfrom nltk.corpus import stopwordsfrom nltk.stem import WordNetLemmatizerclass TextPreprocessor:    """文本预处理流水线"""    def __init__(self, lang='en', use_lemmatization=True):        self.lang = lang        self.use_lemmatization = use_lemmatization        if lang == 'en':            self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])            self.stopwords = set(stopwords.words('english'))            if use_lemmatization:                self.lemmatizer = WordNetLemmatizer()        elif lang == 'zh':            # 中文无需spaCy，用jieba            self.stopwords = set(["的", "了", "在", "是", "我", "你", "他", "她", "它", "我们", "你们", "他们"])            # 可加载更大的中文停用词表        else:            raise ValueError("lang must be 'en' or 'zh'")    def clean(self, text):        """清洗"""        # 移除HTML        text = re.sub(r'<[^>]+>', '', text)        # 移除URL        text = re.sub(r'http[s]?://\S+', '', text)        # 移除@提及        text = re.sub(r'@\w+', '', text)        # 保留基本字符        if self.lang == 'en':            text = re.sub(r'[^a-zA-Z\s\.\,\!\?]', '', text)        else:            text = re.sub(r'[^\w\s\u4e00-\u9fff。，！？、]', '', text)        # 合并空格        text = re.sub(r'\s+', ' ', text).strip()        return text    def tokenize(self, text):        """分词"""        if self.lang == 'en':            doc = self.nlp(text)            return [token.text for token in doc]        else:            return list(jieba.cut(text))    def normalize_token(self, token):        """对单个token归一化"""        # 转为小写（英文）        if self.lang == 'en':            token = token.lower()        # 词形还原        if self.lang == 'en' and self.use_lemmatization:            token = self.lemmatizer.lemmatize(token)        return token    def process(self, text):        """完整流水线"""        # 清洗        text = self.clean(text)        # 分词        tokens = self.tokenize(text)        # 去除停用词、归一化        processed = []        for token in tokens:            norm = self.normalize_token(token)            if norm not in self.stopwords and len(norm) > 1:  # 过滤单字符                processed.append(norm)        return processed# 测试英文prep_en = TextPreprocessor(lang='en')text_en = "<p>I absolutely LOVED the movie!!! It was soooo amazing 😍 http://example.com</p>"result_en = prep_en.process(text_en)print("英文预处理结果：", result_en)# 测试中文prep_zh = TextPreprocessor(lang='zh')text_zh = "<p>这家店的奶茶真的超级好喝！！！服务态度也超赞的～～下次还会来💯</p>"result_zh = prep_zh.process(text_zh)print("中文预处理结果：", result_zh)

五、实战案例：社交媒体评论情感分析预处理

场景：对Twitter评论进行预处理，为情感分析准备数据。

def preprocess_tweet(tweet):    """针对Twitter的专用预处理"""    # 移除RT（转发标记）    tweet = re.sub(r'^RT ', '', tweet)    # 移除话题标签的#号，但保留文字    tweet = re.sub(r'#(\w+)', r'\1', tweet)    # 统一表情符号    tweet = tweet.replace('😍', ' heart_eyes ').replace('😂', ' laughing ')    # 调用通用预处理    return prep_en.process(tweet)tweets = [    "RT @user: I can't believe how good this movie is!!! #awesome",    "Worst service ever 😤😤😤 @company",    "Just bought the new iPhone. Loving it! 😍😍😍"]for tweet in tweets:    processed = preprocess_tweet(tweet)    print(f"原始: {tweet}")    print(f"处理后: {processed}\n")

输出：

原始: RT @user: I can't believe how good this movie is!!! #awesome处理后: ['believe', 'good', 'movie', 'awesome']原始: Worst service ever 😤😤😤 @company处理后: ['worst', 'service', 'ever']原始: Just bought the new iPhone. Loving it! 😍😍😍处理后: ['bought', 'new', 'iphone', 'loving', 'heart_eyes']

六、预处理前后的效果对比

原始文本：

"I absolutely LOVED the movie!!! It was soooo amazing 😍😍😍 #mustwatch"

经过清洗+分词+停用词+词形还原：

['absolutely', 'love', 'movie', 'amazing', 'mustwatch']

差异：

大小写统一
重复字母缩减（soooo → so，进一步被停用词移除？实际上“so”是停用词？需要调整）
表情符号转为文字
词形还原（loved → love）

对模型的影响：

词汇表大小从 8 降到 5
无用信息（it, was, the）被移除
核心情感词保留

七、结语：干净的数据是模型的上限

“垃圾进，垃圾出” —— 在NLP领域，这句话尤其真实。

今天你掌握了从原始文本到干净语料的完整流程，学会了：

清洗噪音（HTML、URL、特殊符号）
统一化和标准化
分词与停用词过滤
词干提取与词形还原
构建可复用的预处理流水线

这些技能将是你所有NLP项目的共同起点——无论你是做情感分析、主题建模还是问答系统，预处理都是绕不开的第一步。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

一天一个Python知识点——Day 163:文本预处理实战

一、开篇：为什么80%的时间都在洗数据？

二、文本预处理的核心目标

三、预处理步骤详解（附代码）

3.1 数据清洗——铲除噪音

3.2 统一化——把“苹果”和“Apple”归一

3.3 分词——切出最小语义单元

3.4 去除停用词——删除无意义的词

3.5 词干提取 vs 词形还原——让词归一

四、构建一个完整的预处理流水线

五、实战案例：社交媒体评论情感分析预处理

六、预处理前后的效果对比

七、结语：干净的数据是模型的上限

最新文章

热门文章

随机文章

一天一个Python知识点——Day 163:文本预处理实战

一、开篇：为什么80%的时间都在洗数据？

二、文本预处理的核心目标

三、预处理步骤详解（附代码）

3.1 数据清洗——铲除噪音

3.2 统一化——把“苹果”和“Apple”归一

3.3 分词——切出最小语义单元

3.4 去除停用词——删除无意义的词

3.5 词干提取 vs 词形还原——让词归一

四、构建一个完整的预处理流水线

五、实战案例：社交媒体评论情感分析预处理

六、预处理前后的效果对比

七、结语：干净的数据是模型的上限

一个YAML干掉3000行Python,18K Star的 Kestra正在重新定义工作流编排

使用 Python 自动化地理处理工作流

最新文章

热门文章

随机文章