一、开篇:为什么还需要NLTK?
昨天你学习了自然语言处理的宏大图景——从词向量到大模型,从BERT到ChatGPT。在这个大模型主宰的时代,你可能会有疑问:
“为什么还要学NLTK这样一个诞生于2001年的‘老古董’?”
答案很简单:
NLTK是NLP的“九阳神功”——它包含了自然语言处理最基础、最核心的概念实现。掌握NLTK,你就掌握了NLP的“内功”。
教学价值无可替代——没有抽象的黑箱,每一步都看得见、摸得着。
轻量快速——对于原型验证、教学实验、小型项目,NLTK比加载一个BERT模型快得多。
语料库丰富——内置数十种语料库,随时可用。
NLTK(Natural Language Toolkit) 是Python最著名的自然语言处理库,由宾夕法尼亚大学开发,广泛用于教学和研究。
二、安装与环境配置
首次使用需要下载NLTK的数据包(语料库、模型等):import nltk# 下载常用数据(只需一次)nltk.download('punkt') # 分词模型nltk.download('stopwords') # 停用词表nltk.download('averaged_perceptron_tagger') # 词性标注模型nltk.download('maxent_ne_chunker') # 命名实体识别nltk.download('words') # 词汇表nltk.download('vader_lexicon') # 情感分析词典
nltk.download('all') # 约3.5GB,谨慎使用
三、NLTK的核心数据结构:文本即对象
NLTK中最常用的数据结构是文本对象——其实就是Python字符串或列表,但包装了一些实用方法。
from nltk.tokenize import word_tokenize, sent_tokenizetext = "Natural Language Processing with NLTK is fun. Let's learn it step by step."# 分句sentences = sent_tokenize(text)print("句子:", sentences)# 分词words = word_tokenize(text)print("单词:", words)# 查看单词数量print("单词数:", len(words))
句子: ['Natural Language Processing with NLTK is fun.', "Let's learn it step by step."]单词: ['Natural', 'Language', 'Processing', 'with', 'NLTK', 'is', 'fun', '.', 'Let', "'s", 'learn', 'it', 'step', 'by', 'step', '.']单词数: 16
注意:Let's被拆分为Let和's,这是英文分词的典型结果。四、文本预处理——清洗数据的艺术
4.1 去除停用词
停用词是“的”、“是”、“在”这类高频但对分析无用的词。
from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))words = ['Natural', 'Language', 'Processing', 'is', 'fun', '.']filtered = [w for w in words if w.lower() not in stop_words]print("去除停用词后:", filtered)# 输出:['Natural', 'Language', 'Processing', 'fun', '.']
4.2 词干提取(Stemming)
将单词还原为词干形式(可能不是合法单词)。
from nltk.stem import PorterStemmerstemmer = PorterStemmer()words = ['running', 'runner', 'runs', 'easily', 'fairly']stems = [stemmer.stem(w) for w in words]print("词干:", stems)# 输出:['run', 'runner', 'run', 'easili', 'fairli'] # 注意easily变成了easili
4.3 词形还原(Lemmatization)
还原为词典中的原形(一定是合法单词)。
from nltk.stem import WordNetLemmatizernltk.download('wordnet')lemmatizer = WordNetLemmatizer()words = ['running', 'runner', 'runs', 'better', 'geese']lemmas = [lemmatizer.lemmatize(w, pos='v') for w in words] # pos指定词性print("词形还原:", lemmas)# 输出:['run', 'runner', 'run', 'better', 'geese'] # 动词变回原形,名词保留
词干 vs 词形还原:
词干提取:粗暴砍掉后缀,速度快,但结果可能不是词
词形还原:考虑词性和上下文,结果准确,速度慢
五、词性标注——给每个词贴标签
词性标注(POS Tagging)是很多高级任务的基础。
from nltk import pos_tagtext = "I love natural language processing"words = word_tokenize(text)tagged = pos_tag(words)print("词性标注结果:")for word, tag in tagged: print(f"{word}: {tag}")
I: PRP(人称代词)love: VBP(动词,非第三人称单数现在时)natural: JJ(形容词)language: NN(名词,单数)processing: NN(名词,单数)
常见标签:
NN 名词
VB 动词
JJ 形容词
RB 副词
PRP 代词
完整标签集可参考:nltk.help.upenn_tagset()
六、命名实体识别——找出专有名词
NLTK内置了命名实体识别器,可以识别人名、地名、组织等。
from nltk import ne_chunk# 需要先下载模型:nltk.download('maxent_ne_chunker')sentence = "Apple Inc. is planning to open a new store in New York next month."words = word_tokenize(sentence)tagged = pos_tag(words)entities = ne_chunk(tagged)print(entities) # 输出树状结构
ne_chunk返回一个树状结构,可以用循环遍历:for chunk in entities: if hasattr(chunk, 'label'): # 是命名实体 print(f"实体:{chunk.label()} -> {' '.join(c[0] for c in chunk)}")
实体:ORGANIZATION -> Apple Inc.实体:GPE -> New York
七、频率分布与文本统计
NLTK的FreqDist类是统计词频的利器。
from nltk import FreqDisttext = "the cat in the hat sat on the mat with a cat"words = word_tokenize(text)fdist = FreqDist(words)print("最常见的词:", fdist.most_common(3))print("单词'the'出现次数:", fdist['the'])# 绘制频率分布图fdist.plot(10, cumulative=False)
应用:
八、情感分析——用VADER进行简单情感判断
VADER(Valence Aware Dictionary and sEntiment Reasoner)是一个基于词典的情感分析工具,特别适合社交媒体文本。
from nltk.sentiment import SentimentIntensityAnalyzernltk.download('vader_lexicon')sid = SentimentIntensityAnalyzer()sentences = [ "This movie was absolutely amazing! I loved it.", "The food was terrible and the service was even worse.", "It's okay, nothing special."]for sentence in sentences: scores = sid.polarity_scores(sentence) print(f"\n句子:{sentence}") print(f"情感分数:{scores}") print(f"总体情感:{scores['compound']}")
句子:This movie was absolutely amazing! I loved it.情感分数:{'neg': 0.0, 'neu': 0.282, 'pos': 0.718, 'compound': 0.8887}总体情感:0.8887句子:The food was terrible and the service was even worse.情感分数:{'neg': 0.454, 'neu': 0.546, 'pos': 0.0, 'compound': -0.8271}
compound分数范围[-1,1],>0.05为正面,<-0.05为负面。九、语料库访问——NLTK的宝库
NLTK内置了大量经典语料库,无需网络即可使用。
from nltk.corpus import gutenberg, brown, reuters# 查看古腾堡计划语料库print(gutenberg.fileids())# 加载《圣经》bible = gutenberg.words('bible-kjv.txt')print("圣经单词数:", len(bible))# 布朗语料库(按文体分类)print(brown.categories())news_text = brown.words(categories='news')print("新闻类单词数:", len(news_text))# 路透社语料库print(reuters.fileids()[:5])print(reuters.categories())
常用语料库:
gutenberg:经典文学作品
brown:布朗语料库,按文体标注
reuters:路透社新闻语料
inaugural:美国总统就职演说
stopwords:多语言停用词表
十、实战项目:基于朴素贝叶斯的文本分类
用NLTK自带的电影评论语料库,训练一个情感分类器。
10.1 加载数据
from nltk.corpus import movie_reviewsimport randomnltk.download('movie_reviews')documents = [(list(movie_reviews.words(fileid)), category) for category in movie_reviews.categories() for fileid in movie_reviews.fileids(category)]random.shuffle(documents)print("样本数:", len(documents))print("正样本示例:", documents[0][0][:10], documents[0][1])
10.2 特征提取:词袋模型
我们用一个简单特征:文本中包含哪些词(前2000高频词作为特征)。
from nltk import FreqDist# 收集所有词all_words = FreqDist(w.lower() for w in movie_reviews.words())word_features = list(all_words.keys())[:2000]def document_features(document): document_words = set(document) features = {} for word in word_features: features[f'contains({word})'] = (word in document_words) return features
10.3 训练/测试划分
featuresets = [(document_features(doc), category) for (doc, category) in documents]train_set, test_set = featuresets[100:], featuresets[:100]
10.4 训练朴素贝叶斯分类器
from nltk import NaiveBayesClassifierclassifier = NaiveBayesClassifier.train(train_set)
10.5 评估
print("准确率:", nltk.classify.accuracy(classifier, test_set))# 查看最有信息量的特征classifier.show_most_informative_features(10)
准确率: 0.81Most Informative Features contains(bad) = True neg : pos = 13.2 : 1.0 contains(amazing) = True pos : neg = 11.0 : 1.0 contains(worst) = True neg : pos = 10.8 : 1.0 contains(fantastic) = True pos : neg = 9.8 : 1.0
10.6 预测新文本
def predict_sentiment(text): words = word_tokenize(text.lower()) features = document_features(words) return classifier.classify(features)print(predict_sentiment("This movie is really great and exciting!"))print(predict_sentiment("Terrible film, waste of time."))
总结:NLTK——NLP的“经典数学”
在深度学习的浪潮中,NLTK依然屹立不倒,因为它代表着NLP的基础知识。就像学数学必须从加减乘除开始,学NLP也绕不开NLTK。
今天你学会了:
文本分词、去除停用词、词干提取、词形还原
词性标注、命名实体识别
词频统计和情感分析
用朴素贝叶斯做文本分类
使用内置语料库
这些技能将是你深入NLP世界的坚实基石。