一种全新的思维方式|《Think Python》第12章
12. 文本分析与生成
本章我们将使用列表、字典和元组实现文本分析与马尔可夫文本生成。这类算法与大型语言模型的核心原理相似,我们将统计单词频率、分析单词接续关系,并生成风格相似的新文本。
12.1 唯一单词
首先读取文本并统计书中的唯一单词数量,将单词作为字典的键来记录。
# 文件名:化身博士.txtfilename = 'dr_jekyll.txt'# 用字典记录所有唯一单词unique_words = {}for line inopen(filename):# 将每行分割为单词序列 seq = line.split()for word in seq: unique_words[word] = 1# 字典长度即为唯一单词数量len(unique_words)
运行结果:
6040
# 按长度排序,查看最长的5个单词sorted(unique_words, key=len)[-5:]
运行结果:
['chocolate-coloured', 'superiors—behold!”', 'coolness—frightened', 'gentleman—something', 'pocket-handkerchief.']
12.2 标点符号处理
原始文本包含标点、破折号等干扰,需要清洗单词。
import unicodedata# 将破折号替换为空格后分割单词defsplit_line(line):return line.replace('—', ' ').split()
# 提取文本中所有标点符号punc_marks = {}for line inopen(filename):for char in line: category = unicodedata.category(char)if category.startswith('P'): punc_marks[char] = 1# 拼接为标点字符串punctuation = ''.join(punc_marks)print(punctuation)
运行结果:
.’;,-“”:?—‘!()_
# 清洗单词:去除首尾标点并转为小写defclean_word(word):return word.strip(punctuation).lower()
# 示例:清洗带标点的单词clean_word('“Behold!”')
运行结果:
'behold'
# 重新统计清洗后的唯一单词unique_words2 = {}for line inopen(filename):for word in split_line(line): word = clean_word(word) unique_words2[word] = 1len(unique_words2)
运行结果:
4005
12.3 单词频率
统计每个单词出现的次数。
# 单词频率统计字典word_counter = {}for line inopen(filename):for word in split_line(line): word = clean_word(word)if word notin word_counter: word_counter[word] = 1else: word_counter[word] += 1
# 取元组的第二个元素(频率)defsecond_element(t):return t[1]# 按频率降序排序items = sorted(word_counter.items(), key=second_element, reverse=True)# 输出前5个高频词for word, freq in items[:5]:print(freq, word, sep='\t')
运行结果:
1614 the972 and941 of640 to640 i
12.4 可选参数
定义带默认值的可选参数函数。
# 打印最常见的单词,num默认为5defprint_most_common(word_counter, num=5): items = sorted(word_counter.items(), key=second_element, reverse=True)for word, freq in items[:num]:print(freq, word, sep='\t')
# 使用默认值print_most_common(word_counter)
运行结果:
1614 the972 and941 of640 to640 i
# 指定num=3print_most_common(word_counter, 3)
运行结果:
1614 the972 and941 of
12.5 字典减法
找出书中存在但标准单词表中不存在的单词(疑似拼写错误)。
# 读取标准单词表word_list = open('words.txt').read().split()# 构建有效单词字典valid_words = {}for word in word_list: valid_words[word] = 1# 字典减法:保留d1有但d2没有的键defsubtract(d1, d2): res = {}for key in d1:if key notin d2: res[key] = d1[key]return res# 找出书中不在单词表中的词diff = subtract(word_counter, valid_words)# 查看这些词print_most_common(diff)
运行结果:
640 i628 a128 utterson124 mr98 hyde
12.6 随机数
使用random模块随机选择单词,实现基础随机文本。
import random# 从列表随机选择一个元素t = [1, 2, 3]random.choice(t)
运行结果:
1
# 将字典键转为列表words = list(word_counter)# 随机选一个单词random.choice(words)
运行结果:
'posture'
# 按词频加权随机选择6个单词weights = word_counter.values()random_words = random.choices(words, weights=weights, k=6)result = ' '.join(random_words)print(result)
运行结果:
'reach streets edward a said to'
12.7 双词组合(Bigrams)
统计连续两个单词的组合(双词)。
# 双词计数器bigram_counter = {}defcount_bigram(bigram):# 转为元组作为字典键 key = tuple(bigram)if key notin bigram_counter: bigram_counter[key] = 1else: bigram_counter[key] += 1# 滑动窗口保存当前两个单词window = []defprocess_word(word): window.append(word)iflen(window) == 2:# 窗口满时统计双词 count_bigram(window)# 弹出最前面的单词 window.pop(0)
# 遍历全书处理单词for line inopen(filename):for word in split_line(line): word = clean_word(word) process_word(word)# 查看最常见双词print_most_common(bigram_counter)
运行结果:
178 ('of', 'the')139 ('in', 'the')94 ('it', 'was')80 ('and', 'the')73 ('to', 'the')
12.8 马尔可夫分析
建立单词→后续单词列表的映射,用于文本生成。
# 后继词映射:键是单词,值是后继词列表successor_map = {}defadd_bigram(bigram): first, second = bigramif first notin successor_map: successor_map[first] = [second]else: successor_map[first].append(second)window = []defprocess_word_bigram(word): window.append(word)iflen(window) == 2: add_bigram(window) window.pop(0)
# 重新分析全书successor_map = {}window = []for line inopen(filename):for word in split_line(line): word = clean_word(word) process_word_bigram(word)# 查看某个单词的后继词successor_map['going']
运行结果:
['east', 'in', 'to', 'to', 'up', 'to', 'of']
12.9 文本生成
根据马尔可夫模型随机生成连贯文本。
# 从一个起始词开始生成10个单词word = 'although'for i inrange(10): successors = successor_map[word] word = random.choice(successors)print(word, end=' ')
运行结果:
continue to hesitate and swallowed the smile withered from that
12.10 调试建议
12.11 术语表
12.12 练习
12.12.2 三词统计
trigram_counter = {}defcount_trigram(trigram): key = tuple(trigram)if key notin trigram_counter: trigram_counter[key] = 1else: trigram_counter[key] += 1window = []defprocess_word_trigram(word): window.append(word)iflen(window) == 3: count_trigram(window) window.pop(0)
12.12.3 三词马尔可夫映射
defadd_trigram(trigram):# 前两个词为键,第三个词为后继 key = (trigram[0], trigram[1]) third = trigram[2]if key notin successor_map: successor_map[key] = [third]else: successor_map[key].append(third)