当前位置：首页>python>Python文本分析与生成:从词频统计到马尔可夫自动写作

Python文本分析与生成:从词频统计到马尔可夫自动写作

2026-06-29 23:50:01

一种全新的思维方式｜《Think Python》第12章

12. 文本分析与生成

本章我们将使用列表、字典和元组实现文本分析与马尔可夫文本生成。这类算法与大型语言模型的核心原理相似，我们将统计单词频率、分析单词接续关系，并生成风格相似的新文本。

12.1 唯一单词

首先读取文本并统计书中的唯一单词数量，将单词作为字典的键来记录。

# 文件名：化身博士.txtfilename = 'dr_jekyll.txt'# 用字典记录所有唯一单词unique_words = {}for line inopen(filename):# 将每行分割为单词序列    seq = line.split()for word in seq:        unique_words[word] = 1# 字典长度即为唯一单词数量len(unique_words)

运行结果：

# 按长度排序，查看最长的5个单词sorted(unique_words, key=len)[-5:]

运行结果：

['chocolate-coloured', 'superiors—behold!”', 'coolness—frightened', 'gentleman—something', 'pocket-handkerchief.']

12.2 标点符号处理

原始文本包含标点、破折号等干扰，需要清洗单词。

import unicodedata# 将破折号替换为空格后分割单词defsplit_line(line):return line.replace('—', ' ').split()

# 提取文本中所有标点符号punc_marks = {}for line inopen(filename):for char in line:        category = unicodedata.category(char)if category.startswith('P'):            punc_marks[char] = 1# 拼接为标点字符串punctuation = ''.join(punc_marks)print(punctuation)

运行结果：

.’;,-“”:?—‘!()_

# 清洗单词：去除首尾标点并转为小写defclean_word(word):return word.strip(punctuation).lower()

# 示例：清洗带标点的单词clean_word('“Behold!”')

运行结果：

'behold'

# 重新统计清洗后的唯一单词unique_words2 = {}for line inopen(filename):for word in split_line(line):        word = clean_word(word)        unique_words2[word] = 1len(unique_words2)

运行结果：

12.3 单词频率

统计每个单词出现的次数。

# 单词频率统计字典word_counter = {}for line inopen(filename):for word in split_line(line):        word = clean_word(word)if word notin word_counter:            word_counter[word] = 1else:            word_counter[word] += 1

# 取元组的第二个元素（频率）defsecond_element(t):return t[1]# 按频率降序排序items = sorted(word_counter.items(), key=second_element, reverse=True)# 输出前5个高频词for word, freq in items[:5]:print(freq, word, sep='\t')

运行结果：

1614 the972 and941 of640 to640 i

12.4 可选参数

定义带默认值的可选参数函数。

# 打印最常见的单词，num默认为5defprint_most_common(word_counter, num=5):    items = sorted(word_counter.items(), key=second_element, reverse=True)for word, freq in items[:num]:print(freq, word, sep='\t')

# 使用默认值print_most_common(word_counter)

运行结果：

1614 the972 and941 of640 to640 i

# 指定num=3print_most_common(word_counter, 3)

运行结果：

1614 the972 and941 of

12.5 字典减法

找出书中存在但标准单词表中不存在的单词（疑似拼写错误）。

# 读取标准单词表word_list = open('words.txt').read().split()# 构建有效单词字典valid_words = {}for word in word_list:    valid_words[word] = 1# 字典减法：保留d1有但d2没有的键defsubtract(d1, d2):    res = {}for key in d1:if key notin d2:            res[key] = d1[key]return res# 找出书中不在单词表中的词diff = subtract(word_counter, valid_words)# 查看这些词print_most_common(diff)

运行结果：

640 i628 a128 utterson124 mr98 hyde

12.6 随机数

使用random模块随机选择单词，实现基础随机文本。

import random# 从列表随机选择一个元素t = [1, 2, 3]random.choice(t)

运行结果：

# 将字典键转为列表words = list(word_counter)# 随机选一个单词random.choice(words)

运行结果：

'posture'

# 按词频加权随机选择6个单词weights = word_counter.values()random_words = random.choices(words, weights=weights, k=6)result = ' '.join(random_words)print(result)

运行结果：

'reach streets edward a said to'

12.7 双词组合（Bigrams）

统计连续两个单词的组合（双词）。

# 双词计数器bigram_counter = {}defcount_bigram(bigram):# 转为元组作为字典键    key = tuple(bigram)if key notin bigram_counter:        bigram_counter[key] = 1else:        bigram_counter[key] += 1# 滑动窗口保存当前两个单词window = []defprocess_word(word):    window.append(word)iflen(window) == 2:# 窗口满时统计双词        count_bigram(window)# 弹出最前面的单词        window.pop(0)

# 遍历全书处理单词for line inopen(filename):for word in split_line(line):        word = clean_word(word)        process_word(word)# 查看最常见双词print_most_common(bigram_counter)

运行结果：

178 ('of', 'the')139 ('in', 'the')94     ('it', 'was')80     ('and', 'the')73     ('to', 'the')

12.8 马尔可夫分析

建立单词→后续单词列表的映射，用于文本生成。

# 后继词映射：键是单词，值是后继词列表successor_map = {}defadd_bigram(bigram):    first, second = bigramif first notin successor_map:        successor_map[first] = [second]else:        successor_map[first].append(second)window = []defprocess_word_bigram(word):    window.append(word)iflen(window) == 2:        add_bigram(window)        window.pop(0)

# 重新分析全书successor_map = {}window = []for line inopen(filename):for word in split_line(line):        word = clean_word(word)        process_word_bigram(word)# 查看某个单词的后继词successor_map['going']

运行结果：

['east', 'in', 'to', 'to', 'up', 'to', 'of']

12.9 文本生成

根据马尔可夫模型随机生成连贯文本。

# 从一个起始词开始生成10个单词word = 'although'for i inrange(10):    successors = successor_map[word]    word = random.choice(successors)print(word, end=' ')

运行结果：

continue to hesitate and swallowed the smile withered from that

12.10 调试建议

阅读代码，检查逻辑是否符合预期
分段运行，缩小错误范围
暂停思考，推断错误类型
橡皮鸭调试：向物品解释代码
回退代码，简化结构
休息后再调试

12.11 术语表

英文	中文	解释
default value	默认值	未传参时参数使用的值
override	覆盖	用传入参数替换默认值
deterministic	确定性	相同输入必产生相同输出
pseudorandom	伪随机	算法生成的看似随机的序列
bigram	双词	连续两个单词的序列
trigram	三词	连续三个单词的序列
n-gram	n元组	连续n个元素的序列
rubber duck debugging	橡皮鸭调试	通过讲解代码发现错误

12.12 练习

12.12.2 三词统计

trigram_counter = {}defcount_trigram(trigram):    key = tuple(trigram)if key notin trigram_counter:        trigram_counter[key] = 1else:        trigram_counter[key] += 1window = []defprocess_word_trigram(word):    window.append(word)iflen(window) == 3:        count_trigram(window)        window.pop(0)

12.12.3 三词马尔可夫映射

defadd_trigram(trigram):# 前两个词为键，第三个词为后继    key = (trigram[0], trigram[1])    third = trigram[2]if key notin successor_map:        successor_map[key] = [third]else:        successor_map[key].append(third)

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

Python文本分析与生成:从词频统计到马尔可夫自动写作

12. 文本分析与生成

12.1 唯一单词

12.2 标点符号处理

12.3 单词频率

12.4 可选参数

12.5 字典减法

12.6 随机数

12.7 双词组合（Bigrams）

12.8 马尔可夫分析

12.9 文本生成

12.10 调试建议

12.11 术语表

12.12 练习

12.12.2 三词统计

12.12.3 三词马尔可夫映射

最新文章

热门文章

随机文章

Python文本分析与生成:从词频统计到马尔可夫自动写作

12. 文本分析与生成

12.1 唯一单词

12.2 标点符号处理

12.3 单词频率

12.4 可选参数

12.5 字典减法

12.6 随机数

12.7 双词组合（Bigrams）

12.8 马尔可夫分析

12.9 文本生成

12.10 调试建议

12.11 术语表

12.12 练习

12.12.2 三词统计

12.12.3 三词马尔可夫映射

Linux 7.0 内核的“激进实验”:Rust 转正、AI 写代码合法,但责任全归人

Linux C目录库说明,访问和操作接口

最新文章

热门文章

随机文章