昨晚十一点多吧,我在公司楼下抽烟,手机一震,我们组小李发我一句:哥,领导让我“从一段文本里抠关键字”,要四种法子,还得能落地……我当时就乐了,这玩意你说简单也简单,说麻烦也麻烦,麻烦点在于:就一段文本,没有语料库,你还想要“像搜索引擎那样懂你”,那就得耍点小聪明。
我下面这四个办法,都是我平时写小工具会用的那种,代码我就直接给你们一份能跑的(不依赖第三方库那种),你拿去改一改就行。对了,别指望一次就准,关键字这东西本来就“见仁见智”,你得根据业务调停用词、窗口大小、返回数量。
先放个示例文本,不然你们看着像在背课文一样尴尬:
TEXT = """我们线上有个接口最近老是慢,trace 一拉发现是下游 SQL 在更新热点行。后来把 where 条件补上,还把重试逻辑从同步改成 MQ 异步,峰值就稳了。顺手把日志结构化,做了个按 trace_id 串起来的排查面板,故障定位快多了。"""
好,第一个路子最“糙”但经常够用:单文本的 TF 词频 + 一点点降噪。你别笑,真能解决一半需求。核心就是:数出现次数,然后把特别短、特别常见的词剔掉,再给长一点的词一点加成。缺点也明显:容易把“接口、系统、问题”这种泛词顶上来,所以停用词得配。
import refrom collections import CounterDEFAULT_STOP = {"的","了","是","在","和","有","就","也","把","还","从","一个","这个","那个","我们","你","我","他","她","它","最近","发现","后来","以及"}deftokenize(text: str):# 既能吃中文也能吃英文/数字,粗分词:中文按连续汉字块,英文按单词 text = text.lower() chunks = re.findall(r"[\u4e00-\u9fff]+|[a-z0-9_]+", text)return [c for c in chunks if len(c) >= 2]defkeywords_tf(text: str, topk=10, stopwords=None): stopwords = stopwords or DEFAULT_STOP toks = [t for t in tokenize(text) if t notin stopwords] cnt = Counter(toks)# 简单“词长奖励 + 词频对数压缩” scored = []for w, f in cnt.items(): score = (1 + (len(w) - 2) * 0.15) * (1 + (f ** 0.5)) scored.append((w, score)) scored.sort(key=lambda x: x[1], reverse=True)return scored[:topk]
第二个就开始“像回事”了:RAKE 思路(不是那个打网球的哈)。大概意思是:先用停用词和标点把文本切成一段段“候选短语”,比如“热点行”“结构化 日志”“MQ 异步”这种,然后看短语里词的连接程度,连接得越紧、越像一个概念,分就越高。这个对中文也挺香,因为短语往往比单个词更像关键字。
defsplit_phrases(text: str, stopwords=None): stopwords = stopwords or DEFAULT_STOP# 先把标点当分隔符 parts = re.split(r"[。!?;\n\r,,.()()::]+", text) phrases = []for p in parts: toks = tokenize(p) buf = []for t in toks:if t in stopwords:if buf: phrases.append(" ".join(buf)) buf = []else: buf.append(t)if buf: phrases.append(" ".join(buf))# 去掉太短的短语return [ph for ph in phrases if len(ph.split()) >= 1]defkeywords_rake(text: str, topk=10, stopwords=None): stopwords = stopwords or DEFAULT_STOP phrases = split_phrases(text, stopwords)# 统计每个词的频次 freq 和“度数” degree(跟多少词一起出现) freq = Counter() degree = Counter()for ph in phrases: ws = ph.split()for w in ws: freq[w] += 1 degree[w] += (len(ws) - 1) word_score = {w: (degree[w] + freq[w]) / max(freq[w], 1) for w in freq} phrase_score = []for ph in phrases: score = sum(word_score.get(w, 0.0) for w in ph.split()) phrase_score.append((ph.replace(" ", ""), score)) # 拼回去更像中文短语 phrase_score.sort(key=lambda x: x[1], reverse=True)return phrase_score[:topk]
第三个,TextRank,老江湖了。你可以把它理解成:词跟词在一段窗口里老是一起出现,那它俩就“关系铁”;关系铁的词,就更可能是主题。然后跑个 PageRank,让“被重要词指向的词”也变重要。这个特别适合那种技术排查类文本,因为“SQL/热点行/where/更新”这种会抱团。
from collections import defaultdictdefkeywords_textrank(text: str, topk=10, window=4, stopwords=None, iters=20, d=0.85): stopwords = stopwords or DEFAULT_STOP toks = [t for t in tokenize(text) if t notin stopwords]ifnot toks:return [] g = defaultdict(set)for i, w in enumerate(toks):for j in range(i+1, min(i+window, len(toks))): u = toks[j]if w != u: g[w].add(u) g[u].add(w)# 初始化分数 score = {w: 1.0for w in g}for _ in range(iters): new = {}for w in g: s = 1 - dfor v in g[w]: s += d * (score[v] / max(len(g[v]), 1)) new[w] = s score = new ranked = sorted(score.items(), key=lambda x: x[1], reverse=True)return ranked[:topk]
第四个我自己特别爱用,原因很现实:你只有“一段文本”,那就把它切成很多句子,当成“小语料库”,用“句子级 IDF”。这一下就能把“在每句话都出现的泛词”压下去,把“只在某几句出现但很关键的词”顶上来。比如你这段里“trace_id、热点行、MQ”这种就会更突出。然后再加个小技巧:选词别全选同一类,做个简单的多样性过滤,不然结果一堆“更新、更新、更新”。
import mathdefsplit_sentences(text: str): sents = [s.strip() for s in re.split(r"[。!?;\n\r]+", text) if s.strip()]return sents if sents else [text]defkeywords_sentence_idf(text: str, topk=10, stopwords=None): stopwords = stopwords or DEFAULT_STOP sents = split_sentences(text) sent_tokens = [] df = Counter()for s in sents: toks = [t for t in tokenize(s) if t notin stopwords] sent_tokens.append(toks)for w in set(toks): df[w] += 1 N = len(sents) tf = Counter([w for toks in sent_tokens for w in toks]) scored = []for w, f in tf.items(): idf = math.log((N + 1) / (df[w] + 1)) + 1.0 scored.append((w, f * idf)) scored.sort(key=lambda x: x[1], reverse=True)# 简单多样性:Jaccard 太像的就跳过(避免同类词刷屏)defjaccard(a, b): sa, sb = set(a), set(b)return len(sa & sb) / max(len(sa | sb), 1) picked = []for w, s in scored:if all(jaccard(w, pw) < 0.6for pw, _ in picked): picked.append((w, s))if len(picked) >= topk:breakreturn picked
你们要是想直接看效果,就这么跑一下:
if __name__ == "__main__": print("TF:", keywords_tf(TEXT, topk=8)) print("RAKE:", keywords_rake(TEXT, topk=8)) print("TextRank:", keywords_textrank(TEXT, topk=8)) print("Sentence-IDF:", keywords_sentence_idf(TEXT, topk=8))
反正我这边的经验哈:你要是写公众号那种“讲清楚思路还得能复制”,就把这四种都跑一遍,然后告诉读者——业务上通常是“RAKE/句子IDF”产短语更像人话,“TextRank”更稳,“TF”最快最土但救急。最后你再补一句:停用词才是灵魂,别拿默认停用词硬跑生产文本,不然“系统、接口、问题”能把你气死。
哎我说着说着又想起上次我们线上排障,那个词提取出来第一名居然是“然后”,我当场就……算了不吐槽了,我先去把外卖拿一下,回来再说你们要不要加个“关键词高亮”和“摘要句子”一起做。