当前位置：首页>python>算法工程师面试:TF-IDF原理与python代码

算法工程师面试:TF-IDF原理与python代码

2026-02-05 01:04:56

TF-IDF（Term Frequency–Inverse Document Frequency）是一种经典的文本特征表示方法，用于衡量一个词在文档中的重要程度。其核心思想是：一个词在当前文档中出现越频繁，且在其他文档中越少见，它对当前文档的区分能力就越强。该方法广泛应用于信息检索、文本分类、关键词提取等任务。

原理分解

TF-IDF 由两部分组成：

1. 词频（Term Frequency, TF）表示某个词在文档中出现的频率，通常归一化以避免长文档天然具有更高词频的问题：
词在文档中出现的次数文档的总词数
2. 逆文档频率（Inverse Document Frequency, IDF）衡量词的全局稀有程度。如果一个词在很多文档中都出现（如“的”“是”），则其 IDF 值低；若仅在少数文档中出现（如“量子计算”），则 IDF 值高：
语料库中文档总数包含词的文档数
为避免分母为零，实际实现中常对包含词 t 的文档数加 1，并对总文档数也做平滑处理。

最终，TF-IDF 值为两者乘积：

特点与局限

• 优点：计算简单、可解释性强、无需训练；
• 局限：忽略词序和语义（如“不高兴”与“高兴”被视为无关）；对短文本敏感；无法捕捉上下文关系。

尽管如此，TF-IDF 仍是文本预处理和基线模型的重要工具。

Python 实现

import mathfrom collections import Counterimport numpy as npclass TFIDFVectorizer:    def __init__(self, smooth_idf=True, use_log_tf=True):        """        TF-IDF 向量化器        参数:        smooth_idf: 是否对 IDF 进行平滑处理        use_log_tf: 是否对 TF 使用对数平滑        """        self.smooth_idf = smooth_idf        self.use_log_tf = use_log_tf        self.vocab_ = None        self.idf_ = None        self.N = 0  # 文档总数    def fit(self, documents):        """        训练 TF-IDF 模型        参数:        documents: 文档列表，每个文档是一个字符串        """        # 1. 构建词汇表        self.vocab_ = set()        for doc in documents:            words = self._tokenize(doc)            self.vocab_.update(words)        self.vocab_ = sorted(list(self.vocab_))        # 2. 计算每个词的文档频率        doc_freq = {word: 0 for word in self.vocab_}        self.N = len(documents)        for doc in documents:            words = set(self._tokenize(doc))            for word in words:                if word in doc_freq:                    doc_freq[word] += 1        # 3. 计算 IDF        self.idf_ = {}        for word, df in doc_freq.items():            if self.smooth_idf:                # 平滑版本：log(N/(1+df))                self.idf_[word] = math.log(self.N / (1 + df))            else:                # 非平滑版本：log(N/df)                if df > 0:                    self.idf_[word] = math.log(self.N / df)                else:                    self.idf_[word] = 0        return self    def transform(self, documents):        """        将文档转换为 TF-IDF 向量        参数:        documents: 文档列表        返回:        TF-IDF 矩阵 (n_docs × n_features)        """        if self.vocab_ is None or self.idf_ is None:            raise ValueError("请先调用 fit 方法训练模型")        # 创建词汇到索引的映射        word_to_idx = {word: i for i, word in enumerate(self.vocab_)}        # 初始化结果矩阵        n_docs = len(documents)        n_features = len(self.vocab_)        tfidf_matrix = np.zeros((n_docs, n_features))        for i, doc in enumerate(documents):            words = self._tokenize(doc)            # 计算词频            word_counts = Counter(words)            total_words = len(words)            # 计算 TF-IDF            for word, count in word_counts.items():                if word in word_to_idx:                    idx = word_to_idx[word]                    # 计算 TF                    if self.use_log_tf:                        tf = math.log(1 + count)                    else:                        tf = count / total_words if total_words > 0 else 0                    # 计算 TF-IDF                    tfidf_matrix[i, idx] = tf * self.idf_.get(word, 0)        return tfidf_matrix    def fit_transform(self, documents):        """同时进行训练和转换"""        self.fit(documents)        return self.transform(documents)    def _tokenize(self, text):        """简单的分词函数（实际应用中可能需要更复杂的分词）"""        # 转换为小写，按空格分词        return text.lower().split()    def get_feature_names(self):        """获取特征名称（词汇表）"""        return self.vocab_ if self.vocab_ is not None else []# 示例文档documents = [    "机器学习 是 人工智能 的 重要 分支",    "深度学习 是 机器学习 的 一个 子领域",    "自然语言处理 使用 机器学习 技术",    "人工智能 将 改变 世界"]# 使用自定义 TF-IDFvectorizer = TFIDFVectorizer(smooth_idf=True, use_log_tf=True)tfidf_matrix = vectorizer.fit_transform(documents)feature_names = vectorizer.get_feature_names()print("词汇表:", feature_names)print("\nTF-IDF 矩阵:")print(tfidf_matrix)print("\n矩阵形状:", tfidf_matrix.shape)# 打印每个文档的 TF-IDF 向量for i, doc in enumerate(documents):    print(f"\n文档 {i+1}: {doc[:30]}...")    for j, word in enumerate(feature_names):        score = tfidf_matrix[i, j]        if score > 0:            print(f"  {word}: {score:.4f}")

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

算法工程师面试:TF-IDF原理与python代码

原理分解

特点与局限

Python 实现

最新文章

热门文章

随机文章

算法工程师面试:TF-IDF原理与python代码

原理分解

特点与局限

Python 实现

告别重复劳作!用 Python 打造批量报告生成工具,10 分钟搞定 1000 份文档

信阳市ICode编程公益活动Python组

最新文章

热门文章

随机文章