写在前面
训练出模型只是第一步,如何科学评估模型才是决定机器学习项目成败的关键一步。同样的模型,不同的评估方法可能得出完全相反的结论;选择错误的指标,甚至可能让你误以为模型很好而实际上它一无是处。
本文将系统介绍三大类机器学习任务的完整评估体系:从有监督的分类、回归,到无监督的聚类,从指标计算到验证方法,从理论原理到实战代码,一步步带你掌握科学评估模型的全套方法论。
读完本文,你将不再只盯着"准确率"这一个指标,而是能够根据不同场景,选择最适合的评估方法,客观、全面地判断模型的真实性能。
分类是机器学习最常见的任务之一。评估分类模型不能只看准确率,我们需要一套完整的指标体系来全面衡量模型性能。
混淆矩阵是分类评估的基础,它清晰地展示了模型在每个类别上的预测正确和错误情况。
| 实际为正 | ||
| 实际为负 |
代码示例:
from sklearn.metrics import confusion_matrixy_true = [0, 1, 0, 1, 0, 0, 1, 1]y_pred = [0, 1, 0, 1, 0, 1, 1, 0]cm = confusion_matrix(y_true, y_pred)print("混淆矩阵:")print(cm)基于混淆矩阵,我们可以计算出四个核心指标:
| 准确率 (Accuracy) | TP+TN+FP+FNTP+TN | ||
| 精确率 (Precision) | TP+FPTP | ||
| 召回率 (Recall) | TP+FNTP | ||
| F1分数 | P+R2×P×R |
代码示例:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_scoreprint(f"准确率: {accuracy_score(y_true, y_pred):.4f}")print(f"精确率: {precision_score(y_true, y_pred):.4f}")print(f"召回率: {recall_score(y_true, y_pred):.4f}")print(f"F1分数: {f1_score(y_true, y_pred):.4f}")完整实例:
from sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report)# 加载乳腺癌数据集cancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, test_size=0.2, random_state=42)# 训练模型model = LogisticRegression(max_iter=1000, random_state=42)model.fit(X_train, y_train)y_pred = model.predict(X_test)# 计算各项指标print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")print(f"精确率: {precision_score(y_test, y_pred):.4f}")print(f"召回率: {recall_score(y_test, y_pred):.4f}")print(f"F1分数: {f1_score(y_test, y_pred):.4f}")print("\n混淆矩阵:")print(confusion_matrix(y_test, y_pred))print("\n分类报告:")print(classification_report(y_test, y_pred, target_names=['恶性', '良性']))ROC曲线(Receiver Operating Characteristic)展示了**在不同分类阈值下,模型的真阳性率(TPR)和假阳性率(FPR)**的变化关系。
AUC(Area Under the Curve)是ROC曲线下的面积,取值范围[0, 1]:
代码示例:
from sklearn.metrics import roc_curve, roc_auc_score, aucimport matplotlib.pyplot as plt# 获取预测概率(正类的概率)y_prob = model.predict_proba(X_test)[:, 1]# 计算ROCfpr, tpr, thresholds = roc_curve(y_test, y_prob)roc_auc = auc(fpr, tpr)# 绘制ROC曲线plt.figure(figsize=(8, 6))plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC曲线 (AUC = {roc_auc:.4f})')plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--', label='随机猜测')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('假阳性率 (FPR)')plt.ylabel('真阳性率 (TPR)')plt.title('ROC曲线')plt.legend(loc="lower right")plt.grid(True, alpha=0.3)plt.show()print(f"AUC值: {roc_auc:.4f}")多模型对比示例:
from sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVCcancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split( cancer.data, cancer.target, test_size=0.2, random_state=42)models = {'逻辑回归': LogisticRegression(max_iter=1000),'随机森林': RandomForestClassifier(n_estimators=100),'SVM': SVC(probability=True)}plt.figure(figsize=(10, 8))for name, model in models.items(): model.fit(X_train, y_train) y_prob = model.predict_proba(X_test)[:, 1] fpr, tpr, _ = roc_curve(y_test, y_prob) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.4f})')plt.plot([0, 1], [0, 1], 'k--', lw=2)plt.xlabel('假阳性率')plt.ylabel('真阳性率')plt.title('多模型ROC曲线对比')plt.legend(loc="lower right")plt.grid(True, alpha=0.3)plt.show()当数据类别不平衡时,ROC曲线可能过于乐观,Precision-Recall(PR)曲线更可靠。
代码示例:
from sklearn.metrics import precision_recall_curve, average_precision_scorey_prob = model.predict_proba(X_test)[:, 1]precision, recall, _ = precision_recall_curve(y_test, y_prob)ap = average_precision_score(y_test, y_prob)plt.figure(figsize=(8, 6))plt.plot(recall, precision, lw=2, label=f'PR曲线 (AP = {ap:.4f})')plt.xlabel('召回率')plt.ylabel('精确率')plt.title('Precision-Recall曲线')plt.legend()plt.grid(True, alpha=0.3)plt.show()print(f"平均精确度 (AP): {ap:.4f}")回归任务预测的是连续值,评估指标和分类任务完全不同。本节介绍回归任务中最常用的几种评估指标。
这三个指标都是基于预测误差来衡量模型性能,值越小越好。
| MSE | n1∑i=1n(yi−y^i)2 | ||
| RMSE | MSE | ||
| MAE | n1∑i=1n∣yi−y^i∣ |
代码示例:
from sklearn.metrics import mean_squared_error, mean_absolute_errorimport numpy as npy_true = [3, -0.5, 2, 7]y_pred = [2.5, 0.0, 2, 8]mse = mean_squared_error(y_true, y_pred)rmse = np.sqrt(mse)mae = mean_absolute_error(y_true, y_pred)print(f"MSE: {mse:.4f}")print(f"RMSE: {rmse:.4f}")print(f"MAE: {mae:.4f}")完整实例:波士顿房价预测
from sklearn.datasets import fetch_california_housingfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scoreimport numpy as np# 加载加州房价数据集housing = fetch_california_housing()X_train, X_test, y_train, y_test = train_test_split( housing.data, housing.target, test_size=0.2, random_state=42)# 训练线性回归模型model = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)# 计算各项指标mse = mean_squared_error(y_test, y_pred)rmse = np.sqrt(mse)mae = mean_absolute_error(y_test, y_pred)print(f"MSE: {mse:.4f}")print(f"RMSE: {rmse:.4f}")print(f"MAE: {mae:.4f}")R²(R-squared)衡量模型解释数据方差的能力,越接近1越好。
公式:
R2=1−∑(yi−yˉ)2∑(yi−y^i)2
代码示例:
from sklearn.metrics import r2_scorer2 = r2_score(y_test, y_pred)print(f"R²: {r2:.4f}")# 或者直接用模型的score方法print(f"R² (通过score): {model.score(X_test, y_test):.4f}")MSE vs MAE如何选择?
选择好指标后,我们还需要科学的验证方法来评估模型的泛化能力。单次划分训练集/测试集可能结果波动很大,交叉验证能给出更稳定可靠的评估。
最简单也最常用,直接将数据划分为训练集和测试集。
from sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import load_irisiris = load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42, stratify=iris.target)model = LogisticRegression()model.fit(X_train, y_train)print(f"测试集准确率: {model.score(X_test, y_test):.4f}")特点:
K折交叉验证通过多次划分取平均,结果更稳定可靠。
流程:
代码示例:
from sklearn.model_selection import KFold, cross_val_scorefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.datasets import load_irisiris = load_iris()model = RandomForestClassifier(n_estimators=100, random_state=42)# 5折交叉验证kf = KFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, iris.data, iris.target, cv=kf, scoring='accuracy')print(f"各折准确率: {[f'{s:.4f}'for s in scores]}")print(f"平均准确率: {scores.mean():.4f} (±{scores.std()*2:.4f})")使用cross_val_score简化版:
from sklearn.model_selection import cross_val_scorescores = cross_val_score(model, iris.data, iris.target, cv=5, scoring='accuracy')print(f"5折平均准确率: {scores.mean():.4f}")不同场景需要不同的交叉验证策略:
1. 分层K折(StratifiedKFold)
from sklearn.model_selection import StratifiedKFoldskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)scores = cross_val_score(model, iris.data, iris.target, cv=skf)print(f"分层5折平均准确率: {scores.mean():.4f}")2. 留一法(Leave-One-Out)
from sklearn.model_selection import LeaveOneOutloo = LeaveOneOut()scores = cross_val_score(model, iris.data, iris.target, cv=loo)print(f"留一法平均准确率: {scores.mean():.4f}")3. 分组K折(GroupKFold)
from sklearn.model_selection import GroupKFold# 示例:假设有分组信息groups = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4] # 分组标签gkf = GroupKFold(n_splits=5)for train_idx, test_idx in gkf.split(X, y, groups=groups):# 训练和验证passK值选择建议:
无监督学习没有真实标签,评估更加困难。我们分两种情况讨论:有真实标签和无真实标签。
如果我们知道真实的聚类结果,可以用以下指标评估:
1. Adjusted Rand Index(ARI,调整兰德指数)
from sklearn.metrics import adjusted_rand_scorey_true = [0, 0, 1, 1, 2, 2]y_pred = [0, 0, 0, 1, 2, 2]ari = adjusted_rand_score(y_true, y_pred)print(f"ARI: {ari:.4f}")2. Normalized Mutual Information(NMI,归一化互信息)
from sklearn.metrics import normalized_mutual_info_scorenmi = normalized_mutual_info_score(y_true, y_pred)print(f"NMI: {nmi:.4f}")完整示例:
from sklearn.cluster import KMeansfrom sklearn.metrics import adjusted_rand_score, normalized_mutual_info_scorefrom sklearn.datasets import load_irisiris = load_iris()X = iris.datay_true = iris.target# K-means聚类kmeans = KMeans(n_clusters=3, random_state=42)y_pred = kmeans.fit_predict(X)print(f"ARI: {adjusted_rand_score(y_true, y_pred):.4f}")print(f"NMI: {normalized_mutual_info_score(y_true, y_pred):.4f}")大多数情况下聚类没有真实标签,需要用内部指标评估:
1. Silhouette Coefficient(轮廓系数)
from sklearn.metrics import silhouette_scorefrom sklearn.cluster import KMeansfrom sklearn.datasets import load_irisiris = load_iris()X = iris.datakmeans = KMeans(n_clusters=3, random_state=42)y_pred = kmeans.fit_predict(X)silhouette = silhouette_score(X, y_pred)print(f"轮廓系数: {silhouette:.4f}")2. Davies-Bouldin Index(DB指数)
from sklearn.metrics import davies_bouldin_scoredb = davies_bouldin_score(X, y_pred)print(f"DB指数: {db:.4f}")3. Calinski-Harabasz Index(CH指数)
from sklearn.metrics import calinski_harabasz_scorech = calinski_harabasz_score(X, y_pred)print(f"CH指数: {ch:.4f}")选择K值的小技巧:尝试不同的K(比如2到10),计算轮廓系数,选择轮廓系数最大的K。
import matplotlib.pyplot as pltk_range = range(2, 11)sil_scores = []for k in k_range: kmeans = KMeans(n_clusters=k, random_state=42) y_pred = kmeans.fit_predict(X) sil_scores.append(silhouette_score(X, y_pred))plt.figure(figsize=(8, 4))plt.plot(k_range, sil_scores, 'o-')plt.xlabel('聚类数K')plt.ylabel('轮廓系数')plt.title('轮廓系数选择最佳K')plt.grid(True, alpha=0.3)plt.show()best_k = k_range[sil_scores.index(max(sil_scores))]print(f"最佳K值: {best_k}")面对这么多指标和方法,初学者常常困惑:我到底该用哪一个?这一节给你清晰的选择指南。
| 类别平衡 | ||
| 类别不平衡 | ||
| 漏检代价高 | 召回率 | |
| 误报代价高 | 精确率 | |
| 需要平衡P&R | F1分数 | |
| 模型对比 | ROC曲线 + AUC |
一句话原则: 根据业务错判代价选择,哪个错更不能接受,就优先优化对应的指标。
| 有异常值,需要鲁棒性 | ||
| 大误差需要惩罚 | ||
| 需要直观解释 | ||
| 模型对比 |
| 有真实标签 | |
| 无真实标签 | |
| 选择聚类数K |
| 数据量大 | |
| 数据量中等 | |
| 数据量很小 | |
| 分类任务 | |
| 数据有分组结构 |
现在让我们通过一个完整的案例,把所有知识点串联起来。我们将使用乳腺癌数据集,从数据划分到模型训练,再到多维度评估,走一遍完整流程。
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFoldfrom sklearn.preprocessing import StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.svm import SVCfrom sklearn.metrics import ( accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report, roc_curve, auc, precision_recall_curve, average_precision_score)# 1. 加载数据并划分data = load_breast_cancer()X = data.datay = data.targetprint(f"数据集形状: {X.shape}")print(f"类别分布: {np.bincount(y)}")# 分层划分,保持类别比例X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)# 特征标准化scaler = StandardScaler()X_train_scaled = scaler.fit_transform(X_train)X_test_scaled = scaler.transform(X_test)# 2. 训练多个模型对比models = {'逻辑回归': LogisticRegression(max_iter=1000, random_state=42),'随机森林': RandomForestClassifier(n_estimators=100, random_state=42),'SVM': SVC(probability=True, random_state=42)}# 3. 5折交叉验证评估print("\n=== 5折分层交叉验证结果 ===")skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)cv_results = {}for name, model in models.items(): scores = cross_val_score(model, X_train_scaled, y_train, cv=skf, scoring='accuracy') cv_results[name] = scoresprint(f"{name}: 平均准确率 = {scores.mean():.4f} (±{scores.std()*2:.4f})")# 4. 在测试集上进行详细评估print("\n=== 测试集详细评估 ===")plt.figure(figsize=(12, 5))for i, (name, model) inenumerate(models.items()):# 训练模型 model.fit(X_train_scaled, y_train) y_pred = model.predict(X_test_scaled) y_prob = model.predict_proba(X_test_scaled)[:, 1]# 计算各项指标 acc = accuracy_score(y_test, y_pred) prec = precision_score(y_test, y_pred) rec = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred)print(f"\n【{name}】")print(f"准确率: {acc:.4f}")print(f"精确率: {prec:.4f}")print(f"召回率: {rec:.4f}")print(f"F1分数: {f1:.4f}")print("\n混淆矩阵:")print(confusion_matrix(y_test, y_pred))print("\n分类报告:")print(classification_report(y_test, y_pred, target_names=['恶性', '良性']))# 绘制ROC曲线 fpr, tpr, _ = roc_curve(y_test, y_prob) roc_auc = auc(fpr, tpr) plt.subplot(1, 2, 1) plt.plot(fpr, tpr, lw=2, label=f'{name} (AUC = {roc_auc:.4f})')# 绘制PR曲线 precision, recall, _ = precision_recall_curve(y_test, y_prob) ap = average_precision_score(y_test, y_prob) plt.subplot(1, 2, 2) plt.plot(recall, precision, lw=2, label=f'{name} (AP = {ap:.4f})')# 完成图表绘制plt.subplot(1, 2, 1)plt.plot([0, 1], [0, 1], 'k--', lw=2)plt.xlabel('假阳性率')plt.ylabel('真阳性率')plt.title('ROC曲线对比')plt.legend(loc='lower right')plt.grid(True, alpha=0.3)plt.subplot(1, 2, 2)plt.xlabel('召回率')plt.ylabel('精确率')plt.title('Precision-Recall曲线对比')plt.legend(loc='lower left')plt.grid(True, alpha=0.3)plt.tight_layout()plt.show()运行结果解读:
这个完整流程展示了如何多维度、科学地评估模型,而不只是看一个准确率。
光看不练假把式。现在请运用本文所学,完成以下实践任务:
使用葡萄酒数据集 load_wine() 完成:
使用加州房价数据集 fetch_california_housing() 完成:
使用鸢尾花数据集 load_iris() 完成:
至此,我们已经系统遍历了三大类机器学习任务的完整评估体系:
从分类任务的混淆矩阵、准确率、精确率、召回率、F1,到ROC曲线与AUC、PR曲线与AP;从回归任务的MSE、RMSE、MAE到R²决定系数;从验证方法的留出法、K折交叉验证、分层K折到留一法和分组K折;从聚类评估的外部指标ARI、NMI到内部指标轮廓系数、DB指数、CH指数;最后还给了不同场景下的选择指南,告诉你该如何选择最合适的评估方法。
记住:正确评估比盲目调参更重要。选择了错误的评估指标,你可能以为模型很好,实际部署到生产环境却一塌糊涂。掌握了本文介绍的这套方法,你就能科学、客观、全面地评估你的模型,为后续改进打下坚实基础。
最后提醒:机器学习是实践的艺术,请务必运行文中的代码,完成本期作业,这比读十遍都更有收获。
| 分类 - 基础 | |||
| 分类 - 进阶 | |||
| 回归 | |||
| 验证方法 | |||
| 聚类 - 外部 | |||
| 聚类 - 内部 | |||
| sklearn代码 | from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score | ||
from sklearn.metrics import roc_curve, roc_auc_score, precision_recall_curve, average_precision_score | |||
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score | |||
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score, silhouette_score, davies_bouldin_score, calinski_harabasz_score | |||
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold, LeaveOneOut, GroupKFold |
请在微信客户端打开