写在前面
我们已经学完了特征工程,知道了怎么处理数据,接下来就要训练模型了。但是,模型有很多超参数,比如正则化强度、树的数量、树的深度等等,这些参数不是模型学出来的,需要我们手动调整。选不同的超参数,模型性能可能天差地别。
所以,在讲具体算法之前,我们先专门来讲讲超参数调优:有哪些常用的调优方法?每种方法优缺点是什么?什么场景用什么方法?读完这篇,你就知道怎么科学高效地找到最佳超参数了。
在上一篇 特征工程入门与实践 的尾声,我们留下了两个实践作业,现在让我们一起揭晓答案。
from sklearn.datasets import fetch_california_housingfrom sklearn.preprocessing import StandardScalerfrom sklearn.feature_selection import SelectKBest, f_regressionfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import r2_scorefrom sklearn.model_selection import train_test_splitimport numpy as npimport matplotlib.pyplot as pltimport pandas as pd# 1. 加载数据并检查缺失值housing = fetch_california_housing()X = pd.DataFrame(housing.data, columns=housing.feature_names)y = housing.targetprint(f"缺失值检查:\n{X.isnull().sum()}")# 加州房价数据集本身没有缺失值,这一步检查完直接继续# 2. 对收入特征MedInc做对数变换plt.figure(figsize=(10, 4))plt.subplot(1, 2, 1)plt.hist(X['MedInc'], bins=30)plt.title('原始分布')X['MedInc_log'] = np.log1p(X['MedInc'])plt.subplot(1, 2, 2)plt.hist(X['MedInc_log'], bins=30)plt.title('对数变换后')plt.show()# 替换原特征X = X.drop(['MedInc'], axis=1)# 3. 特征标准化scaler = StandardScaler()X_scaled = scaler.fit_transform(X)# 4. F检验选择Top 5特征selector = SelectKBest(f_regression, k=5)X_selected = selector.fit_transform(X_scaled, y)selected_names = X.columns[selector.get_support()]print(f"\n选择的Top 5特征: {list(selected_names)}")print(f"p-values: {selector.pvalues_.round(4)}")# 5. 划分训练测试,训练模型,计算R²X_train, X_test, y_train, y_test = train_test_split( X_selected, y, test_size=0.2, random_state=42)model = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)r2 = r2_score(y_test, y_pred)print(f"\nR² = {r2:.4f}")可以看到,对数变换让收入分布更对称,F检验选出了最重要的5个特征,最终线性模型也能得到不错的R²。
import pandas as pdimport numpy as npimport seaborn as snsfrom sklearn.model_selection import train_test_splitfrom sklearn.impute import SimpleImputerfrom sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeaturesfrom sklearn.feature_selection import SelectFromModelfrom sklearn.compose import ColumnTransformerfrom sklearn.pipeline import Pipelinefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import accuracy_score# 加载数据data = sns.load_dataset('titanic')X = data.drop(['survived'], axis=1)y = data['survived']# 1. 处理缺失值# Age用中位数填充,Embarked用众数填充numeric_features = ['age', 'fare']categorical_features = ['sex', 'pclass', 'embarked']# 2. 添加Fare和Pclass的多项式交互特征# 我们先提取这两列做多项式,然后合并poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)# 3. 预处理管道numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features) ])# 4. 完整管道:预处理 -> 特征选择 -> 模型full_pipeline = Pipeline(steps=[ ('preprocessing', preprocessor), ('feature_selection', SelectFromModel( RandomForestClassifier(n_estimators=100, random_state=42), threshold='mean')), ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))])# 5. 训练评估X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)full_pipeline.fit(X_train, y_train)y_pred = full_pipeline.predict(X_test)acc = accuracy_score(y_test, y_pred)print(f"测试集准确率: {acc:.4f}")# 查看选择了多少特征X_preprocessed = preprocessor.fit_transform(X_train)print(f"\n预处理后总特征数: {X_preprocessed.shape[1]}")selector = full_pipeline.named_steps['feature_selection']print(f"选择后特征数: {selector.get_support().sum()}")运行这段代码,你应该能得到约81-83%的准确率,这在泰坦尼克数据集上已经很不错了。
先搞清楚一个基本区别:
| 模型参数 | |||
| 超参数 |
同一个模型,不同的超参数,性能可能差很多。比如:
所以,我们需要找到一组超参数,让模型在验证集上性能最好,这就是超参数调优。
网格搜索就是暴力搜索:你给定每个超参数的候选值列表,算法会穷举所有可能的组合,一个个试过去,找到验证集性能最好的那一组。
比如你要调两个超参数:
网格搜索会尝试所有 5 × 4 = 20 种组合,选最好的出来。
from sklearn import datasetsfrom sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCV, train_test_splitfrom sklearn.metrics import accuracy_score# 加载数据iris = datasets.load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42, stratify=iris.target)# 定义参数网格param_grid = [ {'C': [0.01, 0.1, 1, 10, 100], 'gamma': [0.001, 0.01, 0.1, 1], 'kernel': ['rbf']}]# 创建网格搜索grid = GridSearchCV(SVC(), param_grid, cv=5, scoring='accuracy')grid.fit(X_train, y_train)print(f"最佳参数: {grid.best_params_}")print(f"交叉验证最佳准确率: {grid.best_score_:.4f}")# 在测试集评估y_pred = grid.predict(X_test)print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")# 查看所有结果results = pd.DataFrame(grid.cv_results_)[['params', 'mean_test_score', 'std_test_score']]print(results)适用场景: 超参数少(1-3个),每个超参数候选值不多 → 网格搜索最合适。
随机搜索不搜网格了,直接在参数空间随机采样,采N个,试N个,选最好的。
为什么随机搜索比网格搜索好?
from sklearn import datasetsfrom sklearn.svm import SVCfrom sklearn.model_selection import RandomizedSearchCV, train_test_splitfrom sklearn.metrics import accuracy_scorefrom scipy.stats import loguniformimport pandas as pd# 加载数据iris = datasets.load_iris()X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42, stratify=iris.target)# 定义参数分布:C和gamma在对数空间均匀分布param_dist = {'C': loguniform(1e-2, 1e2), # 0.01 ~ 100 对数均匀分布'gamma': loguniform(1e-3, 1e0), # 0.001 ~ 1 对数均匀分布'kernel': ['rbf']}# 创建随机搜索random_search = RandomizedSearchCV( SVC(), param_distributions, n_iter=20, cv=5, random_state=42, scoring='accuracy')random_search.fit(X_train, y_train)print(f"最佳参数: {random_search.best_params_}")print(f"交叉验证最佳准确率: {random_search.best_score_:.4f}")y_pred = random_search.predict(X_test)print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")# 查看所有结果results = pd.DataFrame(random_search.cv_results_)[['params', 'mean_test_score']]print(results.sort_values('mean_test_score', ascending=False).head())经验法则:
网格搜索和随机搜索都是瞎试,试过就完了,不会从之前的结果中学到什么。
贝叶斯优化不一样,它会建立一个概率模型来拟合超参数和验证集性能之间的关系,用这个概率模型来预测哪些超参数可能更好,一步步逼近最优解。
简单说:它会"学习",越试越聪明,能更快找到好的超参数,比瞎试效率高很多。
我们用 hyperopt 这个库来演示:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trialsfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import cross_val_scorefrom sklearn.svm import SVCimport numpy as np# 加载数据data = load_breast_cancer()X, y = data.data, data.target# 1. 定义参数搜索空间space = {'C': hp.loguniform('C', np.log(1e-2), np.log(1e2)), # 0.01 ~ 100 对数均匀'gamma': hp.loguniform('gamma', np.log(1e-3), np.log(1e0)), # 0.001 ~ 1}# 2. 定义目标函数(要最小化)defobjective(params): svm = SVC(**params, random_state=42) score = cross_val_score(svm, X, y, cv=5, scoring='accuracy').mean()# hyperopt要最小化,所以我们返回负准确率return {'loss': -score, 'status': STATUS_OK}# 3. 运行贝叶斯优化trials = Trials()best = fmin( fn=objective, space=space, algo=tpe.suggest, max_evals=20, trials=trials, rstate=np.random.RandomState(42))print(f"最佳参数: {best}")# 查看结果for trial in trials.trials:print(f"loss: {-trial['result']['loss']:.4f}, params: {trial['misc']['vals']}")贝叶斯优化通常用更少的尝试就能找到比随机搜索更好的结果。
我的建议:
现在我们用乳腺癌数据集,完整演示三种调优方法对比:
import numpy as npimport pandas as pdfrom sklearn.datasets import load_breast_cancerfrom sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCVfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import accuracy_scorefrom scipy.stats import loguniform# 加载数据data = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split( data.data, data.target, test_size=0.2, random_state=42, stratify=data.target)# 1. 网格搜索print("=== 网格搜索 ===")param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100],'penalty': ['l2']}grid = GridSearchCV( LogisticRegression(max_iter=1000, random_state=42), param_grid, cv=5, scoring='accuracy')grid.fit(X_train, y_train)print(f"最佳参数: {grid.best_params_}")print(f"交叉验证准确率: {grid.best_score_:.4f}")y_pred = grid.predict(X_test)print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")print(f"尝试了 {len(grid.cv_results_['params'])} 种组合\n")# 2. 随机搜索print("=== 随机搜索 ===")param_dist = {'C': loguniform(1e-3, 1e2),'penalty': ['l2']}random_search = RandomizedSearchCV( LogisticRegression(max_iter=1000, random_state=42), param_distributions=param_dist, n_iter=20, cv=5, random_state=42, scoring='accuracy')random_search.fit(X_train, y_train)print(f"最佳参数: {random_search.best_params_}")print(f"交叉验证准确率: {random_search.best_score_:.4f}")y_pred = random_search.predict(X_test)print(f"测试集准确率: {accuracy_score(y_test, y_pred):.4f}")print(f"尝试了 {len(random_search.cv_results_['params'])} 种组合\n")# 对比结果results = pd.DataFrame({'方法': ['网格搜索', '随机搜索'],'尝试组合数': [len(grid.cv_results_['params']), len(random_search.cv_results_['params'])],'最佳交叉验证准确率': [f"{grid.best_score_:.4f}", f"{random_search.best_score_:.4f}"],'测试集准确率': [f"{accuracy_score(y_test, grid.predict(X_test)):.4f}", f"{accuracy_score(y_test, random_search.predict(X_test)):.4f}"]})print("=== 对比总结 ===")print(results.to_string(index=False))运行这段代码,你会发现随机搜索用更少的尝试就能得到和网格搜索相当甚至更好的结果。
现在轮到你练习了,请完成以下任务:
使用SVM在鸢尾花数据集上做超参数调优:
超参数调优是机器学习流水线中必不可少的一步,同样的模型,调好超参数和瞎蒙一个,性能可能差很多。
我们介绍了三种最常用的调优方法:
记住我们的选择口诀:
掌握了超参数调优,你就能让你的模型发挥出最佳性能。> 提醒:请运行文中代码,完成本期作业,实践出真知。
| GridSearchCV | from sklearn.model_selection import GridSearchCV | ||
| RandomizedSearchCV | from sklearn.model_selection import RandomizedSearchCV | ||
| hyperopt | from hyperopt import hp, fmin, tpe | ||
| loguniform | from scipy.stats import loguniform |
关键总结:
请在微信客户端打开