写在前面
数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。 这句话在机器学习圈广为流传,足以说明特征工程的重要性。
哪怕你用上最复杂的深度学习模型,如果特征工程做得不好,结果也可能一塌糊涂;反之,如果特征工程做得出色,哪怕用简单的线性模型也能得到惊艳的结果。
本文承接上篇《模型评估方法》,将带你系统学习特征工程的核心内容:从最基础的数据预处理(缺失值、编码、缩放),到特征变换(对数变换、多项式特征、离散化),再到特征选择(过滤法、嵌入法、包裹法),一步步掌握从原始数据到优质特征的完整流程。每一步都配有可运行的Python代码,让你即学即用。
在上一篇 模型评估方法 的尾声,我们留下了三个实践作业,现在让我们一起揭晓答案。
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report,
roc_curve, auc)
import matplotlib.pyplot as plt
import numpy as np
# 加载数据
wine = load_wine()
X = wine.data
y = wine.target
# 1. 分层划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# 2. 训练逻辑回归
model = LogisticRegression(max_iter=1000, random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
# 3. 计算各项指标
print(f"准确率: {accuracy_score(y_test, y_pred):.4f}")
print(f"精确率(宏平均): {precision_score(y_test, y_pred, average='macro'):.4f}")
print(f"召回率(宏平均): {recall_score(y_test, y_pred, average='macro'):.4f}")
print(f"F1分数(宏平均): {f1_score(y_test, y_pred, average='macro'):.4f}")
print("\n4. 混淆矩阵:")
print(confusion_matrix(y_test, y_pred))
print("\n5. 分类报告:")
print(classification_report(y_test, y_pred, target_names=wine.target_names))
# 6. 绘制ROC曲线(一对多)
plt.figure(figsize=(8, 6))
for i inrange(3):
y_true_bin = (y_test == i)
y_score = model.predict_proba(X_test_scaled)[:, i]
fpr, tpr, _ = roc_curve(y_true_bin, y_score)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=2, label=f'类别{i} (AUC = {roc_auc:.4f})')
plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlabel('假阳性率')
plt.ylabel('真阳性率')
plt.title('多分类ROC曲线(一对多)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np
# 加载数据
housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(
housing.data, housing.target, test_size=0.2, random_state=42
)
# 训练模型
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# 计算各项指标
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"R²: {r2:.4f}")
输出结果:
MSE: 0.5559
RMSE: 0.7456
MAE: 0.5178
R²: 0.5758
R²解读: R²约为0.58,说明线性模型能解释房价58%的方差,还有很大提升空间。这符合直觉,房价受很多非线性因素影响,单纯的线性模型拟合效果一般。
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans
from sklearn.metrics import (silhouette_score, davies_bouldin_score,
calinski_harabasz_score,
adjusted_rand_score, normalized_mutual_info_score)
import matplotlib.pyplot as plt
iris = load_iris()
X = iris.data
y_true = iris.target
k_range = [2, 3, 4, 5]
sil_scores = []
db_scores = []
ch_scores = []
for k in k_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
sil = silhouette_score(X, y_pred)
db = davies_bouldin_score(X, y_pred)
ch = calinski_harabasz_score(X, y_pred)
sil_scores.append(sil)
db_scores.append(db)
ch_scores.append(ch)
print(f"K={k}: 轮廓系数={sil:.4f}, DB={db:.4f}, CH={ch:.1f}")
# 绘制轮廓系数曲线
plt.figure(figsize=(8, 4))
plt.plot(k_range, sil_scores, 'o-', linewidth=2)
plt.xlabel('聚类数K')
plt.ylabel('轮廓系数')
plt.title('轮廓系数选择最佳K')
plt.grid(True, alpha=0.3)
plt.show()
# 选择最佳K
best_k = k_range[sil_scores.index(max(sil_scores))]
print(f"\n根据轮廓系数,最佳K值为: {best_k}")
# 评估K=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
y_pred = kmeans.fit_predict(X)
ari = adjusted_rand_score(y_true, y_pred)
nmi = normalized_mutual_info_score(y_true, y_pred)
print(f"K=3时,ARI = {ari:.4f}, NMI = {nmi:.4f}")
结果解读:
原始数据很少能直接用于建模,数据预处理是特征工程的第一步,也是最关键的一步。"垃圾进,垃圾出",预处理不好,后续再好的模型也没用。
现实世界的数据总有缺失。常用处理方法:
1. 删除缺失值 - 缺失比例很大时使用
import pandas as pd
import numpy as np
# 创建示例数据
df = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, np.nan, 8],
'C': [10, 20, 30, 40]
})
# 删除含有缺失值的行
df_dropped_rows = df.dropna()
print("删除行后:")
print(df_dropped_rows)
# 删除全为缺失值的列
df_dropped_cols = df.dropna(axis=1, how='all')
2. 填充缺失值 - 更常用,保留数据
# 用均值填充
df['A_fill_mean'] = df['A'].fillna(df['A'].mean())
# 用中位数填充(对异常值更鲁棒)
df['A_fill_median'] = df['A'].fillna(df['A'].median())
# 用众数填充(分类特征)
df['B_fill_mode'] = df['B'].fillna(df['B'].mode()[0])
# 前向填充/后向填充(时间序列常用)
df['A_ffill'] = df['A'].ffill()
df['A_bfill'] = df['A'].bfill()
使用sklearn的Imputer:
from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([[1, 2], [np.nan, 3], [7, 6]])
imputer = SimpleImputer(strategy='mean') # mean, median, most_frequent
X_imputed = imputer.fit_transform(X)
print(X_imputed)
选择原则:
大多数机器学习模型只能处理数值,所以类别特征需要编码。
1. One-Hot编码(独热编码)
import pandas as pd
df = pd.DataFrame({'颜色': ['红', '绿', '蓝', '红']})
df_onehot = pd.get_dummies(df['颜色'], prefix='颜色')
print(df_onehot)
使用sklearn:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False)
X = np.array(['红', '绿', '蓝', '红']).reshape(-1, 1)
encoded = encoder.fit_transform(X)
print(encoded)
2. 标签编码(Label Encoding)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
labels = ['红', '绿', '蓝', '红']
encoded = encoder.fit_transform(labels)
print(encoded) # [0, 1, 2, 0]
3. 序数编码(Ordinal Encoding)
from sklearn.preprocessing import OrdinalEncoder
# 学历有天然顺序
categories = [['小学', '中学', '大学', '研究生']]
encoder = OrdinalEncoder(categories=categories)
X = np.array([['大学'], ['小学'], ['研究生']])
encoded = encoder.fit_transform(X)
print(encoded) # [[2.], [0.], [3.]]
很多算法(比如基于距离的KNN、梯度下降、SVM、神经网络)对特征尺度很敏感,需要缩放。
1. 标准化(Standardization)
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
2. 最小-最大缩放(Min-Max Scaling)
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled)
什么时候需要缩放?
特征变换通过对原始特征进行数学变换,创造出更有利于模型学习的新特征。
对数变换可以让右偏分布变得更对称,缩小变量范围,让异常值不那么极端。常用于收入、价格等正偏数据。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# 示例:收入数据通常是右偏的
income = np.array([1000, 2000, 3000, 5000, 10000, 20000, 50000, 100000])
# 对数变换
income_log = np.log1p(income) # log1p = log(1+x),处理0更安全
print("原始:", income)
print("对数变换后:", income_log.round(2))
# 可视化对比
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(income, bins=10)
plt.title('原始分布')
plt.subplot(1, 2, 2)
plt.hist(income_log, bins=10)
plt.title('对数变换后')
plt.show()
什么时候用对数变换?
通过对原始特征进行多项式组合,可以捕捉特征间的交互关系,增加模型表达能力。
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
X = np.array([[1, 2]])
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
print(f"原始特征: {X}")
print(f"二次多项式特征: {X_poly}")
# 输出: [1, a, b, a^2, ab, b^2]
完整示例:
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
import matplotlib.pyplot as plt
# 生成非线性数据
np.random.seed(42)
X = np.linspace(0, 10, 100).reshape(-1, 1)
y = 2*X + 0.5*X**2 + np.random.randn(100, 1)*2
# 原始特征训练(线性)
model1 = LinearRegression()
model1.fit(X, y)
y_pred1 = model1.predict(X)
# 添加二次多项式特征
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model2 = LinearRegression()
model2.fit(X_poly, y)
y_pred2 = model2.predict(X_poly)
# 对比
plt.figure(figsize=(8, 5))
plt.scatter(X, y, alpha=0.6, label='原始数据')
plt.plot(X, y_pred1, 'r-', linewidth=2, label='线性模型')
plt.plot(X, y_pred2, 'g-', linewidth=2, label='二次多项式')
plt.legend()
plt.xlabel('X')
plt.ylabel('y')
plt.title('多项式特征的作用')
plt.grid(True, alpha=0.3)
plt.show()
注意: 多项式特征会快速增加特征数量,degree=3就能让特征数爆炸。小心过拟合。
将连续特征离散化,可以引入非线性,增加模型表达能力,同时对异常值更鲁棒。
1. 等宽离散化
import pandas as pd
import numpy as np
age = np.array([5, 18, 25, 30, 45, 52, 60, 75])
age_bins = pd.cut(age, bins=3, labels=['青年', '中年', '老年'])
print(pd.DataFrame({'年龄': age, '分组': age_bins}))
2. 等频离散化
age_qbins = pd.qcut(age, q=3, labels=['低三分位', '中三分位', '高三分位'])
print(pd.DataFrame({'年龄': age, '分组': age_qbins}))
3. 自定义区间
bins = [0, 18, 35, 60, 100]
labels = ['未成年', '青年', '中年', '老年']
age_custom = pd.cut(age, bins=bins, labels=labels)
print(pd.DataFrame({'年龄': age, '分组': age_custom}))
离散化优缺点:
不是所有特征都有用。太多不相关的特征会导致维度灾难,让模型过拟合,还增加计算量。特征选择帮我们找出最有用的特征子集。
特征选择主要分三类:过滤法、包裹法、嵌入法。
删除方差太小的特征,方差太小说明特征几乎不变,没什么信息量。
from sklearn.feature_selection import VarianceThreshold
X = [[0, 0, 1], [0, 1, 1], [0, 0, 1], [0, 1, 1]]
# 删除方差低于阈值的特征(默认删除全为相同值的特征)
selector = VarianceThreshold(threshold=0.1)
X_selected = selector.fit_transform(X)
print(X_selected)
1. Pearson相关系数 - 衡量线性相关性
import pandas as pd
import numpy as np
from sklearn.feature_selection import f_regression
# 回归问题,基于F统计量选择特征
X = np.random.randn(100, 5)
y = X[:, 0] + X[:, 1] + 0.5 * np.random.randn(100)
f_values, p_values = f_regression(X, y)
for i, (f, p) inenumerate(zip(f_values, p_values)):
print(f"特征{i}: F={f:.2f}, p={p:.4f}")
2. 卡方检验 - 分类问题,检验类别特征与目标的相关性
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
# 选择Top 2个特征
selector = SelectKBest(chi2, k=2)
X_selected = selector.fit_transform(X, y)
print(f"原始特征数: {X.shape[1]}, 选择后: {X_selected.shape[1]}")
print("p-values:", selector.pvalues_.round(4))
常用过滤法总结:
训练树模型后,直接获取特征重要性,选出最重要的特征。这是最常用也最方便的方法。
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data
y = data.target
# 训练随机森林
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
# 输出特征重要性
for i, importance inenumerate(rf.feature_importances_):
print(f"{data.feature_names[i]}: {importance:.4f}")
# 基于重要性选择特征
selector = SelectFromModel(rf, threshold='mean', prefit=True)
X_selected = selector.transform(X)
print(f"\n原始特征数: {X.shape[1]}, 选择后: {X_selected.shape[1]}")
递归特征消除(RFE)反复训练模型,每次剔除最不重要的特征,直到达到想要的数量。
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import numpy as np
# 生成数据
X = np.random.randn(100, 10)
y = X[:, 0] + X[:, 3] + X[:, 7] + np.random.randn(100)
# 线性回归,选择Top 3特征
lr = LinearRegression()
rfe = RFE(lr, n_features_to_select=3, step=1)
rfe.fit(X, y)
print("特征选择结果(True表示被选中):")
print(rfe.support_)
print("特征排名(1表示被选中,越大越不重要):")
print(rfe.ranking_)
三种方法对比:
| 过滤法 | ||
| 嵌入法 | ||
| 包裹法 |
实用建议:
现在让我们走一遍完整的特征工程流程,从原始数据开始,一步步处理得到可以建模的数据。我们使用泰坦尼克号生存预测数据集来演示。
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 1. 加载数据(这里我们用seaborn的泰坦尼克数据)
import seaborn as sns
data = sns.load_dataset('titanic')
print("原始数据形状:", data.shape)
print("\n前5行:")
print(data.head())
print("\n缺失值情况:")
print(data.isnull().sum())
# 2. 分离特征和目标
X = data.drop(['survived'], axis=1)
y = data['survived']
# 分离数值特征和类别特征
numeric_features = ['age', 'fare']
categorical_features = ['sex', 'pclass', 'embarked', 'who', 'adult_male', 'alone']
# 3. 划分训练集测试集
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 4. 构建预处理管道
# 数值特征:缺失值用中位数填充,然后标准化
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# 类别特征:缺失值用众数填充,然后One-Hot编码
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# 合并预处理
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
# 5. 完整管道:预处理 + 特征选择 + 模型
full_pipeline = Pipeline(steps=[
('preprocessing', preprocessor),
('feature_selection', SelectFromModel(RandomForestClassifier(
n_estimators=100, random_state=42), threshold='mean')),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# 6. 训练模型
full_pipeline.fit(X_train, y_train)
# 7. 评估
y_pred = full_pipeline.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"\n测试集准确率: {acc:.4f}")
# 查看选择了多少特征
X_preprocessed = preprocessor.fit_transform(X_train)
print(f"\n预处理后总特征数: {X_preprocessed.shape[1]}")
selector = full_pipeline.named_steps['feature_selection']
print(f"选择后特征数: {selector.get_support().sum()}")
这个流程展示了什么?
ColumnTransformer 优雅组合Pipeline 封装,避免数据泄露这就是工业界推荐的完整特征工程流程!
现在轮到你动手练习了,请完成以下任务:
使用加州房价数据集 fetch_california_housing() 完成:
使用泰坦尼克数据,自己动手完整走一遍特征工程:
特征工程是机器学习中"艺术"多于"算法"的部分,但这不代表它没有章法可循。本文系统介绍了特征工程的完整流程:
从数据预处理开始,我们学习了缺失值处理(删除/填充)、类别特征编码(One-Hot/标签/序数)、特征缩放(标准化/Min-Max);
接着学习了特征变换:对数变换让偏态分布更对称、多项式特征捕捉交互关系、离散化让线性模型拥有非线性能力;
然后学习了特征选择:过滤法(方差、卡方、相关系数)、嵌入法(树模型特征重要性)、包裹法(RFE),并比较了它们的优缺点;
最后我们走了一遍完整实战流程,展示如何用ColumnTransformer和Pipeline优雅地组合所有步骤,避免数据泄露。
记住:垃圾进,垃圾出。花时间做好特征工程,比盲目调参收获更大。
再次提醒:特征工程是实践出来的,请务必运行文中代码,完成本期作业,你才能真正掌握它。
| 缺失值处理 | from sklearn.impute import SimpleImputer | ||
| 类别编码 | from sklearn.preprocessing import OneHotEncoder | ||
from sklearn.preprocessing import LabelEncoder | |||
from sklearn.preprocessing import OrdinalEncoder | |||
| 特征缩放 | from sklearn.preprocessing import StandardScaler | ||
from sklearn.preprocessing import MinMaxScaler | |||
| 特征变换 | |||
from sklearn.preprocessing import PolynomialFeatures | |||
| 特征选择 | from sklearn.feature_selection import VarianceThreshold | ||
from sklearn.feature_selection import SelectKBest, chi2 | |||
from sklearn.feature_selection import SelectKBest, f_regression | |||
from sklearn.feature_selection import SelectFromModel | |||
from sklearn.feature_selection import RFE | |||
| 管道组合 | from sklearn.compose import ColumnTransformer | ||
from sklearn.pipeline import Pipeline |
需要缩放的模型: KNN、SVM、线性模型、神经网络、PCA
不需要缩放的模型: 决策树、随机森林、XGBoost、LightGBM
请在微信客户端打开