数据竞赛和工业界都有一句老话:"数据和特征决定了机器学习的上限,模型和算法只是在逼近这个上限。"本文系统梳理特征工程的核心方法,每个技巧都有完整代码,是提升模型效果的实战手册。
一、什么是特征工程?
特征工程是将原始数据转化为模型能有效利用的输入特征的过程,主要包括:
原始数据 ├── 特征理解(EDA) ├── 特征清洗(缺失值、异常值) ├── 特征构造(新特征创建) ├── 特征变换(缩放、编码、分布变换) ├── 特征选择(过滤冗余特征) └── 最终特征矩阵 → 输入模型
二、数值特征处理
1. 标准化与归一化
不同的缩放方法适用于不同场景,选错了会损害模型效果。
import numpy as npimport pandas as pdfrom sklearn.preprocessing import ( StandardScaler, MinMaxScaler, RobustScaler, PowerTransformer, QuantileTransformer)data = np.array([[1, 200], [2, 300], [3, 400], [4, 10000], [5, 500]]) # 注意第4行是异常值# Z-Score 标准化(均值0方差1,适合正态分布 + 线性模型)scaler = StandardScaler()print(scaler.fit_transform(data))# Min-Max 归一化(压缩到[0,1],对异常值敏感)scaler = MinMaxScaler()print(scaler.fit_transform(data))# RobustScaler(基于中位数和IQR,对异常值鲁棒)scaler = RobustScaler()print(scaler.fit_transform(data)) # 推荐:有异常值时使用# 选择指南:# - 线性模型 / KNN / SVM → StandardScaler 或 RobustScaler# - 神经网络 → MinMaxScaler 或 StandardScaler# - 树模型 → 不需要缩放(决策树对尺度不敏感)
2. 分布变换:让偏态数据更"正态"
import matplotlib.pyplot as pltfrom scipy import stats# 模拟右偏数据(如收入、交易金额)income = np.random.exponential(scale=50000, size=10000)# 方法1:对数变换(最常用)income_log = np.log1p(income) # log(1+x),避免 log(0)# 方法2:Box-Cox 变换(数据必须为正)income_boxcox, lambda_ = stats.boxcox(income + 1)print(f"最优 lambda: {lambda_:.4f}")# 方法3:Yeo-Johnson 变换(支持负数)pt = PowerTransformer(method='yeo-johnson')income_yj = pt.fit_transform(income.reshape(-1, 1))# 方法4:分位数变换(强制转为均匀或正态分布)qt = QuantileTransformer(output_distribution='normal', n_quantiles=1000)income_qt = qt.fit_transform(income.reshape(-1, 1))# 验证偏度变化print(f"原始偏度: {stats.skew(income):.2f}")print(f"log 变换后偏度: {stats.skew(income_log):.2f}")print(f"分位数变换后偏度: {stats.skew(income_qt.flatten()):.2f}")
💡 规律:偏度 > 1 时优先尝试 log 变换;若 log 后仍偏,用 Box-Cox 或分位数变换。
3. 分箱(Binning):将连续变量离散化
df = pd.DataFrame({"age": np.random.randint(18, 80, 10000)})# 等宽分箱df["age_bin_equal"] = pd.cut( df["age"], bins=5, labels=["18-29", "30-41", "42-53", "54-65", "66-79"])# 等频分箱(每箱样本数相同,分布更均匀)df["age_bin_quantile"] = pd.qcut( df["age"], q=5, labels=["Q1", "Q2", "Q3", "Q4", "Q5"])# 自定义业务分箱(最符合实际意义)bins = [0, 24, 34, 44, 59, 100]labels = ["Z世代", "80后", "70后", "60后", "银发族"]df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels)# 高级:最优分箱(基于信息增益,需要标签)# pip install optbinningfrom optbinning import OptimalBinningoptb = OptimalBinning(name="age", dtype="numerical", solver="cp")optb.fit(df["age"].values, y_binary)print(optb.binning_table.build())
4. 构造交互特征
# 基础四则运算特征df["price_per_sqm"] = df["price"] / df["area"]df["income_to_debt"] = df["income"] / (df["debt"] + 1)df["profit_margin"] = (df["revenue"] - df["cost"]) / (df["revenue"] + 1e-6)# 多项式特征(捕捉非线性关系)from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=2, interaction_only=False, include_bias=False)X_poly = poly.fit_transform(df[["age", "income", "spend"]])poly_feature_names = poly.get_feature_names_out(["age", "income", "spend"])# 自动生成:age², income², spend², age*income, age*spend, income*spend# 统计聚合特征(user 维度)user_stats = df.groupby("user_id").agg( spend_mean = ("spend", "mean"), spend_std = ("spend", "std"), spend_max = ("spend", "max"), order_count = ("order_id", "count"), recency_days = ("order_date", lambda x: (pd.Timestamp.now() - x.max()).days),).reset_index()# 这就是经典 RFM 特征的雏形
三、类别特征编码
这是特征工程中最容易踩坑的部分,编码方式选错会严重影响模型效果。
5. 标签编码 vs One-Hot 编码
from sklearn.preprocessing import LabelEncoder, OneHotEncoderdf = pd.DataFrame({"city": ["北京", "上海", "广州", "北京", "深圳"],"level": ["高", "中", "低", "高", "中"],"product": ["A", "B", "C", "A", "D"]})# ① LabelEncoder:有序类别(如等级:低<中<高)le = LabelEncoder()df["level_encoded"] = le.fit_transform(df["level"])# 注意:LabelEncoder 默认按字母序,"低"→0,"高"→1,"中"→2,顺序错误!# 正确做法:手动指定顺序from sklearn.preprocessing import OrdinalEncoderoe = OrdinalEncoder(categories=[["低", "中", "高"]])df["level_ordinal"] = oe.fit_transform(df[["level"]]) # 低→0,中→1,高→2 ✓# ② One-Hot 编码:无序类别(如城市、颜色)# pandas 方式(推荐:直接得到 DataFrame)df_encoded = pd.get_dummies(df, columns=["city"], prefix="city", drop_first=True)# sklearn 方式(适合 Pipeline)ohe = OneHotEncoder(sparse_output=False, drop="first", handle_unknown="ignore")city_encoded = ohe.fit_transform(df[["city"]])
⚠️ 坑点:高基数类别(如 user_id 有几十万取值)用 One-Hot 会产生几十万列,切勿直接使用。
6. 目标编码(Target Encoding):高基数类别的利器
# 目标编码:用该类别对应的目标均值替换# 例:city = "上海" → 上海用户的平均购买金额import pandas as pdimport numpy as npfrom sklearn.model_selection import KFolddeftarget_encoding_cv(df, cat_col, target_col, n_splits=5):"""交叉验证版目标编码,防止数据泄露""" df[f"{cat_col}_te"] = np.nan kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)for train_idx, val_idx in kf.split(df): train = df.iloc[train_idx] val = df.iloc[val_idx]# 在训练集上计算均值 means = train.groupby(cat_col)[target_col].mean()# 对验证集编码(未知类别用全局均值填充) global_mean = train[target_col].mean() df.loc[df.index[val_idx], f"{cat_col}_te"] = ( val[cat_col].map(means).fillna(global_mean) )return df# 更好的方案:使用 category_encoders 库# pip install category_encodersfrom category_encoders import TargetEncoder, WOEEncoder, CatBoostEncoder# CatBoostEncoder 自带防泄露,生产中最推荐encoder = CatBoostEncoder(cols=["city", "product"])X_encoded = encoder.fit_transform(X_train, y_train)# WOE 编码(适合二分类,金融风控常用)encoder = WOEEncoder(cols=["city"])X_encoded = encoder.fit_transform(X_train, y_train)
7. 频率编码与计数编码
# 频率编码:用该类别出现的频率代替原值freq_map = df["city"].value_counts(normalize=True).to_dict()df["city_freq"] = df["city"].map(freq_map)# 计数编码:用出现次数count_map = df["city"].value_counts().to_dict()df["city_count"] = df["city"].map(count_map)# 这两种编码能隐式传递"这个城市重不重要"的信号
四、时间特征工程
8. 从时间戳中提取丰富特征
df["ts"] = pd.to_datetime(df["timestamp"])# 基础时间特征df["year"] = df["ts"].dt.yeardf["month"] = df["ts"].dt.monthdf["day"] = df["ts"].dt.daydf["hour"] = df["ts"].dt.hourdf["dayofweek"] = df["ts"].dt.dayofweek # 0=周一df["weekofyear"] = df["ts"].dt.isocalendar().week.astype(int)df["quarter"] = df["ts"].dt.quarterdf["is_weekend"] = df["ts"].dt.dayofweek >= 5df["is_month_end"] = df["ts"].dt.is_month_end# 周期性编码(解决"1月和12月相邻"的问题)# 用 sin/cos 表示循环特征,让模型感知周期性df["month_sin"] = np.sin(2 * np.pi * df["month"] / 12)df["month_cos"] = np.cos(2 * np.pi * df["month"] / 12)df["hour_sin"] = np.sin(2 * np.pi * df["hour"] / 24)df["hour_cos"] = np.cos(2 * np.pi * df["hour"] / 24)df["weekday_sin"] = np.sin(2 * np.pi * df["dayofweek"] / 7)df["weekday_cos"] = np.cos(2 * np.pi * df["dayofweek"] / 7)# 时间差特征reference_date = pd.Timestamp("2020-01-01")df["days_since_ref"] = (df["ts"] - reference_date).dt.daysdf["days_since_last_buy"] = (df["ts"] - df.groupby("user_id")["ts"].shift(1)).dt.days
9. 滑动窗口统计特征(时序场景核心)
df = df.sort_values(["user_id", "ts"])# 用户维度的滑窗特征for window in [7, 14, 30, 90]: df[f"spend_sum_{window}d"] = ( df.groupby("user_id")["spend"] .transform(lambda x: x.rolling(window, min_periods=1).sum()) ) df[f"spend_mean_{window}d"] = ( df.groupby("user_id")["spend"] .transform(lambda x: x.rolling(window, min_periods=1).mean()) ) df[f"order_cnt_{window}d"] = ( df.groupby("user_id")["order_id"] .transform(lambda x: x.rolling(window, min_periods=1).count()) )# Lag 特征(历史滞后值)for lag in [1, 3, 7, 14]: df[f"spend_lag_{lag}"] = df.groupby("user_id")["spend"].shift(lag)
五、特征选择
构造完特征后,需要筛掉冗余、噪声特征,防止维度灾难。
10. 过滤法:快速去掉无用特征
from sklearn.feature_selection import ( VarianceThreshold, SelectKBest, f_classif, mutual_info_classif)# ① 删除低方差特征(几乎不变的特征没有信息量)selector = VarianceThreshold(threshold=0.01)X_filtered = selector.fit_transform(X)# ② 删除高相关特征(相关系数 > 0.95 的保留一个)corr_matrix = pd.DataFrame(X).corr().abs()upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]X_reduced = pd.DataFrame(X).drop(columns=to_drop)# ③ 单变量统计检验(分类任务)selector = SelectKBest(score_func=mutual_info_classif, k=20)X_best = selector.fit_transform(X, y)scores = pd.Series(selector.scores_, index=feature_names).sort_values(ascending=False)print(scores.head(10))
11. 嵌入法:用模型本身做特征选择
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.feature_selection import SelectFromModelimport lightgbm as lgb# 随机森林特征重要性rf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X_train, y_train)importance_df = pd.DataFrame({"feature": feature_names,"importance": rf.feature_importances_}).sort_values("importance", ascending=False)print(importance_df.head(15))# 自动选择重要特征selector = SelectFromModel(rf, threshold="median")X_selected = selector.fit_transform(X_train, y_train)# LightGBM 特征重要性(更快更准)model = lgb.LGBMClassifier(n_estimators=300, random_state=42)model.fit(X_train, y_train)lgb.plot_importance(model, max_num_features=20, figsize=(10, 8))
12. Permutation Importance(更可靠的特征重要性)
from sklearn.inspection import permutation_importance# 随机打乱某特征的值,观察模型性能下降多少# 下降越多 → 该特征越重要,不受特征尺度影响result = permutation_importance( rf, X_test, y_test, n_repeats=10, random_state=42, n_jobs=-1)perm_df = pd.DataFrame({"feature": feature_names,"importance_mean": result.importances_mean,"importance_std": result.importances_std}).sort_values("importance_mean", ascending=False)print(perm_df.head(10))
六、自动化特征工程
13. Featuretools 自动生成特征
# pip install featuretoolsimport featuretools as ft# 定义实体集es = ft.EntitySet(id="orders")es = es.add_dataframe( dataframe_name="orders", dataframe=df_orders, index="order_id", time_index="order_date")es = es.add_dataframe( dataframe_name="users", dataframe=df_users, index="user_id")es = es.add_relationship("users", "user_id", "orders", "user_id")# 自动深度特征合成(DFS)feature_matrix, feature_defs = ft.dfs( entityset=es, target_dataframe_name="users", agg_primitives=["mean", "sum", "count", "max", "min", "std", "n_unique"], trans_primitives=["month", "weekday", "hour"], max_depth=2)print(f"自动生成了 {len(feature_defs)} 个特征")print(feature_matrix.head())
七、特征工程全流程 Pipeline
from sklearn.pipeline import Pipelinefrom sklearn.compose import ColumnTransformerfrom sklearn.impute import SimpleImputer# 定义各类型列num_cols = ["age", "income", "spend"]cat_cols = ["city", "category"]ord_cols = ["level"]# 数值特征处理管道num_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="median")), ("scaler", RobustScaler()),])# 类别特征处理管道cat_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="constant", fill_value="unknown")), ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),])# 有序类别处理管道ord_pipeline = Pipeline([ ("imputer", SimpleImputer(strategy="most_frequent")), ("encoder", OrdinalEncoder(categories=[["低", "中", "高"]])),])# 组合preprocessor = ColumnTransformer([ ("num", num_pipeline, num_cols), ("cat", cat_pipeline, cat_cols), ("ord", ord_pipeline, ord_cols),])# 完整建模 Pipelinefull_pipeline = Pipeline([ ("preprocessor", preprocessor), ("model", RandomForestClassifier(n_estimators=200, random_state=42)),])full_pipeline.fit(X_train, y_train)print(f"测试集准确率: {full_pipeline.score(X_test, y_test):.4f}")
💡 Pipeline 的最大好处:防止数据泄露,保证 fit/transform 在训练集和测试集上的一致性,还能直接用于生产部署。
八、特征工程最佳实践
| |
|---|
| |
| |
| |
| 用 joblib 保存 scaler/encoder,保证预测一致性 |
| |
import joblib# 保存整个 Pipeline(包含所有变换器)joblib.dump(full_pipeline, "model_v1.pkl")# 加载并预测pipeline = joblib.load("model_v1.pkl")predictions = pipeline.predict(X_new)
特征工程没有捷径,需要大量的业务理解和实验。但掌握这些系统化方法,能让你在每个项目上少走很多弯路。欢迎点赞收藏,下次遇到特征工程问题直接来查!