当前位置：首页>python>【重复测量纵向数据】Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)

【重复测量纵向数据】Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)

2026-07-02 13:39:16

【重复测量纵向数据】

Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)

纵向数据是在不同时间点上对同一组个体、物体或实体进行重复观测所收集的数据。它就像给研究对象拍摄“动态影像”，能记录其随时间变化的过程，帮助我们理解趋势、模式和影响因素。

🔬 模型的概念、原理、思想

19.混合效应随机森林

概念：将随机森林算法与混合效应模型思想结合，用于分析具有聚类或纵向结构的数据，以同时捕获固定效应和随机效应。

原理：算法实现多样。一种常见策略是：先构建一个随机森林来建模固定效应部分，然后从残差中拟合一个线性混合模型来估计随机效应的方差分量；或者在树的分裂规则中直接融入随机效应的似然。

思想：“机器学习与传统纵向模型的融合”。利用随机森林处理高维、非线性、交互作用的强大预测能力，同时通过随机效应结构尊重数据的纵向依赖特性，提升预测精度和可解释性。

可视化：变量重要性图、部分依赖图（展示关键预测变量与结局的非线性关系）。

公共卫生意义：从电子健康记录中大量的、复杂交互的纵向变量（如多次实验室检查、用药记录）中，预测患者住院期间发生院内感染的风险，同时考虑患者个体的异质性和测量时间的相关性。

💡 22种重复测量纵向数据分析模型

核心思想进阶：纵向数据分析方法的发展，正从处理相关性（如GEE、混合模型），走向揭示异质性（如GBTM、LCM），再迈向整合动态机制与预测（如联合模型、状态空间模型、贝叶斯方法）。

1.数据预处理：

2.模型拟合：

3.模型可视化：

4.结果保存与报告生成

💎模型实现代码介绍

总的来说，纵向数据因其时间延续性和对象一致性，成为了解事物动态发展过程、探究因果关系的强大工具。

当然，处理纵向数据也伴随一些挑战，例如成本较高、可能存在数据缺失，且分析方法通常比处理横截面数据更为复杂。

下面我们使用R语言进行纵向数据Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)：

# pip install numpy pandas scikit-learn matplotlib seaborn statsmodels plotly openpyxl# ===========================================# 19. 混合效应随机森林 (Mixed Effects Random Forest, MERF) - Python实现# ===========================================import osimport sysimport warningsimport numpy as npimport pandas as pdfrom pathlib import Pathimport matplotlib.pyplot as pltimport seaborn as snsfrom datetime import datetimeimport jsonimport picklefrom scipy import statsfrom typing import Dict, List, Tuple, Optional, Any# 设置中文显示plt.rcParams['font.sans-serif'] = ['SimHei', 'Arial Unicode MS', 'DejaVu Sans']plt.rcParams['axes.unicode_minus'] = False# 忽略警告warnings.filterwarnings('ignore')# ===========================================# 1. 导入必要的库# ===========================================print("=" * 70)print("混合效应随机森林(MERF)分析 - Python实现")print("=" * 70)print("\n1. 导入必要的库...")try:# 数据处理    import numpy as np    import pandas as pd    from scipy import stats    import itertools# 机器学习    from sklearn.ensemble import RandomForestRegressor    from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score    from sklearn.preprocessing import StandardScaler, LabelEncoder    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score    from sklearn.inspection import permutation_importance, PartialDependenceDisplay# 混合效应模型    try:        import statsmodels.api as sm        from statsmodels.regression.mixed_linear_model import MixedLM    except ImportError:print("警告: statsmodels 未安装，部分功能可能受限")print("安装命令: pip install statsmodels")# 可视化    import matplotlib.pyplot as plt    import seaborn as sns    from matplotlib import gridspec    import matplotlib.cm as cm    import plotly.express as px    import plotly.graph_objects as go    from plotly.subplots import make_subplots# 其他    from datetime import datetime    import json    import pickle    import warnings    from pathlib import Path    import itertoolsprint("✓ 所有必要的库已导入")except ImportError as e:print(f"✗ 导入库时出错: {e}")print("请确保已安装以下库:")print("pip install numpy pandas scikit-learn matplotlib seaborn statsmodels plotly")    sys.exit(1)# ===========================================# 2. 设置工作目录和路径# ===========================================print("\n2. 设置工作目录和路径...")# 自动查找桌面路径desktop_path = Path.home() / "Desktop"if not desktop_path.exists():# 尝试其他可能的桌面路径    possible_paths = [        Path.home() / "桌面",  # 中文系统        Path.home() / "Desktop",        Path.home() / "OneDrive" / "Desktop",  # OneDrive    ]for path in possible_paths:if path.exists():            desktop_path = pathbreakprint(f"桌面路径: {desktop_path}")# 设置结果文件夹result_dir = desktop_path / "19-MERF结果"result_dir.mkdir(exist_ok=True)# 创建子文件夹figures_dir = result_dir / "figures"tables_dir = result_dir / "tables"models_dir = result_dir / "models"reports_dir = result_dir / "reports"for dir_path in [figures_dir, tables_dir, models_dir, reports_dir]:    dir_path.mkdir(exist_ok=True)print(f"结果将保存到: {result_dir}")# ===========================================# 3. 读取数据# ===========================================print("\n3. 读取数据...")try:# 尝试不同的Excel文件路径    data_file_paths = [        desktop_path / "longitudinal_data.xlsx",        desktop_path.parent / "longitudinal_data.xlsx",        Path.cwd() / "longitudinal_data.xlsx",    ]    data_path = Nonefor path in data_file_paths:if path.exists():            data_path = pathbreakif data_path is None:        raise FileNotFoundError("找不到 longitudinal_data.xlsx 文件")print(f"找到数据文件: {data_path}")# 读取Excel文件    try:# 尝试读取所有sheet，找到包含数据的sheet        excel_file = pd.ExcelFile(data_path)        sheet_names = excel_file.sheet_namesif"Full_Dataset"in sheet_names:            data_full = pd.read_excel(data_path, sheet_name="Full_Dataset")elif len(sheet_names) > 0:            data_full = pd.read_excel(data_path, sheet_name=sheet_names[0])print(f"使用sheet: {sheet_names[0]}")else:            raise ValueError("Excel文件中没有可用的sheet")    except Exception as e:print(f"读取Excel文件失败: {e}")# 尝试CSV格式        csv_path = data_path.with_suffix('.csv')if csv_path.exists():print("尝试读取CSV格式...")            data_full = pd.read_csv(csv_path)else:            raiseprint("✓ 数据读取成功")# 检查数据结构print(f"\n数据结构:")print(f"数据维度: {data_full.shape}")print(f"变量数: {len(data_full.columns)}")# 检查必要的列if'Time'in data_full.columns:print(f"时间点: {sorted(data_full['Time'].unique())}")if'ID'in data_full.columns:print(f"个体数: {data_full['ID'].nunique()}")# 显示前几行print("\n数据预览:")print(data_full.head())except Exception as e:print(f"✗ 读取数据时出错: {e}")print("创建示例数据用于演示...")# 创建示例数据    np.random.seed(1234)    n_subjects = 100    n_timepoints = 5    n_total = n_subjects * n_timepoints    data_full = pd.DataFrame({'ID': np.repeat(range(1, n_subjects + 1), n_timepoints),'Time': np.tile(range(1, n_timepoints + 1), n_subjects),'Indicator1': np.random.normal(50, 10, n_total),'Indicator2': np.random.normal(30, 5, n_total),'Indicator3': np.random.normal(20, 3, n_total),'Age': np.random.normal(45, 10, n_total),'Sex': np.random.choice(['M', 'F'], n_total),'Treatment': np.random.choice(['A', 'B', 'C'], n_total),'Medication_Adherence': np.random.uniform(0.5, 1.0, n_total),'Stress_Level': np.random.normal(5, 2, n_total),'Quality_of_Life': np.random.normal(70, 15, n_total),'Baseline_Score': np.random.normal(50, 8, n_total)    })print("使用示例数据继续分析...")# ===========================================# 4. 数据准备与预处理# ===========================================print("\n4. 数据准备与预处理...")# 4.1 创建MERF分析数据集print("\n4.1 创建MERF分析数据集...")# 选择需要的列required_columns = ['ID', 'Time', 'Indicator1', 'Indicator2', 'Indicator3', 'Age', 'Sex', 'Treatment', 'Medication_Adherence', 'Stress_Level', 'Quality_of_Life', 'Baseline_Score']# 检查缺失的列missing_cols = [col for col in required_columns if col not in data_full.columns]if missing_cols:print(f"警告: 以下列不存在: {missing_cols}")# 尝试使用现有列    available_cols = [col for col in required_columns if col in data_full.columns]print(f"使用现有列: {available_cols}")    required_columns = available_colsmerf_data = data_full[required_columns].copy()# 检查缺失值print("\n缺失值统计:")missing_stats = merf_data.isnull().sum()missing_cols = missing_stats[missing_stats > 0]if len(missing_cols) > 0:print("有缺失值的列:")for col, count in missing_cols.items():print(f"  {col}: {count}个缺失值 ({count/len(merf_data)*100:.1f}%)")# 删除缺失值    merf_data_complete = merf_data.dropna().copy()print(f"\n删除缺失值后的数据维度: {merf_data_complete.shape}")print(f"删除了 {len(merf_data) - len(merf_data_complete)} 行数据")else:    merf_data_complete = merf_data.copy()print("无缺失值")# 转换为适当的变量类型print("\n转换变量类型...")# 确保ID是分类变量merf_data_complete['ID'] = merf_data_complete['ID'].astype('category')# 处理分类变量categorical_cols = ['Sex', 'Treatment']for col in categorical_cols:if col in merf_data_complete.columns:        merf_data_complete[col] = merf_data_complete[col].astype('category')print(f"  {col}: 转换为分类变量, {merf_data_complete[col].nunique()}个类别")# 确保数值列是数值类型numeric_cols = ['Time', 'Age', 'Indicator1', 'Indicator2', 'Indicator3','Medication_Adherence', 'Stress_Level', 'Quality_of_Life', 'Baseline_Score']for col in numeric_cols:if col in merf_data_complete.columns:        merf_data_complete[col] = pd.to_numeric(merf_data_complete[col], errors='coerce')# 检查数据是否足够print(f"\n最终数据维度: {merf_data_complete.shape}")if len(merf_data_complete) < 50:print(f"警告: 样本量较小 ({len(merf_data_complete)})，模型可能不稳定")# 4.2 数据分割（训练集和测试集）print("\n4.2 数据分割（训练集和测试集）...")np.random.seed(1234)# 按个体分割，确保同一个体不会同时出现在训练集和测试集unique_ids = merf_data_complete['ID'].unique()n_ids = len(unique_ids)train_ids = np.random.choice(unique_ids, size=int(0.7 * n_ids), replace=False)# 创建训练集和测试集train_mask = merf_data_complete['ID'].isin(train_ids)test_mask = ~train_masktrain_data = merf_data_complete[train_mask].copy()test_data = merf_data_complete[test_mask].copy()print(f"训练集大小: {len(train_data)} ({len(train_data)/len(merf_data_complete)*100:.1f}%)")print(f"测试集大小: {len(test_data)} ({len(test_data)/len(merf_data_complete)*100:.1f}%)")print(f"训练集个体数: {train_data['ID'].nunique()}")print(f"测试集个体数: {test_data['ID'].nunique()}")# ===========================================# 5. 描述性统计分析# ===========================================print("\n5. 描述性统计分析...")# 5.1 总体描述print("\n5.1 总体描述性统计:")desc_overall = pd.DataFrame({'N': [len(train_data)],'Mean_Ind1': [train_data['Indicator1'].mean()],'SD_Ind1': [train_data['Indicator1'].std()],'Mean_Ind2': [train_data['Indicator2'].mean()],'SD_Ind2': [train_data['Indicator2'].std()],'Mean_Ind3': [train_data['Indicator3'].mean()],'SD_Ind3': [train_data['Indicator3'].std()],'Mean_Age': [train_data['Age'].mean()],'SD_Age': [train_data['Age'].std()]}).round(3)print(desc_overall.to_string(index=False))# 保存描述性统计desc_overall.to_csv(tables_dir / "总体描述性统计.csv", index=False, encoding='utf-8-sig')# 5.2 按治疗组描述if'Treatment'in train_data.columns:print("\n5.2 按治疗组描述性统计:")    desc_treatment = train_data.groupby('Treatment').agg({'Indicator1': ['count', 'mean', 'std'],'Age': ['mean', 'std']    }).round(3)# 重命名列    desc_treatment.columns = ['N', 'Mean_Ind1', 'SD_Ind1', 'Mean_Age', 'SD_Age']    desc_treatment = desc_treatment.reset_index()print(desc_treatment.to_string(index=False))    desc_treatment.to_csv(tables_dir / "按治疗组描述性统计.csv", index=False, encoding='utf-8-sig')# 5.3 按时间点描述if'Time'in train_data.columns:print("\n5.3 按时间点描述性统计:")    desc_time = train_data.groupby('Time').agg({'Indicator1': ['count', 'mean', 'std']    }).round(3)# 重命名列    desc_time.columns = ['N', 'Mean_Ind1', 'SD_Ind1']    desc_time = desc_time.reset_index()print(desc_time.to_string(index=False))    desc_time.to_csv(tables_dir / "按时间点描述性统计.csv", index=False, encoding='utf-8-sig')# ===========================================# 6. 混合效应随机森林模型拟合# ===========================================print("\n6. 混合效应随机森林模型拟合...")# 准备特征和目标变量print("\n准备特征和目标变量...")# 定义固定效应特征if'Sex'in train_data.columns and 'Treatment'in train_data.columns:# 创建虚拟变量    X_train = pd.get_dummies(train_data[['Time', 'Age', 'Sex', 'Treatment', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score']],                            columns=['Sex', 'Treatment'], drop_first=True)    X_test = pd.get_dummies(test_data[['Time', 'Age', 'Sex', 'Treatment','Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score']],                          columns=['Sex', 'Treatment'], drop_first=True)else:    X_train = train_data[['Time', 'Age', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score']].copy()    X_test = test_data[['Time', 'Age', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score']].copy()y_train = train_data['Indicator1'].valuesy_test = test_data['Indicator1'].values# 获取分组变量groups_train = train_data['ID'].valuesgroups_test = test_data['ID'].valuesprint(f"训练集特征形状: {X_train.shape}")print(f"训练集目标形状: {y_train.shape}")print(f"训练集分组形状: {groups_train.shape}")

# 计算训练集预测    train_data['rf_pred'] = rf_model.predict(X_train)# 计算残差    train_data['residuals'] = train_data['Indicator1'] - train_data['rf_pred']# 使用statsmodels拟合线性混合模型（随机效应）print("拟合线性混合模型（随机效应）...")# 准备混合模型数据    mixed_data = train_data.copy()    mixed_data['intercept'] = 1  # 添加截距项    try:# 使用statsmodels的MixedLM        from statsmodels.regression.mixed_linear_model import MixedLM# 拟合混合模型        mixed_formula = 'residuals ~ 1'        mixed_model = MixedLM.from_formula(            mixed_formula,             data=mixed_data,             groups=mixed_data['ID']        )        mixed_result = mixed_model.fit()print("✓ 线性混合模型拟合成功!")# 提取随机效应        random_effects = mixed_result.random_effects# 将随机效应映射回训练数据        train_data['random_effects'] = train_data['ID'].map(            lambda x: random_effects.get(x, [0])[0] if isinstance(random_effects.get(x, []), list) and len(random_effects.get(x, [])) > 0 else 0        )# 合并预测        train_data['merf_pred'] = train_data['rf_pred'] + train_data['random_effects']        method1_models = {'rf_model': rf_model,'mixed_model': mixed_result,'random_effects': random_effects        }# 保存模型        with open(models_dir / "random_forest_model.pkl", 'wb') as f:            pickle.dump(rf_model, f)# 保存混合模型结果        mixed_result.save(models_dir / "linear_mixed_model.pkl")print("✓ 模型已保存")    except Exception as e:print(f"✗ 线性混合模型拟合失败: {e}")print("使用简单随机效应估计...")# 简单方法：计算每个个体的平均残差作为随机效应        random_effects_simple = train_data.groupby('ID')['residuals'].mean().to_dict()        train_data['random_effects'] = train_data['ID'].map(random_effects_simple)        train_data['merf_pred'] = train_data['rf_pred'] + train_data['random_effects']        method1_models = {'rf_model': rf_model,'random_effects': random_effects_simple,'mixed_model': None        }        with open(models_dir / "random_forest_model.pkl", 'wb') as f:            pickle.dump(rf_model, f)        with open(models_dir / "random_effects_simple.pkl", 'wb') as f:            pickle.dump(random_effects_simple, f)print("✓ 使用简单随机效应估计")except Exception as e:print(f"✗ 方法1拟合失败: {e}")    method1_models = {}# 6.2 方法2: 增强随机森林模型（添加交互特征）print("\n6.2 方法2: 增强随机森林模型...")method2_model = {}try:# 创建交互特征    train_data_enhanced = train_data.copy()# 添加交互项    train_data_enhanced['Time_Age'] = train_data_enhanced['Time'] * train_data_enhanced['Age']if'Treatment'in train_data_enhanced.columns:        train_data_enhanced['Time_Treatment'] = train_data_enhanced['Time'].astype(str) + '_' + train_data_enhanced['Treatment'].astype(str)    train_data_enhanced['Stress_Adherence'] = train_data_enhanced['Stress_Level'] * train_data_enhanced['Medication_Adherence']# 准备特征    enhanced_features = ['Time', 'Age', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score','Time_Age', 'Stress_Adherence']if'Sex'in train_data_enhanced.columns and 'Treatment'in train_data_enhanced.columns:        X_train_enhanced = pd.get_dummies(train_data_enhanced[enhanced_features + ['Sex', 'Treatment']],                                          columns=['Sex', 'Treatment'], drop_first=True)else:        X_train_enhanced = train_data_enhanced[enhanced_features].copy()    y_train_enhanced = train_data_enhanced['Indicator1'].values# 拟合增强的随机森林    rf_enhanced = RandomForestRegressor(        n_estimators=500,        max_features='sqrt',        random_state=1234,        n_jobs=-1,        verbose=0    )    rf_enhanced.fit(X_train_enhanced, y_train_enhanced)print("✓ 增强随机森林模型拟合成功!")# 保存模型    with open(models_dir / "random_forest_enhanced.pkl", 'wb') as f:        pickle.dump(rf_enhanced, f)    method2_model = {'rf_enhanced': rf_enhanced,'feature_names': X_train_enhanced.columns.tolist()    }except Exception as e:print(f"✗ 方法2拟合失败: {e}")    method2_model = {}# 6.3 方法3: 梯度提升树作为对比print("\n6.3 方法3: 梯度提升树模型...")method3_model = {}try:    from sklearn.ensemble import GradientBoostingRegressor    gb_model = GradientBoostingRegressor(        n_estimators=200,        learning_rate=0.1,        max_depth=5,        random_state=1234,        verbose=0    )    gb_model.fit(X_train, y_train)print("✓ 梯度提升树模型拟合成功!")# 保存模型    with open(models_dir / "gradient_boosting_model.pkl", 'wb') as f:        pickle.dump(gb_model, f)    method3_model = {'gb_model': gb_model    }except Exception as e:print(f"✗ 方法3拟合失败: {e}")    method3_model = {}# ===========================================# 7. 模型评估与比较# ===========================================print("\n7. 模型评估与比较...")# 7.1 训练集性能评估print("\n7.1 训练集性能评估...")model_performance = []# 方法1评估if method1_models and 'merf_pred'in train_data.columns:    y_true = train_data['Indicator1']    y_pred = train_data['merf_pred']    mae1 = mean_absolute_error(y_true, y_pred)    rmse1 = np.sqrt(mean_squared_error(y_true, y_pred))    r2_1 = r2_score(y_true, y_pred)    model_performance.append({'Model': '随机森林+线性混合模型','MAE': mae1,'RMSE': rmse1,'R2': r2_1,'N_Trees': 500    })print(f"随机森林+线性混合模型:")print(f"  MAE: {mae1:.3f}")print(f"  RMSE: {rmse1:.3f}")print(f"  R²: {r2_1:.3f}")# 方法2评估if method2_model:    try:# 准备测试数据（增强特征）        test_data_enhanced = test_data.copy()        test_data_enhanced['Time_Age'] = test_data_enhanced['Time'] * test_data_enhanced['Age']if'Treatment'in test_data_enhanced.columns:            test_data_enhanced['Time_Treatment'] = test_data_enhanced['Time'].astype(str) + '_' + test_data_enhanced['Treatment'].astype(str)        test_data_enhanced['Stress_Adherence'] = test_data_enhanced['Stress_Level'] * test_data_enhanced['Medication_Adherence']# 准备特征        enhanced_features = ['Time', 'Age', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score','Time_Age', 'Stress_Adherence']if'Sex'in test_data_enhanced.columns and 'Treatment'in test_data_enhanced.columns:            X_test_enhanced = pd.get_dummies(test_data_enhanced[enhanced_features + ['Sex', 'Treatment']],                                            columns=['Sex', 'Treatment'], drop_first=True)else:            X_test_enhanced = test_data_enhanced[enhanced_features].copy()# 确保列一致for col in method2_model['feature_names']:if col not in X_test_enhanced.columns:                X_test_enhanced[col] = 0        X_test_enhanced = X_test_enhanced[method2_model['feature_names']]        y_test_enhanced = test_data_enhanced['Indicator1'].values# 预测        rf_enhanced_pred = method2_model['rf_enhanced'].predict(X_test_enhanced)        mae2 = mean_absolute_error(y_test_enhanced, rf_enhanced_pred)        rmse2 = np.sqrt(mean_squared_error(y_test_enhanced, rf_enhanced_pred))        r2_2 = r2_score(y_test_enhanced, rf_enhanced_pred)        model_performance.append({'Model': '增强随机森林','MAE': mae2,'RMSE': rmse2,'R2': r2_2,'N_Trees': 500        })print(f"\n增强随机森林:")print(f"  MAE: {mae2:.3f}")print(f"  RMSE: {rmse2:.3f}")print(f"  R²: {r2_2:.3f}")    except Exception as e:print(f"方法2评估失败: {e}")# 方法3评估if method3_model:    try:# 预测        gb_pred = method3_model['gb_model'].predict(X_test)        mae3 = mean_absolute_error(y_test, gb_pred)        rmse3 = np.sqrt(mean_squared_error(y_test, gb_pred))        r2_3 = r2_score(y_test, gb_pred)        model_performance.append({'Model': '梯度提升树','MAE': mae3,'RMSE': rmse3,'R2': r2_3,'N_Trees': 200        })print(f"\n梯度提升树:")print(f"  MAE: {mae3:.3f}")print(f"  RMSE: {rmse3:.3f}")print(f"  R²: {r2_3:.3f}")    except Exception as e:print(f"方法3评估失败: {e}")# 转换为DataFramemodel_performance_df = pd.DataFrame(model_performance)# 保存性能比较if not model_performance_df.empty:print("\n模型性能比较:")print(model_performance_df.to_string(index=False))    model_performance_df.to_csv(tables_dir / "模型性能比较.csv", index=False, encoding='utf-8-sig')print("✓ 模型性能比较表已保存")else:print("警告: 无有效的模型性能数据")# 7.2 测试集性能评估print("\n7.2 测试集性能评估...")# 选择最佳模型（基于R²）if not model_performance_df.empty:    best_model_index = model_performance_df['R2'].idxmax()    best_model_name = model_performance_df.loc[best_model_index, 'Model']print(f"\n最佳模型: {best_model_name}")# 根据最佳模型进行测试集预测    test_performance = {}if best_model_name == "随机森林+线性混合模型" and method1_models:# 使用方法1进行测试集预测        test_data['rf_pred'] = method1_models['rf_model'].predict(X_test)# 使用训练集随机效应的平均值作为测试集随机效应的近似if'random_effects'in train_data.columns:            mean_random_effect = train_data['random_effects'].mean()            test_data['merf_pred'] = test_data['rf_pred'] + mean_random_effectelse:            test_data['merf_pred'] = test_data['rf_pred']# 计算测试集性能        test_mae = mean_absolute_error(test_data['Indicator1'], test_data['merf_pred'])        test_rmse = np.sqrt(mean_squared_error(test_data['Indicator1'], test_data['merf_pred']))        test_r2 = r2_score(test_data['Indicator1'], test_data['merf_pred'])        test_performance = {'Model': best_model_name,'MAE': test_mae,'RMSE': test_rmse,'R2': test_r2        }        best_model = method1_modelselif best_model_name == "增强随机森林" and method2_model:# 使用方法2进行测试集预测        test_data_enhanced = test_data.copy()        test_data_enhanced['Time_Age'] = test_data_enhanced['Time'] * test_data_enhanced['Age']if'Treatment'in test_data_enhanced.columns:            test_data_enhanced['Time_Treatment'] = test_data_enhanced['Time'].astype(str) + '_' + test_data_enhanced['Treatment'].astype(str)        test_data_enhanced['Stress_Adherence'] = test_data_enhanced['Stress_Level'] * test_data_enhanced['Medication_Adherence']# 准备特征        enhanced_features = ['Time', 'Age', 'Indicator2', 'Indicator3', 'Medication_Adherence','Stress_Level', 'Quality_of_Life', 'Baseline_Score','Time_Age', 'Stress_Adherence']if'Sex'in test_data_enhanced.columns and 'Treatment'in test_data_enhanced.columns:            X_test_enhanced = pd.get_dummies(test_data_enhanced[enhanced_features + ['Sex', 'Treatment']],                                            columns=['Sex', 'Treatment'], drop_first=True)else:            X_test_enhanced = test_data_enhanced[enhanced_features].copy()# 确保列一致for col in method2_model['feature_names']:if col not in X_test_enhanced.columns:                X_test_enhanced[col] = 0        X_test_enhanced = X_test_enhanced[method2_model['feature_names']]# 预测        test_data_enhanced['rf_enhanced_pred'] = method2_model['rf_enhanced'].predict(X_test_enhanced)# 计算测试集性能        test_mae = mean_absolute_error(test_data_enhanced['Indicator1'], test_data_enhanced['rf_enhanced_pred'])        test_rmse = np.sqrt(mean_squared_error(test_data_enhanced['Indicator1'], test_data_enhanced['rf_enhanced_pred']))        test_r2 = r2_score(test_data_enhanced['Indicator1'], test_data_enhanced['rf_enhanced_pred'])        test_performance = {'Model': best_model_name,'MAE': test_mae,'RMSE': test_rmse,'R2': test_r2        }        best_model = method2_modelelif best_model_name == "梯度提升树" and method3_model:# 使用方法3进行测试集预测        test_data['gb_pred'] = method3_model['gb_model'].predict(X_test)# 计算测试集性能        test_mae = mean_absolute_error(test_data['Indicator1'], test_data['gb_pred'])        test_rmse = np.sqrt(mean_squared_error(test_data['Indicator1'], test_data['gb_pred']))        test_r2 = r2_score(test_data['Indicator1'], test_data['gb_pred'])        test_performance = {'Model': best_model_name,'MAE': test_mae,'RMSE': test_rmse,'R2': test_r2        }        best_model = method3_modelif test_performance:print(f"\n测试集性能:")print(f"  MAE: {test_performance['MAE']:.3f}")print(f"  RMSE: {test_performance['RMSE']:.3f}")print(f"  R²: {test_performance['R2']:.3f}")# 保存测试集性能        test_performance_df = pd.DataFrame([test_performance])        test_performance_df.to_csv(tables_dir / "测试集性能评估.csv", index=False, encoding='utf-8-sig')print("✓ 测试集性能评估表已保存")# ===========================================# 8. 可视化分析# ===========================================print("\n8. 生成可视化分析图...")# 8.1 变量重要性图print("\n8.1 变量重要性图...")try:if method1_models and 'rf_model'in method1_models:        rf_model = method1_models['rf_model']# 提取变量重要性        feature_importance = pd.DataFrame({'Variable': X_train.columns,'Importance': rf_model.feature_importances_        }).sort_values('Importance', ascending=False)print("变量重要性排名:")print(feature_importance.head(10).to_string(index=False))# 保存变量重要性表        feature_importance.to_csv(tables_dir / "变量重要性表.csv", index=False, encoding='utf-8-sig')# 创建变量重要性图        plt.figure(figsize=(12, 8))        plt.barh(feature_importance['Variable'][:15], feature_importance['Importance'][:15])        plt.xlabel('重要性得分')        plt.ylabel('变量')        plt.title('混合效应随机森林: 变量重要性')        plt.gca().invert_yaxis()  # 最重要的在顶部        plt.tight_layout()        plt.savefig(figures_dir / "变量重要性图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 变量重要性图已保存")except Exception as e:print(f"✗ 变量重要性图生成失败: {e}")# 8.2 部分依赖图print("\n8.2 部分依赖图...")try:if method1_models and 'rf_model'in method1_models:        rf_model = method1_models['rf_model']# 选择最重要的几个变量        top_vars = feature_importance['Variable'].head(4).tolist()# 创建部分依赖图        fig, axes = plt.subplots(2, 2, figsize=(14, 10))        axes = axes.ravel()for i, var in enumerate(top_vars):if i >= 4:  # 最多显示4个breakif var in X_train.columns:# 计算部分依赖                from sklearn.inspection import partial_dependence                pdp_result = partial_dependence(                    rf_model, X_train, [var],                     grid_resolution=20                )# 绘制部分依赖图                axes[i].plot(pdp_result['values'][0], pdp_result['average'][0],                            linewidth=2, color='blue')                axes[i].scatter(pdp_result['values'][0], pdp_result['average'][0],                               color='red', s=30)                axes[i].set_xlabel(var)                axes[i].set_ylabel('部分依赖')                axes[i].set_title(f'部分依赖图: {var}')                axes[i].grid(True, alpha=0.3)        plt.suptitle('混合效应随机森林: 关键变量的部分依赖图', fontsize=16, fontweight='bold')        plt.tight_layout()        plt.savefig(figures_dir / "部分依赖图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 部分依赖图已保存")except Exception as e:print(f"✗ 部分依赖图生成失败: {e}")# 8.3 预测 vs 观测图print("\n8.3 预测 vs 观测图...")try:if'merf_pred'in train_data.columns:# 创建预测 vs 观测散点图        plt.figure(figsize=(10, 8))        plt.scatter(train_data['Indicator1'], train_data['merf_pred'],                    alpha=0.5, color='steelblue')# 添加完美预测线        min_val = min(train_data['Indicator1'].min(), train_data['merf_pred'].min())        max_val = max(train_data['Indicator1'].max(), train_data['merf_pred'].max())        plt.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='完美预测线')# 添加回归线        from scipy import stats        slope, intercept, r_value, p_value, std_err = stats.linregress(            train_data['Indicator1'], train_data['merf_pred']        )        plt.plot([min_val, max_val],                 [slope*min_val + intercept, slope*max_val + intercept],'g-', linewidth=2, label=f'回归线 (R={r_value:.3f})')        plt.xlabel('观测值')        plt.ylabel('预测值')        plt.title('混合效应随机森林: 预测值 vs 观测值')        plt.legend()        plt.grid(True, alpha=0.3)        plt.axis('equal')        plt.tight_layout()        plt.savefig(figures_dir / "预测vs观测图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 预测vs观测图已保存")# 创建残差图if'residuals'in train_data.columns:            plt.figure(figsize=(12, 6))            plt.scatter(train_data['merf_pred'], train_data['residuals'],                        alpha=0.5, color='steelblue')# 添加零线            plt.axhline(y=0, color='r', linestyle='--', linewidth=2)# 添加平滑线            from scipy.ndimage import gaussian_filter1d            sorted_indices = np.argsort(train_data['merf_pred'])            sorted_pred = train_data['merf_pred'].iloc[sorted_indices].values            sorted_resid = train_data['residuals'].iloc[sorted_indices].values# 使用滑动窗口平均            window_size = 50            smooth_resid = np.convolve(sorted_resid, np.ones(window_size)/window_size, mode='valid')            smooth_pred = sorted_pred[window_size-1:]            plt.plot(smooth_pred, smooth_resid, 'g-', linewidth=2, label='平滑趋势')            plt.xlabel('预测值')            plt.ylabel('残差')            plt.title('混合效应随机森林: 残差图')            plt.legend()            plt.grid(True, alpha=0.3)            plt.tight_layout()            plt.savefig(figures_dir / "残差图.png", dpi=300, bbox_inches='tight')            plt.close()print("✓ 残差图已保存")except Exception as e:print(f"✗ 预测vs观测图生成失败: {e}")# 8.4 随机效应分布图print("\n8.4 随机效应分布图...")try:if'random_effects'in train_data.columns:# 创建随机效应分布图        plt.figure(figsize=(12, 6))        plt.hist(train_data['random_effects'], bins=30,                 color='skyblue', edgecolor='black', alpha=0.7)        mean_effect = train_data['random_effects'].mean()        std_effect = train_data['random_effects'].std()        plt.axvline(x=mean_effect, color='r', linestyle='--',                    linewidth=2, label=f'均值 = {mean_effect:.3f}')        plt.xlabel('随机效应值')        plt.ylabel('频数')        plt.title(f'混合效应随机森林: 随机效应分布\n(均值 = {mean_effect:.3f}, 标准差 = {std_effect:.3f})')        plt.legend()        plt.grid(True, alpha=0.3)        plt.tight_layout()        plt.savefig(figures_dir / "随机效应分布图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 随机效应分布图已保存")# 创建按个体的随机效应图        random_effects_by_id = train_data.groupby('ID')['random_effects'].mean().reset_index()        random_effects_by_id = random_effects_by_id.sort_values('random_effects')        plt.figure(figsize=(14, 6))        plt.bar(range(len(random_effects_by_id)), random_effects_by_id['random_effects'],               color='steelblue', alpha=0.7)        plt.axhline(y=0, color='r', linestyle='--', linewidth=1)        plt.xlabel('个体ID (排序后)')        plt.ylabel('随机效应值')        plt.title('混合效应随机森林: 按个体的随机效应')        plt.xticks([])  # 隐藏x轴标签        plt.grid(True, alpha=0.3, axis='y')        plt.tight_layout()        plt.savefig(figures_dir / "个体随机效应图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 个体随机效应图已保存")except Exception as e:print(f"✗ 随机效应分布图生成失败: {e}")# 8.5 时间趋势预测图print("\n8.5 时间趋势预测图...")try:if'Time'in train_data.columns and 'merf_pred'in train_data.columns:# 计算每个时间点的平均预测值        time_trend = train_data.groupby('Time').agg({'Indicator1': ['mean', 'std', 'count'],'merf_pred': ['mean', 'std']        }).round(3)# 简化列名        time_trend.columns = ['Observed_Mean', 'Observed_SD', 'N', 'Predicted_Mean', 'Predicted_SD']        time_trend['Observed_SE'] = time_trend['Observed_SD'] / np.sqrt(time_trend['N'])        time_trend['Predicted_SE'] = time_trend['Predicted_SD'] / np.sqrt(time_trend['N'])# 创建时间趋势图        plt.figure(figsize=(12, 6))        time_points = time_trend.index# 观测值        plt.plot(time_points, time_trend['Observed_Mean'], 'b-', linewidth=2, marker='o', markersize=8, label='观测值')        plt.fill_between(time_points,                        time_trend['Observed_Mean'] - 1.96 * time_trend['Observed_SE'],                        time_trend['Observed_Mean'] + 1.96 * time_trend['Observed_SE'],                        alpha=0.2, color='blue', label='观测值95%置信区间')# 预测值        plt.plot(time_points, time_trend['Predicted_Mean'], 'r--', linewidth=2, marker='s', markersize=6, label='预测值')        plt.fill_between(time_points,                        time_trend['Predicted_Mean'] - 1.96 * time_trend['Predicted_SE'],                        time_trend['Predicted_Mean'] + 1.96 * time_trend['Predicted_SE'],                        alpha=0.2, color='red', label='预测值95%置信区间')        plt.xlabel('时间点')        plt.ylabel('Indicator1平均值')        plt.title('混合效应随机森林: 时间趋势预测')        plt.legend()        plt.grid(True, alpha=0.3)        plt.tight_layout()        plt.savefig(figures_dir / "时间趋势预测图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 时间趋势预测图已保存")except Exception as e:print(f"✗ 时间趋势预测图生成失败: {e}")# 8.6 治疗组比较图print("\n8.6 治疗组比较图...")try:if'Treatment'in train_data.columns and 'merf_pred'in train_data.columns:# 计算各治疗组的平均预测值和观测值        treatment_comparison = train_data.groupby('Treatment').agg({'Indicator1': ['mean', 'std', 'count'],'merf_pred': ['mean', 'std']        }).round(3)# 简化列名        treatment_comparison.columns = ['Observed_Mean', 'Observed_SD', 'N','Predicted_Mean', 'Predicted_SD']        treatment_comparison['Observed_SE'] = treatment_comparison['Observed_SD'] / np.sqrt(treatment_comparison['N'])        treatment_comparison['Predicted_SE'] = treatment_comparison['Predicted_SD'] / np.sqrt(treatment_comparison['N'])# 创建治疗组比较图        plt.figure(figsize=(12, 6))        treatments = treatment_comparison.index        x_pos = np.arange(len(treatments))# 柱状图：观测值        plt.bar(x_pos - 0.2, treatment_comparison['Observed_Mean'],                width=0.4, color='skyblue', alpha=0.7, label='观测值')# 误差线：观测值        plt.errorbar(x_pos - 0.2, treatment_comparison['Observed_Mean'],                    yerr=1.96*treatment_comparison['Observed_SE'],                    fmt='none', color='blue', capsize=5, linewidth=2)# 点：预测值        plt.scatter(x_pos + 0.2, treatment_comparison['Predicted_Mean'],                   color='red', s=100, zorder=5, label='预测值')# 误差线：预测值        plt.errorbar(x_pos + 0.2, treatment_comparison['Predicted_Mean'],                    yerr=1.96*treatment_comparison['Predicted_SE'],                    fmt='none', color='red', capsize=5, linewidth=2, linestyle='--')        plt.xlabel('治疗组')        plt.ylabel('Indicator1平均值')        plt.title('混合效应随机森林: 治疗组效果比较')        plt.xticks(x_pos, treatments)        plt.legend()        plt.grid(True, alpha=0.3, axis='y')        plt.tight_layout()        plt.savefig(figures_dir / "治疗组比较图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 治疗组比较图已保存")except Exception as e:print(f"✗ 治疗组比较图生成失败: {e}")# 8.7 特征交互可视化print("\n8.7 特征交互可视化...")try:# 检查重要变量的交互作用    important_vars = ['Indicator1', 'Time', 'Age', 'Indicator2', 'Quality_of_Life']    available_vars = [var for var in important_vars if var in train_data.columns]if len(available_vars) >= 3:# 创建散点图矩阵        scatter_matrix_data = train_data[available_vars[:4]]  # 最多4个变量        fig = plt.figure(figsize=(12, 10))# 创建自定义的散点图矩阵        n_vars = len(scatter_matrix_data.columns)for i in range(n_vars):for j in range(n_vars):                ax = plt.subplot(n_vars, n_vars, i * n_vars + j + 1)if i == j:# 对角线：直方图                    ax.hist(scatter_matrix_data.iloc[:, i], bins=20,                            alpha=0.7, color='skyblue', edgecolor='black')                    ax.set_title(scatter_matrix_data.columns[i], fontsize=10)else:# 非对角线：散点图                    ax.scatter(scatter_matrix_data.iloc[:, j],                               scatter_matrix_data.iloc[:, i],                              alpha=0.5, s=10, color='steelblue')# 设置刻度if i != n_vars - 1:                    ax.set_xticks([])if j != 0:                    ax.set_yticks([])        plt.suptitle('重要变量间关系散点图矩阵', fontsize=16, fontweight='bold')        plt.tight_layout()        plt.savefig(figures_dir / "变量关系散点图矩阵.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 变量关系散点图矩阵已保存")except Exception as e:print(f"✗ 特征交互可视化失败: {e}")# 8.8 模型性能对比图print("\n8.8 模型性能对比图...")try:if not model_performance_df.empty:# 创建模型性能对比图        fig, axes = plt.subplots(1, 3, figsize=(15, 5))        metrics = ['MAE', 'RMSE', 'R2']        colors = plt.cm.viridis(np.linspace(0, 1, len(model_performance_df)))for i, metric in enumerate(metrics):            ax = axes[i]            bars = ax.bar(range(len(model_performance_df)),                          model_performance_df[metric],                          color=colors, alpha=0.8)            ax.set_xlabel('模型')            ax.set_ylabel(metric)            ax.set_title(f'{metric} 对比')            ax.set_xticks(range(len(model_performance_df)))            ax.set_xticklabels(model_performance_df['Model'], rotation=45, ha='right')# 在柱子上添加数值for j, bar in enumerate(bars):                height = bar.get_height()                ax.text(bar.get_x() + bar.get_width()/2., height + 0.01,                       f'{height:.3f}', ha='center', va='bottom', fontsize=9)        plt.suptitle('混合效应随机森林: 模型性能对比', fontsize=16, fontweight='bold')        plt.tight_layout()        plt.savefig(figures_dir / "模型性能对比图.png", dpi=300, bbox_inches='tight')        plt.close()print("✓ 模型性能对比图已保存")except Exception as e:print(f"✗ 模型性能对比图生成失败: {e}")# ===========================================# 9. 创建综合结果表格# ===========================================print("\n9. 创建综合结果表格...")# 9.1 模型性能汇总表print("\n9.1 模型性能汇总表...")if not model_performance_df.empty:# 格式化性能表    model_performance_formatted = model_performance_df.copy()# 重命名列    model_performance_formatted.columns = ['模型', '平均绝对误差', '均方根误差', '决定系数', '树数量']# 格式化数值for col in ['平均绝对误差', '均方根误差', '决定系数']:if col in model_performance_formatted.columns:            model_performance_formatted[col] = model_performance_formatted[col].apply(                lambda x: f"{x:.3f}"            )    model_performance_formatted.to_csv(tables_dir / "模型性能汇总表.csv",                                       index=False, encoding='utf-8-sig')print("✓ 模型性能汇总表已保存")# 9.2 变量重要性汇总表print("\n9.2 变量重要性汇总表...")if'feature_importance'in locals():# 格式化变量重要性表    importance_formatted = feature_importance.copy()    importance_formatted['Rank'] = range(1, len(importance_formatted) + 1)# 重命名列    importance_formatted.columns = ['变量', '重要性得分', '排名']# 重排顺序    importance_formatted = importance_formatted[['排名', '变量', '重要性得分']]# 格式化数值    importance_formatted['重要性得分'] = importance_formatted['重要性得分'].apply(        lambda x: f"{x:.3f}"    )    importance_formatted.to_csv(tables_dir / "变量重要性汇总表.csv",                                index=False, encoding='utf-8-sig')print("✓ 变量重要性汇总表已保存")# 9.3 随机效应统计表print("\n9.3 随机效应统计表...")if'random_effects'in train_data.columns:# 计算随机效应的描述性统计    random_stats = pd.DataFrame({'个体数': [train_data['ID'].nunique()],'随机效应均值': [train_data['random_effects'].mean()],'随机效应标准差': [train_data['random_effects'].std()],'随机效应最小值': [train_data['random_effects'].min()],'随机效应中位数': [train_data['random_effects'].median()],'随机效应最大值': [train_data['random_effects'].max()],'随机效应方差': [train_data['random_effects'].var()]    })# 格式化数值for col in random_stats.columns:if col != '个体数':            random_stats[col] = random_stats[col].apply(lambda x: f"{x:.3f}")    random_stats.to_csv(tables_dir / "随机效应统计表.csv",                        index=False, encoding='utf-8-sig')print("✓ 随机效应统计表已保存")# 按个体的随机效应表    random_by_id = train_data.groupby('ID')['random_effects'].agg(['mean', 'count']).reset_index()    random_by_id.columns = ['ID', '随机效应均值', '观测次数']# 格式化数值    random_by_id['随机效应均值'] = random_by_id['随机效应均值'].apply(lambda x: f"{x:.3f}")    random_by_id.to_csv(tables_dir / "个体随机效应表.csv",                        index=False, encoding='utf-8-sig')print("✓ 个体随机效应表已保存")# 9.4 预测误差分析表print("\n9.4 预测误差分析表...")if'merf_pred'in train_data.columns:# 计算各种预测误差指标    y_true = train_data['Indicator1']    y_pred = train_data['merf_pred']    error_stats = pd.DataFrame({'样本量': [len(train_data)],'平均绝对误差': [mean_absolute_error(y_true, y_pred)],'均方根误差': [np.sqrt(mean_squared_error(y_true, y_pred))],'平均绝对百分比误差': [np.mean(np.abs((y_true - y_pred) / y_true)) * 100],'决定系数': [r2_score(y_true, y_pred)],'最大绝对误差': [np.max(np.abs(y_true - y_pred))]    })# 格式化数值for col in error_stats.columns:if col != '样本量':            error_stats[col] = error_stats[col].apply(lambda x: f"{x:.3f}")    error_stats.to_csv(tables_dir / "预测误差分析表.csv",                       index=False, encoding='utf-8-sig')print("✓ 预测误差分析表已保存")# 按时间点的预测误差表if'Time'in train_data.columns:        error_by_time = train_data.groupby('Time').apply(            lambda x: pd.Series({'观测均值': x['Indicator1'].mean(),'预测均值': x['merf_pred'].mean(),'平均绝对误差': mean_absolute_error(x['Indicator1'], x['merf_pred']),'样本量': len(x)            })        ).reset_index()# 格式化数值for col in ['观测均值', '预测均值', '平均绝对误差']:            error_by_time[col] = error_by_time[col].apply(lambda x: f"{x:.3f}")        error_by_time.to_csv(tables_dir / "按时间点预测误差表.csv",                             index=False, encoding='utf-8-sig')print("✓ 按时间点预测误差表已保存")# 9.5 治疗组效果分析表print("\n9.5 治疗组效果分析表...")if'Treatment'in train_data.columns and 'merf_pred'in train_data.columns:    treatment_effect = train_data.groupby('Treatment').apply(        lambda x: pd.Series({'样本量': len(x),'观测均值': x['Indicator1'].mean(),'预测均值': x['merf_pred'].mean(),'观测标准差': x['Indicator1'].std(),'预测标准差': x['merf_pred'].std(),'平均绝对误差': mean_absolute_error(x['Indicator1'], x['merf_pred'])        })    ).reset_index()# 格式化数值for col in ['观测均值', '预测均值', '观测标准差', '预测标准差', '平均绝对误差']:        treatment_effect[col] = treatment_effect[col].apply(lambda x: f"{x:.3f}")    treatment_effect.to_csv(tables_dir / "治疗组效果分析表.csv",                            index=False, encoding='utf-8-sig')print("✓ 治疗组效果分析表已保存")# 9.6 模型参数设置表print("\n9.6 模型参数设置表...")model_params = pd.DataFrame({'参数': ['树的数量', '每节点变量数(max_features)', '随机种子', '变量重要性方法', '并行作业数'],'值': [500, 'sqrt', 1234, '基尼不纯度减少', -1],'描述': ['随机森林中树的数量', '每次分裂时考虑的变量数量', '随机数种子，确保结果可重复', '计算变量重要性的方法', '使用所有可用CPU核心']})model_params.to_csv(tables_dir / "模型参数设置表.csv",                    index=False, encoding='utf-8-sig')print("✓ 模型参数设置表已保存")# 9.7 综合报告表print("\n9.7 创建综合报告表...")# 收集分析信息analysis_info = pd.DataFrame({'分析项目': ['分析类型', '分析方法', '分析个体数', '时间点数', '总样本量', '训练集样本量', '测试集样本量','因变量', '主要自变量', '随机效应结构','模型算法', '树的数量', '最佳模型','训练集R²', '测试集R²', '主要发现', '公共卫生意义'    ],'结果': ['混合效应随机森林(MERF)','机器学习与传统纵向模型融合',        str(merf_data_complete['ID'].nunique()),        str(merf_data_complete['Time'].nunique()),        str(len(merf_data_complete)),        str(len(train_data)),        str(len(test_data)),'Indicator1','Time, Age, Treatment, Indicator2, Indicator3等','个体随机截距','随机森林 + 线性混合模型','500',        best_model_name if'best_model_name'in locals() else'随机森林+线性混合模型',        f"{model_performance_df['R2'].max():.3f}"if not model_performance_df.empty else'N/A',        f"{test_r2:.3f}"if'test_r2'in locals() else'N/A','详见分析报告','详见分析报告'    ]})analysis_info.to_csv(tables_dir / "混合效应随机森林分析综合报告表.csv",                     index=False, encoding='utf-8-sig')print("✓ 混合效应随机森林分析综合报告表已保存")# ===========================================# 10. 创建HTML分析报告# ===========================================print("\n10. 正在生成HTML分析报告...")try:# 创建HTML报告    html_content = f"""    <!DOCTYPE html>    <html lang="zh-CN">    <head>        <meta charset="UTF-8">        <meta name="viewport" content="width=device-width, initial-scale=1.0">        <title>混合效应随机森林(MERF)分析报告</title>        <style>            body {{                font-family: 'Microsoft YaHei', Arial, sans-serif;                line-height: 1.6;                color: #333;                max-width: 1200px;                margin: 0 auto;                padding: 20px;                background-color: #f5f5f5;            }}            .container {{                background-color: white;                padding: 30px;                border-radius: 10px;                box-shadow: 0 0 20px rgba(0,0,0,0.1);            }}            h1, h2, h3 {{                color: #2c3e50;                border-bottom: 2px solid #3498db;                padding-bottom: 10px;            }}            h1 {{                text-align: center;                color: #2980b9;            }}            .summary {{                background-color: #ecf0f1;                padding: 20px;                border-radius: 5px;                margin: 20px 0;            }}            table {{                width: 100%;                border-collapse: collapse;                margin: 20px 0;            }}            th, td {{                padding: 12px;                text-align: left;                border-bottom: 1px solid #ddd;            }}            th {{                background-color: #3498db;                color: white;            }}            tr:hover {{                background-color: #f5f5f5;            }}            .figure {{                text-align: center;                margin: 30px 0;            }}            .figure img {{                max-width: 100%;                height: auto;                border: 1px solid #ddd;                border-radius: 5px;                box-shadow: 0 0 10px rgba(0,0,0,0.1);            }}            .caption {{                font-style: italic;                color: #7f8c8d;                margin-top: 10px;            }}            .highlight {{                background-color: #fffacd;                padding: 10px;                border-left: 4px solid #f39c12;                margin: 20px 0;            }}            .footer {{                text-align: center;                margin-top: 40px;                padding-top: 20px;                border-top: 1px solid #ddd;                color: #7f8c8d;            }}        </style>    </head>    <body>        <div class="container">            <h1>混合效应随机森林(MERF)分析报告</h1>            <p style="text-align: center; color: #7f8c8d;">生成日期：{datetime.now().strftime('%Y年%m月%d日')}</p>            <div class="summary">                <h2>📊 分析摘要</h2>                <p>本次分析使用混合效应随机森林(Mixed Effects Random Forest, MERF)方法，结合随机森林的强大预测能力和混合效应模型对纵向数据的处理能力。</p>                <ul>                    <li><strong>分析个体数：</strong>{merf_data_complete['ID'].nunique()}</li>                    <li><strong>时间点数：</strong>{merf_data_complete['Time'].nunique()}</li>                    <li><strong>总样本量：</strong>{len(merf_data_complete)}</li>                    <li><strong>训练集样本量：</strong>{len(train_data)}</li>                    <li><strong>测试集样本量：</strong>{len(test_data)}</li>                </ul>            </div>            <h2>📈 模型性能</h2>"""    # 添加模型性能表    if not model_performance_df.empty:        html_content += """            <table>                <thead>                    <tr>                        <th>模型</th>                        <th>平均绝对误差(MAE)</th>                        <th>均方根误差(RMSE)</th>                        <th>决定系数(R²)</th>                        <th>树数量</th>                    </tr>                </thead>                <tbody>"""        for _, row in model_performance_df.iterrows():            html_content += f"""                    <tr>                        <td>{row['Model']}</td>                        <td>{row['MAE']:.3f}</td>                        <td>{row['RMSE']:.3f}</td>                        <td>{row['R2']:.3f}</td>                        <td>{row['N_Trees']}</td>                    </tr>"""        html_content += """                </tbody>            </table>"""    # 添加最佳模型信息    if 'best_model_name' in locals():        html_content += f"""            <div class="highlight">                <h3>🏆 最佳模型</h3>                <p><strong>{best_model_name}</strong> 在测试集上表现最佳。</p>"""        if 'test_performance' in locals():            html_content += f"""                <p><strong>测试集性能：</strong></p>                <ul>                    <li>MAE: {test_performance['MAE']:.3f}</li>                    <li>RMSE: {test_performance['RMSE']:.3f}</li>                    <li>R²: {test_performance['R2']:.3f}</li>                </ul>"""        html_content += """            </div>"""    # 添加变量重要性    if 'feature_importance' in locals():        html_content += """            <h2>🔝 变量重要性</h2>            <p>以下是模型中最重要的10个变量：</p>            <table>                <thead>                    <tr>                        <th>排名</th>                        <th>变量</th>                        <th>重要性得分</th>                    </tr>                </thead>                <tbody>"""        for i, (_, row) in enumerate(feature_importance.head(10).iterrows(), 1):            html_content += f"""                    <tr>                        <td>{i}</td>                        <td>{row['Variable']}</td>                        <td>{row['Importance']:.3f}</td>                    </tr>"""        html_content += """                </tbody>            </table>"""    # 添加随机效应分析    if 'random_effects' in train_data.columns:        html_content += f"""            <h2>🎲 随机效应分析</h2>            <p>随机效应反映了不同个体间的异质性：</p>            <ul>                <li><strong>随机效应均值：</strong>{train_data['random_effects'].mean():.3f}</li>                <li><strong>随机效应标准差：</strong>{train_data['random_effects'].std():.3f}</li>                <li><strong>随机效应方差：</strong>{train_data['random_effects'].var():.3f}</li>            </ul>            <p>较大的随机效应方差表明个体间存在显著差异，需要考虑个体特异性效应。</p>"""    # 添加可视化结果    html_content += """        <h2>📊 可视化结果</h2>        <p>以下是分析生成的主要可视化结果：</p>"""    # 检查并添加可视化图片    figure_files = {        "变量重要性图.png": "变量重要性图展示了各预测变量对模型预测的贡献程度",        "预测vs观测图.png": "预测值与观测值的散点图用于评估模型的拟合优度和预测精度",        "时间趋势预测图.png": "时间趋势预测图展示了模型在不同时间点的预测能力",        "治疗组比较图.png": "治疗组比较图展示了不同治疗组间的预测效果差异",        "随机效应分布图.png": "随机效应分布图显示了个体间随机效应的分布情况"    }    for fig_file, description in figure_files.items():        fig_path = figures_dir / fig_file        if fig_path.exists():            html_content += f"""            <div class="figure">                <h3>{fig_file.replace('.png', '')}</h3>                <img src="{fig_path}" alt="{fig_file}">                <p class="caption">{description}</p>            </div>"""    # 添加公共卫生意义    html_content += """        <h2>🏥 公共卫生意义与应用</h2>        <div class="highlight">            <p>混合效应随机森林在公共卫生领域具有独特的优势和应用价值：</p>            <ul>                <li><strong>电子健康记录分析：</strong>从大量的、复杂交互的纵向变量中提取预测信号</li>                <li><strong>慢性病管理与预测：</strong>处理长期随访数据中的复杂非线性关系</li>                <li><strong>健康干预效果评估：</strong>评估复杂干预措施的非线性效果</li>                <li><strong>公共卫生监测与预警：</strong>整合多源异构数据，提高疾病暴发预测精度</li>            </ul>        </div>        <h2>💡 结论与建议</h2>        <div class="highlight">            <h3>主要结论</h3>"""    # 添加结论    if not model_performance_df.empty:        best_r2 = model_performance_df['R2'].max()        html_content += f"""            <p>混合效应随机森林模型表现良好，训练集R²达到{best_r2:.3f}。</p>"""    if 'feature_importance' in locals():        top_var = feature_importance.iloc[0]['Variable']        html_content += f"""            <p>最重要的预测变量是 <strong>{top_var}</strong>，对模型预测贡献最大。</p>"""    html_content += """            <h3>建议</h3>            <ul>                <li>考虑更复杂的随机效应结构：如随机斜率和随机截距模型</li>                <li>集成更多机器学习算法：如梯度提升树、神经网络等</li>                <li>进行超参数调优：使用交叉验证优化随机森林参数</li>                <li>考虑时间依赖性：加入自回归项或时间滞后变量</li>                <li>外部验证：在独立数据集上验证模型性能</li>            </ul>        </div>        <h2>🔧 技术信息</h2>        <table>            <tr>                <th>分析工具</th>                <td>Python 3.12</td>            </tr>            <tr>                <th>主要库</th>                <td>scikit-learn, statsmodels, pandas, numpy, matplotlib, seaborn</td>            </tr>            <tr>                <th>分析方法</th>                <td>随机森林 + 线性混合模型</td>            </tr>            <tr>                <th>随机种子</th>                <td>1234</td>            </tr>        </table>        <div class="footer">            <p>混合效应随机森林(MERF)分析报告</p>            <p>生成时间：{datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>            <p>结果保存路径：{result_dir}</p>        </div>    </div>    </body>    </html>"""    # 保存HTML报告    report_file = reports_dir / f"混合效应随机森林分析报告_{datetime.now().strftime('%Y%m%d')}.html"    with open(report_file, 'w', encoding='utf-8') as f:        f.write(html_content)    print(f"✓ HTML分析报告已保存: {report_file}")except Exception as e:    print(f"✗ HTML报告生成失败: {e}")# ===========================================# 11. 保存工作空间# ===========================================print("\n11. 保存工作空间...")try:    # 保存关键对象    save_objects = {        'merf_data_complete': merf_data_complete,        'train_data': train_data,        'test_data': test_data,        'X_train': X_train,        'y_train': y_train,        'X_test': X_test,        'y_test': y_test,        'method1_models': method1_models if method1_models else None,        'method2_model': method2_model if method2_model else None,        'method3_model': method3_model if method3_model else None,        'model_performance_df': model_performance_df if 'model_performance_df' in locals() else None,        'feature_importance': feature_importance if 'feature_importance' in locals() else None    }    # 保存为pickle文件    with open(result_dir / "MERF_Analysis_Workspace.pkl", 'wb') as f:        pickle.dump(save_objects, f)    print("✓ 完整工作空间已保存: MERF_Analysis_Workspace.pkl")    # 也保存为JSON格式（仅数据）    json_data = {        'analysis_info': {            'n_subjects': merf_data_complete['ID'].nunique(),            'n_timepoints': merf_data_complete['Time'].nunique(),            'total_samples': len(merf_data_complete),            'train_samples': len(train_data),            'test_samples': len(test_data),            'best_model': best_model_name if 'best_model_name' in locals() else 'N/A',            'best_r2': float(model_performance_df['R2'].max()) if not model_performance_df.empty else 0.0        }    }    with open(result_dir / "MERF_Analysis_Summary.json", 'w', encoding='utf-8') as f:        json.dump(json_data, f, ensure_ascii=False, indent=2)    print("✓ 分析摘要已保存: MERF_Analysis_Summary.json")except Exception as e:    print(f"✗ 保存工作空间失败: {e}")# ===========================================# 12. 汇总输出# ===========================================print("\n" + "=" * 70)print("混合效应随机森林(MERF)分析完成！")print("=" * 70)print(f"\n结果文件汇总 (保存到: {result_dir}):")# 检查文件是否存在result_files = [    "模型性能比较.csv",    "模型性能汇总表.csv",    "变量重要性表.csv",    "变量重要性汇总表.csv",    "随机效应统计表.csv",    "个体随机效应表.csv",    "预测误差分析表.csv",    "按时间点预测误差表.csv",    "治疗组效果分析表.csv",    "模型参数设置表.csv",    "混合效应随机森林分析综合报告表.csv"]print("\n📊 表格文件:")for file in result_files:    file_path = tables_dir / file    if file_path.exists():        print(f"  ✓ {file_path}")    else:        print(f"  ✗ {file} (未生成)")print("\n🖼️ 可视化文件:")visual_files = [    "变量重要性图.png",    "部分依赖图.png",    "预测vs观测图.png",    "残差图.png",    "随机效应分布图.png",    "个体随机效应图.png",    "时间趋势预测图.png",    "治疗组比较图.png",    "变量关系散点图矩阵.png",    "模型性能对比图.png"]for file in visual_files:    file_path = figures_dir / file    if file_path.exists():        print(f"  ✓ {file_path}")    else:        print(f"  ✗ {file} (未生成)")print("\n🤖 模型文件:")model_files = list(models_dir.glob("*.pkl"))for file in model_files:    print(f"  ✓ {file}")print("\n📋 报告文件:")report_files = list(reports_dir.glob("*.html")) + list(reports_dir.glob("*.pdf"))for file in report_files:    print(f"  ✓ {file}")print("\n💾 工作空间文件:")workspace_files = list(result_dir.glob("*.pkl")) + list(result_dir.glob("*.json"))for file in workspace_files:    print(f"  ✓ {file}")# 显示重要统计信息print("\n📈 重要统计信息:")print(f"  分析个体数: {merf_data_complete['ID'].nunique()}")print(f"  时间点数: {merf_data_complete['Time'].nunique()}")print(f"  总样本量: {len(merf_data_complete)}")print(f"  训练集大小: {len(train_data)}")print(f"  测试集大小: {len(test_data)}")if not model_performance_df.empty:    print(f"  最佳模型R²: {model_performance_df['R2'].max():.3f}")    print(f"  最佳模型: {model_performance_df.loc[model_performance_df['R2'].idxmax(), 'Model']}")print("\n" + "=" * 70)print("分析完成！所有结果已保存到上述文件夹中。")print("=" * 70)# 打开结果文件夹（Windows）import platformif platform.system() == "Windows":    try:        os.startfile(result_dir)    except:        passelif platform.system() == "Darwin":  # macOS    try:        import subprocess        subprocess.run(['open', str(result_dir)])    except:        passelif platform.system() == "Linux":    try:        import subprocess        subprocess.run(['xdg-open', str(result_dir)])    except:        passprint(f"\n结果文件夹已打开: {result_dir}")

纵向数据是指在多个时间点对同一组个体进行的重复测量。

1 混合效应模型Mixed Effects ModelMEM

2 固定效应模型 Fixed Effects Model FEM

3 多层线性模型 Hierarchical Linear Model HLM

4 广义估计方程 Generalized Estimating Equations GEE

5 广义线性混合模型 Generalized Linear Mixed Models GLMM

6 潜变量增长曲线模型 Latent Growth Curve Model LGCM

7 组轨迹模型 Group-Based Trajectory Model GBTM

8 交叉滞后模型 Cross-lagged (Panel) Model CLPM

9 重复测量方差分析 Repeated Measures ANOVA / MANOVA RM-ANOVA / RM-MANOVA

10 非线性混合效应模型 Nonlinear Mixed-Effects Models NLME

11 联合模型 Joint Models JM

12 结构方程模型 Structural Equation Modeling SEM

13 广义相加模型 Generalized Additive Models GAM

14 潜类别模型 Latent Class Models LCM

15 潜剖面模型 Latent Profile Analysis LPA

16 状态空间模型 State Space Models SSM

17 纵向因子分析 Longitudinal Factor Analysis LFA

18 贝叶斯纵向模型 Bayesian Longitudinal Models - (Bayesian Models)

19 混合效应随机森林 Mixed Effects Random Forest MERF

20 纵向梯度提升 Longitudinal Gradient Boosting - (Longitudinal GBM)

21 K均值纵向聚类 K-means Longitudinal Clustering - (DTW-KMeans)

22 基于模型的聚类 Model-Based ClusteringMB-CLUST

医学统计数据分析分享交流SPSS、R语言、Python、ArcGis、Geoda、GraphPad、数据分析图表制作等心得。承接数据分析，论文返修，医学统计，机器学习，生存分析，空间分析，问卷分析业务。若有投稿和数据分析代做需求，可以直接联系我，谢谢！

！！！可加我粉丝群！！！

“医学统计数据分析”公众号右下角；

找到“联系作者”，

可加我微信，邀请入粉丝群！

【医学统计数据分析】工作室“粉丝群”

【临床】粉丝群

有临床流行病学数据分析

如（t检验、方差分析、χ2检验、logistic回归）、

（重复测量方差分析与配对T检验、ROC曲线）、

（非参数检验、生存分析、样本含量估计）、

（筛检试验：灵敏度、特异度、约登指数等计算）、

（绘制柱状图、散点图、小提琴图、列线图等）、

机器学习、深度学习、生存分析

等需求的同仁们，加入【临床】粉丝群。

【公卫】粉丝群

疾控，公卫岗位的同仁，可以加一下【公卫】粉丝群，分享生态学研究、空间分析、时间序列、监测数据分析、时空面板技巧等工作科研自动化内容。

【生信】粉丝群

有实验室数据分析需求的同仁们，可以加入【生信】粉丝群，交流NCBI（基因序列）、UniProt（蛋白质）、KEGG（通路）、GEO（公共数据集）等公共数据库、基因组学转录组学蛋白组学代谢组学表型组学等数据分析和可视化内容。

或者可扫码直接加微信进群！！！

精品视频课程-“医学统计数据分析”视频号付费合集

【重磅】传染病监测数据监测预警自动化的R语言与Python课程上线啦！！！
【重磅】二分类因变量常用机器学习及深度学习模型系列课程上线啦！！！
【重磅】生存分析数据常用机器学习及深度学习模型系列课程上线啦！！！
【重磅】定量因变量机器学习及深度学习模型系列课程上线了！！！
【!重磅!】时间序列数据机器学习及深度学习视频课上线啦！！！
【重磅】重复测量纵向数据及22种模型分析方法视频课上线啦！！！
【重磅】生信分析32种模型分析方法系列课程上线啦！！！

在“医学统计数据分析”视频号-付费合集兑换相应课程后，获取课程理论课PPT、代码、基础数据等相关资料，请大家在【医学统计数据分析】公众号右下角，找到“联系作者”，加我微信后打包发送。感谢您的支持！！

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

【重复测量纵向数据】Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)

最新文章

热门文章

随机文章

【重复测量纵向数据】Python19.混合效应随机森林(Mixed Effects Random Forest,MERF)

一图总结python内置数据结构!!

Python 程序问题解决与操作指导

最新文章

热门文章

随机文章