DID(Difference-in-Differences,双重差分)用于评估“某项政策/冲击”带来的因果影响。核心思路是比较两次差值:
如果处理组在政策后“多出来”的那部分变化显著,就可以把它解释为政策效应。
DID 的关键识别假设是“平行趋势”:
实证中通常做两件事:
变量含义:
其中 就是最关注的政策效应。
id(个体)和 year(时间)两个维度treat 要有组间差异(不能全是 0 或全是 1)post 要有时期间差异(不能全是 0 或全是 1)import osimport numpy as npimport pandas as pdimport statsmodels.formula.api as smfimport matplotlib as mplimport matplotlib.pyplot as pltimport seaborn as snsfrom matplotlib import font_manager as fm# ========= 0) 主题与中文字体 =========sns.set_theme(style="whitegrid", context="talk")def setup_cn_font(): candidate_paths = [ "/System/Library/Fonts/Hiragino Sans GB.ttc", "/System/Library/Fonts/Supplemental/Songti.ttc", "/Library/Fonts/Arial Unicode.ttf", "/System/Library/Fonts/STHeiti Light.ttc", ] chosen_path = None for p in candidate_paths: if os.path.exists(p): fm.fontManager.addfont(p) chosen_path = p break if chosen_path is None: raise RuntimeError("未找到可用中文字体,请先安装中文字体(如 Hiragino/Songti/Noto Sans CJK)。") chosen_name = fm.FontProperties(fname=chosen_path).get_name() mpl.rcParams["font.family"] = "sans-serif" mpl.rcParams["font.sans-serif"] = [chosen_name] mpl.rcParams["axes.unicode_minus"] = False mpl.rcParams["pdf.fonttype"] = 42 mpl.rcParams["ps.fonttype"] = 42 print(f"Using Chinese font: {chosen_name} ({chosen_path})") return fm.FontProperties(fname=chosen_path)CN_FONT = setup_cn_font()# ========= 1) 构造示例数据 =========np.random.seed(42)n_id = 120years = np.arange(2015, 2023)policy_year = 2019rows = []for i in range(n_id): treat = 1 if i < n_id // 2 else 0 alpha_i = np.random.normal(0, 1) for y in years: post = 1 if y >= policy_year else 0 x1 = np.random.normal(0, 1) x2 = np.random.normal(0, 1) tau = 0.8 eps = np.random.normal(0, 1) y_out = 2 + alpha_i + 0.2*(y-2015) + tau*(treat*post) + 0.5*x1 - 0.3*x2 + eps rows.append([i, y, treat, post, x1, x2, y_out])df = pd.DataFrame(rows, columns=["id", "year", "treat", "post", "x1", "x2", "y"])df["did"] = df["treat"] * df["post"]# ========= 2) 基准 DID =========model = smf.ols("y ~ did + x1 + x2 + C(id) + C(year)", data=df).fit( cov_type="cluster", cov_kwds={"groups": df["id"]})key_vars = ["did", "x1", "x2"]result_table = pd.DataFrame({ "coef": model.params[key_vars], "std_err": model.bse[key_vars], "p_value": model.pvalues[key_vars]})print("\n=== 基准回归核心结果 ===")print(result_table)# ========= 3) 事件研究 =========df["rel_year"] = df["year"] - policy_yeardef ev_name(k): return f"event_m{abs(k)}" if k < 0 else f"event_p{k}"for k in range(-3, 4): if k != -1: col = ev_name(k) df[col] = ((df["rel_year"] == k) & (df["treat"] == 1)).astype(int)event_cols = [ev_name(k) for k in range(-3, 4) if k != -1]formula_es = "y ~ " + " + ".join(event_cols) + " + x1 + x2 + C(id) + C(year)"es_model = smf.ols(formula_es, data=df).fit( cov_type="cluster", cov_kwds={"groups": df["id"]})# ========= 4) 事件研究图 =========ks = [k for k in range(-3, 4) if k != -1]coef = np.array([es_model.params.get(ev_name(k), np.nan) for k in ks])se = np.array([es_model.bse.get(ev_name(k), np.nan) for k in ks])fig, ax = plt.subplots(figsize=(9, 5), dpi=150)ax.axhline(0, color="black", linewidth=1.2)ax.axvline(-1, color="gray", linestyle="--", linewidth=1.2)ax.errorbar(ks, coef, yerr=1.96*se, fmt="o-", capsize=4, color="#D55E00")ax.set_title("事件研究:动态政策效应", fontproperties=CN_FONT, pad=10)ax.set_xlabel("事件时间 k(-1期为基准)", fontproperties=CN_FONT)ax.set_ylabel("系数估计值", fontproperties=CN_FONT)for t in ax.get_xticklabels() + ax.get_yticklabels(): t.set_fontproperties(CN_FONT)fig.tight_layout()fig.savefig("event_study.png", dpi=300, bbox_inches="tight")plt.show()# ========= 5) 核心系数图 =========coef_df = pd.DataFrame({ "变量": ["DID效应", "控制变量 x1", "控制变量 x2"], "coef": [model.params["did"], model.params["x1"], model.params["x2"]], "se": [model.bse["did"], model.bse["x1"], model.bse["x2"]]})coef_df["ci_low"] = coef_df["coef"] - 1.96 * coef_df["se"]coef_df["ci_high"] = coef_df["coef"] + 1.96 * coef_df["se"]fig, ax = plt.subplots(figsize=(8.5, 4.8), dpi=150)ypos = np.arange(len(coef_df))ax.errorbar( coef_df["coef"], ypos, xerr=[coef_df["coef"] - coef_df["ci_low"], coef_df["ci_high"] - coef_df["coef"]], fmt="o", capsize=5, color="#0072B2", ecolor="#56B4E9")ax.axvline(0, color="black", linewidth=1.1)ax.set_yticks(ypos)ax.set_yticklabels(coef_df["变量"], fontproperties=CN_FONT)ax.set_xlabel("系数估计值", fontproperties=CN_FONT)ax.set_title("核心回归系数(95%置信区间)", fontproperties=CN_FONT, pad=10)for t in ax.get_xticklabels(): t.set_fontproperties(CN_FONT)fig.tight_layout()fig.savefig("did_key_results.png", dpi=300, bbox_inches="tight")plt.show()print("\n图片已保存:event_study.png, did_key_results.png")did 系数显著为正:政策使结果变量上升did 系数显著为负:政策使结果变量下降PatsyError: numbers besides '0' and '1' are only allowed with **原因通常是列名含 event_-3 这种负号命名。解决:改成 event_m3、event_p0 这类命名。covariance of constraints does not have full rank在“固定效应很多 + 聚类标准误”时较常见,通常不影响核心系数读取。建议重点查看 did、x1、x2 的系数、标准误和 p 值。

df = pd.read_csv("你的数据.csv")id, year, treat, post, y 这些字段DID 难点不在公式本身,而在于数据组织、模型设定和结果表达是否规范。把这套流程跑顺后,你就能快速迁移到真实课题,稳定产出可复现的实证结果。
完整代码:https://www.ppmandata.cn/codeBase/2029458496147169295
⬇️点击阅读原文可直接跳转