01 什么是断点回归(RDD)?
断点回归(Regression Discontinuity Design)是一种准实验方法,利用"临界值"附近的自然实验来识别因果效应。
核心思想
如果个体在临界值的一侧vs另一侧几乎是随机分配的,那么临界值处的"跳跃"就是处理效应的估计。
经典应用场景
02 Sharp RDD vs Fuzzy RDD
Sharp RDD(清晰断点)
定义:处理状态在临界值处确定性跳跃
D = 0 if X < cD = 1 if X >= c
适用场景:政策明确规定了资格门槛
举例:新《劳动合同法》规定,月薪超过5000元的员工享有特定权益 → 5000元就是Sharp断点
Fuzzy RDD(模糊断点)
定义:处理概率在临界值处跳跃但非100%
lim P(D=1 | X→c⁺) ≠ lim P(D=1 | X→c⁻)
适用场景:资格门槛不严格强制执行
举例:贫困线附近的家庭可能"不上报"收入 → 处理概率跳跃但不是0/1
03 核心概念:带宽与核函数
为什么要带宽?
RDD本质上是在临界值附近做局部实验,只使用"接近"断点的观测。
带宽选择 = 精度 vs 偏差的权衡 带宽太宽 → 包含远处观测 → 偏差增大 带宽太窄 → 样本量太小 → 方差增大
带宽选择方法
| | |
|---|
| MSE最优 | | |
| Imbens-Kalyanaraman | | |
| 不同带宽检验 | | |
核函数
核函数决定带宽内观测的权重:
# 三角核 (Triangle) - 默认推荐K(u) = 1 - |u| if |u| <= 1K(u) = 0 otherwise# 均匀核 (Uniform)K(u) = 0.5 if |u| <= 1# Epanechnikov核K(u) = 0.75(1-u²) if |u| <= 1
04 Python实操:完整代码
数据生成
import numpy as npimport pandas as pddef generate_rdd_data(n=2000, cutoff=0, treatment_effect=2, seed=42): np.random.seed(seed) # 驱动变量 (Running Variable) - 断点附近的分配接近随机 running_var = np.random.uniform(-1, 1, n) # 潜在结果 Y(0) 和 Y(1) error = np.random.normal(0, 0.5, n) Y0 = 1 + 2 * running_var + error Y1 = Y0 + treatment_effect # 真实处理效应 # 处理变量 (Sharp RDD: 确定性的 0/1) treatment = (running_var >= cutoff).astype(int) # 观测结果 outcome = treatment * Y1 + (1 - treatment) * Y0 return pd.DataFrame({ 'running_var': running_var, 'outcome': outcome, 'treatment': treatment })
Sharp RDD 估计
class SharpRDD: """Sharp RDD 估计器""" def __init__(self, data, running_var, outcome, cutoff=0): self.data = data.copy() self.x = data[running_var].values self.y = data[outcome].values self.cutoff = cutoff self.x_centered = self.x - cutoff def local_linear_regression(self, bandwidth=0.3, kernel='triangular'): """ 局部线性回归 (LLR) 参数: ------ bandwidth : 带宽 kernel : 核函数类型 """ # 计算核权重 weights = self._kernel_weights(self.x_centered, bandwidth, kernel) # 带宽内样本 mask = np.abs(self.x_centered) <= bandwidth x_c = self.x_centered[mask] y = y[mask] w = weights[mask] # 加权最小二乘 X = np.column_stack([np.ones(len(x_c)), x_c]) XW = X * np.sqrt(w)[:, np.newaxis] yW = y * np.sqrt(w) beta = np.linalg.lstsq(XW, yW, rcond=None)[0] # 处理效应 = 断点处的跳跃 tau = beta[0] # 截距项即为断点跳跃 # 标准误 residuals = y - X @ beta sigma2 = np.sum(w * residuals**2) / (len(y) - X.shape[1]) var_beta = sigma2 * np.linalg.inv(XW.T @ XW) se = np.sqrt(var_beta[0, 0]) return tau, se def _kernel_weights(self, x, bandwidth, kernel): """核函数权重计算""" x_norm = np.abs(x / bandwidth) if kernel == 'triangular': return np.where(x_norm <= 1, 1 - x_norm, 0) elif kernel == 'uniform': return np.where(x_norm <= 1, 0.5, 0) elif kernel == 'epanechnikov': return np.where(x_norm <= 1, 0.75 * (1 - x_norm**2), 0)
Fuzzy RDD 估计(2SLS框架)
class FuzzyRDD: """Fuzzy RDD = 工具变量框架下的RDD""" def estimate(self, bandwidth=0.3): """ 两阶段最小二乘 (2SLS) 阶段1: D = α + βX + γ·1{X≥c} + ε 预测处理状态 阶段2: Y = a + bX + τ·D̂ + u 使用预测值估计因果效应 (LATE) """ indicator = (self.x >= self.cutoff).astype(float) mask = np.abs(self.x_centered) <= bandwidth x_c = self.x_centered[mask] y = self.y[mask] d = self.d[mask] z = indicator[mask] # 工具变量 = 断点指示 # 阶段1: 预测处理 X1 = np.column_stack([np.ones(len(x_c)), x_c, z]) d_pred = X1 @ np.linalg.lstsq(X1, d, rcond=None)[0] # 阶段2: IV估计 X2 = np.column_stack([np.ones(len(x_c)), x_c, d_pred]) coef2 = np.linalg.lstsq(X2, y, rcond=None)[0] tau = coef2[2] # 处理效应 (LATE) # 标准误 & F统计量 residuals = y - X2 @ coef2 n, k = len(y), X2.shape[1] sigma2 = np.sum(residuals**2) / (n - k) var_covar = sigma2 * np.linalg.inv(X2.T @ X2) se = np.sqrt(var_covar[2, 2]) # 第一阶段F统计量 (检验工具变量强度) ssr1 = np.sum((d - d.mean())**2) ssr2 = np.sum((d - d_pred)**2) f_stat = ((ssr1 - ssr2) / 1) / (ssr2 / (n - 3)) return tau, se, f_stat
05 诊断检验(必做!)
RDD结果的可信度依赖于三大假设:
假设1:驱动变量连续性(无Manipulation)
McCrary密度检验:检验个体是否"操控"驱动变量
def density_test(self, bandwidth=0.3): """ McCrary (2008) 密度检验 H0: 密度在断点处连续 """ left_mask = (self.x < self.cutoff) & (self.x >= self.cutoff - bandwidth) right_mask = (self.x >= self.cutoff) & (self.x <= self.cutoff + bandwidth) n_left, n_right = left_mask.sum(), right_mask.sum() log_diff = np.log(n_right / n_left) se = np.sqrt(1/n_left + 1/n_right) z_stat = log_diff / se p_value = 2 * (1 - norm.cdf(abs(z_stat))) return { 'z_stat': z_stat, 'p_value': p_value, 'conclusion': '连续 ✓' if p_value > 0.1 else '可能Manipulation ✗' }
解读:
- • p > 0.1 → 不能拒绝H0,密度连续,RDD可信
假设2:协变量前定(平衡性检验)
协变量在断点两侧应该相似
def covariate_balance_test(self, covariates, bandwidth=0.3): """ 检验协变量在断点两侧是否平衡 H0: 协变量均值相等 """ results = {} for cov in covariates: cov_vals = self.data[cov].values indicator = (self.x >= self.cutoff).astype(int) mask = np.abs(self.x_centered) <= bandwidth cov_left = cov_vals[(mask) & (indicator == 0)] cov_right = cov_vals[(mask) & (indicator == 1)] diff = np.mean(cov_right) - np.mean(cov_left) se = np.sqrt(np.var(cov_left)/len(cov_left) + np.var(cov_right)/len(cov_right)) t_stat = diff / se p_value = 2 * (1 - norm.cdf(abs(t_stat))) results[cov] = {'diff': diff, 'p_value': p_value, 'balanced': p_value > 0.1} return results
解读:
- • p > 0.1 → 协变量平衡,处理组和对照组可比
假设3:结果趋势连续(安慰剂检验)
在假断点处估计,应该不显著
def placebo_tests(self, outcome, n_placebos=20): """ 安慰剂检验:在假断点处进行RDD估计 真实效应应该只在真实断点处显著 """ placebos = np.linspace(-0.8, 0.8, n_placebos + 1) placebos = [p for p in placebos if abs(p) > 0.1] # 排除真实断点 tau_placebo = [] for p in placebos: rdd = SharpRDD(self.data, 'running_var', outcome, cutoff=p) tau, _ = rdd.local_linear_regression(bandwidth=0.3) tau_placebo.append(tau) # 计算真实效应在安慰剂分布中的位置 valid_taus = [t for t in tau_placebo if not np.isnan(t)] percentile = sum(t < self.true_tau for t in valid_taus) / len(valid_taus) return { 'placebo_taus': tau_placebo, 'percentile': percentile, 'passed': percentile > 0.05 }
解读:
- • 真实效应位于安慰剂分布的5%-95%之外 → 效应不稳健
- • 真实效应位于95%置信区间内 → 通过安慰剂检验 ✓
假设4:带宽敏感性
不同带宽下结果应该稳健
def bandwidth_sensitivity(self, outcome, bandwidths=None): """ 带宽敏感性分析 检验不同带宽下估计结果是否一致 """ if bandwidths is None: bandwidths = np.linspace(0.1, 0.6, 10) results = [] for bw in bandwidths: rdd = SharpRDD(self.data, 'running_var', outcome, cutoff=0) tau, se = rdd.local_linear_regression(bandwidth=bw) results.append({ 'bandwidth': bw, 'tau': tau, 'se': se, 'ci_lower': tau - 1.96*se, 'ci_upper': tau + 1.96*se }) return pd.DataFrame(results)
06 可视化
图1:RDD散点图 + 拟合线
def plot_rdd_scatter(data, x, y, cutoff=0, bandwidth=0.3): fig, ax = plt.subplots(figsize=(10, 6)) # 散点图(透明度显示密度) ax.scatter(data[x], data[y], alpha=0.2, s=10, color='gray') # 分箱均值 binned = data.groupby(pd.cut(data[x], 30))[y].mean() ax.plot(binned.index.astype(str), binned.values, 'o-', color='black', markersize=8) # 断点线 ax.axvline(x=cutoff, color='red', linestyle='--', linewidth=2) # 分段拟合线 left_mask = data[x] < cutoff right_mask = data[x] >= cutoff # 左侧拟合 x_left = data.loc[left_mask, x] y_left = data.loc[left_mask, y] z_left = np.polyfit(x_left, y_left, 1) ax.plot(x_left, np.polyval(z_left, x_left), 'b-', linewidth=2) # 右侧拟合 x_right = data.loc[right_mask, x] y_right = data.loc[right_mask, y] z_right = np.polyfit(x_right, y_right, 1) ax.plot(x_right, np.polyval(z_right, x_right), 'r-', linewidth=2) ax.set_xlabel(x) ax.set_ylabel(y) ax.set_title('RDD: Discontinuity at Cutoff') plt.savefig('rdd_scatter.png', dpi=150)
效果示意:
图2:事件研究图
def plot_event_study(data, outcome, cutoff=0): # 分时间窗口计算均值和置信区间 bins = np.linspace(data['running_var'].min(), data['running_var'].max(), 30) results = [] for i in range(len(bins)-1): mask = (data['running_var'] >= bins[i]) & (data['running_var'] < bins[i+1]) if mask.sum() > 0: results.append({ 'bin_center': (bins[i] + bins[i+1]) / 2, 'mean_y': data.loc[mask, outcome].mean(), 'se_y': data.loc[mask, outcome].std() / np.sqrt(mask.sum()) }) df = pd.DataFrame(results) fig, ax = plt.subplots(figsize=(12, 6)) ax.errorbar(df['bin_center'], df['mean_y'], yerr=1.96*df['se_y'], fmt='o', color='black', capsize=3) ax.axvline(x=cutoff, color='green', linestyle='--', linewidth=2) ax.set_xlabel('Running Variable') ax.set_ylabel(outcome) ax.set_title('Event Study: No Pre-trend') plt.savefig('rdd_event_study.png', dpi=150)
效果示意:
图3:安慰剂检验分布
def plot_placebo_distribution(placebo_taus, true_tau): fig, ax = plt.subplots(figsize=(10, 6)) ax.hist(placebo_taus, bins=20, color='gray', alpha=0.7) ax.axvline(x=true_tau, color='red', linestyle='--', linewidth=2, label=f'True Effect = {true_tau:.3f}') ax.set_xlabel('Estimated Effect') ax.set_ylabel('Frequency') ax.set_title('Placebo Test: No Effect at False Cutoffs') plt.savefig('placebo_test.png', dpi=150)
效果示意:
07 完整分析流程演示
运行结果
# 生成模拟数据df = generate_rdd_data(n=2000, treatment_effect=2)# Sharp RDD估计rdd = SharpRDD(df, 'running_var', 'outcome', cutoff=0)tau, se = rdd.local_linear_regression(bandwidth=0.3)print(f"估计效应: {tau:.4f} (SE={se:.4f}, t={tau/se:.2f})")# 输出: 估计效应: 2.0362 (SE=0.0310, t=65.64)
结果汇总表
诊断检验结果
08 论文写作指南
RDD结果的标准呈现
===========================================================表X:断点回归结果=========================================================== (1) (2) (3) Narrow BW Mid BW Wide BW-----------------------------------------------------------断点指示变量 2.04*** 2.03*** 2.01*** (0.03) (0.04) (0.05)样本量 612 1,023 1,456带宽 0.15 0.30 0.45核函数 三角 三角 三角控制变量 N Y Y地区固定效应 N N Y===========================================================注:括号内为聚类标准误;*** p<0.01
稳健性检验清单
09 参考文献
Angrist, J. D., & Pischke, J. S. (2009). Mostly Harmless Econometrics
Imbens, G., & Kalyanaraman, K. (2012). Optimal bandwidth choice for the regression discontinuity estimator. REStud, 79(3), 933-959.
McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics, 142(2), 698-714.
Calonico, S., Cattaneo, M. D., & Titiunik, R. (2014). Robust nonparametric confidence intervals for regression-discontinuity designs. Econometrica, 82(6), 2295-2326.