Python数据分析:窗口运算
1. 核心知识点概述
窗口运算是时间序列分析的重要工具,用于计算滑动窗口内的统计量:
rolling()expanding()ewm()- 常用统计:
mean()、sum()、std()、min()、max()、corr()等。
关键参数说明
windowmin_periodscenterwin_type: 窗口类型(如'triang'、'gaussian')。span
2. 示例代码
2.1 准备数据
In [1]:
import matplotlib.pyplot as pltdates = pd.date_range('2024-01-01', periods=60, freq='D')price = 100 + np.cumsum(np.random.randn(60) * 2)volume = np.random.randint(1000, 5000, 60)print(f"\n数据形状: {df.shape}")
2024-01-01 100.993428 39772024-01-02 100.716900 41042024-01-03 102.012277 41192024-01-04 105.058336 15022024-01-05 104.590030 34542024-01-06 104.121756 46452024-01-07 107.280181 27512024-01-08 108.815051 18042024-01-09 107.876102 31462024-01-10 108.961222 3731
2.2 Rolling 滚动窗口
计算固定大小窗口内的统计量。
In [2]:
df['ma_7'] = df['price'].rolling(window=7).mean()df['ma_30'] = df['price'].rolling(window=30).mean()df[['price', 'ma_7', 'ma_30']].plot(figsize=(12, 6), title='Stock Price with Moving Averages')plt.legend(['Price', 'MA 7', 'MA 30'])plt.grid(True, alpha=0.3)print(df[['price', 'ma_7', 'ma_30']].head(15))
2024-01-01 100.993428 NaN NaN2024-01-02 100.716900 NaN NaN2024-01-03 102.012277 NaN NaN2024-01-04 105.058336 NaN NaN2024-01-05 104.590030 NaN NaN2024-01-06 104.121756 NaN NaN2024-01-07 107.280181 103.538987 NaN2024-01-08 108.815051 104.656362 NaN2024-01-09 107.876102 105.679105 NaN2024-01-10 108.961222 106.671811 NaN2024-01-11 108.034387 107.096961 NaN2024-01-12 107.102927 107.455947 NaN2024-01-13 107.586852 107.950960 NaN2024-01-14 103.760291 107.448119 NaN2024-01-15 100.310456 106.233177 NaN
In [3]:
df['rolling_std_7'] = df['price'].rolling(window=7).std()df['rolling_max_14'] = df['price'].rolling(window=14).max()df['rolling_min_14'] = df['price'].rolling(window=14).min()print(df[['price', 'rolling_std_7', 'rolling_max_14', 'rolling_min_14']].head(20))
price rolling_std_7 rolling_max_14 rolling_min_142024-01-01 100.993428 NaN NaN NaN2024-01-02 100.716900 NaN NaN NaN2024-01-03 102.012277 NaN NaN NaN2024-01-04 105.058336 NaN NaN NaN2024-01-05 104.590030 NaN NaN NaN2024-01-06 104.121756 NaN NaN NaN2024-01-07 107.280181 2.398755 NaN NaN2024-01-08 108.815051 2.803018 NaN NaN2024-01-09 107.876102 2.403706 NaN NaN2024-01-10 108.961222 2.045125 NaN NaN2024-01-11 108.034387 1.961430 NaN NaN2024-01-12 107.102927 1.627702 NaN NaN2024-01-13 107.586852 0.716650 NaN NaN2024-01-14 103.760291 1.752300 108.961222 100.7169002024-01-15 100.310456 3.086759 108.961222 100.3104562024-01-16 99.185881 3.944302 108.961222 99.1858812024-01-17 97.160218 4.453943 108.961222 97.1602182024-01-18 97.788713 4.322602 108.961222 97.1602182024-01-19 95.972665 4.106657 108.961222 95.9726652024-01-20 93.148058 3.368370 108.961222 93.148058
2.3 Rolling 高级用法
使用min_periods和center参数。
In [4]:
df['ma_7_min3'] = df['price'].rolling(window=7, min_periods=3).mean()print("\nmin_periods=3的移动平均(前10行):")print(df[['price', 'ma_7', 'ma_7_min3']].head(10))df['ma_7_center'] = df['price'].rolling(window=7, center=True).mean()print("\ncenter=True的移动平均(前10行):")print(df[['price', 'ma_7', 'ma_7_center']].head(10))
min_periods=3的移动平均(前10行):2024-01-01 100.993428 NaN NaN2024-01-02 100.716900 NaN NaN2024-01-03 102.012277 NaN 101.2408682024-01-04 105.058336 NaN 102.1952352024-01-05 104.590030 NaN 102.6741942024-01-06 104.121756 NaN 102.9154542024-01-07 107.280181 103.538987 103.5389872024-01-08 108.815051 104.656362 104.6563622024-01-09 107.876102 105.679105 105.6791052024-01-10 108.961222 106.671811 106.6718112024-01-01 100.993428 NaN NaN2024-01-02 100.716900 NaN NaN2024-01-03 102.012277 NaN NaN2024-01-04 105.058336 NaN 103.5389872024-01-05 104.590030 NaN 104.6563622024-01-06 104.121756 NaN 105.6791052024-01-07 107.280181 103.538987 106.6718112024-01-08 108.815051 104.656362 107.0969612024-01-09 107.876102 105.679105 107.4559472024-01-10 108.961222 106.671811 107.950960
2.4 Rolling Apply 自定义函数
在滚动窗口上应用自定义函数。
In [5]:
df['price_range_7'] = df['price'].rolling(window=7).apply(price_range) return (x.iloc[-1] - x.iloc[0]) / x.iloc[0] * 100df['return_7'] = df['price'].rolling(window=7).apply(total_return)print(df[['price', 'price_range_7', 'return_7']].head(15))
price price_range_7 return_72024-01-01 100.993428 NaN NaN2024-01-02 100.716900 NaN NaN2024-01-03 102.012277 NaN NaN2024-01-04 105.058336 NaN NaN2024-01-05 104.590030 NaN NaN2024-01-06 104.121756 NaN NaN2024-01-07 107.280181 6.563282 6.2249132024-01-08 108.815051 8.098151 8.0405092024-01-09 107.876102 6.802774 5.7481572024-01-10 108.961222 4.839466 3.7149702024-01-11 108.034387 4.839466 3.2931982024-01-12 107.102927 4.839466 2.8631592024-01-13 107.586852 1.858295 0.2858592024-01-14 103.760291 5.200931 -4.6452762024-01-15 100.310456 8.650767 -7.013274
2.5 Expanding 扩张窗口
从起始位置逐渐扩大窗口,包含所有历史数据。
In [6]:
df['expanding_mean'] = df['price'].expanding().mean()df['expanding_max'] = df['price'].expanding().max()df['expanding_min'] = df['price'].expanding().min()df[['price', 'expanding_mean', 'expanding_max', 'expanding_min']].plot(figsize=(12, 6), title='Expanding Window Statistics')plt.legend(['Price', 'Expanding Mean', 'Expanding Max', 'Expanding Min'])plt.grid(True, alpha=0.3)print(df[['price', 'expanding_mean', 'expanding_max', 'expanding_min']].head(15))

price expanding_mean expanding_max expanding_min2024-01-01 100.993428 100.993428 100.993428 100.9934282024-01-02 100.716900 100.855164 100.993428 100.7169002024-01-03 102.012277 101.240868 102.012277 100.7169002024-01-04 105.058336 102.195235 105.058336 100.7169002024-01-05 104.590030 102.674194 105.058336 100.7169002024-01-06 104.121756 102.915454 105.058336 100.7169002024-01-07 107.280181 103.538987 107.280181 100.7169002024-01-08 108.815051 104.198495 108.815051 100.7169002024-01-09 107.876102 104.607118 108.815051 100.7169002024-01-10 108.961222 105.042528 108.961222 100.7169002024-01-11 108.034387 105.314515 108.961222 100.7169002024-01-12 107.102927 105.463550 108.961222 100.7169002024-01-13 107.586852 105.626881 108.961222 100.7169002024-01-14 103.760291 105.493553 108.961222 100.7169002024-01-15 100.310456 105.148013 108.961222 100.310456
2.6 EWM 指数加权移动平均
给予近期数据更高权重,对价格变化更敏感。
In [7]:
df['ewm_span_7'] = df['price'].ewm(span=7).mean()df['ewm_span_30'] = df['price'].ewm(span=30).mean()df[['price', 'ma_7', 'ewm_span_7']].plot(figsize=(12, 6), title='SMA vs EMA')plt.legend(['Price', 'SMA 7', 'EMA 7'])plt.grid(True, alpha=0.3)print("\n指数加权移动平均(前15行):")print(df[['price', 'ma_7', 'ewm_span_7']].head(15))
2024-01-01 100.993428 NaN 100.9934282024-01-02 100.716900 NaN 100.8354122024-01-03 102.012277 NaN 101.3443262024-01-04 105.058336 NaN 102.7025932024-01-05 104.590030 NaN 103.3212662024-01-06 104.121756 NaN 103.5647182024-01-07 107.280181 103.538987 104.6366722024-01-08 108.815051 104.656362 105.7974792024-01-09 107.876102 105.679105 106.3593202024-01-10 108.961222 106.671811 107.0486122024-01-11 108.034387 107.096961 107.3059232024-01-12 107.102927 107.455947 107.2535142024-01-13 107.586852 107.950960 107.3388772024-01-14 103.760291 107.448119 106.4280002024-01-15 100.310456 106.233177 104.877900
In [8]:
# 使用com (center of mass)参数df['ewm_com_5'] = df['price'].ewm(com=5).mean()df['ewm_halflife_5'] = df['price'].ewm(halflife=5).mean()df['ewm_alpha_0.3'] = df['price'].ewm(alpha=0.3).mean()print("\n不同EWM参数对比(前10行):")print(df[['price', 'ewm_span_7', 'ewm_com_5', 'ewm_halflife_5', 'ewm_alpha_0.3']].head(10))
price ewm_span_7 ewm_com_5 ewm_halflife_5 ewm_alpha_0.32024-01-01 100.993428 100.993428 100.993428 100.993428 100.9934282024-01-02 100.716900 100.835412 100.842595 100.845596 100.8307642024-01-03 102.012277 101.344326 101.305326 101.289469 101.3702682024-01-04 105.058336 102.702593 102.513449 102.435662 102.8262762024-01-05 104.590030 103.321266 103.092087 102.993425 103.4622982024-01-06 104.121756 103.564718 103.350110 103.252068 103.6865142024-01-07 107.280181 104.636672 104.258690 104.091645 104.8613692024-01-08 108.815051 105.797479 105.248215 105.004078 106.1200332024-01-09 107.876102 106.359320 105.791486 105.525639 106.6690072024-01-10 108.961222 107.048612 106.421531 106.118618 107.376661
2.7 滚动相关系数和协方差
计算两个序列之间的滚动相关性。
In [9]:
df['price2'] = df['price'] + np.random.randn(60) * 5df['rolling_corr'] = df['price'].rolling(window=14).corr(df['price2'])df['rolling_cov'] = df['price'].rolling(window=14).cov(df['price2'])fig, axes = plt.subplots(2, 1, figsize=(12, 8))df[['price', 'price2']].plot(ax=axes[0], title='Two Price Series')axes[0].grid(True, alpha=0.3)df['rolling_corr'].plot(ax=axes[1], title='Rolling Correlation (14-day)', color='purple')axes[1].axhline(y=0, color='r', linestyle='--', alpha=0.5)axes[1].grid(True, alpha=0.3)print(df[['price', 'price2', 'rolling_corr', 'rolling_cov']].head(20))

price price2 rolling_corr rolling_cov2024-01-01 100.993428 103.500670 NaN NaN2024-01-02 100.716900 106.507833 NaN NaN2024-01-03 102.012277 103.297861 NaN NaN2024-01-04 105.058336 106.630901 NaN NaN2024-01-05 104.590030 111.449340 NaN NaN2024-01-06 104.121756 104.999522 NaN NaN2024-01-07 107.280181 105.733739 NaN NaN2024-01-08 108.815051 112.180678 NaN NaN2024-01-09 107.876102 106.592951 NaN NaN2024-01-10 108.961222 107.122094 NaN NaN2024-01-11 108.034387 114.403055 NaN NaN2024-01-12 107.102927 105.643164 NaN NaN2024-01-13 107.586852 94.310972 NaN NaN2024-01-14 103.760291 105.487881 0.209861 2.8427222024-01-15 100.310456 98.332873 0.331210 5.0295222024-01-16 99.185881 97.740196 0.516154 9.1061682024-01-17 97.160218 99.424900 0.559975 12.1784142024-01-18 97.788713 96.958409 0.626131 16.2393142024-01-19 95.972665 97.047359 0.696766 20.2400632024-01-20 93.148058 83.036483 0.788135 35.555180
2.8 滚动回归
在滚动窗口上进行线性回归。
In [10]:
return cov / var if var != 0 else np.nan x = df['price'].iloc[i-window+1:i+1].values y = df['price2'].iloc[i-window+1:i+1].values betas.append(rolling_beta(x, y))df['rolling_beta'] = betasdf['rolling_beta'].plot(figsize=(12, 5), title='Rolling Beta (14-day)')plt.grid(True, alpha=0.3)print(df[['price', 'price2', 'rolling_beta']].head(20))

price price2 rolling_beta2024-01-01 100.993428 103.500670 NaN2024-01-02 100.716900 106.507833 NaN2024-01-03 102.012277 103.297861 NaN2024-01-04 105.058336 106.630901 NaN2024-01-05 104.590030 111.449340 NaN2024-01-06 104.121756 104.999522 NaN2024-01-07 107.280181 105.733739 NaN2024-01-08 108.815051 112.180678 NaN2024-01-09 107.876102 106.592951 NaN2024-01-10 108.961222 107.122094 NaN2024-01-11 108.034387 114.403055 NaN2024-01-12 107.102927 105.643164 NaN2024-01-13 107.586852 94.310972 NaN2024-01-14 103.760291 105.487881 0.3739302024-01-15 100.310456 98.332873 0.6230602024-01-16 99.185881 97.740196 0.9831932024-01-17 97.160218 99.424900 0.9277552024-01-18 97.788713 96.958409 0.9806492024-01-19 95.972665 97.047359 0.9485982024-01-20 93.148058 83.036483 1.230099
2.9 技术指标示例:布林带
使用滚动窗口计算布林带指标。
In [11]:
df['bb_middle'] = df['price'].rolling(window=window).mean()df['bb_std'] = df['price'].rolling(window=window).std()df['bb_upper'] = df['bb_middle'] + (df['bb_std'] * num_std)df['bb_lower'] = df['bb_middle'] - (df['bb_std'] * num_std)plt.figure(figsize=(12, 6))plt.plot(df.index, df['price'], label='Price', color='blue')plt.plot(df.index, df['bb_middle'], label='Middle Band', color='orange', linestyle='--')plt.plot(df.index, df['bb_upper'], label='Upper Band', color='red', linestyle=':')plt.plot(df.index, df['bb_lower'], label='Lower Band', color='green', linestyle=':')plt.fill_between(df.index, df['bb_upper'], df['bb_lower'], alpha=0.1, color='gray')plt.title('Bollinger Bands')plt.grid(True, alpha=0.3)print(df[['price', 'bb_middle', 'bb_upper', 'bb_lower']].head(25))

price bb_middle bb_upper bb_lower2024-01-01 100.993428 NaN NaN NaN2024-01-02 100.716900 NaN NaN NaN2024-01-03 102.012277 NaN NaN NaN2024-01-04 105.058336 NaN NaN NaN2024-01-05 104.590030 NaN NaN NaN2024-01-06 104.121756 NaN NaN NaN2024-01-07 107.280181 NaN NaN NaN2024-01-08 108.815051 NaN NaN NaN2024-01-09 107.876102 NaN NaN NaN2024-01-10 108.961222 NaN NaN NaN2024-01-11 108.034387 NaN NaN NaN2024-01-12 107.102927 NaN NaN NaN2024-01-13 107.586852 NaN NaN NaN2024-01-14 103.760291 NaN NaN NaN2024-01-15 100.310456 NaN NaN NaN2024-01-16 99.185881 NaN NaN NaN2024-01-17 97.160218 NaN NaN NaN2024-01-18 97.788713 NaN NaN NaN2024-01-19 95.972665 NaN NaN NaN2024-01-20 93.148058 103.023787 112.460066 93.5875072024-01-21 96.079355 102.778083 112.681313 92.8748532024-01-22 95.627802 102.523628 112.900063 92.1471932024-01-23 95.762859 102.211157 113.019811 91.4025042024-01-24 92.913363 101.603908 113.082912 90.1249052024-01-25 91.824597 100.965637 113.143839 88.787434
2.10 综合对比
对比不同类型的窗口运算。
In [12]:
comparison_df = pd.DataFrame({ 'rolling_ma': df['ma_7'], 'expanding_ma': df['expanding_mean'], 'ewm_ma': df['ewm_span_7']comparison_df.plot(figsize=(12, 6), title='Comparison of Window Operations')plt.legend(['Price', 'Rolling MA (7)', 'Expanding MA', 'EMA (7)'])plt.grid(True, alpha=0.3)print(comparison_df.head(15))print(comparison_df.tail(10))

price rolling_ma expanding_ma ewm_ma2024-01-01 100.993428 NaN 100.993428 100.9934282024-01-02 100.716900 NaN 100.855164 100.8354122024-01-03 102.012277 NaN 101.240868 101.3443262024-01-04 105.058336 NaN 102.195235 102.7025932024-01-05 104.590030 NaN 102.674194 103.3212662024-01-06 104.121756 NaN 102.915454 103.5647182024-01-07 107.280181 103.538987 103.538987 104.6366722024-01-08 108.815051 104.656362 104.198495 105.7974792024-01-09 107.876102 105.679105 104.607118 106.3593202024-01-10 108.961222 106.671811 105.042528 107.0486122024-01-11 108.034387 107.096961 105.314515 107.3059232024-01-12 107.102927 107.455947 105.463550 107.2535142024-01-13 107.586852 107.950960 105.626881 107.3388772024-01-14 103.760291 107.448119 105.493553 106.4280002024-01-15 100.310456 106.233177 105.148013 104.877900 price rolling_ma expanding_ma ewm_ma2024-02-20 78.100777 79.233914 93.128858 79.6400482024-02-21 77.330613 78.775691 92.825046 79.0626892024-02-22 75.976769 78.329731 92.507154 78.2912092024-02-23 77.200121 78.190148 92.223690 78.0184372024-02-24 79.262120 78.043100 91.988025 78.3293582024-02-25 81.124681 78.063956 91.794037 79.0281892024-02-26 79.446246 78.348761 91.577409 79.1327032024-02-27 78.827821 78.452624 91.357589 79.0564822024-02-28 79.490348 78.761158 91.156449 79.1649492024-02-29 81.441438 79.541825 90.994532 79.734071
3. 常见应用场景总结
- 趋势分析
- 波动率计算
- 技术指标
- 异常检测
- 相关性分析
- 累计统计:使用expanding计算累计收益率、累计最大值等。