Python数据分析:时间序列
1. 核心知识点概述
Pandas提供了强大的时间序列处理功能:
TimestampPeriodDatetimeIndexdate_range()resample()tz_localize()/tz_convert()
关键参数说明
freq: 频率字符串,如'D'(日)、'W'(周)、'M'(月)、'H'(小时)。startperiodsclosed
2. 示例代码
2.1 准备数据
In [1]:
import matplotlib.pyplot as pltplt.rcParams['font.sans-serif'] = ['SimHei', 'DejaVu Sans']plt.rcParams['axes.unicode_minus'] = False
========================================
2.2 创建时间戳 (Timestamp)
创建和解析时间戳。
In [2]:
ts1 = pd.Timestamp('2024-03-15')ts2 = pd.Timestamp(2024, 3, 15, 14, 30, 0)ts3 = pd.Timestamp('2024-03-15 14:30:00')print(f"月份: {ts1.month}")print(f"星期几: {ts1.dayofweek} (0=周一)")print(f"星期几名称: {ts1.day_name()}")print(f"是否月末: {ts1.is_month_end}")print(f"是否闰年: {ts1.is_leap_year}")print(f"加1天: {ts1 + pd.Timedelta(days=1)}")print(f"加3小时: {ts1 + pd.Timedelta(hours=3)}")print(f"减1周: {ts1 - pd.Timedelta(weeks=1)}")
加3小时: 2024-03-15 03:00:00
2.3 生成时间范围 (date_range)
使用date_range生成时间序列索引。
In [3]:
daily = pd.date_range(start='2024-01-01', end='2024-01-10', freq='D')weekly = pd.date_range(start='2024-01-01', periods=5, freq='W')business_days = pd.date_range(start='2024-01-01', periods=10, freq='B')monthly = pd.date_range(start='2024-01-01', periods=6, freq='ME')hourly = pd.date_range(start='2024-01-01', periods=12, freq='h')
DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-06', '2024-01-07', '2024-01-08', '2024-01-09', '2024-01-10'], dtype='datetime64[ns]', freq='D')DatetimeIndex(['2024-01-07', '2024-01-14', '2024-01-21', '2024-01-28', dtype='datetime64[ns]', freq='W-SUN')DatetimeIndex(['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05', '2024-01-08', '2024-01-09', '2024-01-10', '2024-01-11', '2024-01-12'], dtype='datetime64[ns]', freq='B')DatetimeIndex(['2024-01-31', '2024-02-29', '2024-03-31', '2024-04-30', '2024-05-31', '2024-06-30'], dtype='datetime64[ns]', freq='ME')DatetimeIndex(['2024-01-01 00:00:00', '2024-01-01 01:00:00', '2024-01-01 02:00:00', '2024-01-01 03:00:00', '2024-01-01 04:00:00', '2024-01-01 05:00:00', '2024-01-01 06:00:00', '2024-01-01 07:00:00', '2024-01-01 08:00:00', '2024-01-01 09:00:00', '2024-01-01 10:00:00', '2024-01-01 11:00:00'], dtype='datetime64[ns]', freq='h')
2.4 创建时间序列数据
创建带有DatetimeIndex的DataFrame。
In [4]:
dates = pd.date_range('2024-01-01', periods=30, freq='D')values = np.random.randn(30).cumsum() + 100ts_data = pd.Series(values, index=dates) 'volume': np.random.randint(1000, 5000, 30)print("\n时间序列DataFrame(前5行):")print(f"\n索引类型: {type(df_ts.index)}")
2024-01-01 100.496714 23632024-01-02 100.358450 31392024-01-03 101.006138 23902024-01-04 102.529168 40032024-01-05 102.295015 2478索引类型: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>
2.5 时间序列索引和切片
使用时间进行索引和切片。
In [5]:
print(f"2024-01-05: {ts_data['2024-01-05']}")print("\n切片 2024-01-05 到 2024-01-10:")print(ts_data['2024-01-05':'2024-01-10'])print(ts_data['2024-01'])print("\ntruncate截断后的数据:")print(ts_data.truncate(before='2024-01-10', after='2024-01-20'))
2024-01-05: 102.29501487162543切片 2024-01-05 到 2024-01-10:
2.6 重采样 (Resample)
改变时间序列的频率。
In [6]:
hourly_dates = pd.date_range('2024-01-01', periods=168, freq='h') # 一周数据hourly_values = np.random.randn(168).cumsum() + 100hourly_ts = pd.Series(hourly_values, index=hourly_dates)print(hourly_ts.head(10))daily_mean = hourly_ts.resample('D').mean()weekly_sum = daily_mean.resample('W').sum()
2024-01-01 00:00:00 100.4967142024-01-01 01:00:00 100.3584502024-01-01 02:00:00 101.0061382024-01-01 03:00:00 102.5291682024-01-01 04:00:00 102.2950152024-01-01 05:00:00 102.0608782024-01-01 06:00:00 103.6400912024-01-01 07:00:00 104.4075252024-01-01 08:00:00 103.9380512024-01-01 09:00:00 104.480611Freq: W-SUN, dtype: float64
In [7]:
resampled = hourly_ts.resample('D').agg(['mean', 'min', 'max', 'std'])daily_data = pd.Series([100, 105, 98, 102, 108], index=pd.date_range('2024-01-01', periods=5, freq='D'))hourly_ffill = daily_data.resample('h').ffill()print("\n升采样 - 前向填充(前10个):")print(hourly_ffill.head(10))hourly_interp = daily_data.resample('h').interpolate(method='linear')print("\n升采样 - 线性插值(前10个):")print(hourly_interp.head(10))
2024-01-01 100.851231 96.456681 104.480611 2.6353002024-01-02 93.170429 89.088604 96.023221 2.1802112024-01-03 89.778946 87.753344 92.469581 1.1157052024-01-04 91.223889 89.287646 93.998399 1.0421302024-01-05 88.648586 86.473055 90.499713 1.2336112024-01-01 00:00:00 100.0000002024-01-01 01:00:00 100.2083332024-01-01 02:00:00 100.4166672024-01-01 03:00:00 100.6250002024-01-01 04:00:00 100.8333332024-01-01 05:00:00 101.0416672024-01-01 06:00:00 101.2500002024-01-01 07:00:00 101.4583332024-01-01 08:00:00 101.6666672024-01-01 09:00:00 101.875000
2.7 时区处理
处理不同时区的时间数据。
In [8]:
ts_no_tz = pd.Series(range(5), index=pd.date_range('2024-01-01', periods=5, freq='h'))print(f"时区: {ts_no_tz.index.tz}")ts_utc = ts_no_tz.tz_localize('UTC')print(f"时区: {ts_utc.index.tz}")ts_shanghai = ts_utc.tz_convert('Asia/Shanghai')ts_ny = ts_utc.tz_convert('America/New_York')
2024-01-01 00:00:00+00:00 02024-01-01 01:00:00+00:00 12024-01-01 02:00:00+00:00 22024-01-01 03:00:00+00:00 32024-01-01 04:00:00+00:00 4
2024-01-01 08:00:00+08:00 02024-01-01 09:00:00+08:00 12024-01-01 10:00:00+08:00 22024-01-01 11:00:00+08:00 32024-01-01 12:00:00+08:00 42023-12-31 19:00:00-05:00 02023-12-31 20:00:00-05:00 12023-12-31 21:00:00-05:00 22023-12-31 22:00:00-05:00 32023-12-31 23:00:00-05:00 4
2.8 时间段 (Period)
使用Period表示时间区间。
In [9]:
p1 = pd.Period('2024-01', freq='M')p2 = pd.Period('2024-Q1', freq='Q')p3 = pd.Period('2024', freq='Y')print(f"开始时间: {p1.start_time}")print(f"结束时间: {p1.end_time}")period_range = pd.period_range('2024-01', periods=6, freq='M')
开始时间: 2024-01-01 00:00:00结束时间: 2024-01-31 23:59:59.999999999PeriodIndex(['2024-01', '2024-02', '2024-03', '2024-04', '2024-05', '2024-06'], dtype='period[M]')
2.9 日期偏移 (DateOffset)
使用DateOffset进行灵活的日期运算。
In [10]:
from pandas.tseries.offsets import Day, Week, MonthEnd, QuarterEnd, BusinessDayts = pd.Timestamp('2024-03-15')print(f"加5天: {ts + Day(5)}")print(f"加2周: {ts + Week(2)}")print(f"月末: {ts + MonthEnd(0)}")print(f"下月末: {ts + MonthEnd(1)}")print(f"下季度末: {ts + QuarterEnd(1)}")print(f"5个工作日后: {ts + BusinessDay(5)}")from pandas.tseries.offsets import DateOffsetcustom_offset = DateOffset(months=2, days=5)print(f"\n自定义偏移(2个月5天): {ts + custom_offset}")
原始时间: 2024-03-15 00:00:00下季度末: 2024-03-31 00:00:005个工作日后: 2024-03-22 00:00:00自定义偏移(2个月5天): 2024-05-20 00:00:00
2.10 时间序列可视化
绘制时间序列图表。
In [11]:
monthly_dates = pd.date_range('2023-01-01', periods=12, freq='M')monthly_sales = np.random.randint(100, 200, 12) + np.linspace(0, 50, 12)monthly_ts = pd.Series(monthly_sales, index=monthly_dates)fig, axes = plt.subplots(2, 1, figsize=(12, 8))monthly_ts.plot(ax=axes[0], marker='o', title='Monthly Sales Trend')axes[0].set_xlabel('Date')axes[0].set_ylabel('Sales')axes[0].grid(True, alpha=0.3)yoy_growth = monthly_ts.pct_change() * 100yoy_growth.plot(ax=axes[1], kind='bar', color='steelblue', title='Month-over-Month Growth (%)')axes[1].set_xlabel('Date')axes[1].set_ylabel('Growth (%)')axes[1].axhline(y=0, color='r', linestyle='--', alpha=0.5)axes[1].grid(True, alpha=0.3)print(yoy_growth.round(2))
C:\Users\zhanghc\AppData\Local\Temp\ipykernel_16360\2668907939.py:3: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead. monthly_dates = pd.date_range('2023-01-01', periods=12, freq='M')
3. 常见应用场景总结
- 金融数据分析
- 销售报表
- 日志分析
- 传感器数据
- 时区转换
- 重采样