Python数据分析:分组操作 (GroupBy)
1. 核心知识点概述
GroupBy是Pandas中非常强大的数据分析工具,遵循"拆分-应用-合并"(split-apply-combine)模式:
groupby()- 聚合操作:
sum()、mean()、count()、min()、max()等。 - 转换操作:
transform()对每个分组进行转换,保持原DataFrame形状。 - 过滤操作
- 应用自定义函数
关键参数说明
2. 示例代码
2.1 准备数据
In [1]:
'department': ['Sales', 'Sales', 'IT', 'IT', 'HR', 'HR', 'Sales', 'IT', 'HR', 'Sales'], 'employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'], 'gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M'], 'salary': np.random.randint(5000, 15000, 10), 'bonus': np.random.randint(500, 2000, 10), 'years': np.random.randint(1, 10, 10)df['total_comp'] = df['salary'] + df['bonus']
department employee gender salary bonus years total_comp0 Sales Alice F 12270 1832 6 141021 Sales Bob M 5860 1269 9 71292 IT Charlie M 10390 843 1 112333 IT David M 10191 1937 3 121284 HR Eve F 10734 1305 7 120395 HR Frank M 11265 885 4 121506 Sales Grace F 5466 1715 9 71817 IT Henry M 9426 1455 3 108818 HR Ivy F 10578 776 5 113549 Sales Jack M 13322 1684 3 15006
2.2 基础分组与聚合
按单列分组并进行聚合计算。
In [2]:
dept_avg_salary = df.groupby('department')['salary'].mean()dept_stats = df.groupby('department')['salary'].agg(['count', 'sum', 'mean', 'min', 'max'])dept_multi = df.groupby('department')[['salary', 'bonus', 'total_comp']].mean()print("\n各部门平均薪资、奖金、总薪酬:")
Name: salary, dtype: float64HR 3 32577 10859.000000 10578 11265IT 3 30007 10002.333333 9426 10390Sales 4 36918 9229.500000 5466 13322HR 10859.000000 988.666667 11847.666667IT 10002.333333 1411.666667 11414.000000Sales 9229.500000 1625.000000 10854.500000
2.3 多列分组
按多个列进行分组。
In [3]:
dept_gender = df.groupby(['department', 'gender'])['salary'].mean()dept_gender_reset = df.groupby(['department', 'gender'], as_index=False)['salary'].mean()dept_gender_stats = df.groupby(['department', 'gender']).agg({ 'salary': ['mean', 'min', 'max'],
Name: salary, dtype: float64HR F 10656.000000 10578 10734 6.000000 M 11265.000000 11265 11265 4.000000IT M 10002.333333 9426 10390 2.333333Sales F 8868.000000 5466 12270 7.500000 M 9591.000000 5860 13322 6.000000
2.4 agg() 多聚合函数
使用agg()同时应用多个聚合函数。
In [4]:
salary_agg = df.groupby('department')['salary'].agg(['count', 'sum', 'mean', 'std', 'min', 'max'])custom_agg = df.groupby('department').agg({ 'salary': ['mean', 'max'], 'years': ['mean', 'min', 'max'],custom_func = df.groupby('department')['salary'].agg(['mean', salary_range])custom_func.columns = ['mean_salary', 'salary_range']
count sum mean std min maxHR 3 32577 10859.000000 360.154134 10578 11265IT 3 30007 10002.333333 508.940403 9426 10390Sales 4 36918 9229.500000 4143.696377 5466 13322 salary bonus years employee mean max sum mean min max countHR 10859.000000 11265 2966 5.333333 4 7 3IT 10002.333333 10390 4235 2.333333 1 3 3Sales 9229.500000 13322 6500 6.750000 3 9 4
2.5 transform() 转换操作
对每个分组进行转换,保持与原DataFrame相同的形状。
In [5]:
df['dept_avg_salary'] = df.groupby('department')['salary'].transform('mean')df['salary_diff'] = df['salary'] - df['dept_avg_salary']print(df[['employee', 'department', 'salary', 'dept_avg_salary', 'salary_diff']])df['salary_rank'] = df.groupby('department')['salary'].transform('rank', ascending=False)print(df[['employee', 'department', 'salary', 'salary_rank']].sort_values(['department', 'salary_rank']))df['salary_zscore'] = df.groupby('department')['salary'].transform(lambda x: (x - x.mean()) / x.std())print(df[['employee', 'department', 'salary', 'salary_zscore']])
employee department salary dept_avg_salary salary_diff0 Alice Sales 12270 9229.500000 3040.5000001 Bob Sales 5860 9229.500000 -3369.5000002 Charlie IT 10390 10002.333333 387.6666673 David IT 10191 10002.333333 188.6666674 Eve HR 10734 10859.000000 -125.0000005 Frank HR 11265 10859.000000 406.0000006 Grace Sales 5466 9229.500000 -3763.5000007 Henry IT 9426 10002.333333 -576.3333338 Ivy HR 10578 10859.000000 -281.0000009 Jack Sales 13322 9229.500000 4092.500000 employee department salary salary_rank employee department salary salary_zscore0 Alice Sales 12270 0.7337651 Bob Sales 5860 -0.8131632 Charlie IT 10390 0.7617133 David IT 10191 0.3707055 Frank HR 11265 1.1272956 Grace Sales 5466 -0.9082477 Henry IT 9426 -1.1324189 Jack Sales 13322 0.987645
2.6 filter() 过滤分组
根据条件筛选分组。
In [6]:
large_depts = df.groupby('department').filter(lambda x: len(x) > 2)print(large_depts[['employee', 'department']])high_pay_depts = df.groupby('department').filter(lambda x: x['salary'].mean() > 8000)print("\n平均工资大于8000的部门:")print(high_pay_depts[['employee', 'department', 'salary']])stable_depts = df.groupby('department').filter(lambda x: x['salary'].std() < 3000)print(stable_depts[['employee', 'department', 'salary']])
employee department salary employee department salary
2.7 apply() 应用自定义函数
对每个分组应用任意自定义函数。
In [7]:
def get_top_earner(group): return group.loc[group['salary'].idxmax()]top_earners = df.groupby('department', group_keys=False).apply(get_top_earner)print(top_earners[['employee', 'department', 'salary']])def salary_distribution(group): 'q25': group['salary'].quantile(0.25), 'q50': group['salary'].median(), 'q75': group['salary'].quantile(0.75), 'iqr': group['salary'].quantile(0.75) - group['salary'].quantile(0.25)dist = df.groupby('department').apply(salary_distribution)
employee department salaryHR 10656.0 10734.0 10999.5 343.5IT 9808.5 10191.0 10290.5 482.0Sales 5761.5 9065.0 12533.0 6771.5
C:\Users\zhanghc\AppData\Local\Temp\ipykernel_7176\3295751641.py:5: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. top_earners = df.groupby('department', group_keys=False).apply(get_top_earner)C:\Users\zhanghc\AppData\Local\Temp\ipykernel_7176\3295751641.py:18: DeprecationWarning: DataFrameGroupBy.apply operated on the grouping columns. This behavior is deprecated, and in a future version of pandas the grouping columns will be excluded from the operation. Either pass `include_groups=False` to exclude the groupings or explicitly select the grouping columns after groupby to silence this warning. dist = df.groupby('department').apply(salary_distribution)
2.8 分组迭代
遍历每个分组进行处理。
In [8]:
for name, group in df.groupby('department'): print(f" 员工数: {len(group)}") print(f" 平均薪资: {group['salary'].mean():.2f}") print(f" 最高薪资: {group['salary'].max()}") print(f" 最低薪资: {group['salary'].min()}")sales_dept = df.groupby('department').get_group('Sales')print(sales_dept[['employee', 'salary', 'bonus']])
2.9 分组描述性统计
快速获取分组描述性统计信息。
In [9]:
desc = df.groupby('department')['salary'].describe()desc_multi = df.groupby('department')[['salary', 'bonus']].describe()group_sizes = df.groupby('department').size()unique_genders = df.groupby('department')['gender'].nunique()
count mean std min 25% 50% \HR 3.0 10859.000000 360.154134 10578.0 10656.0 10734.0 IT 3.0 10002.333333 508.940403 9426.0 9808.5 10191.0 Sales 4.0 9229.500000 4143.696377 5466.0 5761.5 9065.0 count mean std min 25% 50% HR 3.0 10859.000000 360.154134 10578.0 10656.0 10734.0 IT 3.0 10002.333333 508.940403 9426.0 9808.5 10191.0 Sales 4.0 9229.500000 4143.696377 5466.0 5761.5 9065.0 75% max count mean std min 25% HR 10999.5 11265.0 3.0 988.666667 279.321201 776.0 830.50 IT 10290.5 10390.0 3.0 1411.666667 548.285814 843.0 1149.00 Sales 12533.0 13322.0 4.0 1625.000000 245.741056 1269.0 1580.25 Sales 1699.5 1744.25 1832.0 Name: gender, dtype: int64
2.10 分组与透视表结合
groupby与pivot_table的对比使用。
In [10]:
pivot_groupby = df.groupby(['department', 'gender'])['salary'].mean().unstack()print("\nGroupBy创建的透视表:")pivot_table = df.pivot_table(values='salary', index='department', columns='gender', aggfunc='mean')pivot_with_margins = df.pivot_table(values='salary', index='department', columns='gender', aggfunc='mean', margins=True)print(pivot_with_margins)
HR 10656.0 11265.000000 10859.000000IT NaN 10002.333333 10002.333333Sales 8868.0 9591.000000 9229.500000All 9762.0 10075.666667 9950.200000
3. 常见应用场景总结
- 部门统计
- 时间聚合
- 异常检测:通过transform计算每个值相对于组平均的偏差。
- 数据标准化
- 分组排名
- 筛选分组