当前位置：首页>Linux>深入浅出 Linux 内核(进程篇):CFS调度

深入浅出 Linux 内核(进程篇):CFS调度

2026-07-01 23:40:42

大家好，我是蟹老板~

我，当年啃Linux内核CFS调度的时候，说真的，差点把头发薅秃，第一次看CFS（完全公平调度器）的时候，心里就一个念头：这名字起得也太理想主义了吧？“完全公平”？在这个世界上，哪有什么真正的公平啊。

不过在看完代码之后，倒是也觉得，这帮内核大佬们，至少在技术上，是真心想追求那么一点儿公平的。

提醒一下：各位最好先看看《深入浅出 Linux 内核（进程篇）：进程调度》这篇文章，这样的话可以对进程调度有一个大体的认识。

本文内核版本为Linux5.6.4

进程这东西，说简单也简单，说复杂也复杂，大概分三类：交互式进程、批处理进程还有实时进程。对应这三类进程，Linux内核整了六种调度策略，我当年背这几个宏定义，背了又忘，忘了又背，现在闭着眼都能写出来：

#define SCHED_NORMAL		0#define SCHED_FIFO		1#define SCHED_RR		2#define SCHED_BATCH		3/* SCHED_ISO: reserved but not implemented yet */#define SCHED_IDLE		5#define SCHED_DEADLINE		6

这里面，SCHED_NORMAL和SCHED_BATCH就是咱们最常接触的普通进程，写个接口、跑个脚本，基本都是这俩策略；SCHED_FIFO、SCHED_RR还有SCHED_DEADLINE，是给实时进程用的，比如工业控制里的任务，差一秒都不行；至于SCHED_IDLE，就是CPU没事干的时候才跑的0号进程，相当于CPU的“摸鱼模式”嘛，啥时候没其他进程干活了，它才出来凑数。

绝大多数情况下，你跑的那些程序都是普通进程，归CFS调度器管。这一点，从内核怎么挑下一个要跑的进程就能看出来——pick_next_task函数里，有个优化判断，如果所有任务都归fair调度类管，就直接走快速路径。

static inline struct task_struct *pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf){const struct sched_class *class;struct task_struct *p;/*	 * Optimization: we know that if all tasks are in the fair class we can	 * call that function directly, but only if the @prev task wasn't of a	 * higher scheduling class, because otherwise those loose the	 * opportunity to pull in more work from other CPUs.	 */	 /* likely告诉编译器该分支执行概率大，即系统中都是由CFS调度的进程  */if (likely((prev->sched_class == &idle_sched_class ||		    prev->sched_class == &fair_sched_class) &&		   rq->nr_running == rq->cfs.h_nr_running)) {		p = pick_next_task_fair(rq, prev, rf);if (unlikely(p == RETRY_TASK))goto restart;/* Assumes fair_sched_class->next == idle_sched_class */if (!p) {put_prev_task(rq, prev);			p = pick_next_task_idle(rq);		}return p;	}	…………	for_each_class(class) {		p = class->pick_next_task(rq);if (p)return p;	}/* The idle class should always have a runnable task: */BUG();}

Linux为了搞进程调度，整了好几个核心概念，我当年刚开始记的时候，总把它们弄混，后来干脆弄了个表，才算理清楚——调度类有5个，负责提供调度方法；调度实体有3个，是调度的基本单元，既能调度单个进程，也能调度一组进程；调度策略就是刚才说的6种，按实时和普通进程分；调度算法有4种，跟着调度策略走，啥策略对应啥算法。

这些概念的优先级的关系，我给大家捋了捋，从高到低来，看表格更清楚，当年我就是靠这个表格，才没把优先级搞反：

调度类	调度策略	调度实体	调度算法
stop_sched_class	无	无	无
dl_sched_class	SCHED_DEADLINE	sched_dl_entity	EDF
rt_sched_class	SCHED_RR/SCHED_FIFO	sched_rt_entity	RR/FIFO
fair_sched_class	SCHED_NORMAL/SCHED_BATCH	sched_entity	CFS
idle_sched_class	无	无	无

咱们今天聊的CFS调度，其实就是fair_sched_class调度类管的事儿，对应SCHED_NORMAL和SCHED_BATCH两种策略，调度实体就是sched_entity，说穿了，就是普通进程的“管家”。

一、CFS调度类

聊到CFS，就绕不开它的调度类fair_sched_class，这玩意儿说白了就是给CFS调度提供各种方法的“工具包”，我当年第一次看它的源码，光记这些方法就头大了好几天——毕竟方法太多，每个方法的作用还不一样，记错一个，理解就全偏了。如下表，大伙不用死记硬背，大概有个印象就行，后续用到的时候再回头查：

方法	描述
enqueue_task_fair	将进程对应的调度实体添加到运行队列红黑树中
dequeue_task_fair	将进程对应的调度实体从运行队列红黑树中删除
yield_task_fair	sched_yield系统调用，将当前进程放入运行队列skip成员中
check_preempt_wakeup	检查当前进程是否可被新进程抢占
__pick_next_task_fair	在调度类中选择下一个要执行的进程
put_prev_task_fair	更新进程vruntime，再将进程放回运行队列，与set_next_task配合使用
set_next_task_fair	将进程放从运行队列删除，并作为当前运行实体，与put_prev_task配合使用
task_tick_fair	scheduler_tick定时器调用，CFS调度算法调用update_curr计算vruntime，可能触发调度
task_fork_fair	do_fork调用，CFS调度算法初始化vruntime

对应的源码大家可以对照着看，我当年就是一边看源码，一边对照表格，才搞懂每个方法的作用，建议大家也这么做，比光看文字管用多了：

/* * All the scheduling class methods: */const struct sched_class fair_sched_class = {	.next			= &idle_sched_class, /* 下一个调度类idle */	.enqueue_task		= enqueue_task_fair, /* 进程加入运行队列 */	.dequeue_task		= dequeue_task_fair,/* 进程退出运行队列 */	.yield_task		= yield_task_fair,/* 进程主动让出CPU，进程退出运行队列后再加入运行队列尾 */	.yield_to_task		= yield_to_task_fair,	.check_preempt_curr	= check_preempt_wakeup,/* 检查当前进程是否可被新进程抢占 */	.pick_next_task		= __pick_next_task_fair,/* 在调度类中选择下一个要执行的进程 */	.put_prev_task		= put_prev_task_fair,/* 将进程放回运行队列 */	.set_next_task          = set_next_task_fair,#ifdef CONFIG_SMP	.balance		= balance_fair,	.select_task_rq		= select_task_rq_fair,/* 返回进程运行队列CPU number */ 	.migrate_task_rq	= migrate_task_rq_fair,/* 进程迁移到指定CPU，set_task_cpu调用 */	.rq_online		= rq_online_fair,/* 启动运行队列 */	.rq_offline		= rq_offline_fair,/* 禁止运行队列 */	.task_dead		= task_dead_fair,/* 进程切换时前一个进程状态为TASK_DEAD时调用 */	.set_cpus_allowed	= set_cpus_allowed_common,#endif	.task_tick		= task_tick_fair,/*scheduler_tick定时器调用，CFS调度算法调用update_curr计算vruntime*/	.task_fork		= task_fork_fair,/* do_fork调用，CFS调度算法初始化vruntime *//* __sched_setscheduler调用check_class_changed，进程切换调度策略时触发切换调度类以及优先级的变化 */	.prio_changed		= prio_changed_fair,	.switched_from		= switched_from_fair,	.switched_to		= switched_to_fair,/* 系统调用sched_rr_get_interval，返回 round robin time，非RR调度类返回0 */	.get_rr_interval	= get_rr_interval_fair,	.update_curr		= update_curr_fair,/* 更新当前进程运行时间 */#ifdef CONFIG_FAIR_GROUP_SCHED	.task_change_group	= task_change_group_fair,#endif#ifdef CONFIG_UCLAMP_TASK	.uclamp_enabled		= 1,#endif}

二、CFS运行队列

内核给每个CPU都分配了一个运行队列rq，也就是this_rq()能拿到的那个，每个rq里都藏着一个CFS运行队列cfs_rq，专门管普通进程的调度。说真的，我当年调试一个CPU占用过高的bug时，就是盯着这个cfs_rq的成员看，看来看去，才发现是nr_running数值异常，导致调度混乱，折腾了整整一个通宵。

cfs_rq里的那些成员，重点记几个就行，比如min_vruntime和tasks_timeline，这俩是CFS的核心，记不住它们，基本就没搞懂CFS：

成员	描述
load	运行队列所有调度实体总负载。se入队时update_load_add增加负载；se出队时通过update_load_sub减去负载
nr_running	CFS运行队列调度实体数量，se入队时加1，se出队时减1
h_nr_running	CFS运行队列调度实体数量
idle_h_nr_running	记录idle调度实体数量
min_vruntime	记录 CFS运行队列红黑树中最小的vruntime值，保持单调递增。该值非常重要，在对唤醒进程和fork进程vruntime做补偿时使用
tasks_timeline	CFS运行队列红黑树根，所有调度实体都依据se->vruntime(红黑树的KEY)大小加入到红黑树接受调度
curr	记录CFS运行队列中当前正在运行的se
next	记录CFS运行队列中急需运行的se，wakeup唤醒进程时可能将被唤醒的进程赋值给next。pick_next_entity时会优先选择cfs_rq->next
last	记录CFS运行队列中当前运行的se，与curr不同，curr一定会记录当前运行se，而last只会记录执行wakeup操作的se。pick_next_entity时会次优先选择cfs_rq->last。这样有利于重复利用cache
skip	记录CFS运行队列中跳过运行的se，系统调用sched_yield会将当前实体赋值到cfs_rq->skip。pick_next_entity时如发现选择的se是cfs_rq->skip时，会重新选择se

对应的结构体源码也放这儿，建议大家这个结构体打印出来，贴在显示器旁边，随时看，看多了自然就记住了：

/* CFS-related fields in a runqueue */struct cfs_rq {struct load_weight	load; /* CFS运行队列所有调度实体总负载 */unsigned long		runnable_weight;unsigned int		nr_running; /* CFS运行队列调度实体数量 */unsigned int		h_nr_running;   /* SCHED_{NORMAL,BATCH,IDLE} */unsigned int		idle_h_nr_running; /* SCHED_IDLE */	u64			exec_clock;	u64			min_vruntime;/* CFS运行队列红黑树中最小的vruntime值 */#ifndef CONFIG_64BIT	u64			min_vruntime_copy;#endifstruct rb_root_cached	tasks_timeline;/* CFS运行队列红黑树根 *//*	 * 'curr' points to currently running entity on this cfs_rq.	 * It is set to NULL otherwise (i.e when none are currently running).	 */struct sched_entity	*curr;/* 记录CFS运行队列中当前正在运行的se */struct sched_entity	*next;/* 记录CFS运行队列中被wakeup的se */struct sched_entity	*last;/* 记录CFS运行队列中执行wakeup的se */struct sched_entity	*skip;/* 记录CFS运行队列需要跳过的se */#ifdef	CONFIG_SCHED_DEBUGunsigned int		nr_spread_over;#endif#ifdef CONFIG_SMP/*	 * CFS load tracking	 */struct sched_avg	avg;#ifndef CONFIG_64BIT	u64			load_last_update_time_copy;#endifstruct {raw_spinlock_t	lock ____cacheline_aligned;int		nr;unsigned long	load_avg;unsigned long	util_avg;unsigned long	runnable_sum;	} removed;#ifdef CONFIG_FAIR_GROUP_SCHEDunsigned long		tg_load_avg_contrib;long			propagate;long			prop_runnable_sum;/*	 *   h_load = weight * f(tg)	 *	 * Where f(tg) is the recursive weight fraction assigned to	 * this group.	 */unsigned long		h_load;	u64			last_h_load_update;struct sched_entity	*h_load_next;#endif/* CONFIG_FAIR_GROUP_SCHED */#endif/* CONFIG_SMP */#ifdef CONFIG_FAIR_GROUP_SCHEDstruct rq		*rq;	/* CPU runqueue to which this cfs_rq is attached *//*	 * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in	 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities	 * (like users, containers etc.)	 *	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a CPU.	 * This list is used during load balance.	 */int			on_list;struct list_head	leaf_cfs_rq_list;struct task_group	*tg;	/* group that "owns" this runqueue */	…………#endif/* CONFIG_FAIR_GROUP_SCHED */}

三、CFS调度实体

每个进程的描述符task_struct里，都藏着一个sched_entity成员，这就是调度实体——说穿了，CFS调度的不是进程本身，而是这个调度实体。它可以是一个进程，也可以是一个调度组，当年我就是没搞懂这一点，误以为CFS直接调度进程，绕了好大一圈弯路。

struct sched_entity {/* For load-balancing: */struct load_weight		load; /* 调度权重 */unsigned long			runnable_weight;struct rb_node			run_node;/* 调度实体在cfs_rq红黑树的节点 */struct list_head		group_node;unsigned int			on_rq;/* 调度实体是否在调度队列中接受调度 */	u64				exec_start;	u64				sum_exec_runtime;/* 当前累计运行实际时间 */	u64				vruntime;/* 虚拟运行时间 */	u64				prev_sum_exec_runtime;/* 上次累计运行实际时间 */	u64				nr_migrations;struct sched_statistics		statistics;#ifdef CONFIG_FAIR_GROUP_SCHEDint				depth;struct sched_entity		*parent;/* rq on which this entity is (to be) queued: */struct cfs_rq			*cfs_rq;/* rq "owned" by this entity/group: */struct cfs_rq			*my_q;#endif#ifdef CONFIG_SMP/*	 * Per entity load average tracking.	 *	 * Put into separate cache line so it does not	 * collide with read-mostly values above.	 */struct sched_avg		avg;/* 进程负载 */#endif}

聊完调度实体，就得说说调度优先级了，这是影响进程抢占的关键。task_struct里有几个相关的成员，我给大家重点说下。

int				prio;                    /* 动态优先级 */int				static_prio;             /* 静态优先级 */int				normal_prio;             /* 普通优先级 */unsigned int			rt_priority;     /* 实时优先级 */

四、调度优先级

内核用0到139表示进程优先级，记住了，数值越低，优先级越高，别搞反了！

其中，0到99是给实时进程用的，100到139是给普通进程用的。MAX_RT_PRIO宏是最大实时进程优先级，数值是100；DEFAULT_PRIO宏是默认普通进程优先级，数值是120，也就是说，咱们新建一个普通进程，默认优先级就是120，也可以用nice系统调用来改。

说到nice，大家可得注意，它的取值范围是-20到19，负值会让进程优先级变高，正值会让优先级变低，别搞反了哦。实时进程的优先级范围是0到MAX_PRIO-1，也就是0到99，普通进程是MAX_RT_PRIO到MAX_PRIO-1，也就是100到139，这些数值，记不住没关系，用到的时候查源码就行，源码里写得明明白白：

#define MAX_NICE	19#define MIN_NICE	-20#define NICE_WIDTH	(MAX_NICE - MIN_NICE + 1)/* * Priority of a process goes from 0..MAX_PRIO-1, valid RT * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH * tasks are in the range MAX_RT_PRIO..MAX_PRIO-1. Priority * values are inverted: lower p->prio value means higher priority. * * The MAX_USER_RT_PRIO value allows the actual maximum * RT priority to be separate from the value exported to * user-space.  This allows kernel threads to set their * priority to a value higher than any user task. Note: * MAX_RT_PRIO must not be smaller than MAX_USER_RT_PRIO. */#define MAX_USER_RT_PRIO	100#define MAX_RT_PRIO		MAX_USER_RT_PRIO#define MAX_PRIO		(MAX_RT_PRIO + NICE_WIDTH)#define DEFAULT_PRIO		(MAX_RT_PRIO + NICE_WIDTH / 2)

五、调度权重

调度实体里有个load成员，也就是struct load_weight load，这就是调度权重，它可是CFS实现“公平”的关键——毕竟，优先级最终都会转换成权重，权重越高，进程能拿到的CPU时间就越多。

struct load_weight {unsigned long			weight; /* 调度实体的权重 */	u32				inv_weight; /* 调度实体的权重的中间计算值 */};

关于调度权重，有几点我得跟大家说清楚，这些都是当年我踩过坑才记住的。普通进程的优先级是通过nice值确定的，默认优先级120，nice值范围-20到19，对应40个优先级等级，nice值越高，优先级越低，权重也越低，反之亦然。

还有个小规律，nice值每增加1，进程能拿到的CPU时间，就比nice为0的时候少10%；nice值每减少1，就多10%。内核为了计算方便，把nice为0时的权重定为1024，其他nice值对应的权重，不用实时计算，直接查表就行，也就是内核里的全局变量sched_prio_to_weight。

这个数组里，相邻两个值大概是1.25倍的关系，比如nice为0是1024，nice为1就是820，820大概是1024的0.8倍，刚好对应10%的减少，是不是很巧妙？另外，内核还有个sched_prio_to_wmult数组，存的是2^32除以sched_prio_to_weight的值，提前算好，就是为了提高计算效率，不用每次都做除法——毕竟内核里的计算，能省一点是一点。

给大家举个直观的例子，进程A和B，创建时nice值都是0，权重都是1024，那它俩各占50%的CPU时间，也就是1024/(1024+1024)=50%。如果把B的nice值改成1，那B的权重就变成820，此时A占55%左右，B占45%，刚好少了10%，是不是很准？

这两个数组的源码如下，大家看一眼就行，不用死记硬背，知道有这么个东西就行：

/* * Nice levels are multiplicative, with a gentle 10% change for every * nice level changed. I.e. when a CPU-bound task goes from nice 0 to * nice 1, it will get ~10% less CPU time than another CPU-bound task * that remained on nice 0. * * The "10% effect" is relative and cumulative: from _any_ nice level, * if you go up 1 level, it's -10% CPU usage, if you go down 1 level * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25. * If a task goes up by ~10% and another task goes down by ~10% then * the relative distance between them is ~25%.) */const int sched_prio_to_weight[40] = { /* -20 */     88761,     71755,     56483,     46273,     36291, /* -15 */     29154,     23254,     18705,     14949,     11916, /* -10 */      9548,      7620,      6100,      4904,      3906, /*  -5 */      3121,      2501,      1991,      1586,      1277, /*   0 */      1024,       820,       655,       526,       423, /*   5 */       335,       272,       215,       172,       137, /*  10 */       110,        87,        70,        56,        45, /*  15 */        36,        29,        23,        18,        15,};/* * Inverse (2^32/x) values of the sched_prio_to_weight[] array, precalculated. * * In cases where the weight does not change often, we can use the * precalculated inverse to speed up arithmetics by turning divisions * into multiplications: */const u32 sched_prio_to_wmult[40] = { /* -20 */     48388,     59856,     76040,     92818,    118348, /* -15 */    147320,    184698,    229616,    287308,    360437, /* -10 */    449829,    563644,    704093,    875809,   1099582, /*  -5 */   1376151,   1717300,   2157191,   2708050,   3363326, /*   0 */   4194304,   5237765,   6557202,   8165337,  10153587, /*   5 */  12820798,  15790321,  19976592,  24970740,  31350126, /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717, /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,};

这两个数组的关系，其实就是sched_prio_to_wmult[i] = 2^32 / sched_prio_to_weight[i]，内核提前算好，就是为了避免除法运算，提高效率。内核里还有个set_load_weight函数，专门用来设置调度实体的load，也就是p->se.load：

staticvoidset_load_weight(struct task_struct *p, bool update_load){int prio = p->static_prio - MAX_RT_PRIO;struct load_weight *load = &p->se.load;/*	 * SCHED_IDLE tasks get minimal weight:	 */if (task_has_idle_policy(p)) {		load->weight = scale_load(WEIGHT_IDLEPRIO);		load->inv_weight = WMULT_IDLEPRIO;		p->se.runnable_weight = load->weight;return;	}/*	 * SCHED_OTHER tasks have to update their load when changing their	 * weight	 */if (update_load && p->sched_class == &fair_sched_class) {		reweight_task(p, prio);	} else {		load->weight = scale_load(sched_prio_to_weight[prio]);		load->inv_weight = sched_prio_to_wmult[prio];		p->se.runnable_weight = load->weight;	}}

六、虚拟运行时间的计算

为啥内核要搞个sched_prio_to_wmult数组？

这就得说到CFS调度器的核心——虚拟运行时间的计算了。CFS没有用传统的时间片调度，而是靠权重计算虚拟运行时间，来实现“完全公平”，这也是它叫“完全公平调度器”的原因。

简单说，每个进程的虚拟运行时间，是相对于nice为0时的权重比例值。nice值小的进程，权重大，虚拟运行时间比真实运行时间慢，这样就能拿到更多的CPU运行时间——毕竟，权重高，说明进程更重要，理应多占用CPU嘛。

虚拟运行时间的计算公式是：vruntime = delta_exec * (NICE_0_LOAD / weight)，其中delta_exec是实际运行时间，NICE_0_LOAD是nice为0时的权重1024，weight是当前进程的权重。

但你想啊，内核里做除法多费时间，所以内核就把公式改了一下，变成vruntime = delta_exec * (NICE_0_LOAD * inv_weight) / 2^32，而inv_weight就是sched_prio_to_wmult数组里的值，也就是2^32 / weight。这么一改，除法就变成了乘法和位移运算，效率一下子就提上来了，是不是很聪明？

最终的公式就是：vruntime = delta_exec * (NICE_0_LOAD * sched_prio_to_wmult[i]) >> 32，内核就是这么实现的：

static u64 __calc_delta(u64 delta_exec, unsigned long weight, struct load_weight *lw){	u64 fact = scale_load_down(weight);/* weight=NICE_0_LOAD */int shift = WMULT_SHIFT;	__update_inv_weight(lw);if (unlikely(fact >> 32)) {while (fact >> 32) {			fact >>= 1;			shift--;		}	}/* fact=NICE_0_LOAD*inv_weight */	fact = mul_u32_u32(fact, lw->inv_weight);while (fact >> 32) {		fact >>= 1;		shift--;	}/* delta_exec*=fact >> 32 */return mul_u64_u32_shr(delta_exec, fact, shift);}staticinline u64 calc_delta_fair(u64 delta, struct sched_entity *se){if (unlikely(se->load.weight != NICE_0_LOAD))		delta = __calc_delta(delta, NICE_0_LOAD, &se->load);/* 如果调度实体权重为NICE_0_LOAD，则其虚拟运行时间等于实际运行时间 */return delta;}

很明显，当nice值为0时，进程的权重就是NICE_0_LOAD，这时候虚拟运行时间就等于实际运行时间，不用做额外计算，内核直接返回delta就行，省了不少事。

七、虚拟运行时间的更新

虚拟运行时间不是一成不变的，进程一直在运行，它的vruntime就得一直更新，这个更新工作，就是由update_curr函数完成的。我当年调试的时候，就是通过打印这个函数里的delta_exec和vruntime，才找到进程调度异常的原因，这个函数可是个“调试神器”。

update_curr的工作流程其实很简单，首先它的入参是当前进程的CFS运行队列cfs_rq；然后用rq_clock_task获取当前运行队列的clock_task值，这个值是由update_rq_clock_task更新的；接着计算delta_exec，也就是上次调用update_curr到现在的时间差；再调用calc_delta_fair计算vruntime的增量；最后调用update_min_vruntime更新cfs_rq->min_vruntime，这个值是CFS运行队列里最小的vruntime，特别重要。

/* * Update the current task's runtime statistics. */staticvoidupdate_curr(struct cfs_rq *cfs_rq){struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */	u64 now = rq_clock_task(rq_of(cfs_rq)); /* 运行队列clock_task值 */	u64 delta_exec;if (unlikely(!curr))return;/* delta_exec计算该进程上次调用update_curr到现在的时间差 */	delta_exec = now - curr->exec_start;if (unlikely((s64)delta_exec <= 0))return;/* curr->exec_start记录本次调用的时间 */	curr->exec_start = now;	schedstat_set(curr->statistics.exec_max,		      max(delta_exec, curr->statistics.exec_max));/* curr->sum_exec_runtime记录进程占用CPU的总时间 */	curr->sum_exec_runtime += delta_exec;	schedstat_add(cfs_rq->exec_clock, delta_exec);/* 利用delta_exec实际运行时间的增量计算vruntime增量 */	curr->vruntime += calc_delta_fair(delta_exec, curr);/* 更新cfs_rq->min_vruntime */	update_min_vruntime(cfs_rq);if (entity_is_task(curr)) {struct task_struct *curtask = task_of(curr);		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);		cgroup_account_cputime(curtask, delta_exec);		account_group_exec_runtime(curtask, delta_exec);	}	account_cfs_rq_runtime(cfs_rq, delta_exec);}

update_min_vruntime更新cfs_rq->min_vruntime的原则很简单，就是记录运行队列里最小的虚拟运行时间，而且这个值是单调递增的，不会倒退——毕竟，时间哪能倒流呢？

它工作流程也不复杂，先把vruntime赋值为上次记录的min_vruntime；如果curr调度实体在运行队列里，就把vruntime更新为curr的vruntime；然后，如果红黑树有最左节点（也就是vruntime最小的调度实体），就比较这个节点的vruntime和当前的vruntime，取最小值；最后，把cfs_rq->min_vruntime更新为这个最小值和原来的min_vruntime中的最大值，确保它不会倒退。

staticvoidupdate_min_vruntime(struct cfs_rq *cfs_rq){struct sched_entity *curr = cfs_rq->curr; /* 当前进程调度实体 */struct rb_node *leftmost = rb_first_cached(&cfs_rq->tasks_timeline);/* CFS运行队列红黑树最左子节点 *//* vruntime赋值为上次记录的min_vruntime值 */	u64 vruntime = cfs_rq->min_vruntime;if (curr) {if (curr->on_rq)			vruntime = curr->vruntime;/* curr实体在运行队列中，则更新vruntime值 */else			curr = NULL;	}if (leftmost) { /* non-empty tree */struct sched_entity *se;/* 获取CFS运行队列红黑树最左节点调度实体 */		se = rb_entry(leftmost, struct sched_entity, run_node);if (!curr)/* 若当前调度实体为空，则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime */			vruntime = se->vruntime;else/* 若当前调度实体不为空，则vruntime为CFS运行队列红黑树最左节点调度实体的vruntime和当前vruntime值中的最小值 */			vruntime = min_vruntime(vruntime, se->vruntime);	}/* ensure we never gain time by being placed backwards. *//* cfs_rq->min_vruntime是vruntime和cfs_rq->min_vruntime中的最大值 */	cfs_rq->min_vruntime = max_vruntime(cfs_rq->min_vruntime, vruntime);#ifndef CONFIG_64BITsmp_wmb();	cfs_rq->min_vruntime_copy = cfs_rq->min_vruntime;#endif}

那vruntime什么时候会被更新呢？

其实内核里有很多地方会调用update_curr，我给大家举几个常见的：进程创建时，do_fork会调用sched_fork，再调用task_fork_fair，最后调用update_curr；内核有个定时器任务scheduler_tick，会定时调用调度类的task_tick方法，对于CFS来说，就是task_tick_fair，这个方法会调用entity_tick，进而调用update_curr，这个是周期性的更新；还有，调度实体加入CFS运行队列时，enqueue_task_fair也会调用update_curr。

八、虚拟运行时间的使用和分配

说了这么多，vruntime到底是怎么用的呢？

其实很简单，它有两个核心作用：一是判断进程是否需要被抢占，二是作为红黑树的key，决定进程的调度顺序。

先说说抢占的判断。进程一直在运行，vruntime也一直在增加，当它的vruntime超过一定额度后，内核就会设置进程thread_info里的TIF_NEED_RESCHED标志，触发延时调度schedule，这样就能让其他进程有机会运行，实现公平。

vruntime越小的进程，在红黑树里就越靠左，pick_next_task的时候就优先被选中。

内核的定时器scheduler_tick会周期性地调用task_tick_fair，然后走到check_preempt_tick：

staticvoidcheck_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr){    ideal_runtime = sched_slice(cfs_rq, curr);  // 理论该跑多久    delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;    // 如果实际跑的时间超过理论时间，就触发调度    if (delta_exec > ideal_runtime) {        resched_curr(rq_of(cfs_rq));        return;    }    // 跑到这儿，说明还没超时，但得看看有没有vruntime更小的进程在排队    se = __pick_first_entity(cfs_rq);    delta = curr->vruntime - se->vruntime;    if (delta > ideal_runtime)        resched_curr(rq_of(cfs_rq));}

说白了就是：你跑够了理论时间，或者有比你更“饿”的进程在等着，就得让位。

九、新进程和刚醒来的进程，怎么处理vruntime？

你想想啊，一个进程睡了很久刚被唤醒，或者一个新创建的进程，它们的vruntime可能早就过时了。如果直接用旧的vruntime，那它就会获得“报复性”的CPU占用——因为它的vruntime太小了，红黑树里排在最左边。

所以内核会做点手脚：

唤醒的进程：入队时vruntime加上cfs_rq->min_vruntime，再减掉sched_latency/2作为补偿
新创建的进程：vruntime设为父进程的值，加上一个调度周期的虚拟时间作为“惩罚”，再减掉min_vruntime

staticvoidplace_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int initial){    u64 vruntime = cfs_rq->min_vruntime;    if (initial)  // fork出来的        vruntime += sched_vslice(cfs_rq, se);    else          // 唤醒的        vruntime -= sysctl_sched_latency / 2;    se->vruntime = max_vruntime(se->vruntime, vruntime);}

为啥要这么折腾？就是为了防止“睡美人”现象——一个睡了很久的进程醒来后疯狂抢占CPU，把其他进程饿死。

十、CFS入队：把进程塞进红黑树

enqueue_task_fair负责入队，核心是__enqueue_entity：

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){    struct rb_node **link = &cfs_rq->tasks_timeline.rb_root.rb_node;    bool leftmost = true;    while (*link) {        parent = *link;        entry = rb_entry(parent, struct sched_entity, run_node);        if (entity_before(se, entry)) {            link = &parent->rb_left;        } else {            link = &parent->rb_right;            leftmost = false;  // 往右走就不可能是最左节点了        }    }    rb_link_node(&se->run_node, parent, link);    rb_insert_color_cached(&se->run_node, &cfs_rq->tasks_timeline, leftmost);}

你看，就是标准的红黑树插入——以vruntime为key，小的在左，大的在右。

十一、CFS出队：从红黑树里删掉

dequeue_entity负责出队，简单粗暴：

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){    rb_erase_cached(&se->run_node, &cfs_rq->tasks_timeline);}

就是调用红黑树的删除函数罢了。

十二、CFS选择下一个进程

这是最能体现CFS“公平”的地方了。pick_next_task_fair的核心三板斧：

第一步：put_prev_task —— 把当前进程放回队列

staticvoidput_prev_entity(struct cfs_rq *cfs_rq, struct sched_entity *prev){    if (prev->on_rq) {        update_curr(cfs_rq);        __enqueue_entity(cfs_rq, prev);  // 放回红黑树    }    cfs_rq->curr = NULL;}

第二步：pick_next_entity —— 从队列里挑一个

优先顺序：next > last > 红黑树最左节点，但要跳过skip标记的进程。

static struct sched_entity * pick_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *curr){    struct sched_entity *left = __pick_first_entity(cfs_rq);    struct sched_entity *se = left;    if (cfs_rq->skip == se) {        // 重选，避开skip    }    if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)        se = cfs_rq->last;    if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)        se = cfs_rq->next;    clear_buddies(cfs_rq, se);    return se;}

第三步：set_next_entity —— 让选中的进程上岗

staticvoidset_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se){    if (se->on_rq) {        __dequeue_entity(cfs_rq, se);  // 从红黑树移除    }    cfs_rq->curr = se;    se->prev_sum_exec_runtime = se->sum_exec_runtime;  // 记录开始时间}

十三、进程唤醒

wake_up_process或者wake_up_new_task调用时，会检查新进程能不能抢占当前进程：

staticvoidcheck_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_flags){    if (wakeup_preempt_entity(se, pse) == 1) {        set_next_buddy(pse);        goto preempt;    }    return;preempt:    resched_curr(rq);  // 触发调度    if (sched_feat(LAST_BUDDY) && scale && entity_is_task(se))        set_last_buddy(se);  // 记录last，方便下次优先选择}

wakeup_preempt_entity干了什么？就是比较两个实体的vruntime，如果差距大于调度唤醒粒度（默认1ms换算成虚拟时间），就允许抢占。

说实话，CFS的设计挺巧妙的——用vruntime这把标尺衡量公平，用红黑树高效组织进程，还照顾到sleep进程和fork进程的特殊情况。但公平这事儿啊，永远都是相对的。就像咱们现实生活里，哪有什么绝对的公平？只不过是在各方利益之间，找到一个能被大多数人接受的平衡点罢了。

内核也是这样，在各种进程之间来回权衡，最终给出一个“看起来还算公平”的调度方案。

好了，这篇文章就到这儿吧。如果你在实际开发中遇到过因为调度导致的问题，欢迎来交流——我写过的Bug，可能比你想象的要多得多。

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

一、CFS调度类

二、CFS运行队列

三、CFS调度实体

四、调度优先级

五、调度权重

六、虚拟运行时间的计算

七、虚拟运行时间的更新

八、虚拟运行时间的使用和分配

九、新进程和刚醒来的进程，怎么处理vruntime？

十、CFS入队：把进程塞进红黑树

十一、CFS出队：从红黑树里删掉

十二、CFS选择下一个进程

十三、进程唤醒

深入浅出 Linux 内核(进程篇):CFS调度

最新文章

热门文章

随机文章

深入浅出 Linux 内核(进程篇):CFS调度

一、CFS调度类

二、CFS运行队列

三、CFS调度实体

四、调度优先级

五、调度权重

六、虚拟运行时间的计算

七、虚拟运行时间的更新

八、虚拟运行时间的使用和分配

九、新进程和刚醒来的进程，怎么处理vruntime？

十、CFS入队：把进程塞进红黑树

十一、CFS出队：从红黑树里删掉

十二、CFS选择下一个进程

十三、进程唤醒

【第09期】Linux下的搜索命令:find/xargs/locate/whereis/which/whatis 详解

4.8mm薄过iPad!荣耀新平板自带Linux,3499元开启“安全养虾”时代

最新文章

热门文章

随机文章