Why is MemAvailable sometimes less than MemFree?, https://blogs.oracle.com/linux/memavailable-less-than-memfree
[PATCH 00/14] mm: balance LRU lists based on relative thrashing v2

⚠️WARNING: System PROD01 swap utilization > 90%⚠️警告:系统 PROD01 的 swap 利用率已超过 90%
Swap is often misunderstood, and the way swap is used in modern kernels may cause some confusion. This blog will briefly cover how swap usage is now more of an optimization than an emergency, and will also answer some frequently asked questions about swap.
Swap 经常被误解,而现代内核使用 swap 的方式更容易引起混淆。本文将简要说明,为什么 swap 的使用如今更多是一种性能优化而非紧急状态,同时也会回答一些关于 swap 的常见问题。
Terms, processes and settings
术语、流程与配置
MemFree – As viewed in /proc/meminfo, this is the amount of memory that is not being used for anything at all. While it may be tempting to want to see a large number here, this is really a waste of RAM resources.
MemFree — 可在 /proc/meminfo 中查看,表示完全未被使用的内存量。虽然看到一个大数值很诱人,但这实际上意味着 RAM 资源的浪费。
MemAvailable – This is the amount of memory that is available for programs to use, which includes free memory as well as easily reclaimed cache (minus reserved buffers). This is also present in /proc/meminfo.
MemAvailable — 程序可用的内存量,包括空闲内存和可轻易回收的缓存(减去保留缓冲区)。同样在 /proc/meminfo 中可见。
SwapFree – Idle process memory pages which have been evicted from RAM reside on the swap disk. SwapFree is the amount of free/unused space left on the swap disk. Also present in /proc/meminfo.
SwapFree — 从 RAM 中换出的空闲进程内存页驻留在 swap 磁盘上。SwapFree 是 swap 磁盘上剩余的空闲/未使用空间。同样在 /proc/meminfo 中可见。
kswapd – The kernel thread that handles memory reclaim and swapping in the background. It evicts least recently used (LRU) pages from RAM – whether that's from the page cache or process memory – based on a number of complex factors.
kswapd — 在后台处理内存回收和交换的内核线程。它根据一系列复杂因素,将最近最少使用(LRU)的页面从 RAM 中换出,无论是页面缓存还是进程内存。
kcompactd – Another kernel thread that consolidates smaller, free memory chunks into larger, physically contiguous chunks, and reduces memory fragmentation.
kcompactd — 另一个内核线程,负责将较小的空闲内存块合并为较大的物理连续块,以减少内存碎片。
Low watermark – This is the threshold that controls when kswapd is woken up, to do background reclaim. This threshold can be found in /proc/zoneinfo.
Low watermark(低水位线)— 控制 kswapd 何时被唤醒以进行后台回收的阈值,可在 /proc/zoneinfo 中查看。
Min watermark – This is the threshold that controls when the allocating process itself will block during allocation, to reclaim memory (also called direct reclaim). This can also be found in /proc/zoneinfo.
Min watermark(最低水位线)— 控制分配进程在分配过程中何时被阻塞以回收内存(也称为直接回收)的阈值,同样在 /proc/zoneinfo 中查看。
High watermark – This is the threshold that controls when kswapd goes back to sleep, and memory reclaim is considered successful. This threshold can be found in /proc/zoneinfo.
High watermark(高水位线)— 控制 kswapd 何时重新休眠、内存回收被视为成功的阈值,可在 /proc/zoneinfo 中查看。
A brief introduction to MemFree, kswapd & background reclaim
MemFree、kswapd 与后台回收简介
After your system is booted, most of the memory is in MemFree. As programs allocate memory, MemFree decreases. The Linux kernel, by design, will use up most of the available memory in order to enhance system and application performance. What this means is that it's expected, normal and healthy for MemFree to be low. Most free memory is used for caching – anything that can prevent a disk I/O is good for application performance.
系统启动后,大部分内存处于 MemFree 状态。随着程序分配内存,MemFree 会逐渐减少。Linux 内核在设计上会尽量使用大部分可用内存来提升系统和应用性能。因此,MemFree 偏低是预期之中、完全正常且健康的状态。大部分空闲内存用于缓存——任何能避免磁盘 I/O 的做法都有利于应用性能。
In general, memory used for page cache is easily reclaimable. Since the data in the page cache is also (for the most part) available on disk, the pages can be easily reclaimed and reused in case of memory pressure (if they're clean).
一般而言,用于页面缓存的内存是容易回收的。因为页面缓存中的数据(大部分情况下)在磁盘上也有副本,所以在内存压力下,这些页面可以被轻松回收并重新利用(前提是它们是干净的)。
When MemFree falls below the low watermark, kswapd is woken up and it tries to reclaim memory. It stops its efforts once free memory reaches the high watermark threshold.
当 MemFree 降至低水位线以下时,kswapd 被唤醒并尝试回收内存。一旦空闲内存回升到高水位线阈值,它便停止回收。
If MemFree continues to fall below the min watermark, then not only will kswapd try and reclaim memory but individual processes that are trying to allocate memory will also attempt to reclaim. This is when you might notice some application latencies as each allocation request has to do some direct reclaim before it can be satisfied.
如果 MemFree 继续降至最低水位线以下,不仅 kswapd 会尝试回收内存,正在尝试分配内存的各个进程也会参与回收。此时你可能会注意到应用出现延迟,因为每个分配请求在被满足之前都需要先执行一定的直接回收。
The pages reclaimed in these flows can either be anonymous pages (i.e. process heap, stack) or file pages (i.e. page cache pages). Some kernel memory (e.g. slab cache) is also reclaimed here, but that is typically not where most gains come from, so we'll ignore slab cache shrinkers in this post.
在这些流程中被回收的页面可以是匿名页(即进程的堆、栈)或文件页(即页面缓存页)。部分内核内存(如 slab 缓存)也会在此过程中被回收,但通常不是主要的回收来源,因此本文不讨论 slab 缓存收缩器。
Is swap necessary? Can I just disable swap?
Swap 是否有必要?能否直接禁用?
Enabling swap space gives the system a way to evict pages that are not backed by disk. If there are allocation bursts, having swap enabled gives the system a little flexibility in deciding what pages to evict. If there is no swap but the system is under memory pressure, the only way to recover memory is to evict page cache pages. Even if there are idle processes with a lot of inactive anonymous memory allocated, they will not be evicted because they are not disk-backed – there's no backing store for those pages.
启用 swap 空间为系统提供了一种换出无磁盘后备页面的手段。如果出现分配突增,启用 swap 可以在选择换出哪些页面时多一些灵活性。如果没有 swap 而系统又面临内存压力,唯一的内存回收途径就是换出页面缓存页。即便存在大量空闲进程并持有不少不活跃的匿名内存,这些内存也无法被换出,因为它们没有磁盘后备存储。
If, for instance, there are a lot of processes running on that system (and therefore significant anonymous memory usage), and moderate page cache usage, and there are sudden spikes in memory allocation requests, then the reclaim algorithm will be handicapped by the small amount of page cache. If the page cache also becomes hot (due to backup application or other file I/O intense application), then the system will start thrashing – evicting page cache pages only to read them back in again. This is unstable, and might lead to the Out of Memory (OOM) killer being invoked.
例如,如果系统上运行着大量进程(因而匿名内存使用量很大),页面缓存使用量中等,同时内存分配请求出现突增,那么回收算法就会因可用的页面缓存量太少而受限。如果页面缓存同时变得很热(由于备份程序或其他 I/O 密集型应用),系统就会开始抖动——刚换出的页面缓存页马上又被读回来。这种状态是不稳定的,可能导致 OOM killer 被触发。
There is no need to have cold anonymous pages always in memory – enabling swap space lets the kernel choose between file pages and idle process memory to evict, based on the workload's memory access patterns. This leads to optimal performance for all applications on the system, as well as optimal resource (i.e. memory) utilization.
没有必要让冷的匿名页始终驻留在内存中——启用 swap 空间使内核可以根据工作负载的内存访问模式,在文件页和空闲进程内存之间做出换出选择。这有助于实现系统上所有应用的最佳性能以及最优的资源(即内存)利用率。
On the other hand, swapping during normal workload (on some systems) can increase latency for other processes. In most cases, this is insignificant. But for latency-sensitive or real-time applications, this could be a problem. Most workloads do not fall under this category.
另一方面,在正常工作负载下进行交换(在某些系统上)可能会增加其他进程的延迟。大多数情况下这种影响微乎其微,但对于延迟敏感型或实时应用来说,这可能会成为问题。不过大多数工作负载并不属于这一类。
Why is the swap usage high even when my system is not actively swapping now? Shouldn't SwapFree increase?
为什么系统当前并未活跃交换,swap 使用量却很高?SwapFree 不应该回升吗?
SwapFree in /proc/meminfo is the amount of currently unused swap space. As uptime increases, SwapFree will usually decrease; if there was memory pressure in the past (whether that was one day ago or one month ago) that caused pages to be swapped out, those pages tend to remain in swap unless the process that owns them dies, or if the page is modified in memory (which makes the swap copy stale, and thus it will be discarded).
/proc/meminfo 中的 SwapFree 是当前未使用的 swap 空间量。随着系统运行时间增长,SwapFree 通常会减少;如果过去(无论是一天前还是一个月前)曾经发生过内存压力导致页面被换出,这些页面往往会留在 swap 中,除非拥有它们的进程终止,或者该页面在内存中被修改(使 swap 副本过期从而被丢弃)。
So SwapFree can be quite low due to historic swapping, even if the system's not under any memory pressure now. If you don't see active swapping (in the si and so columns of vmstat output), there is typically little ongoing swap I/O impact on the system, though reclaim latency can still exist for other reasons.
因此即使系统当前没有任何内存压力,SwapFree 也可能因历史交换而相当低。如果你在 vmstat 输出的 si 和 so 列中没有看到活跃的交换活动,通常系统受到的持续 swap I/O 影响很小,尽管出于其他原因回收延迟仍然可能存在。
If you see thrashing, i.e. pages are swapped out due to memory pressure, only to be swapped back in because they're actively referenced, that's an issue and should be addressed, possibly by balancing the workload on the system, or adding more RAM.
如果你观察到抖动现象——即页面因内存压力被换出,但因仍被活跃引用而又被换回——这就是一个需要解决的问题,可以考虑均衡系统上的工作负载或增加更多的 RAM。
Why is my swap usage high even when I have plenty of free memory?
为什么即使有大量空闲内存,swap 使用量仍然很高?
That depends on the watermarks. There might be a lot of free memory globally, but it might be unevenly split among the NUMA (Non-Uniform Memory Access) nodes, where one NUMA node is teetering on the edge of memory pressure and the other is under-utilizing its resources. If this happens, the NUMA node running low on memory could start swapping (depending on allocation policy, cpusets/mempolicy constraints, and zone reclaim behavior).
这取决于水位线设置。全局来看可能有大量空闲内存,但内存可能在各 NUMA(非一致性内存访问)节点之间分布不均——某个 NUMA 节点可能处于内存压力的边缘,而另一个节点的资源却未充分利用。这种情况下,内存紧张的 NUMA 节点可能会开始交换(取决于分配策略、cpusets/mempolicy 约束以及 zone 回收行为)。
The NUMA imbalance situation is shown below with a sample of numastat data:
以下用一组 numastat 数据展示 NUMA 不均衡的情况:
Per-node system memory usage (in MBs):Node 0 Node 1 Total--------------- --------------- ---------------MemTotal 772243.23 774101.36 1546344.59MemFree 9418.90 106649.16 116068.06 <--...Active(file) 81925.87 56966.03 138891.89Inactive(file) 245829.25 168282.12 414111.36...FilePages 333708.79 236806.75 570515.54...Meminfo:zzz <11/27/2023 11:35:24> Count:0MemTotal: 1583456860 kBMemFree: 131671936 kB...Active(anon): 77319236 kBInactive(anon): 3020752 kBActive(file): 140084028 kBInactive(file): 413750580 kB...SwapTotal: 25165820 kBSwapFree: 23269628 kB
Here, NUMA node 0 has just ~9 GB free, whereas NUMA node 1 has over 100 GB free. This is a stark imbalance and node 0 pages could be swapped out if memory pressure worsened on node 0, despite the fact that MemFree is >125 GB (almost all of which is coming from node 1).
此处 NUMA node 0 仅剩约 9 GB 空闲内存,而 NUMA node 1 有超过 100 GB 的空闲内存。这是一个显著的不均衡,如果 node 0 上的内存压力进一步恶化,其页面可能会被换出,尽管全局 MemFree 超过 125 GB(几乎全部来自 node 1)。
Here's another example, which is more drastic:
以下是一个更极端的例子:
Node 0 Node 1 Total--------------- --------------- ---------------MemTotal 1547405.20 1548190.00 3095595.20MemFree 13432.87 848495.85 861928.72 <--...Active(file) 864159.89 31523.51 895683.39 <--Inactive(file) 9656.17 9037.02 18693.19...FilePages 881459.59 63041.10 944500.70Mapped 7441.12 23012.17 30453.29AnonPages 25261.50 19255.40 44516.90Shmem 7627.10 22435.06 30062.16...Summary percentage report:Node 0 Node 1 Total--------------- --------------- ---------------MemFree 0.87% 54.81% 27.84%MemUsed 99.13% 45.19% 72.16%
Note: From an Numa.ExaWatcher report
注:数据来自 Numa.ExaWatcher 报告
Almost 850 GB free, but all on NUMA node 1. Node 0 will experience memory pressure, sooner rather than later, and the system has multiple strategies to deal with that:
将近 850 GB 的空闲内存,但全部在 NUMA node 1 上。Node 0 迟早会面临内存压力,系统有多种策略来应对:
Reclaim inactive page cache pages. 回收不活跃的页面缓存页。
Demote active pages to inactive more aggressively (which then get reclaimed easily). 更积极地将活跃页面降级为不活跃页面(随后更容易被回收)。
Allocate from foreign node (i.e. node 1). 从远端节点(即 node 1)分配内存。
Swap inactive process memory out to disk. 将不活跃的进程内存换出到磁盘。
Shrink slab caches. 收缩 slab 缓存。
The swapping is a small piece of the pie, and not the most important piece. All the other strategies are being deployed as well. The goal is to reclaim as much memory, as efficiently as possible.
交换只是其中的一小部分,也不是最重要的部分。其他策略同样在并行执行。目标是尽可能高效地回收尽可能多的内存。
Why is my system swapping even though there's plenty of page cache? Shouldn't it reclaim from the page cache first?
为什么有大量页面缓存时系统仍在进行交换?不应该优先回收页面缓存吗?
Short answer: it does both. Broadly speaking, the kernel reclaims memory from both file pages (i.e. page cache) and anonymous process pages (by evicting them to swap). It prefers reclaiming clean page cache pages, since those are very easy to reuse without I/O. But it does not wait to swap until the page cache is depleted – both sets of pages are reclaimed, although at different rates.
简短回答:内核两者都做。概括地说,内核同时从文件页(即页面缓存)和匿名进程页(通过将其换出到 swap)中回收内存。它优先回收干净的页面缓存页,因为这些页面无需 I/O 即可重用。但它不会等到页面缓存耗尽才开始交换——两类页面会同时被回收,只是回收速率不同。
Here's the relevant kernel code snippet (this, and all other references in this note are from the 5.15 (i.e. Oracle UEK7) kernel):
以下是相关的内核代码片段(本文中所有引用均来自 5.15 即 Oracle UEK7 内核):
/** Determine how aggressively the anon and file LRU lists should be* scanned. The relative value of each set of LRU lists is determined* by looking at the fraction of the pages scanned we did rotate back* onto the active list instead of evict.** nr[0] = anon inactive pages to scan; nr[1] = anon active pages to scan* nr[2] = file inactive pages to scan; nr[3] = file active pages to scan*/staticvoidget_scan_count(struct lruvec *lruvec, struct scan_control *sc,unsigned long *nr){.../** If there is enough inactive page cache, we do not reclaim* anything from the anonymous working right now.*/if (sc->cache_trim_mode) {scan_balance = SCAN_FILE;goto out;}scan_balance = SCAN_FRACT;/** Calculate the pressure balance between anon and file pages.** The amount of pressure we put on each LRU is inversely* proportional to the cost of reclaiming each list, as* determined by the share of pages that are refaulting, times* the relative IO cost of bringing back a swapped out* anonymous page vs reloading a filesystem page (swappiness).** Although we limit that influence to ensure no list gets* left behind completely: at least a third of the pressure is* applied, before swappiness.** With swappiness at 100, anon and file have equal IO cost.*/total_cost = sc->anon_cost + sc->file_cost;anon_cost = total_cost + sc->anon_cost;file_cost = total_cost + sc->file_cost;total_cost = anon_cost + file_cost;ap = swappiness * (total_cost + 1);ap /= anon_cost + 1;fp = (200 - swappiness) * (total_cost + 1);fp /= file_cost + 1;fraction[0] = ap;fraction[1] = fp;denominator = ap + fp;
If the number of inactive file pages is low, or if the active file pages have a high refault rate, the reclaim preference will tilt towards anon pages – basically whatever pages the system can reclaim with least cost paid in terms of performance, I/O cost, etc. There's no single trigger to start swapping.
如果不活跃文件页数量偏低,或者活跃文件页的 refault 率较高,回收偏好就会倾向于匿名页——本质上是系统在性能、I/O 开销等方面以最低代价回收页面。并不存在触发交换的单一条件。
Let's check the behavior on one system:
来看一个系统上的实际表现:
Before:zzz <03/08/2022 00:00:02> Count:119Cached: 40329468 kBSwapFree: 25165820 kBPage cache usage shoots up:zzz <03/08/2022 01:50:23> Count:359Cached: 225038676 kBSwapFree: 17709564 kB...zzz <03/08/2022 02:50:34> Count:359Cached: 226585112 kBSwapFree: 8473852 kB...zzz <03/08/2022 03:20:42> Count:359Cached: 245398456 kBSwapFree: 7650044 kB
Here, we see page cache usage increase sharply between the hours of midnight and 2 AM, probably due to a backup application scheduled to run at that time. This can increase memory pressure on the system, and the system can start swapping if needed. Adding such backup processes to a memory-constrained cgroup will ensure that it does not consume all available memory on the system, and thus not affect other processes.
可以看到,在午夜到凌晨 2 点之间页面缓存使用量急剧上升,可能是由于计划在该时间段运行的备份应用程序所致。这会增加系统的内存压力,系统可能在需要时开始交换。将此类备份进程加入有内存限制的 cgroup 可以确保它们不会消耗系统上的所有可用内存,从而不影响其他进程。
Related question #1: How do pages get demoted from the active LRU to the inactive LRU list?
相关问题 #1:页面是如何从活跃 LRU 降级到不活跃 LRU 列表的?
Linux categorizes pages into two sets: anonymous pages, which are not file-backed (for instance, heap, stack, etc.) and file pages, which are pages in RAM that have a backing file on disk (for instance, libraries, data files, etc.).
Linux 将页面分为两类:匿名页——没有文件后备(例如堆、栈等);文件页——在 RAM 中有对应磁盘文件的页面(例如库文件、数据文件等)。
These 2 categories are further divided into two lists: active and inactive, using the LRU (Least Recently Used) algorithm. The active list contains pages that have been recently referenced, and the inactive LRU list contains pages that have not been accessed in a while. If a page is accessed that's in the inactive LRU list, it gets 'promoted' to the active list. Memory reclaim favors pages from the inactive LRU lists, as it's not very optimal to evict pages actively in use. Similarly, it's easier to evict clean file-backed pages since they're already up to date on the disk, as opposed to reclaiming dirty file-backed pages (which would need to be written out first) or anonymous pages (which also need to be written out, but to swap).
这两个类别通过 LRU(最近最少使用)算法进一步划分为活跃和不活跃两个列表。活跃列表包含最近被引用过的页面,而不活跃 LRU 列表包含一段时间内未被访问的页面。如果不活跃 LRU 列表中的页面被访问,它会被"提升"到活跃列表。内存回收偏好从不活跃 LRU 列表中回收页面,因为换出正在使用的页面并不理想。同样,干净的文件后备页更容易被换出,因为它们在磁盘上已经是最新的;相比之下,脏的文件后备页需要先写回磁盘,匿名页则需要写出到 swap。
Function shrink_active_list() moves pages from active to inactive LRU list; shrink_active_list() is called by kswapd, during page reclaim. This is called typically if the inactive list is too small – by deactivating pages, it provides enough candidate pages for the reclaim algorithm to make progress.
函数 shrink_active_list() 负责将页面从活跃 LRU 列表移动到不活跃 LRU 列表;该函数在页面回收期间由 kswapd 调用。通常在不活跃列表过小时被调用——通过将页面降级,为回收算法提供足够的候选页面以取得进展。
It scans a batch of pages (denoted by nr_to_scan) at the tail of the active list, and if they can be demoted, it moves them to the inactive list. If they cannot be deactivated, the pages are rotated back to the head of the active list – this gives the page an extra turn around the active list, before it's checked again for demotion.
它扫描活跃列表尾部的一批页面(数量由 nr_to_scan 指定),如果可以降级,就将其移动到不活跃列表。如果无法降级,页面会被旋转回活跃列表的头部——这使该页面在下一次降级检查之前多经历一轮活跃列表的遍历。
staticvoidshrink_active_list(unsigned long nr_to_scan,struct lruvec *lruvec,struct scan_control *sc,enum lru_list lru){...nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,&nr_scanned, sc, lru);...while (!list_empty(&l_hold)) {...if (page_referenced(page, 0, sc->target_mem_cgroup,&vm_flags)) {/** Identify referenced, file-backed active pages and* give them one more trip around the active list. So* that executable code get better chances to stay in* memory under moderate memory pressure. Anon pages* are not likely to be evicted by use-once streaming* IO, plus JVM can create lots of anon VM_EXEC pages,* so we ignore them here.*/if ((vm_flags & VM_EXEC) && page_is_file_lru(page)) {nr_rotated += thp_nr_pages(page);list_add(&page->lru, &l_active);continue;}}ClearPageActive(page); /* we are de-activating */SetPageWorkingset(page);list_add(&page->lru, &l_inactive);}/** Move pages back to the lru list.*/nr_activate = move_pages_to_lru(&l_active);nr_deactivate = move_pages_to_lru(&l_inactive);...__count_vm_events(PGDEACTIVATE, nr_deactivate);__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);...
This counter that tracks how many pages were deactivated in each pass (i.e. pgdeactivate) can be read from /proc/vmstat. It's a cumulative, global counter that tracks all deactivations for the duration of the system's uptime. If you monitor this file and see pgdeactivate going up, it's a sign that the system is shrinking active lists due to lack of sufficient inactive pages to reclaim.
这个跟踪每轮降级页面数量的计数器(即 pgdeactivate)可以从 /proc/vmstat 读取。它是一个累积的全局计数器,记录系统整个运行期间的所有降级操作。如果你监控这个文件并看到 pgdeactivate 持续增长,说明系统正因缺少足够的不活跃页面可供回收而在收缩活跃列表。
Let's look at inactive LRU list handling – how do pages get reclaimed from the inactive list? The core function that implements this logic is shrink_page_list(). It's called here:
再来看不活跃 LRU 列表的处理——页面是如何从不活跃列表中被回收的?实现这一逻辑的核心函数是 shrink_page_list(),其调用位置如下:
static unsigned longshrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,struct scan_control *sc, enum lru_list lru){...nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false);...__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;if (!cgroup_reclaim(sc))__count_vm_events(item, nr_reclaimed);__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);...
More counters one could monitor for, in /proc/vmstat, to understand current reclaim activity – pgsteal_kswapd and pgsteal_direct indicate how many pages were reclaimed by kswapd and in the direct reclaim flow respectively. Among them, pgsteal_anon and pgsteal_file counters indicate which LRU list these pages were reclaimed from.
在 /proc/vmstat 中还可以监控更多计数器以了解当前的回收活动——pgsteal_kswapd 和 pgsteal_direct 分别表示由 kswapd 和直接回收流程回收的页面数量。其中 pgsteal_anon 和 pgsteal_file 计数器指示这些页面是从哪个 LRU 列表回收的。
shrink_page_list() is handed a set of candidate pages which are isolated from the inactive LRU lists (it could be inactive file or inactive anon). It performs a series of checks on each page to ultimately decide if the page can be freed, or if some other course of action is more appropriate. If the page was recently referenced, it is activated (i.e. added to the active LRU list). If a page is dirty, it is queued for writeback, and reclaim is deferred. For anon pages, swap space is allocated, and the mapping is freed. The best case scenario is when a page is clean and file-backed – the page is freed right away.
shrink_page_list() 接收一组从不活跃 LRU 列表(可以是不活跃文件列表或不活跃匿名列表)中隔离出的候选页面。它对每个页面执行一系列检查,最终决定该页面是否可以释放,或者是否有其他更合适的处理方式。如果页面最近被引用过,它会被激活(即加入活跃 LRU 列表)。如果页面是脏的,它会被排入写回队列,回收被推迟。对于匿名页,会分配 swap 空间并释放映射。最佳情况是页面既干净又有文件后备——此时页面会被立即释放。
staticenum page_references page_check_references(struct page *page,struct scan_control *sc){int referenced_ptes, referenced_page;unsigned long vm_flags;referenced_ptes = page_referenced(page, 1, sc->target_mem_cgroup,&vm_flags);referenced_page = TestClearPageReferenced(page);...if (referenced_ptes) {/** All mapped pages start out with page table* references from the instantiating fault, so we need* to look twice if a mapped file page is used more* than once.** Mark it and spare it for another trip around the* inactive list. Another page table reference will* lead to its activation.** Note: the mark is set for activated pages as well* so that recently deactivated but used pages are* quickly recovered.*/SetPageReferenced(page);if (referenced_page || referenced_ptes > 1)return PAGEREF_ACTIVATE;/** Activate file-backed executable pages after first usage.*/if ((vm_flags & VM_EXEC) && !PageSwapBacked(page))return PAGEREF_ACTIVATE;return PAGEREF_KEEP;}/* Reclaim if clean, defer dirty pages to writeback */if (referenced_page && !PageSwapBacked(page))return PAGEREF_RECLAIM_CLEAN;return PAGEREF_RECLAIM;}
Related question #2: Is MemAvailable an accurate statistic for how much memory is readily available for allocation on my system?
相关问题 #2:MemAvailable 是否准确反映了系统上可随时用于分配的内存量?
Not always. MemAvailable is a heuristic and can be inaccurate for some workloads.
并非总是如此。MemAvailable 是一个启发式估算值,对某些工作负载可能不够准确。
long si_mem_available(void){.../** Not all the page cache can be freed, otherwise the system will* start swapping. Assume at least half of the page cache, or the* low watermark worth of cache, needs to stay.*/pagecache = pages[LRU_ACTIVE_FILE] + pages[LRU_INACTIVE_FILE];pagecache -= min(pagecache / 2, wmark_low);available += pagecache;...
MemAvailable is just a heuristic – the calculation assumes that at least half the page cache can be easily reclaimed, which is not true if the page cache is hot. For instance:
MemAvailable 只是一个启发式值——计算时假定至少一半的页面缓存可以被轻松回收,但如果页面缓存很热,这个假设就不成立。例如:
zzz <03/08/2022 01:50:23> Count:359MemAvailable: 204435292 kBActive(file): 185688096 kBInactive(file): 11493160 kB
这里大部分页面缓存处于"活跃"或"热"状态。关于 MemAvailable 计算方式的更多细节,参见"Why is MemAvailable sometimes less than MemFree?"。
How does the kernel decide which processes to swap out?
内核如何决定换出哪些进程?
Processes aren't chosen to be swapped out – rather, pages are chosen to be swapped out/evicted based on a host of parameters, including how recently it was last accessed/referenced. The reclaim algorithm favors pages that have a low cost of reclaim. A file-backed page's reclaim cost depends on if it's clean or dirty (dirty pages have to be written out to disk before they can be evicted, making them costly to evict due to the extra I/O), plus the refault cost (a page is evicted but is needed again, it must be read back in, generating more I/O). For process pages not backed by a file (i.e. anonymous pages), there is no way to avoid I/O before they can be reclaimed – they must be written out (to swap) before being freed. The rate of refault (i.e. reading it back in from swap) also increases the reclaim cost of anon pages. It's irrelevant what process the anon page being swapped out belongs to – what matters is that that page has not been referenced in a while, and thus is being evicted so that the memory can be reused elsewhere.
并不是选择进程来换出,而是根据一系列参数来选择要换出/驱逐的页面,包括页面最近一次被访问/引用的时间。回收算法偏好回收成本低的页面。文件后备页的回收成本取决于它是干净还是脏的(脏页在被换出前必须写回磁盘,额外的 I/O 使其换出成本较高),还取决于 refault 成本(页面被换出后又被需要,必须再读回,产生更多 I/O)。对于没有文件后备的进程页面(即匿名页),回收前无法避免 I/O——它们必须先写出到 swap 才能释放。refault 率(即从 swap 读回的频率)也会增加匿名页的回收成本。被换出的匿名页属于哪个进程并不重要——重要的是该页面已经有一段时间没有被引用,因此被换出以便内存可以在其他地方重新使用。
My system is swapping more after a kernel upgrade to Oracle UEK7, even though the workload is the same. Why?
升级到 Oracle UEK7 后系统交换量增加了,工作负载相同,为什么?
There have been some optimizations merged in the upstream kernel (version 5.15) that affect reclaim behavior and swap usage. Among them (and this is not a comprehensive list):
上游内核(5.15 版本)中合并了一些影响回收行为和 swap 使用的优化。其中包括(此列表不完整):
d483a5dd009a mm: vmscan: limit the range of LRU type balancing96f8bf4fb1dd mm: vmscan: reclaim writepage is IO cost7cf111bc39f6 mm: vmscan: determine anon/file pressure balance at the reclaim root314b57fb0460 mm: balance LRU lists based on relative thrashing264e90cc07f1 mm: only count actual rotations as LRU reclaim costfbbb602e40c2 mm: deactivations shouldn't bias the LRU balance1431d4d11abb mm: base LRU balancing on an explicit cost modela4fe1631f313 mm: vmscan: drop unnecessary div0 avoidance rounding in get_scan_count()968246874739 mm: remove use-once cache bias from LRU balancing34e58cac6d8f mm: workingset: let cache workingset challenge anon6058eaec816f mm: fold and remove lru_cache_add_anon() and lru_cache_add_file()c843966c556d mm: allow swappiness that prefers reclaiming anon over the file workingset497a6c1b0990 mm: keep separate anon and file statistics on page reclaim activity5df741963d52 mm: fix LRU balancing effect of new transparent huge pages
This changes the memory reclaim behavior so that swap usage is not the last resort behavior of a system under pressure – rather it's part of the normal reclaim flow, especially if the page cache pages are being refaulted in, at a high frequency. This kernel.org patchset makes the system utilize swap space more, even under "normal" memory pressure, if the page cache is hot.
这些改动使内存回收行为发生了变化:swap 的使用不再是系统承压时的最后手段,而是正常回收流程的一部分,特别是在页面缓存页以较高频率发生 refault 的情况下。这组 kernel.org 补丁使系统在"正常"内存压力下也会更多地利用 swap 空间,前提是页面缓存较热。
Let's look at some of these statistics, from /proc/vmstat:
来看 /proc/vmstat 中的一些统计数据:
workingset_nodes 1265381workingset_refault_anon 201574workingset_refault_file 181043188workingset_activate_anon 79184workingset_activate_file 25598943workingset_restore_anon 2066workingset_restore_file 7334788workingset_nodereclaim 171392...pswpin 201575pswpout 4522469...pgsteal_kswapd 246507286pgsteal_direct 253820pgdemote_kswapd 0pgdemote_direct 0pgscan_kswapd 269136889pgscan_direct 278885pgscan_direct_throttle 0pgscan_anon 26736135pgscan_file 242679639pgsteal_anon 4493398pgsteal_file 242267708...
Some observations:
几个观察要点:
workingset_refault_file is high – it indicates that the page cache pages are hot, and they are being refaulted in at a high frequency after being reclaimed. This will result in increased pressure on anon LRU list (and therefore swap).
workingset_refault_file 数值很高——表明页面缓存页是热的,被回收后以很高的频率重新故障读入。这会增加匿名 LRU 列表的压力(进而增加 swap 使用)。
workingset_restore_file tracks how often files that are about to be reclaimed are restored back to the working set (due to the pages being referenced again before being evicted) – this value is high, indicating that a lot of pages in the inactive(file) LRU list are not really good candidates for reclaim, and they're promoted to the active list quickly. Both these counters are indicative of a hot page cache that is not suitable for reclaim.
workingset_restore_file 跟踪即将被回收的文件页在被驱逐前因再次被引用而恢复到工作集的频率——该值较高,表明不活跃(文件)LRU 列表中的许多页面并非真正适合回收的候选页面,它们很快被提升到活跃列表。这两个计数器都说明页面缓存很热,不适合被回收。
There is high kswapd activity (indicated by pgsteal_kswapd, pgscan_kswapd).
kswapd 活动量较高(由 pgsteal_kswapd、pgscan_kswapd 体现)。
There seem to be less direct reclaim, which is good (pgscan_direct, pgsteal_direct).
直接回收相对较少,这是好事(pgscan_direct、pgsteal_direct)。
My swap utilization is close to 100% – will the OOM-killer be invoked now? Should I be concerned about system stability?
我的 swap 利用率接近 100%——OOM killer 会被触发吗?我应该担心系统稳定性吗?
Usually no. High swap utilization by itself does not trigger OOM; what matters is whether the current allocation can make reclaim/compaction progress. If there is sufficient reclaimable memory in the page cache, OOM is less likely, but there are exceptions (for example, memcg OOM, strict cpuset/mempolicy constraints, or high-order/GFP constraints).
通常不会。swap 利用率高本身并不会触发 OOM;关键是当前的分配是否还能在回收/压缩方面取得进展。如果页面缓存中有足够的可回收内存,OOM 的可能性较低,但也存在例外(例如 memcg OOM、严格的 cpuset/mempolicy 约束或高阶/GFP 约束)。
The OOM killer is invoked only when the kernel has tried repeatedly to reclaim memory in order to satisfy an allocation request, and has failed. Let's look at the relevant code snippet – __alloc_pages_slowpath() is the function that deals with allocations that cannot be satisfied right away. This is the function that wakes up kswapd, that tries to reclaim memory and then compact it (to ensure memory is not too fragmented for higher order allocations) and then retries the allocation.
OOM killer 仅在内核反复尝试回收内存以满足分配请求但均失败的情况下才会被触发。来看相关代码片段——__alloc_pages_slowpath() 是处理无法立即满足的分配请求的函数。该函数负责唤醒 kswapd,尝试回收内存然后进行压缩(以确保内存不会因碎片化过重而无法满足高阶分配),然后重试分配。
static inline struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,struct alloc_context *ac){...if (alloc_flags & ALLOC_KSWAPD)wake_all_kswapds(order, gfp_mask, ac);.../** For costly allocations, try direct compaction first, as it's likely* that we have enough base pages and don't need to reclaim. For non-* movable high-order allocations, do that as well, as compaction will* try prevent permanent fragmentation by migrating from blocks of the* same migratetype.* Don't try this for allocations that are allowed to ignore* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.*/if (can_direct_reclaim && can_compact &&(costly_order ||(order > 0 && ac->migratetype != MIGRATE_MOVABLE))&& !gfp_pfmemalloc_allowed(gfp_mask)) {page = __alloc_pages_direct_compact(gfp_mask, order,alloc_flags, ac,INIT_COMPACT_PRIORITY,&compact_result);.../* Try direct reclaim and then allocating */page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,&did_some_progress);if (page)goto got_pg;.../** Do not retry costly high order allocations unless they are* __GFP_RETRY_MAYFAIL and we can compact*/if (costly_order && (!can_compact ||!(gfp_mask & __GFP_RETRY_MAYFAIL)))goto nopage;if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,did_some_progress > 0, &no_progress_loops))goto retry;.../* Reclaim has failed us, start killing things */page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);if (page)goto got_pg;/* Avoid allocations with no watermarks from looping endlessly */if (tsk_is_oom_victim(current) &&(alloc_flags & ALLOC_OOM ||(gfp_mask & __GFP_NOMEMALLOC)))goto nopage;/* Retry as long as the OOM killer is making progress */if (did_some_progress) {no_progress_loops = 0;goto retry;}
As you can gather from the snippets and comments above, the kernel tries very hard to satisfy an allocation request. As long as the background reclaim (done by kswapd) is making some progress, the kernel tries to compact and allocate. As long as there's memory used by the page cache, it's very much available to be reclaimed, however slowly. If swap space fills up, the system can no longer optimize for file I/O and will start evicting the page cache – even writing out dirty pages to disk so they can be evicted, and shrinking slab caches, etc. It only invokes the OOM-killer when it has run out of all options. In that case, instead of the entire system grinding to a halt, unable to make progress due to lack of memory, it selects one "victim" process and kills that, hopefully freeing up enough memory to ease the pressure on the system so that the rest of the processes can continue.
从上述代码片段和注释中可以看出,内核会竭尽全力满足分配请求。只要后台回收(由 kswapd 执行)还在取得一些进展,内核就会尝试压缩和分配。只要有页面缓存占用的内存,这些内存就可以被回收,即使速度较慢。如果 swap 空间填满了,系统将无法再为文件 I/O 做优化,转而开始驱逐页面缓存——甚至将脏页写回磁盘以便驱逐,并收缩 slab 缓存等。只有在所有手段都用尽之后才会触发 OOM killer。在那种情况下,内核不会让整个系统因缺乏内存而无法继续运行,而是选择一个"受害"进程将其终止,希望释放足够的内存来缓解系统压力,让其余进程得以继续运行。
In short, if reclaim/compaction keeps making progress, OOM is unlikely. If the page cache has been almost completely evicted, and swap space is 100% full, and the free memory is below the per-zone low watermarks (or the system is too fragmented and compaction fails) – that's when the system is in trouble and the OOM-killer jumps in.
简而言之,如果回收/压缩持续取得进展,OOM 就不太可能发生。如果页面缓存几乎被完全驱逐,swap 空间 100% 填满,空闲内存低于各 zone 的低水位线(或者系统碎片化过重导致压缩失败)——这时系统才真正遇到了麻烦,OOM killer 便会介入。
When should I be concerned about swapping? When is swapping bad?
什么时候应该担心交换?交换什么时候是有害的?
When a system is swapping, it's usually seen as one of the first symptoms of memory pressure, which portends worse things to follow. That is a myth. Perhaps this is how swap was used back when memory was scarce, and disks were super slow and the system would actively try to not page out data to disk unless it had no other choice.
当系统发生交换时,人们通常将其视为内存压力的早期信号,预示着更糟糕的情况即将到来。这其实是一个误解。也许在内存稀缺、磁盘极慢的年代确实如此,那时系统会尽量避免将数据换出到磁盘,除非别无选择。
That is not the case anymore. With terabytes of RAM and SSDs being more commonplace now, swap is not a necessary evil – it is simply one of the mechanisms to make sure the system is operating at it's maximum efficiency. Memory reclaim is choosing the best candidate pages to evict for optimal system performance, in many cases keeping active page cache pages in memory, while moving idle pages out to swap.
但如今情况不同了。随着 TB 级别的 RAM 和 SSD 日益普及,swap 不再是一种不得已的妥协——它只是确保系统在最高效率下运行的机制之一。内存回收会选择最佳的候选页面进行驱逐以获得最优的系统性能,在许多情况下将活跃的页面缓存页保留在内存中,而将空闲页面移出到 swap。
Even if swap space has been used up 100%, it's not a reason to be alarmed; see the previous question (My swap utilization is close to 100% – will the OOM-killer be invoked now? Should I be concerned about system stability?). It's when there is constant swapping in and swapping out – also known as thrashing – that the system is in trouble. Thrashing happens because the available RAM is not enough to satisfy the memory demands of the workload. Due to sustained memory pressure, pages are swapped out (or evicted to disk), but those are still part of the active working set. The processes continue to read/write to those pages, which then results in them being swapped in again, which increases memory pressure (as those new pages have to go somewhere), which results in other active pages being swapped out, and so on. This will result in performance degradation for all applications, and the system will slowly grind to a halt.
即使 swap 空间被 100% 用满,也不必惊慌;参见前一个问题。真正有问题的是持续不断地换入和换出——即抖动(thrashing)。抖动的发生是因为可用 RAM 不足以满足工作负载的内存需求。由于持续的内存压力,页面被换出(或驱逐到磁盘),但这些页面仍然属于活跃工作集的一部分。进程继续读写这些页面,导致它们又被换入,这进一步增加了内存压力(因为这些新页面需要占用空间),又导致其他活跃页面被换出,如此循环。这将导致所有应用的性能下降,系统逐渐陷入停滞。
Are there any kernel sysctl knobs that can affect swapping on my system?
有哪些内核 sysctl 参数可以影响系统上的交换行为?
There are a few knobs that affect memory reclaim behavior – which indirectly affects swap.
有一些参数会影响内存回收行为,从而间接影响 swap。
vm.swappiness
This is the primary swap-related tunable that controls how aggressively the kernel reclaims file pages vs. anonymous memory – the latter gets evicted to swap space. The value can range from 0 to 200, with 60 being the default. Lowering vm.swappiness generally tends to make the reclaim algorithm favor page cache eviction more than swapping, while increasing this value favors evicting process memory to swap and keeping page cache around a little longer.
这是与 swap 相关的主要可调参数,控制内核在回收文件页与匿名内存之间的激进程度——后者会被驱逐到 swap 空间。取值范围为 0 到 200,默认值为 60。降低 vm.swappiness 通常会使回收算法更倾向于驱逐页面缓存而非进行交换,而提高该值则倾向于将进程内存驱逐到 swap 并让页面缓存多保留一段时间。
Note: Setting vm.swappiness=0 can increase OOM risk on some workloads by greatly reducing anonymous pages reclaimed.
注意:将 vm.swappiness 设为 0 会大幅减少匿名页的回收量,在某些工作负载下可能增加 OOM 风险。
vm.min_free_kbytes
This directly affects the zone watermark values – which control how much memory the kernel sets aside as "reserved" – i.e. normal allocations will not be able to dip into this pool. This in turn influences when memory reclaim starts and how long it runs for, which will affect how much swap space gets utilized. If this value is set too low, the system will not have enough reserve memory to handle allocation bursts. The kernel will struggle to do reclaim, compaction, dirty page writeback, etc. Even the memory reclaim flow might need to allocate memory (e.g. to migrate pages), which will stall (in direct reclaim) or fail if this setting is too low. On the other hand, if this is set too high, all that memory is set aside as reserved, which cannot be used for regular allocations. Also, this increases the zone watermarks, which means kswapd will run for longer, evicting more pages from memory and swapping more too. It is not advisable to tune this setting unless there is evidence that the current setting is suboptimal for the workload.
该参数直接影响 zone 水位线值——控制内核预留多少内存作为"保留池"——即正常分配不能动用这部分内存。这反过来影响内存回收的启动时间和持续时间,从而影响 swap 空间的使用量。如果该值设置过低,系统将没有足够的保留内存来应对分配突增。内核在执行回收、压缩、脏页写回等操作时会力不从心。内存回收流程本身也可能需要分配内存(例如迁移页面),如果设置过低,这些操作在直接回收中会阻塞或失败。另一方面,如果设置过高,所有这些内存都被预留而无法用于常规分配。而且这会提高 zone 水位线,意味着 kswapd 会运行更长时间,从内存中驱逐更多页面,也会增加交换量。除非有证据表明当前设置不适合工作负载,否则不建议调整此参数。
vm.compaction_proactiveness
This knob controls how aggressively background compaction is done, by kcompactd. If compaction is successful, this will reduce fragmentation on the system, thus increasing the likelihood that a higher order allocation request will succeed. This will help reduce kswapd's run time.
该参数控制 kcompactd 执行后台压缩的激进程度。如果压缩成功,系统的碎片化程度会降低,从而增加高阶分配请求成功的可能性,有助于减少 kswapd 的运行时间。
watermark_boost_factor; watermark_scale_factor
Like vm.min_free_kbytes, they affect the zone watermarks, which alters reclaim aggressiveness, which in turn can increase swap usage.
与 vm.min_free_kbytes 类似,它们影响 zone 水位线,从而改变回收的激进程度,进而可能增加 swap 使用量。
vm.drop_caches
This sysctl is not a tunable, but writing to it will perform a specific reclaim action that reduces the memory pressure on the system.
这个 sysctl 不是一个可调参数,但向其写入值会执行特定的回收操作以减轻系统的内存压力。
Writing 1 to this parameter will drop reclaimable page cache.
写入 1 会丢弃可回收的页面缓存。
Writing 2 will free up reclaimable dentry and inode slab caches.
写入 2 会释放可回收的 dentry 和 inode slab 缓存。
Writing 3 will free up both slab caches and page cache memory.
写入 3 会同时释放 slab 缓存和页面缓存内存。
⚠️WARNING: Please do not do this on production systems without understanding the performance impact. Dropping the page cache on a system with a heavy workload which is very much actively utilizing said cache is not a good idea. This is a blunt hammer that will evict active and inactive pages – regardless of the performance hit to the applications. Other considerations:
⚠️警告:请不要在不了解性能影响的情况下在生产系统上执行此操作。在工作负载较重且正在大量使用缓存的系统上丢弃页面缓存不是个好主意。这是一种粗暴的手段,会不顾对应用的性能影响而驱逐活跃和不活跃页面。其他注意事项:
During the actual kernel processing of the drop_caches, the system could experience a brownout or eviction.
在内核处理 drop_caches 的过程中,系统可能会出现性能骤降或驱逐。
Any memory it frees up will usually be temporary – the system will read all those pages right back in if the workload demands it. It will not fix the real problems on the system, if any.
释放的内存通常只是暂时的——如果工作负载需要,系统会把这些页面全部读回来。它不会修复系统上的真正问题(如果有的话)。
Due to the re-faulting of those evicted pages, some applications can see higher latencies or performance drops after page cache is dropped.
由于被驱逐页面的重新 fault,某些应用在页面缓存被丢弃后可能出现更高的延迟或性能下降。
There are many more kernel tunables that affect memory reclaim behavior, which could affect swap aggressiveness. Please consider all the pros and cons of these tunables carefully, before changing them from their default values. It is not advisable to change these without expert recommendations, it could have surprising and undesirable consequences.
还有许多其他影响内存回收行为的内核可调参数,可能影响 swap 的激进程度。在更改这些参数的默认值之前,请仔细权衡利弊。建议在没有专家建议的情况下不要更改,否则可能产生意料之外的不良后果。
I would like my system to use less swap. How do I achieve that?
我希望系统少用 swap,如何实现?
Reducing swap usage in itself is not a worthy goal, unless there are some negative effects or performance issues. Typically, on a well planned system where the workload does not exceed the memory capacity, there should not be heavy swapping or thrashing. If you just see some swap usage now and then, that is completely normal. If SwapFree continues to go down (and stay down), that is also normal.
降低 swap 使用量本身并不是一个值得追求的目标,除非它带来了负面影响或性能问题。通常,在规划良好且工作负载不超出内存容量的系统上,不应出现大量交换或抖动。偶尔看到一些 swap 使用是完全正常的。即使 SwapFree 持续下降(并保持在较低水平),这也是正常现象。
As we've discussed earlier in this document, there are a few things that can contribute to increased swapping even on a healthy system. Here are a couple things to check:
正如前文所述,即使在健康的系统上也有一些因素可能导致交换量增加。以下是一些检查建议:
Check if there's NUMA imbalance.
检查是否存在 NUMA 不均衡。
If the page cache is hot, and it's being used mostly for read-once files (like backup, or security scanner etc.), consider using cgroup limits for those applications so they don't use too much memory.
如果页面缓存很热,且主要被一次性读取的文件占用(如备份、安全扫描器等),考虑为这些应用设置 cgroup 内存限制,防止它们消耗过多内存。
⚠️WARNING: Do not run swapoff -a and swapon -a in an attempt to free up swap. This will not help. If the system is already running low in memory, and you turn off swap, you're forcing the data that was swapped out to be read back into memory, which could cause a crash/reboot.
⚠️警告:不要试图通过执行 swapoff -a 和 swapon -a 来释放 swap。这样做不会有帮助。如果系统内存已经不足,关闭 swap 会强制将已换出的数据读回内存,可能导致系统崩溃或重启。
There is no advantage in trying to increase SwapFree to be closer to SwapTotal. There's no benefit in trying to not use swap, if it's enabled. It's good for overall system performance to let the kernel decide what pages it wants to keep in memory and what pages it wants to evict. There's no need to try to change that algorithm unless you're running into performance issues, or the system is thrashing.
试图让 SwapFree 接近 SwapTotal 没有任何好处。如果 swap 已启用,试图不使用它也没有意义。让内核自行决定保留哪些页面在内存中、驱逐哪些页面,对系统整体性能是有益的。除非遇到性能问题或系统出现抖动,否则没有必要试图改变这一算法。
What are the latest updates in swap, in upstream Linux?
上游 Linux 中 swap 的最新进展
On newer kernels, Multi-Generational LRU (MGLRU) may be used rather than just two LRU lists (active and inactive), depending on kernel version, configuration and runtime settings. MGLRU entered upstream in the 5.x series and continued to evolve through 6.x, with the goal of improving page reclamation. This categorizes pages into multiple generations based on how recently they were accessed – the pages in older generations were not accessed recently, whereas the pages in the newer generations are more active. The kernel can then reclaim older pages much more efficiently and reduce the rate of refaults. Benchmarks have shown this to improve the efficiency of kswapd, as well as reducing working set refaults significantly. This should (theoretically at least) reduce unnecessary swapping since there is better accuracy in identifying truly inactive/idle pages. Note that this will not eliminate swapping; for instance, if most of the pages in the oldest generation belong to idle processes, and the page cache pages are in one generation newer, those process pages will be evicted first – i.e. swapped – before the page cache pages are reclaimed. This is still the most optimal thing to do for the system overall. The goal here is to evict truly cold pages from memory, and MGLRU brings more accuracy in identifying those pages, and thus reduce refaults/swap-ins of evicted pages.
在较新的内核中,可能会使用多代 LRU(MGLRU)而不是仅仅两个 LRU 列表(活跃和不活跃),具体取决于内核版本、配置和运行时设置。MGLRU 在 5.x 系列进入上游,并在 6.x 系列中持续演进,目标是改进页面回收效率。它根据页面最近被访问的时间将其分为多个"代"——较老代中的页面最近未被访问,而较新代中的页面更为活跃。内核可以更高效地回收较老代的页面,并降低 refault 率。基准测试表明,MGLRU 提升了 kswapd 的效率,并显著减少了工作集的 refault。这理论上应该能减少不必要的交换,因为它在识别真正不活跃/空闲页面方面更加精确。值得注意的是,这不会消除交换;例如,如果最老代中的大部分页面属于空闲进程,而页面缓存页在稍新的一代中,那些进程页面会先被驱逐(即交换),然后页面缓存页才会被回收。对整个系统而言这仍然是最优策略。目标是驱逐真正冷的页面,MGLRU 在识别这些页面方面带来了更高的精度,从而减少被驱逐页面的 refault/换入。
Conclusion
结论
What we have shown in this blog is that swap usage is no longer a bellwether of system problems. Swap space is being used to improve performance by freeing up real memory for active processes and page cache to use. Alerts that monitor for swap free or swap used should be changed to look for increased and near-constant periods of active swapping – i.e. si and so fields from vmstat. If you do not see this and there are no performances issues, there is nothing to worry about w.r.t. swap space usage. Low SwapFree by itself is not a signal of impending disaster. The default kernel sysctl vm.swappiness does favor reclaiming page cache over swapping when possible, but, at the same time, swapping will happen in parallel with page cache eviction. If you want to reduce swap usage, consider using cgroup limits to constrain applications like backups so that they do not end up using all available memory for caching of file data.
本文说明了 swap 使用不再是系统出问题的风向标。Swap 空间的使用是为了释放实际内存供活跃进程和页面缓存使用,从而提升性能。监控 SwapFree 或 swap 使用量的告警应改为关注 vmstat 中 si 和 so 字段所反映的持续、频繁的活跃交换活动。如果没有观察到这种情况且不存在性能问题,那么 swap 空间的使用就无需担忧。SwapFree 偏低本身并不预示着灾难即将来临。默认的内核 sysctl 参数 vm.swappiness 在可能的情况下确实更倾向于回收页面缓存而非交换,但同时交换也会与页面缓存驱逐并行发生。如果你想降低 swap 使用量,可以考虑使用 cgroup 限制来约束备份等应用,防止它们将所有可用内存用于文件数据缓存。