当前位置：首页>Linux>Linux有哪些惊为天人的设计?

Linux有哪些惊为天人的设计?

2026-07-02 21:28:15

点击上方“嵌入式与Linux那些事”

选择“置顶/星标公众号”

引言

本文旨在深入剖析 Linux 内核 Swap 机制近年来的两次核心演进：由 Chris Li, Kairui Song 等内核专家引入的、基于 Cluster 的高性能分配器，以及后续由 Kairui Song 主导的、革命性的 Swap Table 架构。笔者在近一年的学习研究中，深感这部分存在诸多易混淆的概念，且“为何如此设计”的深层原因往往语焉不详。因此，本文不仅是对 v4 Swap Table 等最新源码的学习记录，更是一次旨在厘清历史脉络、阐明设计哲学的深度探索。如有谬误，欢迎批评指正。全文基于最新的 mm-new 的代码(base commit 199236646ffd82b5a5bcf2bca1579ea06cb0ae74 )。

为避免混淆，在深入技术细节前，我们必须先对几个核心概念进行精确定义。即使你没有任何内核知识的基础请先知悉下面的概念，笔者会逐渐带领大家慢慢探究这个过程。

swap area/file :交换空间或者文件，用于缓解内存不足，注册后可以供设备换入换出页面。

swap addres space：交换空间划分为 64M 的单元，这样设计一方面是使用了文件缓存那套管理机制 xarray 分级管理，一方面这样的大小也是权衡寻址和锁的粒度的设计。（新的机制这部分内容已经移除）

swap map：可以理解为一个巨大的数组管理着 swap entry/slot；

swap cluster：为了实现 THP 分配与提高 SSD 性能，引入了 Cluster 概念，为此设计一个 512（pmd 大小）页面大小的链表。

swap entry_t： 64 位整数可以理解为 present 置 0 的 pte，用于寻址交换空间中的物理页面。常用 swap entry 指代

swap slot：交换设备真实地物理页面，与一个 swap entry 一一对应。

swap table：一种替代目前 xarray，更简洁、更高性能的 swapcache 管理方式。

swap_info_struct（include/Linux/swap.h）

在深入探索内核 Swap 机制的奥秘之前，我们必须首先明确其存在的根基——物理内存管理。系统的所有活动都围绕着对物理页面（在现代内核中更常被称为 folio）的分配与使用。这些以 4KB 为基本单位的物理内存页，在系统上电初始化后，便由内核的伙伴系统（Buddy System）统一管理，等待着为进程和内核自身提供服务。(注：关于物理内存的初始化与管理，我们将会在后续文章中详细剖析，此处读者只需建立此基本概念即可。)

然而，物理内存是一种有限且宝贵的资源。当系统面临以下两种典型场景时，挑战便随之而来：

内存超载 (Overcommit): 系统中所有进程请求的内存总量，超过了物理内存的实际容量。

内存闲置 (Inactive Memory): 大量物理页面被分配出去，但其上的数据长时间未被访问，处于“沉睡”状态，造成了资源浪费。

为了应对这些挑战，Linux 内核引入了一套精巧的页面换出（Swap-out）与换入（Swap-in）机制。其核心思想并非简单地“扩充内存”，而更像是一种内存的“时分复用”：将那些暂时“沉睡”的内存页面转移到一种速度较慢但容量巨大的存储介质上，从而腾出宝贵的物理内存，供给当前更活跃的任务使用。当未来某个时刻需要访问被转移的页面时，再将其重新加载回物理内存。

交换介质：页面的临时家园

这个用于临时安置内存页面的“后备存储介质”，就是我们通常所说的交换空间（Swap Space）。在现代系统中，它通常由我们所熟知的块存储设备来充当，最常见的便是：

固态硬盘 (SSD): 凭借其出色的随机读写性能和极低的延迟，SSD 已成为交换空间的首选介质，它能显著缩短页面的换入换出时间，减轻对系统响应能力的影响。

机械硬盘 (HDD): 尽管其随机读写性能远逊于 SSD，但在成本和容量上仍具优势。在内存压力不极端或对性能要求不高的场景下，HDD 依然是可行的选择。

无论是 SSD 还是 HDD，无论是作为一个专用的交换分区还是一个灵活的交换文件，内核都需要一套统一的机制来识别、管理和操作这些形形色色的物理介质。

swap_info_struct：交换设备的抽象身份证

那么，内核是如何管理这些可能同时存在的、特性各异的交换设备的呢？答案就是通过一个核心的数据结构——struct swap_info_struct。

structswap_info_struct{structpercpu_refusers;/* indicate and keep swap device valid. */unsignedlongflags;/* SWP_USED etc: see above */signedshortprio;/* swap priority of this type */structplist_nodelist;/* entry in swap_active_head */signedchartype;/* strange name for an index */unsignedintmax;/* extent of the swap_map */unsignedchar*swap_map;/* vmalloc『ed array of usage counts */unsignedlong*zeromap;/* kvmalloc』ed bitmap to track zero pages */structswap_cluster_info*cluster_info;/* cluster info. Only for SSD */structlist_headfree_clusters;/* free clusters list */structlist_headfull_clusters;/* full clusters list */structlist_headnonfull_clusters[SWAP_NR_ORDERS];/* list of cluster that contains at least one free slot */structlist_headfrag_clusters[SWAP_NR_ORDERS];/* list of cluster that are fragmented or contented */unsignedintpages;/* total of usable pages of swap */atomic_long_tinuse_pages;/* number of those currently in use */structswap_sequential_cluster*global_cluster;/* Use one global cluster for rotating device */spinlock_tglobal_cluster_lock;/* Serialize usage of global cluster */structrb_rootswap_extent_root;/* root of the swap extent rbtree */structblock_device*bdev;/* swap device or bdev of swap file */structfile*swap_file;/* seldom referenced */structcompletioncomp;/* seldom referenced */spinlock_tlock;/*					 * protect map scan related fields like					 * swap_map, inuse_pages and all cluster					 * lists. other fields are only changed					 * at swapon/swapoff, so are protected					 * by swap_lock. changing flags need					 * hold this lock and swap_lock. If					 * both locks need hold, hold swap_lock					 * first.					 */spinlock_tcont_lock;/*					 * protect swap count continuation page					 * list.					 */structwork_structdiscard_work;/* discard worker */structwork_structreclaim_work;/* reclaim worker */structlist_headdiscard_clusters;/* discard clusters list */structplist_nodeavail_lists[];/*					   * entries in swap_avail_heads, one					   * entry per node.					   * Must be last as the number of the					   * array is nr_node_ids, which is not					   * a fixed value so have to allocate					   * dynamically.					   * And it has to be an array so that					   * plist_for_each_* can work.					   */};

struct swap_info_struct 历经多年的演进，其内部成员已变得相当丰富。为了拨开迷雾，让我们先抛开所有复杂的细节，回归到最朴素、最直观的设计思路。

既然交换空间的核心任务是临时存放从内存中换出的物理页面，一个自然而然的想法便是：我们能否用一个巨大的数组来直接映射交换区中的每一个 4KB 页面（slot）？这个数组的下标（index）直接对应交换区中的页面偏移量，数组的元素则记录着该槽位的状态。这种设计无疑是最高效、最直观的。

这个“巨大数组”的思想，正是 swap_info_struct 内部核心成员之一——swap_map 的设计雏形。swap_map 本质上就是这个“大数组”在内核中的逻辑体现，它为每一个交换槽位都维护着一个元数据条目。

然而，当我们顺着这个思路继续深入，一个问题便浮出水面：真实世界的操作系统是一个高度并发的环境。

一个看似简单的 swap_map，为何在其结构体附近总是伴随着一把自旋锁（spinlock_t）的身影？这把锁究竟在保护什么？

答案在于，一个物理页面在被换出之前，它可能并不仅仅属于单个进程。由于写时复制（Copy-on-Write, CoW）机制的存在，一个只读的物理页面可能被多个进程同时共享映射。例如，父进程 fork() 出一个子进程，在子进程写入共享内存之前，父子进程的页表项（PTE）可能指向同一个物理页面。

当这个共享页面被换出时，所有共享它的进程的 PTE 都会被修改，从指向物理页面，转而指向同一个 swp_entry（交换项）。这意味着，swap_map 中代表这个 swp_entry 的那个条目，将同时被多个 PTE 所引用。

因此，swap_map 的职责变得复杂起来。它不仅要记录一个槽位是否被使用，更需要精确地追踪它被多少个 PTE 所引用。为此，swap_map 的每个条目中都必须包含一个引用计数（swap_count）。当一个新的 PTE 指向它时，计数加一；当一个 PTE 被销毁时，计数减一。只有当引用计数归零时，这个交换槽位才能被安全地释放和重用。

现在，锁的必要性就显而易见了。在多核 CPU 环境下，两个不同的进程可能在两个不同的核心上同时退出，并尝试去递减同一个 swap_count。如果没有锁的保护，这种并发的读-修改-写操作将不可避免地导致计数错误，最终造成交换空间的管理混乱——要么是仍在被引用的槽位被错误释放（导致数据损坏），要么是本该释放的槽位永远无法回收（导致空间泄漏）。

不同的设备会被 swap_info 数组管理，通过解码的 type 对应：

/* * Callers of all helpers below must ensure the entry, type, or offset is * valid, and protect the swap device with reference count or locks. */staticinlinestructswap_info_struct*__swap_type_to_info(inttype){structswap_info_struct*si;si=READ_ONCE(swap_info[type]);/* rcu_dereference() */VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users));/* race with swapoff */returnsi;}

swap_entry_t(include/Linux/mm_types.h)

/* * A swap entry has to fit into a 「unsigned long」, as the entry is hidden * in the 「index」 field of the swapper address space. */typedefstruct{unsignedlongval;}swp_entry_t;

这是一个寻址的关键，就像 pte 的解码可以找到 page 一样，entry 的解码也是一样我们来看 arm64 的 encode：

/* * Encode and decode a swap entry: *	bits 0-1:	present (must be zero) *	bits 2:		remember PG_anon_exclusive *	bit  3:		remember uffd-wp state *	bits 6-10:	swap type *	bit  11:	PTE_PRESENT_INVALID (must be zero) *	bits 12-61:	swap offset */

对应 pte 这里 present 必须置零代表不在内存中，其他两个重要的自然是 swap type 用于寻址不同的交换设备，offset 找到对应的交换设备里面的偏移量寻址到那个 swap map 对应的 slot。

swap_cluster_info（mm/swap.h）

进入这里，即是 Chris Li, Kairui Song 优化的工作的开始：

/* * We use this to track usage of a cluster. A cluster is a block of swap disk * space with SWAPFILE_CLUSTER pages long and naturally aligns in disk. All * free clusters are organized into a list. We fetch an entry from the list to * get a free cluster. * * The flags field determines if a cluster is free. This is * protected by cluster lock. */struct swap_cluster_info {	spinlock_t lock;	/*				 * Protect swap_cluster_info fields				 * other than list, and swap_info_struct->swap_map				 * elements corresponding to the swap cluster.				 */	u16 count;	u8 flags;	u8 order;	atomic_long_t __rcu *table;	/* Swap table entries, see mm/swap_table.h */	struct list_head list;};

首先是当前分配器的问题

kairui 哥和 Chris Li,做了大量的优化工作，最后使得新的分配器：

1. 精细化的 Cluster 状态管理：双向链表的威力

新架构的核心，是使用多个双向链表，根据 Cluster 的占用状态和页面连续性，对其进行精细化的分类管理。每个 Cluster 在其生命周期中，都会根据自身状态的变化，在这些不同的链表之间迁移。

主要的链表分级如下：

free_clusters(全空闲链表):

成员:
完全没有被使用的 Cluster。这些 Cluster 是分配器的首选资源。
作用:
为新的换出操作提供“干净”的、可立即使用的 2MB 交换空间。

nonfull_clusters(非满链表):

成员:
部分槽位已被占用，但仍有空闲槽位的 Cluster。
作用:
这是最常见的状态。分配器会优先利用这些 Cluster 的剩余空间，以提高交换空间的利用率，避免产生过多碎片。full_clusters(全满链表):
成员:
所有 512 个槽位都已被占用的 Cluster。
作用:
这些 Cluster 被暂时“搁置”起来，分配器不会再扫描它们，从而避免了无效的搜索开销。当其中有页面被换回内存导致槽位被释放时，它们会重新迁移回 nonfull_clusters 链表。

2. 面向大页（mTHP）的优化：按阶（Order）管理的独立链表

为了更好地支持大页（Multi-size THP, mTHP）的换出，新架构在上述分类的基础上，为 nonfull_clusters 和 fragmented_clusters 引入了按阶（per-order）管理的机制。

独立的nonfull_clusters[order]链表:

系统会为不同大小的页面（order-0, order-1, ... order-9）维护独立的非满链表。
当需要换出一个 64KB (order-4) 的页面时，分配器会直接去 nonfull_clusters 链表中寻找能够容纳 64KB 连续空间的 Cluster。
这种设计对 mTHP 极其友好，它避免了为了分配一个大页而去扫描大量不相关的、只包含小碎片的 Cluster，极大地提高了大页换出的分配效率。

fragmented_clusters[order](碎片化链表 - 可选/高级特性):

用于管理那些虽然有足够空闲槽位、但无法满足特定 order 连续分配要求的 Cluster。
通过将这些碎片化的 Cluster 单独管理，可以进一步优化高 order 页面的分配路径。

它彻底消除了对整个 swap_map 进行全局扫描的昂贵过程。所有的操作都被限定在了一个个独立的 Cluster 内部，并且通过分类链表实现了高效的目标定位。这从根本上解决了旧架构的分配冲突问题，为高并发环境下的高性能交换操作奠定了坚实的基础。合入链接：HTTPS://lore.kernel.org/Linux-mm/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/

swap table

从“巨锁”到“微锁”：Swap 管理架构的演进之

至此，我们已经铺垫了所有必要的背景知识，正式进入本次探索的核心——剖析新一代 Swap 架构的革命性优化。

一切故事的开端，都源于那个最原始、最简单的设计思想：用一个巨大的 swap_map 数组来管理所有交换槽位。然而，为了保护这个大数组在多核环境下的数据一致性，一把全局的自旋锁（spinlock 是不可避免的。在现代众核处理器上，这把“巨锁”成为了一个灾难性的性能瓶颈，所有需要与 Swap 子系统交互的 CPU 都必须在此串行排队。

中间方案：借用 Page Cache 框架的 64MB address_space

为了挣脱“巨锁”的束缚，内核开发者们迈出了演进的第一步。他们借用了文件系统页面缓存（Page Cache）的管理框架，将庞大的交换空间逻辑上切分为以 64MB 为单位的管理单元，即 swap_address_space。

每一个 64MB 的 address_space 都由一棵独立的 XArray 树 来管理。这成功地将一把“全局锁”打碎成了多把“64MB 区域锁”，显著缓解了锁竞争（注：关于 XArray 的精妙设计，强烈推荐阅读 Kairui 哥的深度解析文章）

让我们来做一个简单的估算：一个 64MB 的空间包含 64MB / 4KB = 16384 个槽位，需要 14 位的索引来覆盖 (2^14 = 16384)。假设 XArray 的每个节点管理 6 位的索引空间（即拥有 64 个槽位），那么：

两层树
最多能覆盖 6 + 6 = 12 位索引，即 2^12 = 4096 个槽位，不足以覆盖 14 位的范围。
三层树
则能覆盖 6 + 6 + 6 = 18 位索引，即 2^18 个槽位，足以轻松覆盖 14 位的需求。因此，在这个架构下，一次 Swap Cache 的查找操作，通常需要 3 次 指针走树遍历。

终极革命：2MB Cluster 与 Swap Table 的诞生

尽管 64MB 的 address_space 方案有所改善，但它本质上仍是“借来”的架构，锁粒度依然偏大，且树形查找（O(logN)）的开销不容忽视。为此，新一代架构应运而生，它采用了全新的、专为 Swap 设计的管理模式：

管理单元的精细化：
新架构以 Cluster 作为基础管理单元，其大小被巧妙地设计为 2MB。这并非一个随意的数字，它精确地对应了 x86-64 架构下一个 PMD 页表项所能映射的透明大页（THP）的大小。
O(1) 寻址的Swap Table:

一个 2MB 的 Cluster 包含 512 个 4KB 的槽位。
为了索引这 512 个槽位的 Swap Cache 状态，内核不再使用复杂的树，而是回归到最快的扁平数组——Swap Table。
这个 Swap Table 就是一个拥有 512 项的 unsigned long 数组，其元数据大小恰好是 512 * 8B = 4KB，可以完美地放入一个物理页面中，极大地增强了内存访问的局部性。状态编码的传承与创新：
Swap Table 中的每一个 8 字节条目，都承袭了 XArray 的精妙设计，通过指针的最低位来编码三种不同的状态：

folio指针:
指向真正在 Swap Cache 中的物理页面，其地址最低位必为 0。
shadow条目:
用于追踪近期换出页面的“影子”元数据，其最低位被置为 1。
NULL:
代表槽位为空，或页面不在 Swap Cache 中。

这种设计在不引入任何额外空间开销的前提下，实现了信息的高度压缩与复用。

通过这一系列变革，Swap 机制的查找过程从 O(logN) 的树形遍历，进化为了 O(1) 的直接地址计算，锁的粒度也从 64MB 精细化到了 2MB，这正是新架构带来巨大性能飞跃的根源所在。并且 kairui 哥清理和提供了一些接口便于使用：

/* * All swap cache helpers below require the caller to ensure the swap entries * used are valid and stablize the device by any of the following ways: * - Hold a reference by get_swap_device(): this ensures a single entry is *   valid and increases the swap device『s refcount. * - Locking a folio in the swap cache: this ensures the folio』s swap entries *   are valid and pinned, also implies reference to the device. * - Locking anything referencing the swap entry: e.g. PTL that protects *   swap entries in the page table, similar to locking swap cache folio. * - See the comment of get_swap_device() for more complex usage. */structfolio*swap_cache_get_folio(swp_entry_tentry);void*swap_cache_get_shadow(swp_entry_tentry);voidswap_cache_add_folio(structfolio*folio,swp_entry_tentry,void**shadow);voidswap_cache_del_folio(structfolio*folio);/* Below helpers require the caller to lock and pass in the swap cluster. */void__swap_cache_del_folio(structswap_cluster_info*ci,structfolio*folio,swp_entry_tentry,void*shadow);void__swap_cache_replace_folio(structswap_cluster_info*ci,structfolio*old,structfolio*new);void__swap_cache_clear_shadow(swp_entry_tentry,intnr_ents);

至于 swap table 中的寻址则在这里，先找到 cluster 在通过 offset 找到我们的 slot：

staticinlinestructswap_cluster_info*__swap_offset_to_cluster(structswap_info_struct*si,pgoff_toffset){VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users));/* race with swapoff */VM_WARN_ON_ONCE(offset>=si->max);return&si->cluster_info[offset/SWAPFILE_CLUSTER];}staticinlineunsignedintswp_cluster_offset(swp_entry_tentry){returnswp_offset(entry)%SWAPFILE_CLUSTER;}

到结尾我们最后来思考一个小的问题，就是管理这部分内容的元数据也是需要开辟宝贵的内存进行存储的，现在的 swap table 刚好是对应一个页面：

staticstructswap_table*swap_table_alloc(gfp_tgfp){structfolio*folio;if(!SWP_TABLE_USE_PAGE)returnkmem_cache_zalloc(swap_table_cachep,gfp);folio=folio_alloc(gfp|__GFP_ZERO,0);if(folio)returnfolio_address(folio);returnNULL;}

kairui 哥他们实现了 rcu 机制用于冷路径的释放，同时也将计划引入针对 swap map 中页面的引用计数 count 的优化，极大的节省内存（注意这部分原来使用 xarray 也是需要开辟内存存储的）。

参考文章：

swapcache：从 xarray 到 swap table 的变迁

内存管理特性分析（十一）:Linux swap 机制及优化技术分析

一个让 Linus Torvalds 「不明觉赞」的内核优化与修复历程

clk 分享 mthp 和 swap 分配器的优化

附注：

学习这部分的内容，笔者经历了很多阶段目前还是旁观学习者的身份，后续争取在这部分贡献代码，而在内核的学习历程中，不仅要学习内核代码更需要一些测试程序去观察和体会，这里放一个同样是社区宝华老师的工具来感受和学习。 tools/mm/thp_swap_allocator_test.C

// SPDX-License-Identifier: GPL-2.0-or-later/* * thp_swap_allocator_test * * The purpose of this test program is helping check if THP swpout * can correctly get swap slots to swap out as a whole instead of * being split. It randomly releases swap entries through madvise * DONTNEED and swapin/out on two memory areas: a memory area for * 64KB THP and the other area for small folios. The second memory * can be enabled by 「-s」. * Before running the program, we need to setup a zRAM or similar * swap device by: *  echo lzo > /sys/block/zram0/comp_algorithm *  echo 64M > /sys/block/zram0/disksize *  echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled *  echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled *  mkswap /dev/zram0 *  swapon /dev/zram0 * The expected result should be 0% anon swpout fallback ratio w/ or * w/o 「-s」. * * Author(s): Barry Song <v-songbaohua@oppo.com> */#define _GNU_SOURCE#include<stdio.h>#include<stdlib.h>#include<unistd.h>#include<string.h>#include<Linux/mman.h>#include<sys/mman.h>#include<errno.h>#include<time.h>#define MEMSIZE_MTHP (60 * 1024 * 1024)#define MEMSIZE_SMALLFOLIO (4 * 1024 * 1024)#define ALIGNMENT_MTHP (64 * 1024)#define ALIGNMENT_SMALLFOLIO (4 * 1024)#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)#define TOTAL_DONTNEED_SMALLFOLIO (1 * 1024 * 1024)#define MTHP_FOLIO_SIZE (64 * 1024)#define SWPOUT_PATH \	「/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout」#define SWPOUT_FALLBACK_PATH \	「/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback」staticvoid*aligned_alloc_mem(size_tsize,size_talignment){void*mem=NULL;if(posix_memalign(&mem,alignment,size)!=0){perror(「posix_memalign」);returnNULL;}returnmem;}/* * This emulates the behavior of native libc and Java heap, * as well as process exit and munmap. It helps generate mTHP * and ensures that iterations can proceed with mTHP, as we * currently don『t support large folios swap-in. */staticvoidrandom_madvise_dontneed(void*mem,size_tmem_size,size_talign_size,size_ttotal_dontneed_size){size_tnum_pages=total_dontneed_size/align_size;size_ti;size_toffset;void*addr;for(i=0;i<num_pages;++i){offset=(rand()%(mem_size/align_size))*align_size;addr=(char*)mem+offset;if(madvise(addr,align_size,MADV_DONTNEED)!=0)perror(「madvisedontneed」);memset(addr,0x11,align_size);}}staticvoidrandom_swapin(void*mem,size_tmem_size,size_talign_size,size_ttotal_swapin_size){size_tnum_pages=total_swapin_size/align_size;size_ti;size_toffset;void*addr;for(i=0;i<num_pages;++i){offset=(rand()%(mem_size/align_size))*align_size;addr=(char*)mem+offset;memset(addr,0x11,align_size);}}staticunsignedlongread_stat(constchar*path){FILE*file;unsignedlongvalue;file=fopen(path,「r」);if(!file){perror(「fopen」);return0;}if(fscanf(file,「%lu」,&value)!=1){perror(「fscanf」);fclose(file);return0;}fclose(file);returnvalue;}intmain(intargc,char*argv[]){intuse_small_folio=0,aligned_swapin=0;void*mem1=NULL,*mem2=NULL;inti;for(i=1;i<argc;++i){if(strcmp(argv[i],「-s」)==0)use_small_folio=1;elseif(strcmp(argv[i],「-a」)==0)aligned_swapin=1;}mem1=aligned_alloc_mem(MEMSIZE_MTHP,ALIGNMENT_MTHP);if(mem1==NULL){fprintf(stderr,「Failedtoallocatelargefoliosmemory\n」);returnEXIT_FAILURE;}if(madvise(mem1,MEMSIZE_MTHP,MADV_HUGEPAGE)!=0){perror(「madvisehugepageformem1」);free(mem1);returnEXIT_FAILURE;}if(use_small_folio){mem2=aligned_alloc_mem(MEMSIZE_SMALLFOLIO,ALIGNMENT_MTHP);if(mem2==NULL){fprintf(stderr,「Failedtoallocatesmallfoliosmemory\n」);free(mem1);returnEXIT_FAILURE;}if(madvise(mem2,MEMSIZE_SMALLFOLIO,MADV_NOHUGEPAGE)!=0){perror(「madvisenohugepageformem2」);free(mem1);free(mem2);returnEXIT_FAILURE;}}/* warm-up phase to occupy the swapfile */memset(mem1,0x11,MEMSIZE_MTHP);madvise(mem1,MEMSIZE_MTHP,MADV_PAGEOUT);if(use_small_folio){memset(mem2,0x11,MEMSIZE_SMALLFOLIO);madvise(mem2,MEMSIZE_SMALLFOLIO,MADV_PAGEOUT);}/* iterations with newly created mTHP, swap-in, and swap-out */for(i=0;i<100;++i){unsignedlonginitial_swpout;unsignedlonginitial_swpout_fallback;unsignedlongfinal_swpout;unsignedlongfinal_swpout_fallback;unsignedlongswpout_inc;unsignedlongswpout_fallback_inc;doublefallback_percentage;initial_swpout=read_stat(SWPOUT_PATH);initial_swpout_fallback=read_stat(SWPOUT_FALLBACK_PATH);/*		 * The following setup creates a 1:1 ratio of mTHP to small folios		 * since large folio swap-in isn』t supported yet. Once we support		 * mTHP swap-in, we『ll likely need to reduce MEMSIZE_MTHP and		 * increase MEMSIZE_SMALLFOLIO to maintain the ratio.		 */random_swapin(mem1,MEMSIZE_MTHP,aligned_swapin?ALIGNMENT_MTHP:ALIGNMENT_SMALLFOLIO,TOTAL_DONTNEED_MTHP);random_madvise_dontneed(mem1,MEMSIZE_MTHP,ALIGNMENT_MTHP,TOTAL_DONTNEED_MTHP);if(use_small_folio){random_swapin(mem2,MEMSIZE_SMALLFOLIO,ALIGNMENT_SMALLFOLIO,TOTAL_DONTNEED_SMALLFOLIO);}if(madvise(mem1,MEMSIZE_MTHP,MADV_PAGEOUT)!=0){perror(「madvisepageoutformem1」);free(mem1);if(mem2!=NULL)free(mem2);returnEXIT_FAILURE;}if(use_small_folio){if(madvise(mem2,MEMSIZE_SMALLFOLIO,MADV_PAGEOUT)!=0){perror(「madvisepageoutformem2」);free(mem1);free(mem2);returnEXIT_FAILURE;}}final_swpout=read_stat(SWPOUT_PATH);final_swpout_fallback=read_stat(SWPOUT_FALLBACK_PATH);swpout_inc=final_swpout-initial_swpout;swpout_fallback_inc=final_swpout_fallback-initial_swpout_fallback;fallback_percentage=(double)swpout_fallback_inc/(swpout_fallback_inc+swpout_inc)*100;printf(「Iteration%d:swpoutinc:%lu,swpoutfallbackinc:%lu,Fallbackpercentage:%.2f%%\n」,i+1,swpout_inc,swpout_fallback_inc,fallback_percentage);}free(mem1);if(mem2!=NULL)free(mem2);returnEXIT_SUCCESS;}

原文链接：Linux有哪些惊为天人的设计？ - lianux的回答 - 知乎
https://www.zhihu.com/question/494965057/answer/1953374530031552367

end

往期推荐

扫码加我微信

进技术交流群

在看

本文来自网友投稿或网络内容，如有侵犯您的权益请联系我们删除，联系邮箱：wyl860211@qq.com 。

附注：

Linux有哪些惊为天人的设计?

引言

swap_info_struct（include/Linux/swap.h）

swap_entry_t(include/Linux/mm_types.h)

swap_cluster_info（mm/swap.h）

1. 精细化的 Cluster 状态管理：双向链表的威力

2. 面向大页（mTHP）的优化：按阶（Order）管理的独立链表

swap table

从“巨锁”到“微锁”：Swap 管理架构的演进之

中间方案：借用 Page Cache 框架的 64MB address_space

终极革命：2MB Cluster 与 Swap Table 的诞生

最新文章

热门文章

随机文章

Linux有哪些惊为天人的设计?

引言

swap_info_struct（include/Linux/swap.h）

swap_entry_t(include/Linux/mm_types.h)

swap_cluster_info（mm/swap.h）

1. 精细化的 Cluster 状态管理：双向链表的威力

2. 面向大页（mTHP）的优化：按阶（Order）管理的独立链表

swap table

从“巨锁”到“微锁”：Swap 管理架构的演进之

中间方案：借用 Page Cache 框架的 64MB address_space

终极革命：2MB Cluster 与 Swap Table 的诞生

附注：

Linux命令每日一清单037:ls--目录列表命令速查表

嵌入式Linux系统编译与定制:从零构建你的RK3576固件

最新文章

热门文章

随机文章