From: Hui Zhu The swap_cluster_alloc_table() function requires several locks to be held by its callers: ci->lock, the per-CPU swap_cluster lock, and, for non-solid-state devices (non-SWP_SOLIDSTATE), the si->global_cluster_lock. While most call paths (e.g., via cluster_alloc_swap_entry() or alloc_swap_scan_list()) correctly acquire these locks before invocation, the path through swap_reclaim_work() -> swap_reclaim_full_clusters() -> isolate_lock_cluster() is distinct. This path operates exclusively on si->full_clusters, where the swap allocation tables are guaranteed to be already allocated. Consequently, isolate_lock_cluster() should never trigger a call to swap_cluster_alloc_table() for these clusters. Strengthen the locking and state assertions to formalize these invariants: 1. Add a lockdep_assert_held() for si->global_cluster_lock in swap_cluster_alloc_table() for non-SWP_SOLIDSTATE devices. 2. Reorder existing lockdep assertions in swap_cluster_alloc_table() to match the actual lock acquisition order (per-CPU lock, then global lock, then cluster lock). 3. Add a VM_WARN_ON_ONCE() in isolate_lock_cluster() to ensure that table allocations are only attempted for clusters being isolated from the free list. Attempting to allocate a table for a cluster from other lists (like the full list during reclaim) indicates a violation of subsystem invariants. These changes ensure locking consistency and help catch potential synchronization or logic issues during development. Changelog: v3: According to the comments of Kairui Song, squash patches and fix logic bug in isolate_lock_cluster() where flags were cleared before check. v2: According to the comments of YoungJun Park, Kairui Song and Chris Li, change acquire locks in swap_reclaim_work() to adds a VM_WARN_ON in isolate_lock_cluster(). According to the comments of YoungJun Park, add code in patch 2 to Change the order of lockdep_assert_held() to match the actual lock acquisition order. Reviewed-by: Youngjun Park Signed-off-by: Hui Zhu --- mm/swapfile.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index 94af29d1de88..4e0fb1ce5245 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -476,8 +476,10 @@ swap_cluster_alloc_table(struct swap_info_struct *si, * Only cluster isolation from the allocator does table allocation. * Swap allocator uses percpu clusters and holds the local lock. */ - lockdep_assert_held(&ci->lock); lockdep_assert_held(&this_cpu_ptr(&percpu_swap_cluster)->lock); + if (!(si->flags & SWP_SOLIDSTATE)) + lockdep_assert_held(&si->global_cluster_lock); + lockdep_assert_held(&ci->lock); /* The cluster must be free and was just isolated from the free list. */ VM_WARN_ON_ONCE(ci->flags || !cluster_is_empty(ci)); @@ -577,6 +579,7 @@ static struct swap_cluster_info *isolate_lock_cluster( struct swap_info_struct *si, struct list_head *list) { struct swap_cluster_info *ci, *found = NULL; + u8 flags; spin_lock(&si->lock); list_for_each_entry(ci, list, list) { @@ -589,6 +592,7 @@ static struct swap_cluster_info *isolate_lock_cluster( ci->flags != CLUSTER_FLAG_FULL); list_del(&ci->list); + flags = ci->flags; ci->flags = CLUSTER_FLAG_NONE; found = ci; break; @@ -596,6 +600,9 @@ static struct swap_cluster_info *isolate_lock_cluster( spin_unlock(&si->lock); if (found && !cluster_table_is_alloced(found)) { + /* Table of non-free cluster must be allocated. */ + VM_WARN_ON_ONCE(flags != CLUSTER_FLAG_FREE); + /* Only an empty free cluster's swap table can be freed. */ VM_WARN_ON_ONCE(list != &si->free_clusters); VM_WARN_ON_ONCE(!cluster_is_empty(found)); -- 2.43.0