From: Kairui Song Fragment clusters were mostly failing high order allocation already. The reason we scan it through now is that a swap slot may get freed without releasing the swap cache, so a swap map entry will end up in HAS_CACHE only status, and the cluster won't be moved back to non-full or free cluster list. This may cause a higher allocation failure rate. Usually only !SWP_SYNCHRONOUS_IO devices may have a large number of slots stuck in HAS_CACHE only status. Because when a !SWP_SYNCHRONOUS_IO device's usage is low (!vm_swap_full()), it will try to lazy free the swap cache. But this fragment list scan out is a bit overkill. Fragmentation is only an issue for the allocator when the device is getting full, and by that time, swap will be releasing the swap cache aggressively already. Only scan one fragment cluster at a time is good enough to reclaim already pinned slots, and move the cluster back to nonfull. And besides, only high order allocation requires iterating over the list, order 0 allocation will succeed on the first attempt. And high order allocation failure isn't a serious problem. So the iteration of fragment clusters is trivial, but it will slow down large allocation by a lot when the fragment cluster list is long. So it's better to drop this fragment cluster iteration design. Test on a 48c96t system, build linux kernel using 10G ZRAM, make -j48, defconfig with 768M cgroup memory limit, on top of tmpfs, 4K folio only: Before: sys time: 4432.56s After: sys time: 4430.18s Change to make -j96, 2G memory limit, 64kB mTHP enabled, and 10G ZRAM: Before: sys time: 11609.69s 64kB/swpout: 1787051 64kB/swpout_fallback: 20917 After: sys time: 5572.85s 64kB/swpout: 1797612 64kB/swpout_fallback: 19254 Change to 8G ZRAM: Before: sys time: 21524.35s 64kB/swpout: 1687142 64kB/swpout_fallback: 128496 After: sys time: 6278.45s 64kB/swpout: 1679127 64kB/swpout_fallback: 130942 Change to use 10G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 7393.50s 64kB/swpout:1788246 swpout_fallback: 0 After: sys time: 7399.88s 64kB/swpout:1784257 swpout_fallback: 0 Change to use 8G brd device with SWP_SYNCHRONOUS_IO flag removed: Before: sys time: 26292.26s 64kB/swpout:1645236 swpout_fallback: 138945 After: sys time: 9463.16s 64kB/swpout:1581376 swpout_fallback: 259979 The performance is a lot better for large folios, and the large order allocation failure rate is only very slightly higher or unchanged even for !SWP_SYNCHRONOUS_IO devices high pressure. Signed-off-by: Kairui Song Acked-by: Nhat Pham --- mm/swapfile.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-) diff --git a/mm/swapfile.c b/mm/swapfile.c index b4f3cc712580..1f1110e37f68 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -926,32 +926,25 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o swap_reclaim_full_clusters(si, false); if (order < PMD_ORDER) { - unsigned int frags = 0, frags_existing; - while ((ci = isolate_lock_cluster(si, &si->nonfull_clusters[order]))) { found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), order, usage); if (found) goto done; - /* Clusters failed to allocate are moved to frag_clusters */ - frags++; } - frags_existing = atomic_long_read(&si->frag_cluster_nr[order]); - while (frags < frags_existing && - (ci = isolate_lock_cluster(si, &si->frag_clusters[order]))) { - atomic_long_dec(&si->frag_cluster_nr[order]); - /* - * Rotate the frag list to iterate, they were all - * failing high order allocation or moved here due to - * per-CPU usage, but they could contain newly released - * reclaimable (eg. lazy-freed swap cache) slots. - */ + /* + * Scan only one fragment cluster is good enough. Order 0 + * allocation will surely success, and large allocation + * failure is not critical. Scanning one cluster still + * keeps the list rotated and reclaimed (for HAS_CACHE). + */ + ci = isolate_lock_cluster(si, &si->frag_clusters[order]); + if (ci) { found = alloc_swap_scan_cluster(si, ci, cluster_offset(si, ci), order, usage); if (found) goto done; - frags++; } } -- 2.50.1