zswap_shrinker_count() calls mem_cgroup_flush_stats(), which takes the global cgroup rstat lock synchronously. On machines with many CPUs and NUMA nodes, this creates severe lock contention in the kswapd reclaim path: - Multiple kswapd threads (one per NUMA node) run concurrently. - do_shrink_slab() invokes zswap_shrinker_count() for each memcg-aware shrinker pass. - Each call flushes the full cgroup rstat hierarchy under the global lock. On AMD EPYC 9684X machines (96 cores, 192 threads, 12 NUMA nodes) running production workloads with zswap enabled, perf shows 2.88% of kernel cycles in osq_lock contention from this path: 2.88% [k] osq_lock --__mutex_lock.constprop.0 --__cgroup_rstat_lock --cgroup_rstat_flush_locked --cgroup_rstat_flush --zswap_shrinker_count do_shrink_slab shrink_slab shrink_node balance_pgdat kswapd 84% of kswapd kernel cycles are spent in shrink_slab -> zswap_shrinker_count -> cgroup_rstat_flush, not in actual page reclaim (shrink_lruvec). Controlled A/B on identical hardware and workload: shrinker=Y: 2.88% osq_lock, memory PSI 1.58% shrinker=N: 0.00% osq_lock, memory PSI 0.57% eBPF-based rstat lock wait measurement across 8 production metals confirms the contention splits cleanly along shrinker enablement: shrinker=Y: 50-250x more contended lock acquisitions (248/s vs 1.1/s) shrinker=N: baseline lock wait (0.0017 s/s vs 1.04 s/s) zswap_shrinker_count() only produces a heuristic estimate, scaled by compression ratio via mult_frac(). The actual writeback happens in zswap_shrinker_scan(). Slightly stale stats are acceptable here. Switch to mem_cgroup_flush_stats_ratelimited(), which only flushes if the periodic 2-second flusher is one full cycle late. This matches the approach already used in prepare_scan_control() (mm/vmscan.c) for the same reclaim path. After applying this patch, rstat flush latency and lock wait time on shrinker=Y machines dropped to the same level as shrinker=N controls, while the zswap shrinker continues to function (pool size remains bounded under the max_pool_percent cap). Previously discussed: - Chengming Zhou (Dec 2023): rstat contention from zswap_shrinker_count [1] - Shakeel Butt (Aug 2024): zswap_shrinker_count still uses sync flush [2] - Yosry Ahmed (Aug 2024): suggested eliminating in-kernel flushers [3] - Jesper Dangaard Brouer (Sep 2024): cgroup/rstat V11 patch [4] [1] https://lore.kernel.org/linux-mm/20231206103935.3440502-1-zhouchengming@bytedance.com/ [2] https://lore.kernel.org/linux-mm/CALvZod7LFxLCxVpOFH8b2Ppm8T40HPGMKQwX_=NPCWB_mFW+oQ@mail.gmail.com/ [3] https://lore.kernel.org/linux-mm/CAJD7tkYvFyOSX+rP_FKGBhxvZiCDxtpsNp-c5CGOA-4Bq9oXSg@mail.gmail.com/ [4] https://lore.kernel.org/linux-mm/172616070094.2055617.17676042522679701515.stgit@firesoul/ Suggested-by: Jesper Dangaard Brouer Signed-off-by: Jesper Dangaard Brouer Signed-off-by: Yunzhao Li Tested-by: Yunzhao Li --- mm/zswap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/zswap.c b/mm/zswap.c index 761cd699e..b5a17ea20 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1217,7 +1217,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker, * Without memcg, use the zswap pool-wide metrics. */ if (!mem_cgroup_disabled()) { - mem_cgroup_flush_stats(memcg); + mem_cgroup_flush_stats_ratelimited(memcg); nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT; nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED); } else { -- 2.43.0