When zswap writeback is enabled and it fails compressing a given page, the page is swapped out to the backing swap device. This behavior breaks the zswap's writeback LRU order, and hence users can experience unexpected latency spikes. If the page is compressed without failure, but results in a size of PAGE_SIZE, the LRU order is kept, but the decompression overhead for loading the page back on the later access is unnecessary. Keep the LRU order and optimize unnecessary decompression overheads in those cases, by storing the original content as-is in zpool. The length field of zswap_entry will be set appropriately, as PAGE_SIZE. Hence whether it is saved as-is or not (whether decompression is unnecessary) is identified by 'zswap_entry->length == PAGE_SIZE'. Because the uncompressed data is saved in zpool, same to the compressed ones, this introduces no change in terms of memory management including movability and migratability of involved pages. This change is also not increasing per zswap entry metadata overhead. But as the number of incompressible pages increases, total zswap metadata overhead is proportionally increased. The overhead should not be problematic in usual cases, since the zswap metadata for single zswap entry is much smaller than PAGE_SIZE, and in common zswap use cases there should be a sufficient amount of compressible pages. Also it can be mitigated by the zswap writeback. When the writeback is disabled, the additional overhead could be problematic. For the case, keep the current behavior that just returns the failure and let swap_writeout() put the page back to the active LRU list in the case. Knowing how many compression failures from the crypto engine happened so far, and how many incompressible pages are stored at the given moment will be useful for future investigations. Add two new debugfs files, crypto_compress_fail and stored_incompressible_pages, for the two counts, respectively. Tests ----- I tested this patch using a simple self-written microbenchmark that is available at GitHub[1]. You can reproduce the test I did by executing run_tests.sh of the repo on your system. Note that the repo's documentation is not good as of this writing, so you may need to read and use the code. The basic test scenario is simple. Run a test program making artificial accesses to memory having artificial content under memory.high-set memory limit and measure how many accesses were made in a given time. The test program repeatedly and randomly access three anonymous memory regions. The regions are all 500 MiB size, and be accessed in the same probability. Two of those are filled up with a simple content that can easily be compressed, while the remaining one is filled up with a content that s read from /dev/urandom, which is easy to fail at compressing to a size smaller than PAGE_SIZE. The program runs for two minutes and prints out the number of accesses made every five seconds. The test script runs the program under below four configurations. - 0: memory.high is set to 2 GiB, zswap is disabled. - 1-1: memory.high is set to 1350 MiB, zswap is disabled. - 1-2: On 1-1, zswap is enabled without this patch. - 1-3: On 1-2, this patch is applied. For all zswap enabled cases, zswap shrinker is enabled. Configuration '0' is for showing the original memory performance. Configurations 1-1, 1-2 and 1-3 are for showing the performance of swap, zswap, and this patch under a level of memory pressure (~10% of working set). Configurations 0 and 1-1 are not the main focus of this patch, but I'm adding those since their results transparently show how far this microbenchmark test is from the real world. Because the per-5 seconds performance is not very reliable, I measured the average of that for the last one minute period of the test program run. I also measured a few vmstat counters including zswpin, zswpout, zswpwb, pswpin and pswpout during the test runs. The measurement results are as below. To save space, I show performance numbers that are normalized to that of the configuration '0' (no memory pressure). The averaged accesses per 5 seconds of configuration '0' was 36493417.75. config 0 1-1 1-2 1-3 perf_normalized 1.0000 0.0057 0.0235 0.0367 perf_stdev_ratio 0.0582 0.0652 0.0167 0.0346 zswpin 0 0 3548424 1999335 zswpout 0 0 3588817 2361689 zswpwb 0 0 10214 340270 pswpin 0 485806 772038 340967 pswpout 0 649543 144773 340270 'perf_normalized' is the performance metric, normalized to that of configuration '0' (no pressure). 'perf_stdev_ratio' is the standard deviation of the averaged data points, as a ratio to the averaged metric value. For example, configuration '0' performance was showing 5.8% stdev. Configurations 1-1 and 1-3 were having about 6.5% and 6.1% stdev. Also the results were highly variable between multiple runs. So this result is not very stable but just showing ball park figures. Please keep this in your mind when reading these results. Under about 10% of working set memory pressure, the performance was dropped to about 0.57% of no-pressure one, when the normal swap is used (1-1). Note that ~10% working set pressure is already extreme, at least on this test setup. No one would desire system setups that can degrade performance to 0.57% of the best case. By turning zswap on (1-2), the performance was improved about 4x, resulting in about 2.35% of no-pressure one. Because of the incompressible pages in the third memory region, a significant amount of (non-zswap) swap I/O operations were made, though. By applying this patch (1-3), about 56% performance improvement was made, resulting in about 3.67% of no-pressure one. Reduced pswpin of 1-3 compared to 1-2 let us see where this improvement came from. Tests without Zswap Shrinker ---------------------------- Zswap shrinker is not enabled by default, so I ran the above test after disabling zswap shrinker. The results are as below. config 0 1-1 1-2 1-3 perf_normalized 1.0000 0.0056 0.0185 0.0260 perf_stdev_ratio 0.0467 0.0348 0.1832 0.3387 zswpin 0 0 2506765 6049078 zswpout 0 0 2534357 6115426 zswpwb 0 0 0 0 pswpin 0 463694 472978 0 pswpout 0 686227 612149 0 The overall normalized performance of the different configs are very similar to those of zswap shrinker enabled case. By adding the memory pressure, the performance was dropped to 0.56% of the original one. By enabling zswap without zswap shrinker, the performance was increased to 1.85% of the original one. By applying this patch on it, the performance was further increased to 2.6% of the original one. Even though zswap shrinker is disabled, 1-2 shows high numbers of pswpin and pswpout because the incompressible pages are directly swapped out. In the case of 1-3, it shows zero pswpin and pswpout since it saves incompressible pages in the memory, and shows higher performance. Note that the performance of 1-2 and 1-3 varies pretty much. Standard deviation of the performance for 1-2 was about 18.32% of the performance, while that for 1-3 was about 33.87%. Because zswap shrinker is disabled and the memory pressure is induced by memory.high, the workload got penalty_jiffies sleeps, and this resulted in the unstabilized performance. Related Works ------------- This is not an entirely new attempt. Nhat Pham and Takero Funaki tried very similar approaches in October 2023[2] and April 2024[3], respectively. The two approaches didn't get merged mainly due to the metadata overhead concern. I described why I think that shouldn't be a problem for this change, which is automatically disabled when writeback is disabled, at the beginning of this changelog. This patch is not particularly different from those, and actually built upon those. I wrote this from scratch again, though. Hence adding Suggested-by tags for them. Actually Nhat first suggested this to me offlist. Historically, writeback disabling was introduced partially as a way to solve the LRU order issue. Yosry pointed out[4] this is still suboptimal when the incompressible pages are cold, since the incompressible pages will continuously be tried to be zswapped out, and burn CPU cycles for compression attempts that will anyway fail. One imaginable solution for the problem is reusing the swapped-out page and its struct page to store in the zswap pool. But that's out of the scope of this patch. [1] https://github.com/sjp38/eval_zswap/blob/master/run.sh [2] https://lore.kernel.org/20231017003519.1426574-3-nphamcs@gmail.com [3] https://lore.kernel.org/20240706022523.1104080-6-flintglass@gmail.com [4] https://lore.kernel.org/CAJD7tkZXS-UJVAFfvxJ0nNgTzWBiqepPYA4hEozi01_qktkitg@mail.gmail.com Signed-off-by: SeongJae Park Suggested-by: Nhat Pham Suggested-by: Takero Funaki Acked-by: Nhat Pham Cc: Chengming Zhou Cc: David Hildenbrand Cc: Johannes Weiner Cc: SeongJae Park Cc: Baoquan He Cc: Barry Song Cc: Chris Li Cc: Kairui Song --- Changes from v3 (https://lore.kernel.org/20250815213020.89327-1-sj@kernel.org) (discussions for changes from v3 were made on v2 thread) - Drop the cumulated compression failure counter (compress_fail) - Add a cumulated crypto-failure only counter (crypto_compress_fail) - Add a not cumulated stored incompressible pages counter (stored_incompressible_pages) - Cleanup compression failure handling code for readability Changes from v2 (https://lore.kernel.org/20250812170046.56468-1-sj@kernel.org) - No code change bug changelog updates - Add zswap shrinker disabled case test results. - Fix a typo on changelog. - Add a clarification of intention of 0 and 1-1 test configs. Changes from v1 (https://lore.kernel.org/20250807181616.1895-1-sj@kernel.org) - Optimize out memcpy() per incompressible page saving, using k[un]map_local(). - Add a debugfs file for counting compression failures. - Use a clear form of a ternary operation. - Add the history of writeback disabling with a link. - Wordsmith comments. Changes from RFC v2 (https://lore.kernel.org/20250805002954.1496-1-sj@kernel.org) - Fix race conditions at decompressed pages identification. - Remove the parameter and make saving as-is the default behavior. - Open-code main changes. - Clarify there is no memory management changes on the cover letter. - Remove 20% pressure case from test results, since it is arguably too extreme and only adds confusion. - Drop RFC tag. Changes from RFC v1 (https://lore.kernel.org/20250730234059.4603-1-sj@kernel.org) - Consider PAGE_SIZE compression successes as failures. - Use zpool for storing incompressible pages. - Test with zswap shrinker enabled. - Wordsmith changelog and comments. - Add documentation of save_incompressible_pages parameter. mm/zswap.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 54 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 3c0fd8a13718..1f1ac043a2d9 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -42,8 +42,10 @@ /********************************* * statistics **********************************/ -/* The number of compressed pages currently stored in zswap */ +/* The number of pages currently stored in zswap */ atomic_long_t zswap_stored_pages = ATOMIC_LONG_INIT(0); +/* The number of incompressible pages currently stored in zswap */ +atomic_long_t zswap_stored_incompressible_pages = ATOMIC_LONG_INIT(0); /* * The statistics below are not protected from concurrent access for @@ -60,6 +62,8 @@ static u64 zswap_written_back_pages; static u64 zswap_reject_reclaim_fail; /* Store failed due to compression algorithm failure */ static u64 zswap_reject_compress_fail; +/* Compression failed by the crypto library */ +static u64 zswap_crypto_compress_fail; /* Compressed page was too big for the allocator to (optimally) store */ static u64 zswap_reject_compress_poor; /* Load or writeback failed due to decompression failure */ @@ -811,6 +815,8 @@ static void zswap_entry_free(struct zswap_entry *entry) obj_cgroup_uncharge_zswap(entry->objcg, entry->length); obj_cgroup_put(entry->objcg); } + if (entry->length == PAGE_SIZE) + atomic_long_dec(&zswap_stored_incompressible_pages); zswap_entry_cache_free(entry); atomic_long_dec(&zswap_stored_pages); } @@ -976,8 +982,28 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, */ comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); dlen = acomp_ctx->req->dlen; - if (comp_ret) - goto unlock; + + /* + * If a page cannot be compressed into a size smaller than PAGE_SIZE, + * save the content as is without a compression, to keep the LRU order + * of writebacks. If writeback is disabled, reject the page since it + * only adds metadata overhead. swap_writeout() will put the page back + * to the active LRU list in the case. + */ + if (comp_ret || !dlen) { + zswap_crypto_compress_fail++; + dlen = PAGE_SIZE; + } + if (dlen >= PAGE_SIZE) { + if (!mem_cgroup_zswap_writeback_enabled( + folio_memcg(page_folio(page)))) { + comp_ret = -EINVAL; + goto unlock; + } + comp_ret = 0; + dlen = PAGE_SIZE; + dst = kmap_local_page(page); + } zpool = pool->zpool; gfp = GFP_NOWAIT | __GFP_NORETRY | __GFP_HIGHMEM | __GFP_MOVABLE; @@ -990,6 +1016,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, entry->length = dlen; unlock: + if (dst != acomp_ctx->buffer) + kunmap_local(dst); if (comp_ret == -ENOSPC || alloc_ret == -ENOSPC) zswap_reject_compress_poor++; else if (comp_ret) @@ -1012,6 +1040,14 @@ static bool zswap_decompress(struct zswap_entry *entry, struct folio *folio) acomp_ctx = acomp_ctx_get_cpu_lock(entry->pool); obj = zpool_obj_read_begin(zpool, entry->handle, acomp_ctx->buffer); + /* zswap entries of length PAGE_SIZE are not compressed. */ + if (entry->length == PAGE_SIZE) { + memcpy_to_folio(folio, 0, obj, entry->length); + zpool_obj_read_end(zpool, entry->handle, obj); + acomp_ctx_put_unlock(acomp_ctx); + return true; + } + /* * zpool_obj_read_begin() might return a kmap address of highmem when * acomp_ctx->buffer is not used. However, sg_init_one() does not @@ -1524,6 +1560,8 @@ static bool zswap_store_page(struct page *page, obj_cgroup_charge_zswap(objcg, entry->length); } atomic_long_inc(&zswap_stored_pages); + if (entry->length == PAGE_SIZE) + atomic_long_inc(&zswap_stored_incompressible_pages); /* * We finish initializing the entry while it's already in xarray. @@ -1792,6 +1830,14 @@ static int debugfs_get_stored_pages(void *data, u64 *val) } DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n"); +static int debugfs_get_stored_incompressible_pages(void *data, u64 *val) +{ + *val = atomic_long_read(&zswap_stored_incompressible_pages); + return 0; +} +DEFINE_DEBUGFS_ATTRIBUTE(stored_incompressible_pages_fops, + debugfs_get_stored_incompressible_pages, NULL, "%llu\n"); + static int zswap_debugfs_init(void) { if (!debugfs_initialized()) @@ -1809,6 +1855,8 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, &zswap_reject_kmemcache_fail); debugfs_create_u64("reject_compress_fail", 0444, zswap_debugfs_root, &zswap_reject_compress_fail); + debugfs_create_u64("crypto_compress_fail", 0444, + zswap_debugfs_root, &zswap_crypto_compress_fail); debugfs_create_u64("reject_compress_poor", 0444, zswap_debugfs_root, &zswap_reject_compress_poor); debugfs_create_u64("decompress_fail", 0444, @@ -1819,6 +1867,9 @@ static int zswap_debugfs_init(void) zswap_debugfs_root, NULL, &total_size_fops); debugfs_create_file("stored_pages", 0444, zswap_debugfs_root, NULL, &stored_pages_fops); + debugfs_create_file("stored_incompressible_pages", 0444, + zswap_debugfs_root, NULL, + &stored_incompressible_pages_fops); return 0; } base-commit: 803d261a97f9b4025282723d2930e58d49adcbf9 -- 2.39.5