process_huge_pages(), used to clear hugepages, is optimized for cache locality. In particular it processes a hugepage in 4KB page units and in a difficult to predict order: clearing pages in the periphery in a backwards or forwards direction, then converging inwards to the faulting page (or page specified via base_addr.) This helps maximize temporal locality at time of access. However, while it keeps stores inside a 4KB page sequential, pages are ordered semi-randomly in a way that is not easy for the processor to predict. This limits the clearing bandwidth to what's available in a 4KB page. Consider the baseline bandwidth: $ perf bench mem mmap -p 2MB -f populate -s 64GB -l 3 # Running 'mem/mmap' benchmark: # function 'populate' (Eagerly populated mmap()) # Copying 64GB bytes ... 11.791097 GB/sec (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13); region-size=64GB, local node; 2.56 GHz, boost=0.) 11.79 GBps amounts to around 323ns/4KB. With memory access latency of ~100ns, that doesn't leave much time to help from, say, hardware prefetchers. (Note that since this is a purely write workload, it's reasonable to assume that the processor does not need to prefetch any cachelines. However, for a processor to skip the prefetch, it would need to look at the access pattern, and see that full cachelines were being written. This might be easily visible if clear_page() was using, say x86 string instructions; less so if it were using a store loop. In any case, the existence of these kind predictors or appropriately helpful threshold values is implementation specific. Additionally, even when the processor can skip the prefetch, coherence protocols will still need to establish exclusive ownership necessitating communication with remote caches.) With that, the change is quite straight-forward. Instead of clearing pages discontiguously, clear contiguously: switch to a loop around clear_user_highpage(). Performance == Testing a demand fault workload shows a decent improvement in bandwidth with pg-sz=2MB. Performance of pg-sz=1GB does not change because it has always used straight clearing. $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5 discontiguous-pages contiguous-pages (baseline) (GBps +- %stdev) (GBps +- %stdev) pg-sz=2MB 11.76 +- 1.10% 23.58 +- 1.95% +100.51% pg-sz=1GB 24.85 +- 2.41% 25.40 +- 1.33% - Analysis (pg-sz=2MB) == At L1 data cache level, nothing changes. The processor continues to access the same number of cachelines, allocating and missing them as it writes to them. discontiguous-pages 7,394,341,051 L1-dcache-loads # 445.172 M/sec ( +- 0.04% ) (35.73%) 3,292,247,227 L1-dcache-load-misses # 44.52% of all L1-dcache accesses ( +- 0.01% ) (35.73%) contiguous-pages 7,205,105,282 L1-dcache-loads # 861.895 M/sec ( +- 0.02% ) (35.75%) 3,241,584,535 L1-dcache-load-misses # 44.99% of all L1-dcache accesses ( +- 0.00% ) (35.74%) The L2 prefetcher, however, is now able to prefetch ~22% more cachelines (L2 prefetch miss rate also goes up significantly showing that we are backend limited): discontiguous-pages 2,835,860,245 l2_pf_hit_l2.all # 170.242 M/sec ( +- 0.12% ) (15.65%) contiguous-pages 3,472,055,269 l2_pf_hit_l2.all # 411.319 M/sec ( +- 0.62% ) (15.67%) That sill leaves a large gap between the ~22% improvement in prefetch and the ~100% improvement in bandwidth but better prefetching seems to streamline the traffic well enough that most of the data starts comes from the L2 leading to substantially fewer cache-misses at the LLC: discontiguous-pages 8,493,499,137 cache-references # 511.416 M/sec ( +- 0.15% ) (50.01%) 930,501,344 cache-misses # 10.96% of all cache refs ( +- 0.52% ) (50.01%) contiguous-pages 9,421,926,416 cache-references # 1.120 G/sec ( +- 0.09% ) (50.02%) 68,787,247 cache-misses # 0.73% of all cache refs ( +- 0.15% ) (50.03%) In addition, there are a few minor frontend optimizations: clear_pages() on x86 is now fully inlined, so we don't have a CALL/RET pair (which isn't free when using RETHUNK speculative execution mitigation as we do on my test system.) The loop in clear_contig_highpages() is also easier to predict (especially when handling faults) as compared to that in process_huge_pages(). discontiguous-pages 980,014,411 branches # 59.005 M/sec (31.26%) discontiguous-pages 180,897,177 branch-misses # 18.46% of all branches (31.26%) contiguous-pages 515,630,550 branches # 62.654 M/sec (31.27%) contiguous-pages 78,039,496 branch-misses # 15.13% of all branches (31.28%) Note that although clearing contiguously is easier to optimize for the processor, it does not, sadly, mean that the processor will necessarily take advantage of it. For instance this change does not result in any improvement in my tests on Intel Icelakex (Oracle X9), or on ARM64 Neoverse-N1 (Ampere Altra). Signed-off-by: Ankur Arora Reviewed-by: Raghavendra K T Tested-by: Raghavendra K T --- Interestingly enough, with this change we are pretty much back to commit 79ac6ba40eb8 ("[PATCH] hugepage: Small fixes to hugepage clear/copy path") from circa 2006! Raghu, I've retained your R-by and tested by on this patch (and the next) since both of these commits just break up the original patch. Please let me know if that's not okay. mm/memory.c | 28 +++++++++------------------- 1 file changed, 9 insertions(+), 19 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 2a55edc48a65..c06e43a8861a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -7237,40 +7237,30 @@ static inline int process_huge_page( return 0; } -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, - unsigned int nr_pages) +static void clear_contig_highpages(struct page *page, unsigned long addr, + unsigned int nr_pages) { - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); - int i; + unsigned int i; might_sleep(); for (i = 0; i < nr_pages; i++) { cond_resched(); - clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); + + clear_user_highpage(page + i, addr + i * PAGE_SIZE); } } -static int clear_subpage(unsigned long addr, int idx, void *arg) -{ - struct folio *folio = arg; - - clear_user_highpage(folio_page(folio, idx), addr); - return 0; -} - /** * folio_zero_user - Zero a folio which will be mapped to userspace. * @folio: The folio to zero. - * @addr_hint: The address will be accessed or the base address if uncelar. + * @addr_hint: The address accessed by the user or the base address. */ void folio_zero_user(struct folio *folio, unsigned long addr_hint) { - unsigned int nr_pages = folio_nr_pages(folio); + unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio)); - if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) - clear_gigantic_page(folio, addr_hint, nr_pages); - else - process_huge_page(addr_hint, nr_pages, clear_subpage, folio); + clear_contig_highpages(folio_page(folio, 0), + base_addr, folio_nr_pages(folio)); } static int copy_user_gigantic_page(struct folio *dst, struct folio *src, -- 2.31.1