process_huge_pages(), used to clear hugepages, is optimized for cache
locality. In particular it processes a hugepage in 4KB page units and
in a difficult to predict order: clearing pages in the periphery in a
backwards or forwards direction, then converging inwards to the
faulting page (or page specified via base_addr.)

This helps maximize temporal locality at time of access. However, while
it keeps stores inside a 4KB page sequential, pages are ordered
semi-randomly in a way that is not easy for the processor to predict.

This limits the clearing bandwidth to what's available in a 4KB page.

Consider the baseline bandwidth:

  $ perf bench mem mmap -p 2MB -f populate -s 64GB -l 3
  # Running 'mem/mmap' benchmark:
  # function 'populate' (Eagerly populated mmap())
  # Copying 64GB bytes ...

      11.791097 GB/sec

  (Unless otherwise noted, all numbers are on AMD Genoa (EPYC 9J13);
   region-size=64GB, local node; 2.56 GHz, boost=0.)

11.79 GBps amounts to around 323ns/4KB. With memory access latency
of ~100ns, that doesn't leave much time to help from, say, hardware
prefetchers.

(Note that since this is a purely write workload, it's reasonable
 to assume that the processor does not need to prefetch any cachelines.

 However, for a processor to skip the prefetch, it would need to look
 at the access pattern, and see that full cachelines were being written.
 This might be easily visible if clear_page() was using, say x86 string
 instructions; less so if it were using a store loop. In any case, the
 existence of these kind predictors or appropriately helpful threshold
 values is implementation specific.

 Additionally, even when the processor can skip the prefetch, coherence
 protocols will still need to establish exclusive ownership
 necessitating communication with remote caches.)

With that, the change is quite straight-forward. Instead of clearing
pages discontiguously, clear contiguously: switch to a loop around
clear_user_highpage().

Performance
==

Testing a demand fault workload shows a decent improvement in bandwidth
with pg-sz=2MB. Performance of pg-sz=1GB does not change because it
has always used straight clearing.

 $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5

                 discontiguous-pages    contiguous-pages
		      (baseline)

                   (GBps +- %stdev)      (GBps +- %stdev)

   pg-sz=2MB       11.76 +- 1.10%        23.58 +- 1.95%       +100.51%
   pg-sz=1GB       24.85 +- 2.41%        25.40 +- 1.33%          -

Analysis (pg-sz=2MB)
==

At L1 data cache level, nothing changes. The processor continues to
access the same number of cachelines, allocating and missing them
as it writes to them.

 discontiguous-pages    7,394,341,051      L1-dcache-loads                  #  445.172 M/sec                       ( +-  0.04% )  (35.73%)
                        3,292,247,227      L1-dcache-load-misses            #   44.52% of all L1-dcache accesses   ( +-  0.01% )  (35.73%)

    contiguous-pages    7,205,105,282      L1-dcache-loads                  #  861.895 M/sec                       ( +-  0.02% )  (35.75%)
                        3,241,584,535      L1-dcache-load-misses            #   44.99% of all L1-dcache accesses   ( +-  0.00% )  (35.74%)

The L2 prefetcher, however, is now able to prefetch ~22% more cachelines
(L2 prefetch miss rate also goes up significantly showing that we are
backend limited):

 discontiguous-pages    2,835,860,245      l2_pf_hit_l2.all                 #  170.242 M/sec                       ( +-  0.12% )  (15.65%)
    contiguous-pages    3,472,055,269      l2_pf_hit_l2.all                 #  411.319 M/sec                       ( +-  0.62% )  (15.67%)

That sill leaves a large gap between the ~22% improvement in prefetch
and the ~100% improvement in bandwidth but better prefetching seems to
streamline the traffic well enough that most of the data starts comes
from the L2 leading to substantially fewer cache-misses at the LLC:

 discontiguous-pages    8,493,499,137      cache-references                 #  511.416 M/sec                       ( +-  0.15% )  (50.01%)
                          930,501,344      cache-misses                     #   10.96% of all cache refs           ( +-  0.52% )  (50.01%)

    contiguous-pages    9,421,926,416      cache-references                 #    1.120 G/sec                       ( +-  0.09% )  (50.02%)
                           68,787,247      cache-misses                     #    0.73% of all cache refs           ( +-  0.15% )  (50.03%)

In addition, there are a few minor frontend optimizations: clear_pages()
on x86 is now fully inlined, so we don't have a CALL/RET pair (which
isn't free when using RETHUNK speculative execution mitigation as we
do on my test system.) The loop in clear_contig_highpages() is also
easier to predict (especially when handling faults) as compared to
that in process_huge_pages().

  discontiguous-pages       980,014,411      branches                         #   59.005 M/sec                       (31.26%)
  discontiguous-pages       180,897,177      branch-misses                    #   18.46% of all branches             (31.26%)

     contiguous-pages       515,630,550      branches                         #   62.654 M/sec                       (31.27%)
     contiguous-pages        78,039,496      branch-misses                    #   15.13% of all branches             (31.28%)

Note that although clearing contiguously is easier to optimize for the
processor, it does not, sadly, mean that the processor will necessarily
take advantage of it. For instance this change does not result in any
improvement in my tests on Intel Icelakex (Oracle X9), or on ARM64
Neoverse-N1 (Ampere Altra).

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---

Interestingly enough, with this change we are pretty much back to commit
79ac6ba40eb8 ("[PATCH] hugepage: Small fixes to hugepage clear/copy path")
from circa 2006!

Raghu, I've retained your R-by and tested by on this patch (and
the next) since both of these commits just break up the original patch.
Please let me know if that's not okay.

 mm/memory.c | 28 +++++++++-------------------
 1 file changed, 9 insertions(+), 19 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2a55edc48a65..c06e43a8861a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7237,40 +7237,30 @@ static inline int process_huge_page(
 	return 0;
 }
 
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+				   unsigned int nr_pages)
 {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i;
 
 	might_sleep();
 	for (i = 0; i < nr_pages; i++) {
 		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
+
+		clear_user_highpage(page + i, addr + i * PAGE_SIZE);
 	}
 }
 
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
-
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
-}
-
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	clear_contig_highpages(folio_page(folio, 0),
+				base_addr, folio_nr_pages(folio));
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1