file->f_ra.mmap_miss is used to stop mmap readahead after repeated
misses.  filemap_fault() increases it when synchronous mmap readahead is
needed, while filemap_map_pages() reduces it when fault-around finds
folios already present in the page cache.

The hit side of that accounting is too generous in two cases.

First, fault-around can install PTEs for multiple pages around the
faulting address.  The fault only proves that the faulting address was
accessed, not that the nearby PTEs will be used by the workload.
Crediting all of those nearby PTEs as mmap hits can make sparse random
access look like successful mmap readahead and keep mmap readahead
enabled for longer than intended.

Second, a fault that misses in the page cache can start synchronous mmap
readahead, drop the mmap_lock, and return VM_FAULT_RETRY.  The retry may
then find the folio that this same fault pulled into the page cache.  If
filemap_map_pages() credits that retry as a hit, the same miss can
immediately cancel its own mmap_miss increase.

Only credit one mmap hit when filemap_map_pages() actually maps the
faulting address.  Also skip the credit on FAULT_FLAG_TRIED retries.
Keep the existing workingset behavior: recently refaulted folios still
do not reduce mmap_miss.

Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe.  In an 8 GiB guest with 2 vCPUs, a 20 GiB file,
8192 KiB read_ahead_kb, cold page cache before each run, and 1% of the
file accessed, the median of 3 runs changed as follows.  This is file
cache capacity pressure from the file being larger than guest memory; no
separate memory hog was used.

mmap_miss_probe is a small userspace benchmark used only for these
measurements.  It mmap()s a prepared file with MADV_NORMAL and then
touches one byte at selected base-page offsets; the access order is
random, sequential, or a fixed page stride.  The harness drops caches
before each run and samples /proc/vmstat around that access loop.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each before/after entry is "pgpgin GiB / elapsed seconds".  "pgpgin GiB"
is the delta of the guest /proc/vmstat pgpgin counter, converted from
KiB to GiB; I use it as an approximate block input counter, not as
resident memory or exact application IO.  "Elapsed seconds" is the
wall-clock runtime of the whole mmap_miss_probe access pass, not
per-access latency.

        workload       before              after
        random         223.377 GiB/101.293s 1.010 GiB/4.790s
        stride1021     204.214 GiB/97.557s 204.208 GiB/108.086s
        stride2053     409.584 GiB/193.700s 0.970 GiB/3.685s
        stride4099     406.452 GiB/134.241s 0.975 GiB/3.499s
        sequential       0.212 GiB/0.050s   0.212 GiB/0.057s

The same 8 GiB guest with a 4 GiB file, so the file fits in memory,
showed the same direction for sparse random access without file-cache
reclaim pressure:

        workload       before             after
        random         3.987 GiB/1.960s   0.980 GiB/1.221s
        stride1021     4.002 GiB/1.838s   4.002 GiB/1.851s
        stride2053     3.991 GiB/1.835s   0.811 GiB/0.985s
        stride4099     4.001 GiB/1.836s   0.819 GiB/1.037s
        sequential     0.056 GiB/0.013s   0.056 GiB/0.018s

This RFC does not claim to solve every sparse pattern.  In particular,
the stride1021 rows above are intentionally included: the 20 GiB run is
still about 204 GiB of pgpgin.

In the table, strideN means that the benchmark advances by N base pages
between mmap loads.  Thus stride1021 is 1021 * 4 KiB = 4084 KiB.  With
8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and
synchronous mmap read-around uses a 2048-page window centered around the
fault, i.e. roughly [index - 1024, index + 1023].  A stride1021 access
therefore lands inside the previous read-around window.  About every
other access can be a real faulting-address page-cache hit, and the
other half can each read about 8 MiB.  For about 52k accesses in the
20 GiB/1% run, half of them times 8 MiB is about 205 GiB, which matches
the observed 204 GiB.  This first version keeps the scope intentionally
limited to mmap_miss hit accounting.

Signed-off-by: fujunjie <fujunjie1@qq.com>
---
 mm/filemap.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 4e636647100c1..463cd19c49f09 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3757,6 +3757,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 	unsigned int count = 0;
 	pte_t *old_ptep = vmf->pte;
 	unsigned long addr0;
+	bool fault_mapped = false;
 
 	/*
 	 * Map the large folio fully where possible:
@@ -3780,16 +3781,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		if (PageHWPoison(page + count))
 			goto skip;
 
-		/*
-		 * If there are too many folios that are recently evicted
-		 * in a file, they will probably continue to be evicted.
-		 * In such situation, read-ahead is only a waste of IO.
-		 * Don't decrease mmap_miss in this scenario to make sure
-		 * we can stop read-ahead.
-		 */
-		if (!folio_test_workingset(folio))
-			(*mmap_miss)++;
-
 		/*
 		 * NOTE: If there're PTE markers, we'll leave them to be
 		 * handled in the specific fault path, and it'll prohibit the
@@ -3806,8 +3797,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 			*rss += count;
 			folio_ref_add(folio, count - ref_from_caller);
 			ref_from_caller = 0;
-			if (in_range(vmf->address, addr, count * PAGE_SIZE))
+			if (in_range(vmf->address, addr, count * PAGE_SIZE)) {
 				ret = VM_FAULT_NOPAGE;
+				fault_mapped = true;
+			}
 		}
 
 		count++;
@@ -3822,8 +3815,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		*rss += count;
 		folio_ref_add(folio, count - ref_from_caller);
 		ref_from_caller = 0;
-		if (in_range(vmf->address, addr, count * PAGE_SIZE))
+		if (in_range(vmf->address, addr, count * PAGE_SIZE)) {
 			ret = VM_FAULT_NOPAGE;
+			fault_mapped = true;
+		}
 	}
 
 	vmf->pte = old_ptep;
@@ -3831,6 +3826,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf,
 		/* Locked folios cannot get truncated. */
 		folio_ref_dec(folio);
 
+	if (fault_mapped && !(vmf->flags & FAULT_FLAG_TRIED) &&
+	    !folio_test_workingset(folio))
+		(*mmap_miss)++;
+
 	return ret;
 }
 
@@ -3844,10 +3843,6 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 	if (PageHWPoison(page))
 		goto out;
 
-	/* See comment of filemap_map_folio_range() */
-	if (!folio_test_workingset(folio))
-		(*mmap_miss)++;
-
 	/*
 	 * NOTE: If there're PTE markers, we'll leave them to be
 	 * handled in the specific fault path, and it'll prohibit
@@ -3856,8 +3851,12 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf,
 	if (!pte_none(ptep_get(vmf->pte)))
 		goto out;
 
-	if (vmf->address == addr)
+	if (vmf->address == addr) {
 		ret = VM_FAULT_NOPAGE;
+		if (!(vmf->flags & FAULT_FLAG_TRIED) &&
+		    !folio_test_workingset(folio))
+			(*mmap_miss)++;
+	}
 
 	set_pte_range(vmf, folio, page, 1, addr);
 	(*rss)++;
-- 
2.34.1