file->f_ra.mmap_miss is used to stop mmap readahead after repeated misses. filemap_fault() increases it when synchronous mmap readahead is needed, while filemap_map_pages() reduces it when fault-around finds folios already present in the page cache. The hit side of that accounting is too generous in two cases. First, fault-around can install PTEs for multiple pages around the faulting address. The fault only proves that the faulting address was accessed, not that the nearby PTEs will be used by the workload. Crediting all of those nearby PTEs as mmap hits can make sparse random access look like successful mmap readahead and keep mmap readahead enabled for longer than intended. Second, a fault that misses in the page cache can start synchronous mmap readahead, drop the mmap_lock, and return VM_FAULT_RETRY. The retry may then find the folio that this same fault pulled into the page cache. If filemap_map_pages() credits that retry as a hit, the same miss can immediately cancel its own mmap_miss increase. Only credit one mmap hit when filemap_map_pages() actually maps the faulting address. Also skip the credit on FAULT_FLAG_TRIED retries. Keep the existing workingset behavior: recently refaulted folios still do not reduce mmap_miss. Current evidence comes from a local KVM/data-disk microbenchmark using mmap_miss_probe. In an 8 GiB guest with 2 vCPUs, a 20 GiB file, 8192 KiB read_ahead_kb, cold page cache before each run, and 1% of the file accessed, the median of 3 runs changed as follows. This is file cache capacity pressure from the file being larger than guest memory; no separate memory hog was used. mmap_miss_probe is a small userspace benchmark used only for these measurements. It mmap()s a prepared file with MADV_NORMAL and then touches one byte at selected base-page offsets; the access order is random, sequential, or a fixed page stride. The harness drops caches before each run and samples /proc/vmstat around that access loop. Each case used a fresh temporary qcow2 data disk, seen by the guest as /dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix. Each before/after entry is "pgpgin GiB / elapsed seconds". "pgpgin GiB" is the delta of the guest /proc/vmstat pgpgin counter, converted from KiB to GiB; I use it as an approximate block input counter, not as resident memory or exact application IO. "Elapsed seconds" is the wall-clock runtime of the whole mmap_miss_probe access pass, not per-access latency. workload before after random 223.377 GiB/101.293s 1.010 GiB/4.790s stride1021 204.214 GiB/97.557s 204.208 GiB/108.086s stride2053 409.584 GiB/193.700s 0.970 GiB/3.685s stride4099 406.452 GiB/134.241s 0.975 GiB/3.499s sequential 0.212 GiB/0.050s 0.212 GiB/0.057s The same 8 GiB guest with a 4 GiB file, so the file fits in memory, showed the same direction for sparse random access without file-cache reclaim pressure: workload before after random 3.987 GiB/1.960s 0.980 GiB/1.221s stride1021 4.002 GiB/1.838s 4.002 GiB/1.851s stride2053 3.991 GiB/1.835s 0.811 GiB/0.985s stride4099 4.001 GiB/1.836s 0.819 GiB/1.037s sequential 0.056 GiB/0.013s 0.056 GiB/0.018s This RFC does not claim to solve every sparse pattern. In particular, the stride1021 rows above are intentionally included: the 20 GiB run is still about 204 GiB of pgpgin. In the table, strideN means that the benchmark advances by N base pages between mmap loads. Thus stride1021 is 1021 * 4 KiB = 4084 KiB. With 8192 KiB read_ahead_kb, file->f_ra.ra_pages is 2048 base pages, and synchronous mmap read-around uses a 2048-page window centered around the fault, i.e. roughly [index - 1024, index + 1023]. A stride1021 access therefore lands inside the previous read-around window. About every other access can be a real faulting-address page-cache hit, and the other half can each read about 8 MiB. For about 52k accesses in the 20 GiB/1% run, half of them times 8 MiB is about 205 GiB, which matches the observed 204 GiB. This first version keeps the scope intentionally limited to mmap_miss hit accounting. Signed-off-by: fujunjie --- mm/filemap.c | 33 ++++++++++++++++----------------- 1 file changed, 16 insertions(+), 17 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 4e636647100c1..463cd19c49f09 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3757,6 +3757,7 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, unsigned int count = 0; pte_t *old_ptep = vmf->pte; unsigned long addr0; + bool fault_mapped = false; /* * Map the large folio fully where possible: @@ -3780,16 +3781,6 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, if (PageHWPoison(page + count)) goto skip; - /* - * If there are too many folios that are recently evicted - * in a file, they will probably continue to be evicted. - * In such situation, read-ahead is only a waste of IO. - * Don't decrease mmap_miss in this scenario to make sure - * we can stop read-ahead. - */ - if (!folio_test_workingset(folio)) - (*mmap_miss)++; - /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit the @@ -3806,8 +3797,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, *rss += count; folio_ref_add(folio, count - ref_from_caller); ref_from_caller = 0; - if (in_range(vmf->address, addr, count * PAGE_SIZE)) + if (in_range(vmf->address, addr, count * PAGE_SIZE)) { ret = VM_FAULT_NOPAGE; + fault_mapped = true; + } } count++; @@ -3822,8 +3815,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, *rss += count; folio_ref_add(folio, count - ref_from_caller); ref_from_caller = 0; - if (in_range(vmf->address, addr, count * PAGE_SIZE)) + if (in_range(vmf->address, addr, count * PAGE_SIZE)) { ret = VM_FAULT_NOPAGE; + fault_mapped = true; + } } vmf->pte = old_ptep; @@ -3831,6 +3826,10 @@ static vm_fault_t filemap_map_folio_range(struct vm_fault *vmf, /* Locked folios cannot get truncated. */ folio_ref_dec(folio); + if (fault_mapped && !(vmf->flags & FAULT_FLAG_TRIED) && + !folio_test_workingset(folio)) + (*mmap_miss)++; + return ret; } @@ -3844,10 +3843,6 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf, if (PageHWPoison(page)) goto out; - /* See comment of filemap_map_folio_range() */ - if (!folio_test_workingset(folio)) - (*mmap_miss)++; - /* * NOTE: If there're PTE markers, we'll leave them to be * handled in the specific fault path, and it'll prohibit @@ -3856,8 +3851,12 @@ static vm_fault_t filemap_map_order0_folio(struct vm_fault *vmf, if (!pte_none(ptep_get(vmf->pte))) goto out; - if (vmf->address == addr) + if (vmf->address == addr) { ret = VM_FAULT_NOPAGE; + if (!(vmf->flags & FAULT_FLAG_TRIED) && + !folio_test_workingset(folio)) + (*mmap_miss)++; + } set_pte_range(vmf, folio, page, 1, addr); (*rss)++; -- 2.34.1