When an mTHP folio is allocated in do_anonymous_page() and the target pte range is not fully empty, current code would release the folio and return. This results an illusion that a page fault has already been processed even if the fact is vmf->address itself is still pte_none(). Another page fault will be triggered again. The race scenario as below, use 64KB mTHP for example, two threads of the same process, base page 4KB, range = [X, X + 64KB), X < Y < X + 64KB CPU 0 (writer, faults at X) CPU 1 (reader, faults at Y) -------------------------------- ----------------------------- do_anonymous_page() do_anonymous_page() alloc_anon_folio() pte_range_none(R) --> true vma_alloc_folio() --> 64KB pte_offset_map_lock(Y) install zero_pfn PTE at Y pte_unmap_unlock() pte_offset_map_lock(X) pte_range_none(R) -> false, Y is populated /* but pte at X is still none */ goto release return 0 In order to avoid this, check if vmf->address has been mapped, if not mapped, try alloc_anon_folio and subsequent operations again. On retry, alloc_anon_folio() re-checks pte_range_none() and falls back to a smaller order, so no infinite loop situation. Signed-off-by: Wandun Chen --- Reproducer (not included in the patch, available on request): two threads hammer the same 64K mTHP range, writer at offset 0, reader at offset 32K, per-round barrier, 1024 rounds. Minor faults before: writer=1951 reader=973 (927 extra faults) Minor faults after: writer=1024 reader=1022 I'm not sure if this situation often occurs in real workloads. --- mm/memory.c | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 0c9d9c2cbf0e..104f5be1de36 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5339,10 +5339,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; unsigned long addr = vmf->address; + unsigned long fault_offset; struct folio *folio; vm_fault_t ret = 0; int nr_pages; pte_t entry; + bool should_retry = false; /* File mapping without ->vm_ops ? */ if (vma->vm_flags & VM_SHARED) @@ -5389,6 +5391,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) ret = vmf_anon_prepare(vmf); if (ret) return ret; +retry: /* Returns NULL on OOM or ERR_PTR(-EAGAIN) if we must retry the fault */ folio = alloc_anon_folio(vmf); if (IS_ERR(folio)) @@ -5413,14 +5416,26 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf) update_mmu_tlb(vma, addr, vmf->pte); goto release; } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)) { - update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); - goto release; + fault_offset = (vmf->address - addr) >> PAGE_SHIFT; + if (!pte_none(ptep_get(vmf->pte + fault_offset))) { + update_mmu_tlb_range(vma, addr, vmf->pte, nr_pages); + goto release; + } + + should_retry = true; } ret = check_stable_address_space(vma->vm_mm); if (ret) goto release; + if (should_retry) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + folio_put(folio); + should_retry = false; + goto retry; + } + /* Deliver the page fault to userland, check inside PT lock */ if (userfaultfd_missing(vma)) { pte_unmap_unlock(vmf->pte, vmf->ptl); -- 2.43.0