From: Lance Yang Currently, special non-swap entries (like PTE markers) are not caught early in hpage_collapse_scan_pmd(), leading to failures deep in the swap-in logic. A function that is called __collapse_huge_page_swapin() and documented to "Bring missing pages in from swap" will handle other types as well. As analyzed by David[1], we could have ended up with the following entry types right before do_swap_page(): (1) Migration entries. We would have waited. -> Maybe worth it to wait, maybe not. We suspect we don't stumble into that frequently such that we don't care. We could always unlock this separately later. (2) Device-exclusive entries. We would have converted to non-exclusive. -> See make_device_exclusive(), we cannot tolerate PMD entries and have to split them through FOLL_SPLIT_PMD. As popped up during a recent discussion, collapsing here is actually counter-productive, because the next conversion will PTE-map it again. -> Ok to not collapse. (3) Device-private entries. We would have migrated to RAM. -> Device-private still does not support THPs, so collapsing right now just means that the next device access would split the folio again. -> Ok to not collapse. (4) HWPoison entries -> Cannot collapse (5) Markers -> Cannot collapse First, this patch adds an early check for these non-swap entries. If any one is found, the scan is aborted immediately with the SCAN_PTE_NON_PRESENT result, as Lorenzo suggested[2], avoiding wasted work. While at it, convert pte_swp_uffd_wp_any() to pte_swp_uffd_wp() since we are in the swap pte branch. Second, as Wei pointed out[3], we may have a chance to get a non-swap entry, since we will drop and re-acquire the mmap lock before __collapse_huge_page_swapin(). To handle this, we also add a non_swap_entry() check there. Note that we can unlock later what we really need, and not account it towards max_swap_ptes. [1] https://lore.kernel.org/linux-mm/09eaca7b-9988-41c7-8d6e-4802055b3f1e@redhat.com [2] https://lore.kernel.org/linux-mm/7df49fe7-c6b7-426a-8680-dcd55219c8bd@lucifer.local [3] https://lore.kernel.org/linux-mm/20251005010511.ysek2nqojebqngf3@master Acked-by: David Hildenbrand Reviewed-by: Wei Yang Reviewed-by: Dev Jain Suggested-by: David Hildenbrand Suggested-by: Lorenzo Stoakes Signed-off-by: Lance Yang --- v2 -> v3: - Collect Acked-by from David - thanks! - Collect Reviewed-by from Wei and Dev - thanks! - Add a non_swap_entry() check in __collapse_huge_page_swapin() (per Wei and David) - thanks! - Rework the changelog to incorporate David's detailed analysis of non-swap entry types - thanks!!! - https://lore.kernel.org/linux-mm/20251001032251.85888-1-lance.yang@linux.dev/ v1 -> v2: - Skip all non-present entries except swap entries (per David) thanks! - https://lore.kernel.org/linux-mm/20250924100207.28332-1-lance.yang@linux.dev/ mm/khugepaged.c | 37 +++++++++++++++++++++++-------------- 1 file changed, 23 insertions(+), 14 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index abe54f0043c7..bec3e268dc76 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1020,6 +1020,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, if (!is_swap_pte(vmf.orig_pte)) continue; + if (non_swap_entry(pte_to_swp_entry(vmf.orig_pte))) { + result = SCAN_PTE_NON_PRESENT; + goto out; + } + vmf.pte = pte; vmf.ptl = ptl; ret = do_swap_page(&vmf); @@ -1281,7 +1286,23 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, addr += PAGE_SIZE) { pte_t pteval = ptep_get(_pte); - if (is_swap_pte(pteval)) { + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { + ++none_or_zero; + if (!userfaultfd_armed(vma) && + (!cc->is_khugepaged || + none_or_zero <= khugepaged_max_ptes_none)) { + continue; + } else { + result = SCAN_EXCEED_NONE_PTE; + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + goto out_unmap; + } + } else if (!pte_present(pteval)) { + if (non_swap_entry(pte_to_swp_entry(pteval))) { + result = SCAN_PTE_NON_PRESENT; + goto out_unmap; + } + ++unmapped; if (!cc->is_khugepaged || unmapped <= khugepaged_max_ptes_swap) { @@ -1290,7 +1311,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, * enabled swap entries. Please see * comment below for pte_uffd_wp(). */ - if (pte_swp_uffd_wp_any(pteval)) { + if (pte_swp_uffd_wp(pteval)) { result = SCAN_PTE_UFFD_WP; goto out_unmap; } @@ -1301,18 +1322,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, goto out_unmap; } } - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <= khugepaged_max_ptes_none)) { - continue; - } else { - result = SCAN_EXCEED_NONE_PTE; - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); - goto out_unmap; - } - } if (pte_uffd_wp(pteval)) { /* * Don't collapse the page if any of the small -- 2.49.0