folio_walk_start() asserts the mmap lock is held. For callers that only need to read a single, already-present page, the mmap lock is a heavy and often badly contended hammer. Such a caller can instead hold the per-VMA lock, which keeps the VMA itself stable. The per-VMA lock does not, however, keep the page tables walked below that VMA from being freed. A concurrent munmap() or THP collapse of an adjacent region in the same mm can free a shared upper-level table, and THP collapse (collapse_huge_page() -> retract_page_tables()) frees page tables of VMAs whose lock it does not hold. Page table freeing synchronizes against lockless walkers the way gup_fast relies on: tlb_remove_table_sync_one() sends an IPI and waits for every CPU to enable interrupts, so a walker that keeps interrupts disabled across the walk cannot be observing a table that is about to be freed. rcu_read_lock() is not sufficient -- it does not block that IPI -- so the caller must keep interrupts disabled, not merely hold an RCU read-side critical section. Add an FW_VMA_LOCKED flag. When passed, folio_walk_start() asserts the per-VMA lock and that interrupts are disabled, instead of asserting the mmap lock; it requires CONFIG_MMU_GATHER_RCU_TABLE_FREE and refuses hugetlb VMAs (PMD sharing maps page tables this VMA's lock does not cover). The caller must keep interrupts disabled until folio_walk_end(). No existing caller passes FW_VMA_LOCKED, so behaviour is unchanged. Assisted-by: Claude:claude-opus-4-8 Signed-off-by: Rik van Riel --- include/linux/pagewalk.h | 7 +++++++ mm/pagewalk.c | 29 +++++++++++++++++++++++++++-- 2 files changed, 34 insertions(+), 2 deletions(-) diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h index b41d7265c01b..d0387470d732 100644 --- a/include/linux/pagewalk.h +++ b/include/linux/pagewalk.h @@ -150,6 +150,13 @@ typedef int __bitwise folio_walk_flags_t; /* Walk shared zeropages (small + huge) as well. */ #define FW_ZEROPAGE ((__force folio_walk_flags_t)BIT(0)) +/* + * The caller holds the per-VMA lock instead of the mmap lock, with interrupts + * disabled across the walk (until folio_walk_end()) to serialize against page + * table freeing, the same way gup_fast does. Only valid with RCU-freed page + * tables (CONFIG_MMU_GATHER_RCU_TABLE_FREE) and not for hugetlb. + */ +#define FW_VMA_LOCKED ((__force folio_walk_flags_t)BIT(1)) enum folio_walk_level { FW_LEVEL_PTE, diff --git a/mm/pagewalk.c b/mm/pagewalk.c index 3ae2586ff45b..ab1e81983cb8 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -890,7 +890,10 @@ int walk_page_mapping(struct address_space *mapping, pgoff_t first_index, * huge_ptep_set_*, ...). Note that the page table entry stored in @fw might * not correspond to the first physical entry of a logical hugetlb entry. * - * The mmap lock must be held in read mode. + * The mmap lock must be held in read mode. Alternatively, if @FW_VMA_LOCKED is + * passed, the VMA's per-VMA lock must be held and interrupts must be disabled + * across the walk and until folio_walk_end() (only supported with RCU-freed page + * tables, i.e. CONFIG_MMU_GATHER_RCU_TABLE_FREE, and not for hugetlb). * * Return: folio pointer on success, otherwise NULL. */ @@ -908,7 +911,29 @@ struct folio *folio_walk_start(struct folio_walk *fw, pgd_t *pgdp; p4d_t *p4dp; - mmap_assert_locked(vma->vm_mm); + if (flags & FW_VMA_LOCKED) { + /* + * Lockless walk under the per-VMA lock instead of the mmap + * lock. The VMA lock keeps the VMA stable, but the page tables + * walked below it can still be freed concurrently: a munmap() or + * THP collapse of an adjacent region in the same mm can free a + * shared upper-level table, and collapse_huge_page() -> + * retract_page_tables() frees page tables of VMAs whose lock it + * does not hold. Page table freeing serializes against lockless + * walkers via tlb_remove_table_sync_one(), which IPIs and waits + * for every CPU to enable interrupts; an RCU read-side critical + * section does not block that IPI, so the caller must keep + * interrupts disabled across the whole walk, like gup_fast. + * Hugetlb (PMD sharing) maps page tables not covered by this + * VMA's lock and is not supported. + */ + VM_WARN_ON_ONCE(!IS_ENABLED(CONFIG_MMU_GATHER_RCU_TABLE_FREE)); + VM_WARN_ON_ONCE(is_vm_hugetlb_page(vma)); + lockdep_assert_irqs_disabled(); + vma_assert_locked(vma); + } else { + mmap_assert_locked(vma->vm_mm); + } vma_pgtable_walk_begin(vma); if (WARN_ON_ONCE(addr < vma->vm_start || addr >= vma->vm_end)) -- 2.53.0-Meta