From: Lance Yang When unsharing hugetlb PMD page tables, we currently send two IPIs: one for TLB invalidation, and another to synchronize with concurrent GUP-fast walkers. However, if the TLB flush already reaches all CPUs, the second IPI is redundant. GUP-fast runs with IRQs disabled, so when the TLB flush IPI completes, any concurrent GUP-fast must have finished. Add tlb_table_flush_implies_ipi_broadcast() to let architectures indicate their TLB flush provides full synchronization, enabling the redundant IPI to be skipped. The default implementation returns false to maintain current behavior. Suggested-by: David Hildenbrand (Red Hat) Signed-off-by: Lance Yang --- include/asm-generic/tlb.h | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 324a21f53b64..3f0add95604f 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -248,6 +248,21 @@ static inline void tlb_remove_table(struct mmu_gather *tlb, void *table) #define tlb_needs_table_invalidate() (true) #endif +/* + * Architectures can override if their TLB flush already broadcasts IPIs to all + * CPUs when freeing or unsharing page tables. + * + * Return true only when the flush guarantees: + * - IPIs reach all CPUs with potentially stale paging-structure cache entries + * - Synchronization with IRQ-disabled code like GUP-fast + */ +#ifndef tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ + return false; +} +#endif + void tlb_remove_table_sync_one(void); #else @@ -829,12 +844,17 @@ static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb) * We only perform this when we are the last sharer of a page table, * as the IPI will reach all CPUs: any GUP-fast. * + * However, if the TLB flush already synchronized with other CPUs + * (indicated by tlb_table_flush_implies_ipi_broadcast()), we can skip + * the additional IPI. + * * Note that on configs where tlb_remove_table_sync_one() is a NOP, * the expectation is that the tlb_flush_mmu_tlbonly() would have issued * required IPIs already for us. */ if (tlb->fully_unshared_tables) { - tlb_remove_table_sync_one(); + if (!tlb_table_flush_implies_ipi_broadcast()) + tlb_remove_table_sync_one(); tlb->fully_unshared_tables = false; } } -- 2.49.0 From: Lance Yang Pass both freed_tables and unshared_tables to flush_tlb_mm_range() to ensure lazy-TLB CPUs receive IPIs and flush their paging-structure caches: flush_tlb_mm_range(..., freed_tables || unshared_tables); Implement tlb_table_flush_implies_ipi_broadcast() for x86: on native x86 without paravirt or INVLPGB, the TLB flush IPI already provides necessary synchronization, allowing the second IPI to be skipped. For paravirt with non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. Suggested-by: David Hildenbrand (Red Hat) Signed-off-by: Lance Yang --- arch/x86/include/asm/tlb.h | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlb.h b/arch/x86/include/asm/tlb.h index 866ea78ba156..96602b7b7210 100644 --- a/arch/x86/include/asm/tlb.h +++ b/arch/x86/include/asm/tlb.h @@ -5,10 +5,24 @@ #define tlb_flush tlb_flush static inline void tlb_flush(struct mmu_gather *tlb); +#define tlb_table_flush_implies_ipi_broadcast tlb_table_flush_implies_ipi_broadcast +static inline bool tlb_table_flush_implies_ipi_broadcast(void); + #include #include #include #include +#include + +static inline bool tlb_table_flush_implies_ipi_broadcast(void) +{ +#ifdef CONFIG_PARAVIRT + /* Paravirt may use hypercalls that don't send real IPIs. */ + if (pv_ops.mmu.flush_tlb_multi != native_flush_tlb_multi) + return false; +#endif + return !cpu_feature_enabled(X86_FEATURE_INVLPGB); +} static inline void tlb_flush(struct mmu_gather *tlb) { @@ -20,7 +34,8 @@ static inline void tlb_flush(struct mmu_gather *tlb) end = tlb->end; } - flush_tlb_mm_range(tlb->mm, start, end, stride_shift, tlb->freed_tables); + flush_tlb_mm_range(tlb->mm, start, end, stride_shift, + tlb->freed_tables || tlb->unshared_tables); } static inline void invlpg(unsigned long addr) -- 2.49.0 From: Lance Yang Similar to the hugetlb PMD unsharing optimization, skip the second IPI in collapse_huge_page() when the TLB flush already provides necessary synchronization. Before commit a37259732a7d ("x86/mm: Make MMU_GATHER_RCU_TABLE_FREE unconditional"), bare metal x86 didn't enable MMU_GATHER_RCU_TABLE_FREE. In that configuration, tlb_remove_table_sync_one() was a NOP. GUP-fast synchronization relied on IRQ disabling, which blocks TLB flush IPIs. When Rik made MMU_GATHER_RCU_TABLE_FREE unconditional to support AMD's INVLPGB, all x86 systems started sending the second IPI. However, on native x86 this is redundant: - pmdp_collapse_flush() calls flush_tlb_range(), sending IPIs to all CPUs to invalidate TLB entries - GUP-fast runs with IRQs disabled, so when the flush IPI completes, any concurrent GUP-fast must have finished - tlb_remove_table_sync_one() provides no additional synchronization On x86, skip the second IPI when running native (without paravirt) and without INVLPGB. For paravirt with non-native flush_tlb_multi and for INVLPGB, conservatively keep both IPIs. Use tlb_table_flush_implies_ipi_broadcast(), consistent with the hugetlb optimization. Suggested-by: David Hildenbrand (Red Hat) Signed-off-by: Lance Yang --- mm/khugepaged.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 97d1b2824386..06ea793a8190 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1178,7 +1178,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, _pmd = pmdp_collapse_flush(vma, address, pmd); spin_unlock(pmd_ptl); mmu_notifier_invalidate_range_end(&range); - tlb_remove_table_sync_one(); + /* + * Skip the second IPI if the TLB flush above already synchronized + * with concurrent GUP-fast via broadcast IPIs. + */ + if (!tlb_table_flush_implies_ipi_broadcast()) + tlb_remove_table_sync_one(); pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); if (pte) { -- 2.49.0