When memory is hot-removed, section_deactivate() can tear down mem_section->usage while concurrent pfn walkers still inspect the subsection map via pfn_section_valid() or pfn_section_first_valid(). After commit 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") converted the teardown to an RCU-based scheme, the code still relies on SECTION_HAS_MEM_MAP becoming visible to readers before ms->usage is cleared and queued for freeing. That ordering is not guaranteed. section_deactivate() can clear ms->usage and queue kfree_rcu() before another CPU observes the SECTION_HAS_MEM_MAP clear. A concurrent pfn walker can therefore see valid_section() return true, enter its sched-RCU read-side critical section after kfree_rcu() has already been queued, and then dereference a stale ms->usage pointer. And pfn_to_online_page() can call pfn_section_valid() without its own sched-RCU read-side critical section, which has similar problem. The race looks like this: compact_zone() memunmap_pages ============== ============== __remove_pages()-> sparse_remove_section()-> section_deactivate(): a) [ Clear SECTION_HAS_MEM_MAP is reordered to b) ] kfree_rcu(ms->usage) __pageblock_pfn_to_page ...... pfn_valid(): rcu_read_lock_sched() valid_section() // return true pfn_section_valid() [Access ms->usage which is UAF] WRITE_ONCE(ms->usage, NULL) rcu_read_unlock_sched() b) Clear SECTION_HAS_MEM_MAP Fix this by using rcu_replace_pointer() when clearing ms->usage in section_deactivate(), then it does not rely on the order of clearing of SECTION_HAS_MEM_MAP. Fixes: 5ec8e8ea8b77 ("mm/sparsemem: fix race in accessing memory_section->usage") Signed-off-by: Muchun Song --- This patch is focused on the ms->usage lifetime race only. One open question is the interaction between pfn_to_online_page() and vmemmap teardown during memory hot-remove. Could pfn_to_online_page() still hand out a stale struct page here? The new sched-RCU critical section ends before pfn_to_page(pfn), but section_deactivate() can still tear the vmemmap down immediately afterwards: mm/sparse-vmemmap.c:section_deactivate() ms->section_mem_map &= ~SECTION_HAS_MEM_MAP; usage = rcu_replace_pointer(ms->usage, NULL, true); kfree_rcu(usage, rcu); depopulate_section_memmap(...); That looks like a reader can observe valid = true, drop sched-RCU, race with section_deactivate(), and then execute pfn_to_page(pfn) after the backing vmemmap was depopulated. Callers such as mm/compaction.c:__reset_isolation_pfn(), mm/page_idle.c:page_idle_get_folio(), and fs/proc/kcore.c:read_kcore_iter() dereference the returned page immediately, and they do not appear to hold get_online_mems() across the pfn_to_online_page() call. I am not fully sure whether that reasoning is correct, or whether current callers are expected to rely on additional hotplug serialization instead. Comments on whether this is a real issue, and how the vmemmap lifetime is expected to be handled here, would be very helpful. --- include/linux/mmzone.h | 6 +++--- mm/memory_hotplug.c | 6 +++++- mm/sparse-vmemmap.c | 6 ++++-- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 238bf2d35a54..0e850924cbeb 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -2014,7 +2014,7 @@ struct mem_section { */ unsigned long section_mem_map; - struct mem_section_usage *usage; + struct mem_section_usage __rcu *usage; #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use @@ -2178,14 +2178,14 @@ static inline int subsection_map_index(unsigned long pfn) static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn) { int idx = subsection_map_index(pfn); - struct mem_section_usage *usage = READ_ONCE(ms->usage); + struct mem_section_usage *usage = rcu_dereference_sched(ms->usage); return usage ? test_bit(idx, usage->subsection_map) : 0; } static inline bool pfn_section_first_valid(struct mem_section *ms, unsigned long *pfn) { - struct mem_section_usage *usage = READ_ONCE(ms->usage); + struct mem_section_usage *usage = rcu_dereference_sched(ms->usage); int idx = subsection_map_index(*pfn); unsigned long bit; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 0c1d3df3a296..335835abe74c 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -340,6 +340,7 @@ struct page *pfn_to_online_page(unsigned long pfn) unsigned long nr = pfn_to_section_nr(pfn); struct dev_pagemap *pgmap; struct mem_section *ms; + bool valid; if (nr >= NR_MEM_SECTIONS) return NULL; @@ -355,7 +356,10 @@ struct page *pfn_to_online_page(unsigned long pfn) if (IS_ENABLED(CONFIG_HAVE_ARCH_PFN_VALID) && !pfn_valid(pfn)) return NULL; - if (!pfn_section_valid(ms, pfn)) + rcu_read_lock_sched(); + valid = pfn_section_valid(ms, pfn); + rcu_read_unlock_sched(); + if (!valid) return NULL; if (!online_device_section(ms)) diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c index 717ac953bba2..05f68dcec0f8 100644 --- a/mm/sparse-vmemmap.c +++ b/mm/sparse-vmemmap.c @@ -601,8 +601,10 @@ static void section_deactivate(unsigned long pfn, unsigned long nr_pages, * was allocated during boot. */ if (!PageReserved(virt_to_page(ms->usage))) { - kfree_rcu(ms->usage, rcu); - WRITE_ONCE(ms->usage, NULL); + struct mem_section_usage *usage; + + usage = rcu_replace_pointer(ms->usage, NULL, true); + kfree_rcu(usage, rcu); } memmap = pfn_to_page(SECTION_ALIGN_DOWN(pfn)); } -- 2.20.1