When performing memory error injection on a THP (Transparent Huge Page) mapped to userspace on an x86 server, the kernel panics with the following trace. The expected behavior is to terminate the affected process instead of panicking the kernel, as the x86 Machine Check code can recover from an in-userspace #MC. mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134 mce: [Hardware Error]: RIP 10: {memchr_inv+0x4c/0xf0} mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320 mce: [Hardware Error]: Run the above through 'mcelog --ascii' mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel Kernel panic - not syncing: Fatal local machine check The root cause of this panic is that handling a memory failure triggered by an in-userspace #MC necessitates splitting the THP. The splitting process employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which reads the sub-pages of the THP to identify zero-filled pages. However, reading the sub-pages results in a second in-kernel #MC, occurring before the initial memory_failure() completes, ultimately leading to a kernel panic. See the kernel panic call trace on the two #MCs. First Machine Check occurs // [1] memory_failure() // [2] try_to_split_thp_page() split_huge_page() split_huge_page_to_list_to_order() __folio_split() // [3] remap_page() remove_migration_ptes() remove_migration_pte() try_to_map_unused_to_zeropage() // [4] memchr_inv() // [5] Second Machine Check occurs // [6] Kernel panic [1] Triggered by accessing a hardware-poisoned THP in userspace, which is typically recoverable by terminating the affected process. [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page(). [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page(). [4] Try to map the unused THP to zeropage. [5] Re-access sub-pages of the hw-poisoned THP in the kernel. [6] Triggered in-kernel, leading to a panic kernel. In Step[2], memory_failure() sets the poisoned flag on the sub-page of the THP by TestSetPageHWPoison() before calling try_to_split_thp_page(). As suggested by David Hildenbrand, fix this panic by not accessing to the poisoned sub-page of the THP during zeropage identification, while continuing to scan unaffected sub-pages of the THP for possible zeropage mapping. This prevents a second in-kernel #MC that would cause kernel panic in Step[4]. [ Credits to Andrew Zaborowski for his original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set, avoiding access to the entire THP for zero-page identification. ] Reported-by: Farrah Chen Suggested-by: David Hildenbrand Tested-by: Farrah Chen Tested-by: Qiuxu Zhuo Signed-off-by: Qiuxu Zhuo --- v1 -> v2: - Apply David Hildenbrand's fix suggestion. - Update the commit message to reflect the new fix. - Add David Hildenbrand's "Suggested-by:" tag. - Remove Andrew Zaborowski's SoB but add credits to him in the commit message. [ I cannot reach him to get his SoB for the completely rewritten commit message and new fix approach. ] mm/huge_memory.c | 3 +++ mm/migrate.c | 3 ++- 2 files changed, 5 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 9c38a95e9f09..2bf5178cca96 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio) if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1) return false; + if (folio_contain_hwpoisoned_page(folio)) + return false; + for (i = 0; i < folio_nr_pages(folio); i++) { kaddr = kmap_local_folio(folio, i * PAGE_SIZE); if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { diff --git a/mm/migrate.c b/mm/migrate.c index 9e5ef39ce73a..393fc2ffc96e 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw, pte_t newpte; void *addr; - if (PageCompound(page)) + if (PageCompound(page) || PageHWPoison(page)) return false; + VM_BUG_ON_PAGE(!PageAnon(page), page); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page); base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a -- 2.43.0