At the end of dissolve_free_hugetlb_folio(), a free HugeTLB folio becomes non-HugeTLB, and it is released to buddy allocator as a high-order folio, e.g. a folio that contains 262144 pages if the folio was a 1G HugeTLB hugepage. This is problematic if the HugeTLB hugepage contained HWPoison subpages. In that case, since buddy allocator does not check HWPoison for non-zero-order folio, the raw HWPoison page can be given out with its buddy page and be re-used by either kernel or userspace. Memory failure recovery (MFR) in kernel does attempt to take raw HWPoison page off buddy allocator after dissolve_free_hugetlb_folio(). However, there is always a time window between dissolve_free_hugetlb_folio() frees a HWPoison high-order folio to buddy allocator and MFR takes HWPoison raw page off buddy allocator. Another similar situation is when a transparent huge page (THP) runs into memory failure but splitting failed. Such THP will eventually be released to buddy allocator when owning userspace processes are gone, but with certain subpages having HWPoison. One obvious way to avoid both problems is to add page sanity checks in page allocate or free path. However, it is against the past efforts to reduce sanity check overhead [1,2,3]. Introduce free_has_hwpoisoned() to only free the healthy pages and to exclude the HWPoison ones in the high-order folio. The idea is to iterate through the sub-pages of the folio to identify contiguous ranges of healthy pages. Instead of freeing pages one by one, decompose healthy ranges into the largest possible blocks having different orders. Every block meets the requirements to be freed via __free_one_page(). free_has_hwpoisoned() has linear time complexity wrt the number of pages in the folio. While the power-of-two decomposition ensures that the number of calls to the buddy allocator is logarithmic for each contiguous healthy range, the mandatory linear scan of pages to identify PageHWPoison() defines the overall time complexity. For a 1G hugepage having several HWPoison pages, free_has_hwpoisoned() takes around 2ms on average. Since free_has_hwpoisoned() has nontrivial overhead, it is added to free_pages_prepare() as a shortcut and is only done if PG_has_hwpoisoned indicates HWPoison page exists and after checks and preparations all succeeded. [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz Signed-off-by: Jiaqi Yan --- mm/page_alloc.c | 133 +++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 131 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cbf758e27aa2c..d6883f1b17d95 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -242,6 +242,7 @@ gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK; unsigned int pageblock_order __read_mostly; #endif +static void free_has_hwpoisoned(struct page *page, unsigned int order); static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags); @@ -1340,14 +1341,30 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) #endif /* CONFIG_MEM_ALLOC_PROFILING */ -__always_inline bool free_pages_prepare(struct page *page, - unsigned int order) +/* + * Returns + * - true: checks and preparations all good, caller can proceed freeing. + * - false: do not proceed freeing for one of the two reasons: + * 1. Some check failed so it is not safe to proceed freeing. + * 2. A compound page having some HWPoison pages. The healthy pages + * are already safely freed, and HWPoison ones isolated. + */ +__always_inline bool free_pages_prepare(struct page *page, unsigned int order) { int bad = 0; bool skip_kasan_poison = should_skip_kasan_poison(page); bool init = want_init_on_free(); bool compound = PageCompound(page); struct folio *folio = page_folio(page); + /* + * When dealing with compound page, PG_has_hwpoisoned is cleared + * with PAGE_FLAGS_SECOND. So the check must be done first. + * + * Note we can't exclude PG_has_hwpoisoned from PAGE_FLAGS_SECOND. + * Because PG_has_hwpoisoned == PG_active, free_page_is_bad() will + * confuse and complaint that the first tail page is still active. + */ + bool should_fhh = compound && folio_test_has_hwpoisoned(folio); VM_BUG_ON_PAGE(PageTail(page), page); @@ -1470,6 +1487,16 @@ __always_inline bool free_pages_prepare(struct page *page, debug_pagealloc_unmap_pages(page, 1 << order); + /* + * After breaking down compound page and dealing with page metadata + * (e.g. page owner and page alloc tags), take a shortcut if this + * was a compound page containing certain HWPoison subpages. + */ + if (should_fhh) { + free_has_hwpoisoned(page, order); + return false; + } + return true; } @@ -2953,6 +2980,108 @@ static bool free_frozen_page_commit(struct zone *zone, return ret; } +/* + * Given a range of physically contiguous pages, efficiently free them + * block by block. Block order is chosen to meet the PFN alignment + * requirement in __free_one_page(). + */ +static void free_contiguous_pages(struct page *curr, + unsigned long nr_pages) +{ + unsigned int order; + unsigned int align_order; + unsigned int size_order; + unsigned long remaining; + unsigned long pfn = page_to_pfn(curr); + const unsigned long end_pfn = pfn + nr_pages; + struct zone *zone = page_zone(curr); + + /* + * This decomposition algorithm at every iteration chooses the + * order to be the minimum of two constraints: + * - Alignment: the largest power-of-two that divides current pfn. + * - Size: the largest power-of-two that fits in the current + * remaining number of pages. + */ + while (pfn < end_pfn) { + remaining = end_pfn - pfn; + align_order = ffs(pfn) - 1; + size_order = fls_long(remaining) - 1; + order = min(align_order, size_order); + + free_one_page(zone, curr, pfn, order, FPI_NONE); + curr += (1UL << order); + pfn += (1UL << order); + } + + VM_WARN_ON(pfn != end_pfn); +} + +/* + * Given a high-order compound page containing certain number of HWPoison + * pages, free only the healthy ones assuming FPI_NONE. + * + * Pages must have passed free_pages_prepare(). Even if having HWPoison + * pages, breaking down compound page and updating metadata (e.g. page + * owner, alloc tag) can be done together during free_pages_prepare(), + * which simplifies the splitting here: unlike __split_unmapped_folio(), + * there is no need to turn split pages into a compound page or to carry + * metadata. + * + * It calls free_one_page O(2^order) times and cause nontrivial overhead. + * So only use this when the compound page really contains HWPoison. + * + * This implementation doesn't work in memdesc world. + */ +static void free_has_hwpoisoned(struct page *page, unsigned int order) +{ + struct page *curr = page; + struct page *next; + unsigned long nr_pages; + /* + * Don't assume end points to a valid page. It is only used + * here for pointer arithmetic. + */ + struct page *end = page + (1 << order); + unsigned long total_freed = 0; + unsigned long total_hwp = 0; + + VM_WARN_ON(order == 0); + VM_WARN_ON(page->flags.f & PAGE_FLAGS_CHECK_AT_PREP); + + while (curr < end) { + next = curr; + nr_pages = 0; + + while (next < end && !PageHWPoison(next)) { + ++next; + ++nr_pages; + } + + if (next != end && PageHWPoison(next)) { + /* + * Avoid accounting error when the page is freed + * by unpoison_memory(). + */ + clear_page_tag_ref(next); + ++total_hwp; + } + + free_contiguous_pages(curr, nr_pages); + total_freed += nr_pages; + + if (next == end) + break; + + VM_WARN_ON(!PageHWPoison(next)); + curr = next + 1; + } + + VM_WARN_ON(total_freed + total_hwp != (1 << order)); + pr_info("Freed %#lx pages, excluded %lu hwpoison pages\n", + total_freed, total_hwp); +} + /* * Free a pcp page */ -- 2.53.0.rc2.204.g2597b5adb4-goog