When pages are freed via __free_one_page they're placed on the per-SPB free_list determined by their pageblock's migratetype, not the original allocation's migratetype. Slab-heavy and cache-heavy workloads both expose structural mismatches that leave non-movable allocations stranded: - RECLAIMABLE pageblocks fill up densely with live slab objects (e.g. btrfs_inode caches), leaving very few sub-pageblock free fragments on the RECL free list. - UNMOVABLE pageblocks accumulate sparse free space from vmalloc and raw-alloc churn -- tens of thousands of free pages, all on the UNMOV free list. - MOVABLE-tagged pageblocks in tainted SPBs absorb freed page-cache and anon-LRU pages, accumulating large pools all on the MOVABLE free list -- invisible to non-movable demand even though the tainted SPB has plenty of unused space. Add two new passes between Pass 2b and Pass 3 of __rmqueue_smallest, both restricted to SB_TAINTED (clean SPBs must not be polluted with cross-type mixing) and both purely transient borrows (no pageblock relabel; the borrowed page returns to its source list when freed): Pass 2c -- cross-non-movable borrow. UNMOV alloc tries the RECL free list; RECL alloc tries the UNMOV free list. Restricted to UNMOV <-> RECL. Pass 2d -- cross-MOV borrow. Non-movable alloc tries the MOVABLE free list of a tainted SPB. Tradeoff: the borrowed UNMOV/RECL content blocks compaction of its source pageblock until freed; restricted to SB_TAINTED so contamination is bounded to one pageblock inside an already-tainted SPB. The alternative -- Pass 3 tainting a fresh clean SPB -- removes a 1 GiB region from the clean pool, which is strictly worse for the anti- fragmentation invariant the series is built around. PB_has_ is set via __spb_set_has_type so spb_defrag accounting reflects that the pageblock now hosts our type's content. PB_has_ stays set since other buddies of that type remain. Movable allocations don't participate (they have Pass 4) and CMA is skipped. Observable as SPB_ALLOC_OUTCOME_PASS_2C and SPB_ALLOC_OUTCOME_PASS_2D on the spb_alloc_walk tracepoint. Live measurement on a 250 GB system with btrfs root (Stage 1 + simplified Stage 2a) at boot+7min: 12 tainted Normal- zone SPBs grew from 4 baseline despite the existing 11 having between 825 and 87,062 free pages each, ALL on the UNMOV list while the workload kept allocating RECL btrfs_inode slab pages. Pass 2c lets those allocs absorb into the existing UNMOV-listed free pool rather than creating fresh tainted SPBs; Pass 2d extends the same idea to the MOV-listed free pool that page- cache reclaim leaves behind. Signed-off-by: Rik van Riel Assisted-by: Claude:claude-opus-4.7 syzkaller --- mm/page_alloc.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 156 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e4ecddb428c3..ce8cd99dd283 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2820,6 +2820,7 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, struct page *page; int full; struct superpageblock *sb; + int opposite_mt; /* * Category search order: 2 passes. * Movable: clean first, then tainted (pack into clean SBs). @@ -2999,6 +3000,161 @@ struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, } } } + + /* + * Pass 2c: cross-non-movable borrow within tainted SPBs. + * + * If we're a non-movable alloc and Pass 1/2/2b couldn't find a + * buddy on our migratetype's free list anywhere, but tainted + * SPBs have free buddies on the *opposite* non-movable type's + * free list, take one of those. + * + * Why this happens: when pages are freed, __free_one_page puts + * them on the free_list determined by their pageblock's tag, + * not the original allocation's migratetype. Slab caches tend + * to be dense (RECL pageblocks fill up; few sub-PB fragments), + * while UNMOV pageblocks accumulate sparse free space from + * vmalloc/raw alloc churn. Net effect: tainted SPBs frequently + * have tens of thousands of free pages all on the UNMOV list, + * invisible to RECL allocs (or vice versa). Without this pass, + * the alloc falls through to Pass 3 and taints a fresh clean + * SPB even though the existing tainted ones have plenty of + * unused space. + * + * We do NOT relabel the source pageblock. The buddy is taken + * from @opposite_mt's free list and the splits go back on + * @opposite_mt's list (page_del_and_expand uses the same mt + * for delete and expand). The pageblock tag is unchanged, so + * the page returns to @opposite_mt's list when freed via + * __free_one_page. Effectively a borrow: the alloc takes a + * physical page from a UNMOV-tagged pageblock for a RECL + * use, and the page cycles back to UNMOV's list on free. + * + * We do set PB_has_ via __spb_set_has_type so + * spb_defrag accounting reflects that this pageblock now hosts + * our migratetype's content too. PB_has_ stays + * set since other buddies of that type remain. + * + * Restricted to UNMOV ↔ RECL. Movable allocations don't + * participate (they have their own Pass 4 fallback path). + * + * Restricted to SB_TAINTED to avoid spreading mixing into + * clean SPBs. + */ + opposite_mt = -1; + if (migratetype == MIGRATE_UNMOVABLE) + opposite_mt = MIGRATE_RECLAIMABLE; + else if (migratetype == MIGRATE_RECLAIMABLE) + opposite_mt = MIGRATE_UNMOVABLE; + + if (opposite_mt >= 0) { + for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) { + list_for_each_entry(sb, + &zone->spb_lists[SB_TAINTED][full], list) { + int co; + + if (!sb->nr_free_pages) + continue; + for (co = min_t(int, pageblock_order - 1, + NR_PAGE_ORDERS - 1); + co >= (int)order; + --co) { + current_order = co; + area = &sb->free_area[current_order]; + page = get_page_from_free_area( + area, opposite_mt); + if (!page) + continue; + if (get_pageblock_isolate(page)) + continue; + if (is_migrate_cma( + get_pageblock_migratetype(page))) + continue; + page_del_and_expand(zone, page, + order, current_order, + opposite_mt); + __spb_set_has_type(page, + migratetype); + trace_mm_page_alloc_zone_locked( + page, order, migratetype, + pcp_allowed_order(order) && + migratetype < MIGRATE_PCPTYPES); + return page; + } + } + } + } + + /* + * Pass 2d: cross-MOV borrow within tainted SPBs. + * + * If Pass 1/2/2b/2c all failed, the next step is Pass 3 + * which would taint a fresh clean SPB. Before that, try + * to borrow an individual buddy from a tainted SPB's + * MIGRATE_MOVABLE free list. + * + * Tainted SPBs accumulate large amounts of free space on + * the MOV free list (e.g. reclaimed page-cache pages + * whose pageblock tag is MOVABLE). Pass 1 cannot see + * those for non-movable allocs, Pass 2/2b cannot claim a + * whole pageblock when sb->nr_free == 0, and Pass 2c is + * restricted to UNMOV<->RECL. The result is a tainted + * SPB with tens to hundreds of thousands of free pages + * all unreachable from non-movable demand. + * + * Borrow semantics mirror Pass 2c: take a buddy from the + * MOVABLE free list without relabeling the source + * pageblock. The page is used for the requesting non- + * movable mt for the lifetime of the allocation, then on + * free returns to the MOVABLE list. + * + * Cost: the borrowed UNMOV/RECL content blocks + * compaction of its source pageblock until freed. + * Restricted to SB_TAINTED so the contamination is + * bounded to an already-tainted SPB; the alternative + * (Pass 3) taints a fresh clean SPB and removes a 1 GiB + * region from the clean pool, which is strictly worse. + * + * Skipped for movable allocs (they have Pass 4) and for + * CMA allocs. + */ + if (!movable && !is_migrate_cma(migratetype)) { + for (full = SB_FULL; full < __NR_SB_FULLNESS; full++) { + list_for_each_entry(sb, + &zone->spb_lists[SB_TAINTED][full], list) { + int co; + + if (!sb->nr_free_pages) + continue; + for (co = min_t(int, pageblock_order - 1, + NR_PAGE_ORDERS - 1); + co >= (int)order; + --co) { + current_order = co; + area = &sb->free_area[current_order]; + page = get_page_from_free_area( + area, MIGRATE_MOVABLE); + if (!page) + continue; + if (get_pageblock_isolate(page)) + continue; + if (is_migrate_cma( + get_pageblock_migratetype(page))) + continue; + page_del_and_expand(zone, page, + order, current_order, + MIGRATE_MOVABLE); + __spb_set_has_type(page, + migratetype); + trace_mm_page_alloc_zone_locked( + page, order, migratetype, + pcp_allowed_order(order) && + migratetype < MIGRATE_PCPTYPES); + return page; + } + } + } + } } /* -- 2.54.0