A production host in the Meta fleet (6.16 kernel, memory allocation profiling enabled) panicked with a kernel stack overflow while a kernel driver was freeing a resource: BUG: TASK stack guard page was hit Oops: stack guard page RIP: 0010:kfree+0x8/0x5d0 Call Trace: __free_slab+0x66/0xc0 kfree+0x3f0/0x5d0 ... ( ~125x __free_slab <-> kfree ) ... do_syscall_64 The crash dump shows a 125-deep __free_slab<->kfree recursion that overflowed the 16 KiB kernel stack. What happened: a KMALLOC_NORMAL slab's obj_exts array (used by allocation profiling / memcg accounting) is itself kmalloc()'d from a KMALLOC_NORMAL cache, so the "slab holds another slab's obj_exts array" relation can form cycles. With sizeof(struct slabobj_ext) == 16 and the host's geometry: - kmalloc-512 has 64 objects/slab -> array is 64*16 == 1024 bytes, served from kmalloc-1k; - kmalloc-1k has 32 objects/slab -> array is 32*16 == 512 bytes, served from kmalloc-512. A kmalloc-512 slab and a kmalloc-1k slab therefore hold each other's obj_exts array. Discarding one frees the other's array, which empties and discards that slab, which frees the first's array, and so on: __free_slab() -> free_slab_obj_exts() -> kfree() -> discard_slab() -> __free_slab() recurses along the cycle until the stack is exhausted. The dump confirms it: the recursion's slabs strictly alternate kmalloc-512 (obj_exts in kmalloc-1k) and kmalloc-1k (obj_exts in kmalloc-512), and mem_alloc_profiling_key was enabled. Commit 280ea9c3154b ("mm/slab: avoid allocating slabobj_ext array from its own slab") is not sufficient: it bumps the allocation size only when the array would come from the *same* cache (object_size ==). At the geometry above neither cache is self-referential (512 != 1024 and 1024 != 512), so the bump never triggers and the kmalloc-512 <-> kmalloc-1k cross cycle remains. Fix it structurally by removing cycles of every shape: serve the array from a cache strictly larger than the one it describes whenever it would otherwise come from the same or a smaller cache. Every reference edge then points from a smaller to a larger cache (here kmalloc-1k's array moves to kmalloc-2k), so the relation is a DAG and cannot contain a cycle. No slab can be self- or cross-pinned, the tear-down recursion is bounded by the number of kmalloc size classes (it terminates at the large-kmalloc path, which carries no obj_exts), and profiling/accounting coverage is unchanged - the array is still allocated, only relocated. Reproduced on next-20260623 at the same geometry: churning kmalloc-512/kmalloc-1k under vm.mem_profiling and then shrinking leaves kmalloc-512 with thousands of unreclaimable objects without this patch (8056) and at baseline with it (847). Fixes: 4b8736964640 ("mm/slab: add allocation accounting into slab allocation and free paths") Signed-off-by: Shakeel Butt Reported-by: Danielle Costantino --- mm/slub.c | 26 ++++++++++---------------- 1 file changed, 10 insertions(+), 16 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 9ec774dc7009..48e54d340865 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -2124,15 +2124,14 @@ static inline void init_slab_obj_exts(struct slab *slab) } /* - * Calculate the allocation size for slabobj_ext array. + * Size of the slabobj_ext array for @slab. * - * When memory allocation profiling is enabled, the obj_exts array - * could be allocated from the same slab cache it's being allocated for. - * This would prevent the slab from ever being freed because it would - * always contain at least one allocated object (its own obj_exts array). - * - * To avoid this, increase the allocation size when we detect the array - * may come from the same cache, forcing it to use a different cache. + * The array is itself kmalloc()'d. If it came from the same or a smaller + * kmalloc cache than @s, the "slab holds another slab's array" relation could + * form a cycle (self, or e.g. kmalloc-512 <-> kmalloc-1k) that pins the slabs + * forever and recurses via free_slab_obj_exts() -> kfree() -> discard_slab() + * at teardown. Force it into a strictly larger cache to keep that relation a + * DAG (acyclic). */ static inline size_t obj_exts_alloc_size(struct kmem_cache *s, struct slab *slab, gfp_t gfp) @@ -2147,14 +2146,9 @@ static inline size_t obj_exts_alloc_size(struct kmem_cache *s, return sz; obj_exts_cache = kmalloc_slab(sz, NULL, gfp, __kmalloc_token(0)); - /* - * We can't simply compare s with obj_exts_cache, because partitioned kmalloc - * caches have multiple caches per size, selected by caller address or type. - * Since caller address or type may differ between kmalloc_slab() and actual - * allocation, bump size when sizes are equal. - */ - if (s->object_size == obj_exts_cache->object_size) - return obj_exts_cache->object_size + 1; + /* compare object_size, not the cache pointer (partitioned kmalloc caches) */ + if (obj_exts_cache->object_size <= s->object_size) + return s->object_size + 1; return sz; } -- 2.53.0-Meta