With CONFIG_VMAP_STACK, kernel stacks are allocated in the vmalloc area, which an unprivileged user can surround with attacker-controlled data by spraying vmap allocations adjacent to a target stack (for example via XDP_UMEM_REG, though other vmalloc spray paths work too). Today each guarded vmalloc allocation is followed by a single unmapped guard page. A single guard page is not enough to contain the x86_64 ENTER instruction used as a one-instruction stack pivot. ENTER imm16, imm8 builds a stack frame and lowers RSP by: imm16 + 8 * (L + 1), L = imm8 & 0x1f imm16 is an unsigned 16-bit operand (ENTER never raises RSP), and L is in [0, 31], so the maximum displacement of a single ENTER is: 0xffff + 8 * 0x20 = 0x100ff bytes That is more than enough to step off the current stack, across the one-page guard, and into the adjacent sprayed pages. When those pages contain a return sled feeding a ROP chain, reaching any ENTER gadget (opcode 0xc8, abundant as both intended and unintended gadgets) turns a control-flow hijack into full ROP execution without any register control at the hijack site, making it a one-gadget-style primitive that significantly eases exploitation. The pivot happens after the control transfer, so it is not constrained by CFI (kCFI/FineIBT). Widen the guard region from one page to VMAP_GUARD_PAGES (0x11 pages, 0x11000 bytes), which is the smallest whole-page span exceeding the 0x100ff-byte maximum single-ENTER pivot. A pivot off the top of the stack now lands in the unmapped guard and faults, instead of in mapped, attacker-controlled memory. RANDOMIZE_KSTACK_OFFSET only perturbs RSP by a sub-page amount, so it does not change the required width. Introduce a VMAP_GUARD_PAGES knob that defaults to a single page (no change for current architectures) and can be overridden per arch via asm/vmalloc.h, and set it to 0x11 on x86_64. This is deliberately scoped to x86_64: the 0x100ff bound is a property of the ENTER opcode, and ENTER is also a one-byte opcode (0xc8) that appears as abundant unintended gadgets. Other architectures (e.g. arm64) have no equivalent single-instruction, immediate-controlled pivot reachable as an unaligned unintended gadget, so they keep the one-page guard and pay no cost. The override is gated on CONFIG_X86_64 rather than applying to all of x86: VMAP_STACK is selected only on x86_64, so 32-bit kernel stacks are not in the vmalloc area and the technique does not apply there. 32-bit x86 also has a far smaller vmalloc window, where widening every guarded area by 16 pages would needlessly pressure the address space. The guard pages are never populated, so there is no extra physical memory and no additional page-table population beyond the larger virtual span; the cost is virtual address space and vmap_area bookkeeping, which is negligible against the 64-bit vmalloc window. get_vm_area_size() is adjusted by the same VMAP_GUARD_SIZE so the usable size reported to callers is unchanged. On x86 this widens the guard for all guarded vmap areas, not only thread stacks. ret2enter targets the stack specifically, so a narrower alternative is to apply the wider guard only on the thread-stack allocation path via a dedicated VM_ flag; we kept the change in the common path as defense in depth for any vmalloc-adjacent pivot target, but are happy to scope it to stacks if maintainers prefer. Signed-off-by: Xiang Mei Signed-off-by: Jennifer Miller --- arch/x86/include/asm/vmalloc.h | 21 +++++++++++++++++++++ include/linux/vmalloc.h | 16 ++++++++++++++-- mm/vmalloc.c | 2 +- 3 files changed, 36 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/vmalloc.h b/arch/x86/include/asm/vmalloc.h index 49ce331f3ac6..2c341f398227 100644 --- a/arch/x86/include/asm/vmalloc.h +++ b/arch/x86/include/asm/vmalloc.h @@ -5,6 +5,27 @@ #include #include +/* + * The x86 ENTER instruction can be used as a one-instruction stack pivot: + * ENTER imm16, imm8 lowers RSP by imm16 + 8 * (L + 1), L = imm8 & 0x1f. + * imm16 is an unsigned 16-bit operand (ENTER never raises RSP) and L is in + * [0, 31], so a single ENTER can lower RSP by at most + * 0xffff + 8 * 0x20 = 0x100ff bytes. With CONFIG_VMAP_STACK the kernel + * stack lives in the vmalloc area, where an unprivileged user can spray + * adjacent allocations; a single-page guard is too small to contain such a + * pivot. Use 0x11 guard pages (0x11000 bytes), the smallest whole-page + * span exceeding 0x100ff, so the pivot faults in the guard instead of + * landing in attacker-controlled memory. + * + * Restrict this to 64-bit: VMAP_STACK is selected only on x86_64, so 32-bit + * kernel stacks are not in the vmalloc area and the technique does not apply. + * 32-bit also has a far smaller vmalloc window, where a 16-page-per-area + * widening would needlessly pressure the address space. + */ +#ifdef CONFIG_X86_64 +#define VMAP_GUARD_PAGES 0x11 +#endif + #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP #ifdef CONFIG_X86_64 diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h index 3b02c0c6b371..b8546e519deb 100644 --- a/include/linux/vmalloc.h +++ b/include/linux/vmalloc.h @@ -49,6 +49,18 @@ struct iov_iter; /* in uio.h */ #define IOREMAP_MAX_ORDER (7 + PAGE_SHIFT) /* 128 pages */ #endif +/* + * Number of unmapped guard pages appended to each guarded vmalloc + * allocation. The default is a single page; an architecture may override + * VMAP_GUARD_PAGES (via asm/vmalloc.h) when a wider guard is needed to + * contain a worst-case single-instruction stack pivot into an adjacent, + * attacker-controlled vmap allocation (see arch/x86 for the ENTER case). + */ +#ifndef VMAP_GUARD_PAGES +#define VMAP_GUARD_PAGES 1 +#endif +#define VMAP_GUARD_SIZE (VMAP_GUARD_PAGES * PAGE_SIZE) + struct vm_struct { union { struct vm_struct *next; /* Early registration of vm_areas. */ @@ -236,8 +248,8 @@ int vmap_pages_range(unsigned long addr, unsigned long end, pgprot_t prot, static inline size_t get_vm_area_size(const struct vm_struct *area) { if (!(area->flags & VM_NO_GUARD)) - /* return actual size without guard page */ - return area->size - PAGE_SIZE; + /* return actual size without guard region */ + return area->size - VMAP_GUARD_SIZE; else return area->size; diff --git a/mm/vmalloc.c b/mm/vmalloc.c index bb6ae08d18f5..8bb2b3ef40a8 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3217,7 +3217,7 @@ struct vm_struct *__get_vm_area_node(unsigned long size, return NULL; if (!(flags & VM_NO_GUARD)) - size += PAGE_SIZE; + size += VMAP_GUARD_SIZE; area->flags = flags; area->caller = caller; -- 2.43.0