I've been chasing down the following flaky splat, introduced by recent
changes in BTF generation [1]:
------------[ cut here ]------------
BUG: unable to handle page fault for address: ffa000000233d828
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 100000067 P4D 100253067 PUD 100258067 PMD 0
Oops: Oops: 0000 [#1] SMP NOPTI
CPU: 1 UID: 0 PID: 390 Comm: test_progs Tainted: G W OE 6.19.0-rc1-gf785a31395d9 #331 PREEMPT(full)
Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-4.el9 04/01/2014
RIP: 0010:simplify_symbols+0x2b2/0x480
9.737179] Code: 85 f6 4d 89 f7 b8 01 00 00 00 4c 0f 44 f8 49 83 fd f0 4d 0f 44 fe 75 5b 4d 85 ff 0f 85 76 ff ff ff eb 50 49 8b 4e 20 c1 e0 06 <48> 8b 44 01 10 9 cf fd ff ff 49 89 c5 eb 36 49 c7 c5
RSP: 0018:ffa00000017afc40 EFLAGS: 00010216
RAX: 00000000003fffc0 RBX: 0000000000000002 RCX: ffa0000001f3d858
RDX: ffffffffc0218038 RSI: ffffffffc0218008 RDI: aaaaaaaaaaaaaaab
RBP: ffa00000017afd18 R08: 0000000000000072 R09: 0000000000000069
R10: ffffffff8160d6ca R11: 0000000000000000 R12: ffa0000001f3d577
R13: ffffffffc0214058 R14: ffa00000017afdc0 R15: ffa0000001f3e518
FS: 00007f1c638654c0(0000) GS:ff1100089b7bc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffa000000233d828 CR3: 000000010ba1f001 CR4: 0000000000771ef0
PKRU: 55555554
Call Trace:
? __kmalloc_node_track_caller_noprof+0x37f/0x740
? __pfx_setup_modinfo_srcversion+0x10/0x10
? srso_alias_return_thunk+0x5/0xfbef5
? kstrdup+0x4a/0x70
? srso_alias_return_thunk+0x5/0xfbef5
? setup_modinfo_srcversion+0x1a/0x30
? srso_alias_return_thunk+0x5/0xfbef5
? setup_modinfo+0x12b/0x1e0
load_module+0x133a/0x1610
__x64_sys_finit_module+0x31b/0x450
? entry_SYSCALL_64_after_hwframe+0x76/0x7e
do_syscall_64+0x80/0x2d0
? srso_alias_return_thunk+0x5/0xfbef5
? exc_page_fault+0x95/0xc0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f1c63a2582d
9.794028] Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff 8 8b 0d bb 15 0f 00 f7 d8 64 89 01 48
RSP: 002b:00007ffe513df128 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f1c63a2582d
RDX: 0000000000000000 RSI: 0000000000ee83f9 RDI: 0000000000000016
RBP: 00007ffe513df150 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000206 R12: 00007ffe513e3588
R13: 000000000088fad0 R14: 00000000014bddb0 R15: 00007f1c63ba7000
Modules linked in: bpf_testmod(OE)
CR2: ffa000000233d828
---[ end trace 0000000000000000 ]---
RIP: 0010:simplify_symbols+0x2b2/0x480
9.821595] Code: 85 f6 4d 89 f7 b8 01 00 00 00 4c 0f 44 f8 49 83 fd f0 4d 0f 44 fe 75 5b 4d 85 ff 0f 85 76 ff ff ff eb 50 49 8b 4e 20 c1 e0 06 <48> 8b 44 01 10 9 cf fd ff ff 49 89 c5 eb 36 49 c7 c5
RSP: 0018:ffa00000017afc40 EFLAGS: 00010216
RAX: 00000000003fffc0 RBX: 0000000000000002 RCX: ffa0000001f3d858
RDX: ffffffffc0218038 RSI: ffffffffc0218008 RDI: aaaaaaaaaaaaaaab
RBP: ffa00000017afd18 R08: 0000000000000072 R09: 0000000000000069
R10: ffffffff8160d6ca R11: 0000000000000000 R12: ffa0000001f3d577
R13: ffffffffc0214058 R14: ffa00000017afdc0 R15: ffa0000001f3e518
FS: 00007f1c638654c0(0000) GS:ff1100089b7bc000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffa000000233d828 CR3: 000000010ba1f001 CR4: 0000000000771ef0
PKRU: 55555554
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
This hasn't happened on BPF CI so far, for example, however I was able
to reproduce it on a particular x64 machine using a kernel built with
LLVM 20.
The crash happens on attempt to load one of the BPF selftest modules
(tools/testing/selftests/bpf/test_kmods/bpf_test_modorder_x.ko) which
is used by kfunc_module_order test.
The reason for the crash is that simplify_symbols() doesn't check for
bounds of the ELF section index:
for (i = 1; i < symsec->sh_size / sizeof(Elf_Sym); i++) {
const char *name = info->strtab + sym[i].st_name;
switch (sym[i].st_shndx) {
case SHN_COMMON:
[...]
default:
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
else
/** HERE --> **/ secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
sym[i].st_value += secbase;
break;
}
}
And in the case I was able to reproduce, the value 0xffff
(SHN_HIRESERVE aka SHN_XINDEX [2]) fell through here.
Now this code fragment is between 15 and 20 years old, so obviously
it's not expected for a kmodule symbol to have such st_shndx
value. Even so, the kernel probably should fail loading the module
instead of crashing, which is what this patch attempts to fix.
Investigating further, I discovered that the module binary became
corrupted by `${OBJCOPY} --update-section` operation updating .BTF_ids
section data in scripts/gen-btf.sh. This explains how the bug has
surfaced after gen-btf.sh was introduced:
$ llvm-readelf -s --wide bpf_test_modorder_x.ko | grep 'BTF_ID'
llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol index (2), but unable to locate the extended symbol index table
llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol index (3), but unable to locate the extended symbol index table
llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol index (4), but unable to locate the extended symbol index table
3: 0000000000000000 16 NOTYPE LOCAL DEFAULT RSV[0xffff] __BTF_ID__set8__bpf_test_modorder_kfunc_x_ids
llvm-readelf: warning: 'bpf_test_modorder_x.ko': found an extended symbol index (16), but unable to locate the extended symbol index table
4: 0000000000000008 4 OBJECT LOCAL DEFAULT RSV[0xffff] __BTF_ID__func__bpf_test_modorder_retx__44417
vs expected
$ llvm-readelf -s --wide bpf_test_modorder_x.ko | grep 'BTF_ID'
3: 0000000000000000 16 NOTYPE LOCAL DEFAULT 6 __BTF_ID__set8__bpf_test_modorder_kfunc_x_ids
4: 0000000000000008 4 OBJECT LOCAL DEFAULT 6 __BTF_ID__func__bpf_test_modorder_retx__44417
But why? Updating section data without changing it's size is not
supposed to affect sections indices, right?
With a bit more testing I confirmed that this is a LLVM-specific
issue (doesn't reproduce with GCC kbuild), and it's not stable,
because in link-vmlinux.h we also do:
${OBJCOPY} --update-section .BTF_ids=${btfids_vmlinux} ${VMLINUX}
However:
$ llvm-readelf -s --wide ~/workspace/prog-aux/linux/vmlinux | grep 0xffff
# no output, which is good
So the suspect is the implementation of llvm-objcopy. As it turns out
there is a relevant known bug that explains the flakiness and isn't
fixed yet [3].
[1] https://lore.kernel.org/bpf/20251219181825.1289460-3-ihor.solodrai@linux.dev/
[2] https://man7.org/linux/man-pages/man5/elf.5.html
[3] https://github.com/llvm/llvm-project/issues/168060#issuecomment-3533552952
Signed-off-by: Ihor Solodrai
---
RFC
While this llvm-objcopy bug is not fixed, we can not trust it in the
kernel build pipeline. In the short-term we have to come up with a
workaround for .BTF_ids section update and replace the calls to
${OBJCOPY} --update-section with something else.
One potential workaround is to force the use of the objcopy (from
binutils) instead of llvm-objcopy when updating .BTF_ids section.
Alternatively, we could just dd the .BTF_ids data computed by
resolve_btfids at the right offset in the target ELF file.
Surprisingly I couldn't find a good way to read a section offset and
size from the ELF with a specified format in a command line. Both
readelf and {llvm-}objdump give a human readable output, and it
appears we can't rely on the column order, for example.
We could still try parsing readelf output with awk/grep, covering
output variants that appear in the kernel build.
We can also do:
llvm-readobj --elf-output-style=JSON --sections "$elf" | \
jq -r --arg name .BTF_ids '
.[0].Sections[] |
select(.Section.Name.Name == $name) |
"\(.Section.Offset) \(.Section.Size)"'
...but idk man, doesn't feel right.
Most reliable way to determine the size and offset of .BTF_ids section
is probably reading them by a C program with libelf, such as
resolve_btfids. Which is quite ironic, given the recent
changes. Setting the irony aside, we could add smth like:
resolve_btfids --section-info=.BTF_ids $elf
Reverting the gen-btf.sh patch is also a possible workaround, but I'd
really like to avoid it, given that BPF features/optimizations in
development depend on it.
I'd appreciate comments and suggestions on this issue. Thank you!
---
kernel/module/main.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 710ee30b3bea..5bf456fad63e 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1568,6 +1568,13 @@ static int simplify_symbols(struct module *mod, const struct load_info *info)
break;
default:
+ if (sym[i].st_shndx >= info->hdr->e_shnum) {
+ pr_err("%s: Symbol %s has an invalid section index %u (max %u)\n",
+ mod->name, name, sym[i].st_shndx, info->hdr->e_shnum - 1);
+ ret = -ENOEXEC;
+ break;
+ }
+
/* Divert to percpu allocation if a percpu var. */
if (sym[i].st_shndx == info->index.pcpu)
secbase = (unsigned long)mod_percpu(mod);
--
2.52.0