| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/06/27 01:53 | flow | patch-triage |
0mArgs:null Results: {
"EnableConfigs": [
"SECCOMP_FILTER",
"MEMFD_CREATE",
"64BIT"
],
"FocusSymbols": [
"seccomp_notify_pin_install",
"seccomp_install_pin",
"seccomp_pin_check",
"seccomp_notify_send_redirect",
"seccomp_redirect_restore_cb",
"seccomp_redirect_is_sigreturn",
"seccomp_run_filters_seq",
"vm_mmap_seal_remote",
"mm_get_unmapped_area_remote"
],
"Reasoning": "The patch introduces a new non-cooperative pinned-memfd redirect feature for seccomp, allowing a supervisor to install a sealed memfd mapping into a trapped task's address space and redirect pointer arguments to it. This adds new ioctls (`SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`) and modifies core seccomp and memory management logic. These new execution paths should be heavily fuzzed to ensure they don't introduce vulnerabilities like memory leaks, race conditions, or unauthorized memory access.",
"WorthFuzzing": true
} |
| 1/1 | 2026/06/27 01:53 | action | read-patch-diff |
0mArgs:null Results: {
"PatchDiff": "commit 5ab21aee5036895747567f9993bf88600f1b136a\nAuthor: syz-cluster \u003ctriage@syzkaller.com\u003e\nDate: Sat Jun 27 01:53:38 2026 +0000\n\n syz-cluster: applied patch under review\n\ndiff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst\nindex cff0fa7f3175e4..3a8ac7775da6fd 100644\n--- a/Documentation/userspace-api/seccomp_filter.rst\n+++ b/Documentation/userspace-api/seccomp_filter.rst\n@@ -289,6 +289,114 @@ above in this document: all arguments being read from the tracee's memory\n should be read into the tracer's memory before any policy decisions are made.\n This allows for an atomic decision on syscall arguments.\n \n+Non-cooperative pinned-memfd redirect\n+=====================================\n+\n+The TOCTOU described above means ``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` cannot\n+enforce a policy on pointer arguments: after the supervisor inspects the\n+target's memory and lets the syscall continue, the target (or a thread sharing\n+its address space) can rewrite that memory before the kernel reads it. The\n+cooperative workaround, the target ``mmap()`` + ``mseal()``-ing a shared\n+buffer, is unavailable in the fork+execve sandbox model, where the supervisor\n+confines a binary it did not write.\n+\n+Two ioctls let the supervisor close this race without target cooperation. The\n+redirect step (below) requires a listener created with\n+``SECCOMP_FILTER_FLAG_REDIRECT`` (in addition to\n+``SECCOMP_FILTER_FLAG_NEW_LISTENER``). Because it rewrites another task's\n+registers, at most one such listener may exist in a task's filter chain; a\n+second fails with ``-EBUSY``:\n+\n+.. code-block:: c\n+\n+ fd = seccomp(SECCOMP_SET_MODE_FILTER,\n+ SECCOMP_FILTER_FLAG_NEW_LISTENER | SECCOMP_FILTER_FLAG_REDIRECT,\n+ \u0026prog);\n+\n+``ioctl(SECCOMP_IOCTL_NOTIF_PIN_INSTALL)`` installs a sealed mapping of a\n+supervisor-owned ``memfd`` directly into the trapped task's address space:\n+\n+.. code-block:: c\n+\n+ struct seccomp_notif_pin_install {\n+ __u64 id;\n+ __u32 flags; /* reserved, must be 0 */\n+ __u32 memfd;\n+ __u64 target_addr;\n+ __u64 size;\n+ __u64 offset; /* page-aligned offset into memfd */\n+ };\n+\n+``id`` names an active notification (the trapped task to install into).\n+``target_addr``, ``size`` and ``offset`` are page-aligned; ``offset`` selects\n+where in ``memfd`` the mapping starts, so one memfd can back several pins. If\n+``target_addr`` is ``0`` the kernel picks a free address and writes it back;\n+otherwise an existing mapping there yields ``-EEXIST``. The pin is read-only\n+and sealed, the target and its threads cannot unmap, move, reprotect or\n+overwrite it, and lasts until the target ``execve()``s or exits.\n+\n+``memfd`` must be write-sealed (``F_SEAL_WRITE`` or ``F_SEAL_FUTURE_WRITE``)\n+or the ioctl returns ``-EINVAL``; otherwise the target could rewrite the pin's\n+bytes through a separate writable handle to the same memfd.\n+``F_SEAL_FUTURE_WRITE`` still lets the supervisor update the contents through\n+its own mapping made before the seal.\n+\n+``ioctl(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT)`` then resumes the trapped syscall\n+like ``SECCOMP_USER_NOTIF_FLAG_CONTINUE``, but with selected argument\n+registers replaced:\n+\n+.. code-block:: c\n+\n+ struct seccomp_notif_resp_redirect {\n+ __u64 id;\n+ __u32 flags; /* SECCOMP_REDIRECT_FLAG_CONTINUE must be set */\n+ __u32 args_mask; /* which arg registers to replace */\n+ __u32 ptr_mask; /* which of those are pointers into a pin */\n+ __u32 memfd; /* the pin's backing memfd */\n+ __u64 args[6]; /* replacement values */\n+ __u64 ptr_len[6]; /* validated access length for each pointer arg */\n+ };\n+\n+Each bit in ``ptr_mask`` (a subset of ``args_mask``) marks ``args[i]`` as a\n+pointer; the access ``[args[i], args[i] + ptr_len[i])`` must lie within a\n+single read-only pin of ``memfd`` in the target, or the ioctl returns\n+``-EFAULT``. ``ptr_len[i]`` must be non-zero for those bits and ``0``\n+otherwise. Bits in ``args_mask`` but not ``ptr_mask`` are scalar replacements\n+written verbatim, e.g. to set the length register that goes with a redirected\n+pointer. The original registers are restored at syscall exit, so the\n+substitution is invisible to the target and the TOCTOU is closed.\n+\n+Scope and limitations\n+---------------------\n+\n+The redirect mechanism is deliberately narrow and is *not* a general syscall\n+rewriting facility:\n+\n+- **Read-only input pointers only.** A pin is read-only, so only an argument\n+ the syscall *reads* (a pathname, a ``sockaddr``) may be redirected into it.\n+ Aiming an output or in/out argument at a pin makes the syscall fail with\n+ ``-EFAULT`` when it writes back.\n+\n+- **Same syscall only.** A redirect replaces arguments, never the syscall\n+ number. ``rt_sigreturn()`` (and its compat variant) cannot be redirected and\n+ return ``-EOPNOTSUPP``.\n+\n+- **Signals and restarts.** The redirected syscall really runs, so it can be\n+ interrupted and restarted. On a restart the original arguments are restored\n+ and the syscall re-traps, so the supervisor is notified again and must answer\n+ consistently. Syscalls the kernel restarts without re-trapping (e.g.\n+ ``nanosleep()``, ``futex(FUTEX_WAIT)``) keep the substituted arguments --\n+ safe for read-only inputs, but a reason not to redirect arguments of syscalls\n+ that block or wait.\n+\n+- **clone()/fork().** A child keeps the substituted argument registers (the\n+ restore is not inherited). Redirect ``clone()``/``fork()`` arguments only if\n+ that is acceptable.\n+\n+- **ptrace.** A tracer sees the substituted arguments at the syscall-exit stop;\n+ they are restored before the task resumes, so a ``PTRACE_SETREGS`` of a\n+ substituted register at that stop is overwritten.\n+\n Sysctls\n =======\n \ndiff --git a/include/linux/mm.h b/include/linux/mm.h\nindex 485df9c2dbddb3..73e5580442a6dd 100644\n--- a/include/linux/mm.h\n+++ b/include/linux/mm.h\n@@ -4152,6 +4152,8 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,\n \tunsigned long len, unsigned long prot, unsigned long flags,\n \tvm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,\n \tstruct list_head *uf);\n+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,\n+\tunsigned long addr, unsigned long len, unsigned long pgoff);\n extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,\n \t\t\t unsigned long start, size_t len, struct list_head *uf,\n \t\t\t bool unlock);\n@@ -4192,6 +4194,7 @@ struct vm_unmapped_area_info {\n \tunsigned long align_mask;\n \tunsigned long align_offset;\n \tunsigned long start_gap;\n+\tstruct mm_struct *mm;\t/* mm to search; NULL means current-\u003emm */\n };\n \n extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info);\ndiff --git a/include/linux/seccomp.h b/include/linux/seccomp.h\nindex 9b959972bf4a22..5d53f8fce50896 100644\n--- a/include/linux/seccomp.h\n+++ b/include/linux/seccomp.h\n@@ -10,12 +10,22 @@\n \t\t\t\t\t SECCOMP_FILTER_FLAG_SPEC_ALLOW | \\\n \t\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER | \\\n \t\t\t\t\t SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \\\n-\t\t\t\t\t SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)\n+\t\t\t\t\t SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \\\n+\t\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT)\n \n /* sizeof() the first published struct seccomp_notif_addfd */\n #define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24\n #define SECCOMP_NOTIFY_ADDFD_SIZE_LATEST SECCOMP_NOTIFY_ADDFD_SIZE_VER0\n \n+/* sizeof() the first published struct seccomp_notif_pin_install */\n+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 32\t\t/* up to @size */\n+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER1 40\t\t/* adds @offset */\n+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER1\n+\n+/* sizeof() the first published struct seccomp_notif_resp_redirect */\n+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 120\n+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0\n+\n #ifdef CONFIG_SECCOMP\n \n #include \u003clinux/thread_info.h\u003e\ndiff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h\nindex dbfc9b37fcaee4..d6888691633c08 100644\n--- a/include/uapi/linux/seccomp.h\n+++ b/include/uapi/linux/seccomp.h\n@@ -25,6 +25,12 @@\n #define SECCOMP_FILTER_FLAG_TSYNC_ESRCH\t\t(1UL \u003c\u003c 4)\n /* Received notifications wait in killable state (only respond to fatal signals) */\n #define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV\t(1UL \u003c\u003c 5)\n+/*\n+ * Declares that this listener's notifier may issue\n+ * SECCOMP_IOCTL_NOTIF_PIN_INSTALL / SECCOMP_IOCTL_NOTIF_SEND_REDIRECT. At most\n+ * one such filter may exist in a task's filter chain. Requires NEW_LISTENER.\n+ */\n+#define SECCOMP_FILTER_FLAG_REDIRECT\t\t(1UL \u003c\u003c 6)\n \n /*\n * All BPF programs must return a 32-bit value.\n@@ -137,6 +143,52 @@ struct seccomp_notif_addfd {\n \t__u32 newfd_flags;\n };\n \n+/**\n+ * struct seccomp_notif_pin_install - have the kernel install a sealed\n+ * MAP_SHARED mapping of @memfd into the trapped task's mm at @target_addr,\n+ * which SECCOMP_IOCTL_NOTIF_SEND_REDIRECT can then use as a target for\n+ * substituted pointer arguments.\n+ *\n+ * The supervisor owns @memfd. The kernel installs the mapping into\n+ * the trapped task's address space without target-side cooperation\n+ * (the target need not mmap or mseal anything itself). The mapping\n+ * is marked VM_SEALED at install time, so the target and any\n+ * CLONE_VM peer cannot munmap, mremap, mprotect, or MAP_FIXED-stomp\n+ * it. The mapping is read-only. The supervisor retains access via its\n+ * own mapping of the same memfd in its own mm.\n+ *\n+ * @memfd must be write-sealed (F_SEAL_WRITE or F_SEAL_FUTURE_WRITE),\n+ * otherwise the ioctl fails with -EINVAL. This guarantees the pin's bytes\n+ * cannot be rewritten through any other reference to the same memfd (for\n+ * example one the target reopened via the supervisor's /proc/\u003cpid\u003e/fd),\n+ * not just through the read-only pin itself. F_SEAL_FUTURE_WRITE still\n+ * lets the supervisor update the bytes through its own pre-seal mapping.\n+ *\n+ * @offset lets one memfd back several disjoint read-only pins.\n+ *\n+ * @id: The ID of an active seccomp notification on this listener,\n+ * identifying the trapped task whose mm receives the pin.\n+ * @flags: Reserved, must be 0.\n+ * @memfd: Supervisor-side fd for the backing memfd. Must be write-sealed.\n+ * @target_addr: Address in the trapped task's mm to install at. Must be\n+ * page-aligned. If non-zero, MAP_FIXED semantics apply, no\n+ * other mapping may exist in [@target_addr, @target_addr +\n+ * @size). If zero, the kernel chooses a free area in the\n+ * target mm. On success the actual mapped address is written\n+ * back here.\n+ * @size: Size of the pin in bytes. Must be page-aligned.\n+ * @offset: Page-aligned byte offset into @memfd to map from. Zero maps\n+ * from the start of the memfd.\n+ */\n+struct seccomp_notif_pin_install {\n+\t__u64 id;\n+\t__u32 flags;\n+\t__u32 memfd;\n+\t__u64 target_addr;\n+\t__u64 size;\n+\t__u64 offset;\n+};\n+\n #define SECCOMP_IOC_MAGIC\t\t'!'\n #define SECCOMP_IO(nr)\t\t\t_IO(SECCOMP_IOC_MAGIC, nr)\n #define SECCOMP_IOR(nr, type)\t\t_IOR(SECCOMP_IOC_MAGIC, nr, type)\n@@ -154,4 +206,78 @@ struct seccomp_notif_addfd {\n \n #define SECCOMP_IOCTL_NOTIF_SET_FLAGS\tSECCOMP_IOW(4, __u64)\n \n+/* Valid flags for struct seccomp_notif_resp_redirect. */\n+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL \u003c\u003c 0)\n+\n+/*\n+ * Number of syscall argument registers a redirect response may\n+ * substitute (matches struct seccomp_data::args[]).\n+ */\n+#define SECCOMP_REDIRECT_ARGS 6\n+\n+/**\n+ * struct seccomp_notif_resp_redirect - resume the trapped syscall with\n+ * substituted arg-register values, optionally pointing into previously\n+ * installed pinned-memfd regions.\n+ *\n+ * Like SECCOMP_USER_NOTIF_FLAG_CONTINUE the syscall actually runs, but the\n+ * kernel first rewrites the arg registers selected by @args_mask. Each\n+ * pointer substitution (@ptr_mask) is validated against the trapped task's\n+ * current address space: the whole access [args[i], args[i] + ptr_len[i])\n+ * must lie inside a single VM_SEALED, read-only mapping of @memfd. No per-pin\n+ * bookkeeping is kept; authorization is re-derived from the live mapping, so\n+ * a target that has exited or execve()d (its mapping gone) simply fails\n+ * validation. Original registers are saved and restored at syscall exit for\n+ * ABI compliance - except after a successful execve, whose new register file\n+ * is left untouched (the redirect still applies, as execve copies the\n+ * pathname from the immutable pin before the old mm is gone, closing that\n+ * TOCTOU too).\n+ *\n+ * @id: The ID of the seccomp notification this response consumes.\n+ * @flags: SECCOMP_REDIRECT_FLAG_*. CONTINUE must be set.\n+ * @args_mask: Bit i set means args[i] replaces the trapped task's\n+ * corresponding arg register before the syscall runs.\n+ * @ptr_mask: Subset of @args_mask. Bit i set means args[i] is a pointer and\n+ * the access [args[i], args[i] + ptr_len[i]) is validated to lie\n+ * entirely inside a single VM_SEALED, read-only mapping of @memfd.\n+ * Scalar replacements (in @args_mask but not @ptr_mask) are\n+ * written verbatim.\n+ * @memfd: Supervisor-side fd for the backing memfd whose sealed mapping the\n+ * pointer substitutions must fall within. Consulted only when\n+ * @ptr_mask is non-zero.\n+ * @args: Replacement values for the arg registers.\n+ * @ptr_len: For each bit set in @ptr_mask, ptr_len[i] is the byte length of\n+ * the access starting at args[i]; it must be non-zero and args[i] +\n+ * ptr_len[i] must not overflow. For every i whose bit is clear in\n+ * @ptr_mask it must be 0.\n+ */\n+struct seccomp_notif_resp_redirect {\n+\t__u64 id;\n+\t__u32 flags;\n+\t__u32 args_mask;\n+\t__u32 ptr_mask;\n+\t__u32 memfd;\n+\t__u64 args[SECCOMP_REDIRECT_ARGS];\n+\t__u64 ptr_len[SECCOMP_REDIRECT_ARGS];\n+};\n+\n+/*\n+ * Install a sealed memfd-backed pin in the trapped task's mm without\n+ * target-side cooperation. The supervisor owns the backing memfd;\n+ * the kernel installs the mapping and marks it VM_SEALED. The actual\n+ * mapped address is written back to @target_addr (relevant when it was\n+ * passed as 0 to let the kernel choose).\n+ */\n+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL\tSECCOMP_IOWR(5, \\\n+\t\t\t\t\t\tstruct seccomp_notif_pin_install)\n+\n+/*\n+ * Resume the trapped syscall with substituted arg-register values\n+ * pointing into an installed pin. The kernel saves and restores the\n+ * original registers at syscall exit so the caller observes ABI-\n+ * correct register preservation.\n+ */\n+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT\tSECCOMP_IOW(6, \\\n+\t\t\t\t\t\tstruct seccomp_notif_resp_redirect)\n+\n #endif /* _UAPI_LINUX_SECCOMP_H */\ndiff --git a/kernel/seccomp.c b/kernel/seccomp.c\nindex 066909393c38f5..84812ce9bdb3b5 100644\n--- a/kernel/seccomp.c\n+++ b/kernel/seccomp.c\n@@ -37,12 +37,19 @@\n #ifdef CONFIG_SECCOMP_FILTER\n #include \u003clinux/file.h\u003e\n #include \u003clinux/filter.h\u003e\n+#include \u003clinux/memfd.h\u003e\n #include \u003clinux/pid.h\u003e\n #include \u003clinux/ptrace.h\u003e\n #include \u003clinux/capability.h\u003e\n #include \u003clinux/uaccess.h\u003e\n #include \u003clinux/anon_inodes.h\u003e\n #include \u003clinux/lockdep.h\u003e\n+#include \u003clinux/mm.h\u003e\n+#include \u003clinux/mman.h\u003e\n+#include \u003clinux/mmap_lock.h\u003e\n+#include \u003clinux/sched/mm.h\u003e\n+#include \u003clinux/task_work.h\u003e\n+#include \u003cuapi/asm-generic/mman-common.h\u003e\n \n /*\n * When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the\n@@ -87,6 +94,13 @@ struct seccomp_knotif {\n \tlong val;\n \tu32 flags;\n \n+\t/*\n+\t * Set by SEND_REDIRECT: the reply rewrote the syscall's registers,\n+\t * so on resume the syscall must be re-evaluated against the filters\n+\t * outer to the one that notified (see __seccomp_filter()).\n+\t */\n+\tbool redirect;\n+\n \t/*\n \t * Signals when this has changed states, such as the listener\n \t * dying, a new seccomp addfd message, or changing to REPLIED\n@@ -226,6 +240,7 @@ struct seccomp_filter {\n \trefcount_t users;\n \tbool log;\n \tbool wait_killable_recv;\n+\tbool redirect_capable;\n \tstruct action_cache cache;\n \tstruct seccomp_filter *prev;\n \tstruct bpf_prog *prog;\n@@ -946,6 +961,13 @@ static long seccomp_attach_filter(unsigned int flags,\n \t\t}\n \t}\n \n+\tif (flags \u0026 SECCOMP_FILTER_FLAG_REDIRECT) {\n+\t\tfor (walker = current-\u003eseccomp.filter; walker;\n+\t\t walker = walker-\u003eprev)\n+\t\t\tif (walker-\u003eredirect_capable)\n+\t\t\t\treturn -EBUSY;\n+\t}\n+\n \t/* Set log flag, if present. */\n \tif (flags \u0026 SECCOMP_FILTER_FLAG_LOG)\n \t\tfilter-\u003elog = true;\n@@ -954,6 +976,10 @@ static long seccomp_attach_filter(unsigned int flags,\n \tif (flags \u0026 SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)\n \t\tfilter-\u003ewait_killable_recv = true;\n \n+\t/* Set redirect-capable flag, if present. */\n+\tif (flags \u0026 SECCOMP_FILTER_FLAG_REDIRECT)\n+\t\tfilter-\u003eredirect_capable = true;\n+\n \t/*\n \t * If there is an existing filter, make it the prev and don't drop its\n \t * task reference.\n@@ -1162,10 +1188,12 @@ static bool should_sleep_killable(struct seccomp_filter *match,\n \n static int seccomp_do_user_notification(int this_syscall,\n \t\t\t\t\tstruct seccomp_filter *match,\n-\t\t\t\t\tconst struct seccomp_data *sd)\n+\t\t\t\t\tconst struct seccomp_data *sd,\n+\t\t\t\t\tbool *redirected)\n {\n \tint err;\n \tu32 flags = 0;\n+\tbool redirect = false;\n \tlong ret = 0;\n \tstruct seccomp_knotif n = {};\n \tstruct seccomp_kaddfd *addfd, *tmp;\n@@ -1222,6 +1250,7 @@ static int seccomp_do_user_notification(int this_syscall,\n \tret = n.val;\n \terr = n.error;\n \tflags = n.flags;\n+\tredirect = n.redirect;\n \n interrupted:\n \t/* If there were any pending addfd calls, clear them out */\n@@ -1248,14 +1277,38 @@ static int seccomp_do_user_notification(int this_syscall,\n \tmutex_unlock(\u0026match-\u003enotify_lock);\n \n \t/* Userspace requests to continue the syscall. */\n-\tif (flags \u0026 SECCOMP_USER_NOTIF_FLAG_CONTINUE)\n+\tif (flags \u0026 SECCOMP_USER_NOTIF_FLAG_CONTINUE) {\n+\t\t*redirected = redirect;\n \t\treturn 0;\n+\t}\n \n \tsyscall_set_return_value(current, current_pt_regs(),\n \t\t\t\t err, ret);\n \treturn -1;\n }\n \n+static u32 seccomp_run_filters_seq(const struct seccomp_data *sd,\n+\t\t\t\t struct seccomp_filter **match,\n+\t\t\t\t struct seccomp_filter *f,\n+\t\t\t\t int this_syscall)\n+{\n+\tfor (; f; f = f-\u003eprev) {\n+\t\tu32 cur_ret = bpf_prog_run_pin_on_cpu(f-\u003eprog, sd);\n+\t\tu32 action = cur_ret \u0026 SECCOMP_RET_ACTION_FULL;\n+\n+\t\tif (action == SECCOMP_RET_ALLOW)\n+\t\t\tcontinue;\n+\t\t/* LOG does not block the syscall; record it and continue. */\n+\t\tif (action == SECCOMP_RET_LOG) {\n+\t\t\tseccomp_log(this_syscall, 0, action, true);\n+\t\t\tcontinue;\n+\t\t}\n+\t\t*match = f;\n+\t\treturn cur_ret;\n+\t}\n+\treturn SECCOMP_RET_ALLOW;\n+}\n+\n static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)\n {\n \tu32 filter_ret, action;\n@@ -1272,6 +1325,8 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)\n \tpopulate_seccomp_data(\u0026sd);\n \n \tfilter_ret = seccomp_run_filters(\u0026sd, \u0026match);\n+\n+eval:\n \tdata = filter_ret \u0026 SECCOMP_RET_DATA;\n \taction = filter_ret \u0026 SECCOMP_RET_ACTION_FULL;\n \n@@ -1334,11 +1389,40 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)\n \n \t\treturn 0;\n \n-\tcase SECCOMP_RET_USER_NOTIF:\n-\t\tif (seccomp_do_user_notification(this_syscall, match, \u0026sd))\n+\tcase SECCOMP_RET_USER_NOTIF: {\n+\t\tstruct seccomp_filter *outer;\n+\t\tbool redirected = false;\n+\n+\t\tif (seccomp_do_user_notification(this_syscall, match, \u0026sd,\n+\t\t\t\t\t\t \u0026redirected))\n \t\t\tgoto skip;\n \n+\t\tif (redirected \u0026\u0026 match-\u003eprev) {\n+\t\t\t/*\n+\t\t\t * The notifier rewrote the registers. Resume\n+\t\t\t * evaluation at the next outer filter on the\n+\t\t\t * substituted syscall, sequentially toward the root:\n+\t\t\t * each outer filter judges the new syscall exactly as\n+\t\t\t * if the target had issued it. Walking outward is\n+\t\t\t * monotonic, so a notifier cannot re-notify on its own\n+\t\t\t * redirect.\n+\t\t\t */\n+\t\t\tthis_syscall = syscall_get_nr(current,\n+\t\t\t\t\t\t current_pt_regs());\n+\t\t\tif (this_syscall \u003c 0)\n+\t\t\t\treturn 0;\n+\t\t\touter = match-\u003eprev;\n+\t\t\tmatch = NULL;\n+\t\t\tpopulate_seccomp_data(\u0026sd);\n+\t\t\tfilter_ret = seccomp_run_filters_seq(\u0026sd, \u0026match, outer,\n+\t\t\t\t\t\t\t this_syscall);\n+\t\t\tif (!match)\n+\t\t\t\treturn 0;\n+\t\t\tgoto eval;\n+\t\t}\n+\n \t\treturn 0;\n+\t}\n \n \tcase SECCOMP_RET_LOG:\n \t\tseccomp_log(this_syscall, 0, action, true);\n@@ -1823,6 +1907,346 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,\n \treturn ret;\n }\n \n+static unsigned long seccomp_install_pin(struct task_struct *target,\n+\t\t\t\t\t struct file *memfd_file,\n+\t\t\t\t\t unsigned long target_addr, size_t size,\n+\t\t\t\t\t unsigned long offset)\n+{\n+\tstruct mm_struct *mm;\n+\tunsigned long ret;\n+\n+\tmm = get_task_mm(target);\n+\tif (!mm)\n+\t\treturn -ESRCH;\n+\n+\t/*\n+\t * Install a sealed, read-only mapping. A fixed request (@target_addr\n+\t * != 0) is MAP_FIXED_NOREPLACE: an existing mapping yields -EEXIST\n+\t * rather than being silently clobbered. A request of 0 lets the kernel\n+\t * pick a free area in the target mm.\n+\t */\n+\tret = vm_mmap_seal_remote(mm, memfd_file, target_addr, size,\n+\t\t\t\t offset \u003e\u003e PAGE_SHIFT);\n+\tmmput(mm);\n+\tif (IS_ERR_VALUE(ret))\n+\t\treturn ret;\n+\tif (target_addr \u0026\u0026 ret != target_addr)\n+\t\treturn -ENOMEM;\n+\treturn ret;\n+}\n+\n+static long seccomp_notify_pin_install(struct seccomp_filter *filter,\n+\t\t\t\t struct seccomp_notif_pin_install __user *upin,\n+\t\t\t\t unsigned int size)\n+{\n+\tstruct seccomp_notif_pin_install pin;\n+\tstruct seccomp_knotif *knotif;\n+\tstruct task_struct *target;\n+\tstruct file *memfd_file;\n+\tunsigned long addr;\n+\tint seals;\n+\tlong ret;\n+\n+\tBUILD_BUG_ON(sizeof(pin) \u003c SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0);\n+\tBUILD_BUG_ON(sizeof(pin) != SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST);\n+\n+\tif (size \u003c SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 || size \u003e= PAGE_SIZE)\n+\t\treturn -EINVAL;\n+\n+\tret = copy_struct_from_user(\u0026pin, sizeof(pin), upin, size);\n+\tif (ret)\n+\t\treturn ret;\n+\n+\tif (pin.flags)\n+\t\treturn -EINVAL;\n+\tif (!pin.size || !IS_ALIGNED(pin.target_addr, PAGE_SIZE) ||\n+\t !IS_ALIGNED(pin.size, PAGE_SIZE) || !IS_ALIGNED(pin.offset, PAGE_SIZE))\n+\t\treturn -EINVAL;\n+\tif (pin.target_addr + pin.size \u003c pin.target_addr)\n+\t\treturn -EINVAL;\n+\tif (pin.offset + pin.size \u003c pin.offset)\n+\t\treturn -EINVAL;\n+\n+\tmemfd_file = fget(pin.memfd);\n+\tif (!memfd_file)\n+\t\treturn -EBADF;\n+\n+\tseals = memfd_get_seals(memfd_file);\n+\tif (seals \u003c 0 || !(seals \u0026 (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {\n+\t\tret = -EINVAL;\n+\t\tgoto out_fput;\n+\t}\n+\n+\tret = mutex_lock_interruptible(\u0026filter-\u003enotify_lock);\n+\tif (ret \u003c 0)\n+\t\tgoto out_fput;\n+\n+\tknotif = find_notification(filter, pin.id);\n+\tif (!knotif) {\n+\t\tret = -ENOENT;\n+\t\tgoto out_unlock;\n+\t}\n+\tif (knotif-\u003estate != SECCOMP_NOTIFY_SENT) {\n+\t\tret = -EINPROGRESS;\n+\t\tgoto out_unlock;\n+\t}\n+\n+\ttarget = knotif-\u003etask;\n+\tget_task_struct(target);\n+\tmutex_unlock(\u0026filter-\u003enotify_lock);\n+\n+\taddr = seccomp_install_pin(target, memfd_file,\n+\t\t\t\t pin.target_addr, pin.size, pin.offset);\n+\tput_task_struct(target);\n+\tif (IS_ERR_VALUE(addr))\n+\t\tret = addr;\n+\telse if (put_user(addr, \u0026upin-\u003etarget_addr))\n+\t\t/* Pin is installed (and sealed); we just can't report where. */\n+\t\tret = -EFAULT;\n+\telse\n+\t\tret = 0;\n+\tgoto out_fput;\n+\n+out_unlock:\n+\tmutex_unlock(\u0026filter-\u003enotify_lock);\n+out_fput:\n+\tfput(memfd_file);\n+\treturn ret;\n+}\n+\n+static bool seccomp_pin_check(struct task_struct *target,\n+\t\t\t struct file *memfd_file, u64 ptr, u64 len)\n+{\n+\tstruct vm_area_struct *vma;\n+\tstruct mm_struct *mm;\n+\tbool ok = false;\n+\tu64 end;\n+\n+\tif (!len)\n+\t\treturn false;\n+\tend = ptr + len;\n+\tif (end \u003c ptr)\n+\t\treturn false;\n+\n+\tmm = get_task_mm(target);\n+\tif (!mm)\n+\t\treturn false;\n+\n+\t/*\n+\t * The access must lie in a single sealed, read-only, memfd-backed VMA.\n+\t * Read-only so no CLONE_VM peer can rewrite the bytes the kernel is\n+\t * about to read; VM_SEALED keeps the mapping itself immutable.\n+\t */\n+\tmmap_read_lock(mm);\n+\tvma = vma_lookup(mm, ptr);\n+\tif (vma \u0026\u0026 end \u003c= vma-\u003evm_end \u0026\u0026 (vma-\u003evm_flags \u0026 VM_SEALED) \u0026\u0026\n+\t !(vma-\u003evm_flags \u0026 VM_WRITE) \u0026\u0026\n+\t vma-\u003evm_file \u0026\u0026 file_inode(vma-\u003evm_file) == file_inode(memfd_file))\n+\t\tok = true;\n+\tmmap_read_unlock(mm);\n+\n+\tmmput(mm);\n+\treturn ok;\n+}\n+\n+struct seccomp_redirect_restore {\n+\tstruct callback_head twork;\n+\tunsigned long orig_args[SECCOMP_REDIRECT_ARGS];\n+\tu32 args_mask;\t\t/* bit i: arg i was substituted, restore it */\n+\tu64 self_exec_id;\t/* snapshot to detect an intervening execve */\n+};\n+\n+static void seccomp_redirect_restore_cb(struct callback_head *cb)\n+{\n+\tstruct seccomp_redirect_restore *r =\n+\t\tcontainer_of(cb, struct seccomp_redirect_restore, twork);\n+\tunsigned long args[SECCOMP_REDIRECT_ARGS];\n+\tint i;\n+\n+\tif (READ_ONCE(current-\u003eself_exec_id) != r-\u003eself_exec_id) {\n+\t\tkfree(r);\n+\t\treturn;\n+\t}\n+\n+\tsyscall_get_arguments(current, current_pt_regs(), args);\n+\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++)\n+\t\tif (r-\u003eargs_mask \u0026 (1U \u003c\u003c i))\n+\t\t\targs[i] = r-\u003eorig_args[i];\n+\tsyscall_set_arguments(current, current_pt_regs(), args);\n+\tkfree(r);\n+}\n+\n+/*\n+ * rt_sigreturn restores the entire register frame from the user signal\n+ * stack; the SEND_REDIRECT register-restore (run from task_work at user-mode\n+ * return) would corrupt that frame, and the syscall takes no arguments to\n+ * substitute anyway. Refuse to redirect it, including the compat variant.\n+ */\n+static bool seccomp_redirect_is_sigreturn(const struct seccomp_data *sd)\n+{\n+#ifdef SECCOMP_ARCH_COMPAT\n+\tif (sd-\u003earch == SECCOMP_ARCH_COMPAT)\n+\t\treturn sd-\u003enr == __NR_seccomp_sigreturn_32;\n+#endif\n+\treturn sd-\u003enr == __NR_seccomp_sigreturn;\n+}\n+\n+static long seccomp_notify_send_redirect(struct seccomp_filter *filter,\n+\t\t\t\t\t struct seccomp_notif_resp_redirect __user *uresp,\n+\t\t\t\t\t unsigned int size)\n+{\n+\tstruct seccomp_notif_resp_redirect resp;\n+\tstruct seccomp_knotif *knotif;\n+\tstruct seccomp_redirect_restore *restore;\n+\tstruct file *memfd_file = NULL;\n+\tstruct pt_regs *target_regs;\n+\tunsigned long args[SECCOMP_REDIRECT_ARGS];\n+\tlong ret;\n+\tint i;\n+\n+\tBUILD_BUG_ON(sizeof(resp) \u003c SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0);\n+\tBUILD_BUG_ON(sizeof(resp) != SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST);\n+\n+\tif (!filter-\u003eredirect_capable)\n+\t\treturn -EPERM;\n+\n+\tif (size \u003c SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 || size \u003e= PAGE_SIZE)\n+\t\treturn -EINVAL;\n+\n+\tret = copy_struct_from_user(\u0026resp, sizeof(resp), uresp, size);\n+\tif (ret)\n+\t\treturn ret;\n+\n+\tif (!(resp.flags \u0026 SECCOMP_REDIRECT_FLAG_CONTINUE))\n+\t\treturn -EINVAL;\n+\tif (resp.flags \u0026 ~SECCOMP_REDIRECT_FLAG_CONTINUE)\n+\t\treturn -EINVAL;\n+\tif (resp.args_mask \u0026 ~((1U \u003c\u003c SECCOMP_REDIRECT_ARGS) - 1))\n+\t\treturn -EINVAL;\n+\tif (resp.ptr_mask \u0026 ~resp.args_mask)\n+\t\treturn -EINVAL;\n+\tif (!resp.args_mask)\n+\t\treturn -EINVAL;\n+\n+\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++) {\n+\t\tif (resp.ptr_mask \u0026 (1U \u003c\u003c i)) {\n+\t\t\tif (!resp.ptr_len[i])\n+\t\t\t\treturn -EINVAL;\n+\t\t} else if (resp.ptr_len[i]) {\n+\t\t\treturn -EINVAL;\n+\t\t}\n+\t}\n+\n+\trestore = kzalloc_obj(*restore, GFP_KERNEL_ACCOUNT);\n+\tif (!restore)\n+\t\treturn -ENOMEM;\n+\tinit_task_work(\u0026restore-\u003etwork, seccomp_redirect_restore_cb);\n+\n+\t/* The backing memfd is only consulted to validate pointer args. */\n+\tif (resp.ptr_mask) {\n+\t\tmemfd_file = fget(resp.memfd);\n+\t\tif (!memfd_file) {\n+\t\t\tkfree(restore);\n+\t\t\treturn -EBADF;\n+\t\t}\n+\t}\n+\n+\tret = mutex_lock_interruptible(\u0026filter-\u003enotify_lock);\n+\tif (ret \u003c 0)\n+\t\tgoto out_free;\n+\n+\tknotif = find_notification(filter, resp.id);\n+\tif (!knotif) {\n+\t\tret = -ENOENT;\n+\t\tgoto out_unlock_free;\n+\t}\n+\tif (knotif-\u003estate != SECCOMP_NOTIFY_SENT) {\n+\t\tret = -EINPROGRESS;\n+\t\tgoto out_unlock_free;\n+\t}\n+\n+\tif (seccomp_redirect_is_sigreturn(knotif-\u003edata)) {\n+\t\tret = -EOPNOTSUPP;\n+\t\tgoto out_unlock_free;\n+\t}\n+\n+\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++) {\n+\t\tif (!(resp.ptr_mask \u0026 (1U \u003c\u003c i)))\n+\t\t\tcontinue;\n+\t\tif (!seccomp_pin_check(knotif-\u003etask, memfd_file,\n+\t\t\t\t resp.args[i], resp.ptr_len[i])) {\n+\t\t\tret = -EFAULT;\n+\t\t\tgoto out_unlock_free;\n+\t\t}\n+\t}\n+\n+\t/*\n+\t * Save original pt_regs args (target is parked in\n+\t * seccomp_do_user_notification, so its pt_regs is stable) and\n+\t * write substituted values. The trapped task's task_work fires\n+\t * at user-mode return, restoring originals for ABI compliance.\n+\t */\n+\ttarget_regs = task_pt_regs(knotif-\u003etask);\n+\tsyscall_get_arguments(knotif-\u003etask, target_regs, args);\n+\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++)\n+\t\trestore-\u003eorig_args[i] = args[i];\n+\trestore-\u003eargs_mask = resp.args_mask;\n+\trestore-\u003eself_exec_id = READ_ONCE(knotif-\u003etask-\u003eself_exec_id);\n+\n+\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++)\n+\t\tif (resp.args_mask \u0026 (1U \u003c\u003c i))\n+\t\t\targs[i] = resp.args[i];\n+\tsyscall_set_arguments(knotif-\u003etask, target_regs, args);\n+\n+\t/*\n+\t * Use TWA_RESUME, not TWA_SIGNAL. TWA_SIGNAL sets TIF_NOTIFY_SIGNAL,\n+\t * which makes signal_pending() true for the entire redirected syscall\n+\t * (the work is queued here, before the target resumes and runs it).\n+\t * An interruptible syscall would then bail out with -ERESTARTSYS before\n+\t * doing any work, restart, re-trap and get redirected again -- a\n+\t * livelock. TWA_RESUME does not feed signal_pending(), and the restore\n+\t * still runs before signal delivery: get_signal() runs task_work_run()\n+\t * before it dequeues a signal, so the original args are back in pt_regs\n+\t * before handle_signal() builds the sigframe or the -ERESTART* path\n+\t * rewinds for restart.\n+\t */\n+\tret = task_work_add(knotif-\u003etask, \u0026restore-\u003etwork, TWA_RESUME);\n+\tif (ret) {\n+\t\tfor (i = 0; i \u003c SECCOMP_REDIRECT_ARGS; i++)\n+\t\t\targs[i] = restore-\u003eorig_args[i];\n+\t\tsyscall_set_arguments(knotif-\u003etask, target_regs, args);\n+\t\tgoto out_unlock_free;\n+\t}\n+\n+\t/*\n+\t * Mark REPLIED with FLAG_CONTINUE so the wait-loop exit path runs the\n+\t * syscall normally. Flag the redirect so the resume path re-validates\n+\t * the rewritten syscall against the filters outer to this one.\n+\t */\n+\tknotif-\u003estate = SECCOMP_NOTIFY_REPLIED;\n+\tknotif-\u003eerror = 0;\n+\tknotif-\u003eval = 0;\n+\tknotif-\u003eflags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;\n+\tknotif-\u003eredirect = true;\n+\tif (filter-\u003enotif-\u003eflags \u0026 SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)\n+\t\tcomplete_on_current_cpu(\u0026knotif-\u003eready);\n+\telse\n+\t\tcomplete(\u0026knotif-\u003eready);\n+\n+\tmutex_unlock(\u0026filter-\u003enotify_lock);\n+\tif (memfd_file)\n+\t\tfput(memfd_file);\n+\treturn 0;\n+\n+out_unlock_free:\n+\tmutex_unlock(\u0026filter-\u003enotify_lock);\n+out_free:\n+\tif (memfd_file)\n+\t\tfput(memfd_file);\n+\tkfree(restore);\n+\treturn ret;\n+}\n+\n static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,\n \t\t\t\t unsigned long arg)\n {\n@@ -1847,6 +2271,12 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,\n \tswitch (EA_IOCTL(cmd)) {\n \tcase EA_IOCTL(SECCOMP_IOCTL_NOTIF_ADDFD):\n \t\treturn seccomp_notify_addfd(filter, buf, _IOC_SIZE(cmd));\n+\tcase EA_IOCTL(SECCOMP_IOCTL_NOTIF_PIN_INSTALL):\n+\t\treturn seccomp_notify_pin_install(filter, buf,\n+\t\t\t\t\t\t _IOC_SIZE(cmd));\n+\tcase EA_IOCTL(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT):\n+\t\treturn seccomp_notify_send_redirect(filter, buf,\n+\t\t\t\t\t\t _IOC_SIZE(cmd));\n \tdefault:\n \t\treturn -EINVAL;\n \t}\n@@ -1986,6 +2416,14 @@ static long seccomp_set_mode_filter(unsigned int flags,\n \t ((flags \u0026 SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))\n \t\treturn -EINVAL;\n \n+\t/*\n+\t * SECCOMP_FILTER_FLAG_REDIRECT declares intent to redirect via the\n+\t * listener notifier, so it requires a listener.\n+\t */\n+\tif ((flags \u0026 SECCOMP_FILTER_FLAG_REDIRECT) \u0026\u0026\n+\t ((flags \u0026 SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))\n+\t\treturn -EINVAL;\n+\n \t/* Prepare the new filter before holding any locks. */\n \tprepared = seccomp_prepare_user_filter(filter);\n \tif (IS_ERR(prepared))\ndiff --git a/mm/internal.h b/mm/internal.h\nindex 181e79f1d6a207..3d698bccc10040 100644\n--- a/mm/internal.h\n+++ b/mm/internal.h\n@@ -1436,6 +1436,14 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,\n unsigned long, unsigned long,\n unsigned long, unsigned long);\n \n+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,\n+\tunsigned long addr, unsigned long len, unsigned long prot,\n+\tunsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,\n+\tunsigned long *populate, struct list_head *uf);\n+\n+unsigned long mm_get_unmapped_area_remote(struct mm_struct *mm,\n+\t\t\t\t\t unsigned long len);\n+\n extern void set_pageblock_order(void);\n unsigned long reclaim_pages(struct list_head *folio_list);\n unsigned int reclaim_clean_pages_from_list(struct zone *zone,\ndiff --git a/mm/mmap.c b/mm/mmap.c\nindex 2311ae7c2ff45c..4328dc21272d3f 100644\n--- a/mm/mmap.c\n+++ b/mm/mmap.c\n@@ -277,7 +277,7 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,\n }\n \n /**\n- * do_mmap() - Perform a userland memory mapping into the current process\n+ * __do_mmap() - Perform a userland memory mapping into @mm's\n * address space of length @len with protection bits @prot, mmap flags @flags\n * (from which VMA flags will be inferred), and any additional VMA flags to\n * apply @vm_flags. If this is a file-backed mapping then the file is specified\n@@ -307,8 +307,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,\n * start of a VMA, rather only the start of a valid mapped range of length\n * @len bytes, rounded down to the nearest page size.\n *\n- * The caller must write-lock current-\u003emm-\u003emmap_lock.\n+ * The caller must write-lock @mm-\u003emmap_lock. do_mmap() is the common\n+ * wrapper that targets current-\u003emm.\n *\n+ * @mm: The mm_struct to install the mapping into. The caller must hold a\n+ * reference and write-lock its mmap_lock.\n * @file: An optional struct file pointer describing the file which is to be\n * mapped, if a file-backed mapping.\n * @addr: If non-zero, hints at (or if @flags has MAP_FIXED set, specifies) the\n@@ -333,13 +336,12 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,\n * Returns: Either an error, or the address at which the requested mapping has\n * been performed.\n */\n-unsigned long do_mmap(struct file *file, unsigned long addr,\n-\t\t\tunsigned long len, unsigned long prot,\n-\t\t\tunsigned long flags, vm_flags_t vm_flags,\n-\t\t\tunsigned long pgoff, unsigned long *populate,\n-\t\t\tstruct list_head *uf)\n+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,\n+\t\t\tunsigned long addr, unsigned long len,\n+\t\t\tunsigned long prot, unsigned long flags,\n+\t\t\tvm_flags_t vm_flags, unsigned long pgoff,\n+\t\t\tunsigned long *populate, struct list_head *uf)\n {\n-\tstruct mm_struct *mm = current-\u003emm;\n \tint pkey = 0;\n \n \t*populate = 0;\n@@ -557,7 +559,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,\n \t\t\tvm_flags |= VM_NORESERVE;\n \t}\n \n-\taddr = mmap_region(file, addr, len, vm_flags, pgoff, uf);\n+\taddr = mmap_region(mm, file, addr, len, vm_flags, pgoff, uf);\n \tif (!IS_ERR_VALUE(addr) \u0026\u0026\n \t ((vm_flags \u0026 VM_LOCKED) ||\n \t (flags \u0026 (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))\n@@ -565,6 +567,15 @@ unsigned long do_mmap(struct file *file, unsigned long addr,\n \treturn addr;\n }\n \n+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,\n+\t\t unsigned long prot, unsigned long flags,\n+\t\t vm_flags_t vm_flags, unsigned long pgoff,\n+\t\t unsigned long *populate, struct list_head *uf)\n+{\n+\treturn __do_mmap(current-\u003emm, file, addr, len, prot, flags,\n+\t\t\t vm_flags, pgoff, populate, uf);\n+}\n+\n unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,\n \t\t\t unsigned long prot, unsigned long flags,\n \t\t\t unsigned long fd, unsigned long pgoff)\n@@ -809,6 +820,40 @@ unsigned long mm_get_unmapped_area_vmflags(struct file *filp, unsigned long addr\n \treturn arch_get_unmapped_area(filp, addr, len, pgoff, flags, vm_flags);\n }\n \n+/*\n+ * Find a free @len-byte area in @mm, honoring @mm's mmap layout direction.\n+ * Unlike the arch_get_unmapped_area() family, the search runs against @mm\n+ * rather than current-\u003emm, so a supervisor can place a mapping in a remote\n+ * task's address space (see vm_mmap_seal_remote()). The caller must hold\n+ * mmap_write_lock(@mm). Returns a page-aligned address or -ENOMEM.\n+ */\n+unsigned long mm_get_unmapped_area_remote(struct mm_struct *mm, unsigned long len)\n+{\n+\tstruct vm_unmapped_area_info info = {\n+\t\t.length = len,\n+\t\t.mm = mm,\n+\t};\n+\tunsigned long addr;\n+\n+\tif (mm_flags_test(MMF_TOPDOWN, mm)) {\n+\t\tinfo.flags = VM_UNMAPPED_AREA_TOPDOWN;\n+\t\tinfo.low_limit = PAGE_SIZE;\n+\t\tinfo.high_limit = arch_get_mmap_base(0, mm-\u003emmap_base);\n+\t\taddr = vm_unmapped_area(\u0026info);\n+\t\tif (!offset_in_page(addr))\n+\t\t\treturn addr;\n+\t\t/* Topdown exhausted (e.g. huge stack rlimit); retry bottom-up. */\n+\t\tinfo.flags = 0;\n+\t\tinfo.low_limit = TASK_UNMAPPED_BASE;\n+\t\tinfo.high_limit = arch_get_mmap_end(0, len, 0);\n+\t\treturn vm_unmapped_area(\u0026info);\n+\t}\n+\n+\tinfo.low_limit = mm-\u003emmap_base;\n+\tinfo.high_limit = arch_get_mmap_end(0, len, 0);\n+\treturn vm_unmapped_area(\u0026info);\n+}\n+\n unsigned long\n __get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,\n \t\tunsigned long pgoff, unsigned long flags, vm_flags_t vm_flags)\ndiff --git a/mm/nommu.c b/mm/nommu.c\nindex ed3934bc2de483..7f2136129c7294 100644\n--- a/mm/nommu.c\n+++ b/mm/nommu.c\n@@ -1009,7 +1009,8 @@ static int do_mmap_private(struct vm_area_struct *vma,\n /*\n * handle mapping creation for uClinux\n */\n-unsigned long do_mmap(struct file *file,\n+unsigned long __do_mmap(struct mm_struct *mm,\n+\t\t\tstruct file *file,\n \t\t\tunsigned long addr,\n \t\t\tunsigned long len,\n \t\t\tunsigned long prot,\n@@ -1246,6 +1247,15 @@ unsigned long do_mmap(struct file *file,\n \treturn -ENOMEM;\n }\n \n+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,\n+\t\t unsigned long prot, unsigned long flags,\n+\t\t vm_flags_t vm_flags, unsigned long pgoff,\n+\t\t unsigned long *populate, struct list_head *uf)\n+{\n+\treturn __do_mmap(current-\u003emm, file, addr, len, prot, flags,\n+\t\t\t vm_flags, pgoff, populate, uf);\n+}\n+\n unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,\n \t\t\t unsigned long prot, unsigned long flags,\n \t\t\t unsigned long fd, unsigned long pgoff)\ndiff --git a/mm/util.c b/mm/util.c\nindex af2c2103f0d952..21568dd0e9f8b0 100644\n--- a/mm/util.c\n+++ b/mm/util.c\n@@ -588,6 +588,68 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,\n \treturn ret;\n }\n \n+/**\n+ * vm_mmap_seal_remote - install a sealed MAP_SHARED file mapping into @mm,\n+ * without target-side cooperation.\n+ * @mm: Target mm; caller holds a reference (e.g. get_task_mm()).\n+ * @file: Backing file.\n+ * @addr: Page-aligned address. If non-zero, MAP_FIXED_NOREPLACE is used\n+ * (-EEXIST if occupied); if zero, the kernel chooses a free area in\n+ * @mm and returns it.\n+ * @len: Length in bytes (page-aligned).\n+ * @pgoff: Page offset into @file.\n+ *\n+ * The mapping is read-only. The VMA is created VM_SEALED, so it is immediately\n+ * immutable against the target mm's owner and its CLONE_VM peers. LSM/fsnotify\n+ * hooks run against %current; cross-task authorization is the caller's\n+ * responsibility (no ptrace_may_access check).\n+ *\n+ * Returns the mapped address on success, or a negative errno.\n+ */\n+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,\n+\tunsigned long addr, unsigned long len, unsigned long pgoff)\n+{\n+\tconst unsigned long prot = PROT_READ;\n+\tconst unsigned long flags = MAP_SHARED | MAP_FIXED_NOREPLACE;\n+\tloff_t off = (loff_t)pgoff \u003c\u003c PAGE_SHIFT;\n+\tunsigned long ret;\n+\tunsigned long populate;\n+\tLIST_HEAD(uf);\n+\n+\tif (WARN_ON_ONCE(!mm))\n+\t\treturn -EINVAL;\n+\tif (!VM_SEALED)\t\t/* sealing unavailable (e.g. !CONFIG_64BIT) */\n+\t\treturn -EOPNOTSUPP;\n+\n+\tret = security_mmap_file(file, prot, flags);\n+\tif (!ret)\n+\t\tret = fsnotify_mmap_perm(file, prot, off, len);\n+\tif (ret)\n+\t\treturn ret;\n+\n+\tif (mmap_write_lock_killable(mm))\n+\t\treturn -EINTR;\n+\n+\tif (!addr) {\n+\t\taddr = mm_get_unmapped_area_remote(mm, PAGE_ALIGN(len));\n+\t\tif (IS_ERR_VALUE(addr)) {\n+\t\t\tret = addr;\n+\t\t\tgoto unlock;\n+\t\t}\n+\t}\n+\tret = __do_mmap(mm, file, addr, len, prot, flags, VM_SEALED,\n+\t\t\tpgoff, \u0026populate, \u0026uf);\n+\t/*\n+\t * Do not mm_populate() against a foreign mm; the target task will\n+\t * fault pages in on first access.\n+\t */\n+unlock:\n+\tmmap_write_unlock(mm);\n+\tuserfaultfd_unmap_complete(mm, \u0026uf);\n+\treturn ret;\n+}\n+EXPORT_SYMBOL_GPL(vm_mmap_seal_remote);\n+\n /*\n * Perform a userland memory mapping into the current process address space. See\n * the comment for do_mmap() for more details on this operation in general.\ndiff --git a/mm/vma.c b/mm/vma.c\nindex 9eea2850818a85..2f9159ab5123a3 100644\n--- a/mm/vma.c\n+++ b/mm/vma.c\n@@ -2731,11 +2731,10 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)\n \treturn false;\n }\n \n-static unsigned long __mmap_region(struct file *file, unsigned long addr,\n-\t\tunsigned long len, vma_flags_t vma_flags,\n+static unsigned long __mmap_region(struct mm_struct *mm, struct file *file,\n+\t\tunsigned long addr, unsigned long len, vma_flags_t vma_flags,\n \t\tunsigned long pgoff, struct list_head *uf)\n {\n-\tstruct mm_struct *mm = current-\u003emm;\n \tstruct vm_area_struct *vma = NULL;\n \tbool have_mmap_prepare = file \u0026\u0026 file-\u003ef_op-\u003emmap_prepare;\n \tVMA_ITERATOR(vmi, mm, addr);\n@@ -2809,14 +2808,16 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,\n \n /**\n * mmap_region() - Actually perform the userland mapping of a VMA into\n- * current-\u003emm with known, aligned and overflow-checked @addr and @len, and\n+ * @mm with known, aligned and overflow-checked @addr and @len, and\n * correctly determined VMA flags @vm_flags and page offset @pgoff.\n *\n * This is an internal memory management function, and should not be used\n * directly.\n *\n- * The caller must write-lock current-\u003emm-\u003emmap_lock.\n+ * The caller must write-lock @mm-\u003emmap_lock.\n *\n+ * @mm: The mm_struct to install the mapping into. The caller must hold a\n+ * reference and write-lock its mmap_lock.\n * @file: If a file-backed mapping, a pointer to the struct file describing the\n * file to be mapped, otherwise NULL.\n * @addr: The page-aligned address at which to perform the mapping.\n@@ -2830,15 +2831,16 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,\n * Returns: Either an error, or the address at which the requested mapping has\n * been performed.\n */\n-unsigned long mmap_region(struct file *file, unsigned long addr,\n-\t\t\t unsigned long len, vm_flags_t vm_flags,\n-\t\t\t unsigned long pgoff, struct list_head *uf)\n+unsigned long mmap_region(struct mm_struct *mm, struct file *file,\n+\t\t\t unsigned long addr, unsigned long len,\n+\t\t\t vm_flags_t vm_flags, unsigned long pgoff,\n+\t\t\t struct list_head *uf)\n {\n \tunsigned long ret;\n \tbool writable_file_mapping = false;\n \tconst vma_flags_t vma_flags = legacy_to_vma_flags(vm_flags);\n \n-\tmmap_assert_write_locked(current-\u003emm);\n+\tmmap_assert_write_locked(mm);\n \n \t/* Check to see if MDWE is applicable. */\n \tif (map_deny_write_exec(\u0026vma_flags, \u0026vma_flags))\n@@ -2857,13 +2859,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,\n \t\twritable_file_mapping = true;\n \t}\n \n-\tret = __mmap_region(file, addr, len, vma_flags, pgoff, uf);\n+\tret = __mmap_region(mm, file, addr, len, vma_flags, pgoff, uf);\n \n \t/* Clear our write mapping regardless of error. */\n \tif (writable_file_mapping)\n \t\tmapping_unmap_writable(file-\u003ef_mapping);\n \n-\tvalidate_mm(current-\u003emm);\n+\tvalidate_mm(mm);\n \treturn ret;\n }\n \n@@ -2957,8 +2959,8 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,\n \n /**\n * unmapped_area() - Find an area between the low_limit and the high_limit with\n- * the correct alignment and offset, all from @info. Note: current-\u003emm is used\n- * for the search.\n+ * the correct alignment and offset, all from @info. Note: @info-\u003emm (or\n+ * current-\u003emm when it is NULL) is used for the search.\n *\n * @info: The unmapped area information including the range [low_limit -\n * high_limit), the alignment offset and mask.\n@@ -2970,7 +2972,7 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)\n \tunsigned long length, gap;\n \tunsigned long low_limit, high_limit;\n \tstruct vm_area_struct *tmp;\n-\tVMA_ITERATOR(vmi, current-\u003emm, 0);\n+\tVMA_ITERATOR(vmi, info-\u003emm ? : current-\u003emm, 0);\n \n \t/* Adjust search length to account for worst case alignment overhead */\n \tlength = info-\u003elength + info-\u003ealign_mask + info-\u003estart_gap;\n@@ -3016,7 +3018,8 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)\n /**\n * unmapped_area_topdown() - Find an area between the low_limit and the\n * high_limit with the correct alignment and offset at the highest available\n- * address, all from @info. Note: current-\u003emm is used for the search.\n+ * address, all from @info. Note: @info-\u003emm (or current-\u003emm when it is NULL)\n+ * is used for the search.\n *\n * @info: The unmapped area information including the range [low_limit -\n * high_limit), the alignment offset and mask.\n@@ -3028,7 +3031,7 @@ unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)\n \tunsigned long length, gap, gap_end;\n \tunsigned long low_limit, high_limit;\n \tstruct vm_area_struct *tmp;\n-\tVMA_ITERATOR(vmi, current-\u003emm, 0);\n+\tVMA_ITERATOR(vmi, info-\u003emm ? : current-\u003emm, 0);\n \n \t/* Adjust search length to account for worst case alignment overhead */\n \tlength = info-\u003elength + info-\u003ealign_mask + info-\u003estart_gap;\ndiff --git a/mm/vma.h b/mm/vma.h\nindex 8e4b61a7304c68..4f5222ad2e9dde 100644\n--- a/mm/vma.h\n+++ b/mm/vma.h\n@@ -459,9 +459,9 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);\n int mm_take_all_locks(struct mm_struct *mm);\n void mm_drop_all_locks(struct mm_struct *mm);\n \n-unsigned long mmap_region(struct file *file, unsigned long addr,\n-\t\tunsigned long len, vm_flags_t vm_flags, unsigned long pgoff,\n-\t\tstruct list_head *uf);\n+unsigned long mmap_region(struct mm_struct *mm, struct file *file,\n+\t\tunsigned long addr, unsigned long len, vm_flags_t vm_flags,\n+\t\tunsigned long pgoff, struct list_head *uf);\n \n int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,\n \t\t unsigned long addr, unsigned long request,\ndiff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c\nindex 358b6c65e120e8..1b1eec5051980d 100644\n--- a/tools/testing/selftests/seccomp/seccomp_bpf.c\n+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c\n@@ -217,6 +217,10 @@ struct seccomp_metadata {\n #define SECCOMP_FILTER_FLAG_NEW_LISTENER\t(1UL \u003c\u003c 3)\n #endif\n \n+#ifndef SECCOMP_FILTER_FLAG_REDIRECT\n+#define SECCOMP_FILTER_FLAG_REDIRECT\t\t(1UL \u003c\u003c 6)\n+#endif\n+\n #ifndef SECCOMP_RET_USER_NOTIF\n #define SECCOMP_RET_USER_NOTIF 0x7fc00000U\n \n@@ -295,6 +299,35 @@ struct seccomp_notif_addfd_big {\n #define PTRACE_EVENTMSG_SYSCALL_EXIT\t2\n #endif\n \n+#ifndef SECCOMP_IOCTL_NOTIF_PIN_INSTALL\n+struct seccomp_notif_pin_install {\n+\t__u64 id;\n+\t__u32 flags;\n+\t__u32 memfd;\n+\t__u64 target_addr;\n+\t__u64 size;\n+\t__u64 offset;\n+};\n+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL\tSECCOMP_IOWR(5, \\\n+\t\t\t\t\t\tstruct seccomp_notif_pin_install)\n+#endif\n+\n+#ifndef SECCOMP_IOCTL_NOTIF_SEND_REDIRECT\n+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL \u003c\u003c 0)\n+#define SECCOMP_REDIRECT_ARGS 6\n+struct seccomp_notif_resp_redirect {\n+\t__u64 id;\n+\t__u32 flags;\n+\t__u32 args_mask;\n+\t__u32 ptr_mask;\n+\t__u32 memfd;\n+\t__u64 args[SECCOMP_REDIRECT_ARGS];\n+\t__u64 ptr_len[SECCOMP_REDIRECT_ARGS];\n+};\n+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT\tSECCOMP_IOW(6, \\\n+\t\t\t\t\t\tstruct seccomp_notif_resp_redirect)\n+#endif\n+\n #ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE\n #define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001\n #endif\n@@ -4368,6 +4401,1000 @@ TEST(user_notification_addfd_rlimit)\n \tclose(memfd);\n }\n \n+/*\n+ * Create a write-sealed memfd of @size for PIN_INSTALL and map a supervisor\n+ * writable view, primed with @content. F_SEAL_FUTURE_WRITE keeps this\n+ * pre-seal mapping writable (so the test can still stage content) while\n+ * barring any other writable reference, as PIN_INSTALL requires. Returns\n+ * the memfd.\n+ */\n+static int make_pin_memfd(struct __test_metadata *_metadata, const char *name,\n+\t\t\t size_t size, char **sup_view, const char *content)\n+{\n+\tint memfd = memfd_create(name, MFD_ALLOW_SEALING);\n+\n+\tASSERT_GE(memfd, 0);\n+\tASSERT_EQ(0, ftruncate(memfd, size));\n+\tASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW));\n+\n+\t*sup_view = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,\n+\t\t\t memfd, 0);\n+\tASSERT_NE(MAP_FAILED, *sup_view);\n+\tASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE));\n+\tmemcpy(*sup_view, content, strlen(content) + 1);\n+\treturn memfd;\n+}\n+\n+/*\n+ * Non-cooperative pinned-memfd: kernel installs a sealed PROT_READ\n+ * MAP_SHARED mapping of the supervisor's memfd directly into the\n+ * trapped task's mm. The target runs no mmap or mseal code itself —\n+ * this exercises the same kernel path that a fork+execve sandbox\n+ * supervisor would use to install a pin in the new image's fresh\n+ * post-exec mm.\n+ *\n+ * Target child does nothing but call openat() on a bait path. The\n+ * supervisor catches the trap, calls PIN_INSTALL (kernel does the\n+ * mmap + seal in target's mm via vm_mmap_seal_remote()), writes a\n+ * safe path into its own memfd view, and SEND_REDIRECTs args[1]\n+ * into the freshly installed pin. The child's openat resumes,\n+ * reads from the sealed pin, and returns an fd to the safe path.\n+ */\n+TEST(user_notification_pinned_memfd_remote)\n+{\n+\tpid_t pid;\n+\tlong ret;\n+\tint status, listener, memfd, unsealed;\n+\tstruct seccomp_notif req = {};\n+\tstruct seccomp_notif_pin_install pin = {};\n+\tstruct seccomp_notif_pin_install unsealed_pin = {};\n+\tstruct seccomp_notif_resp_redirect redir = {};\n+\tchar *sup_view;\n+\tconst size_t PIN_SIZE = 4096;\n+\tconst char *safe_path = \"/dev/null\";\n+\n+\tret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);\n+\tASSERT_EQ(0, ret) {\n+\t\tTH_LOG(\"Kernel does not support PR_SET_NO_NEW_PRIVS!\");\n+\t}\n+\n+\tmemfd = make_pin_memfd(_metadata, \"pinned-remote\", PIN_SIZE,\n+\t\t\t \u0026sup_view, safe_path);\n+\n+\tlistener = user_notif_syscall(__NR_openat,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT);\n+\tASSERT_GE(listener, 0);\n+\n+\tpid = fork();\n+\tASSERT_GE(pid, 0);\n+\n+\tif (pid == 0) {\n+\t\tint fd;\n+\n+\t\t/*\n+\t\t * Target performs no setup. Just trap on openat. Kernel\n+\t\t * (driven by the supervisor) will install the pin in this\n+\t\t * process's mm at a kernel-chosen address behind our back,\n+\t\t * and our openat will be redirected to read from there.\n+\t\t */\n+\t\tfd = syscall(__NR_openat, AT_FDCWD,\n+\t\t\t \"/this/should/never/be/touched\", O_RDONLY, 0);\n+\t\tif (fd \u003c 0)\n+\t\t\t_exit(11);\n+\t\t_exit(0);\n+\t}\n+\n+\tASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req));\n+\tEXPECT_EQ(req.data.nr, __NR_openat);\n+\n+\tpin.id = req.id;\n+\tpin.memfd = memfd;\n+\tpin.target_addr = 0;\n+\tpin.size = PIN_SIZE;\n+\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, \u0026pin)) {\n+\t\tif (errno == EINVAL) {\n+\t\t\tSKIP(goto cleanup,\n+\t\t\t \"Kernel does not support pinned-memfd remote install\");\n+\t\t}\n+\t\tTH_LOG(\"PIN_INSTALL failed: errno=%d\", errno);\n+\t}\n+\n+\t/* The kernel wrote a non-zero, page-aligned address back to us. */\n+\tEXPECT_NE(0, pin.target_addr);\n+\tEXPECT_EQ(0, pin.target_addr \u0026 (PIN_SIZE - 1));\n+\n+\t/* Reject: the backing memfd must be write-sealed. */\n+\tunsealed = memfd_create(\"unsealed\", MFD_ALLOW_SEALING);\n+\tASSERT_GE(unsealed, 0);\n+\tASSERT_EQ(0, ftruncate(unsealed, PIN_SIZE));\n+\tunsealed_pin.id = req.id;\n+\tunsealed_pin.memfd = unsealed;\n+\tunsealed_pin.size = PIN_SIZE;\n+\tEXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL,\n+\t\t\t \u0026unsealed_pin));\n+\tEXPECT_EQ(EINVAL, errno);\n+\tclose(unsealed);\n+\n+\t/* Reject: redirect outside any installed pin. */\n+\tredir.id = req.id;\n+\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\tredir.args_mask = 1U \u003c\u003c 1;\n+\tredir.ptr_mask = 1U \u003c\u003c 1;\n+\tredir.memfd = memfd;\n+\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\tredir.args[1] = pin.target_addr + PIN_SIZE;\t/* one byte past */\n+\tEXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t \u0026redir));\n+\tEXPECT_EQ(EFAULT, errno);\n+\n+\t/* Reject: base is inside the pin but the extent runs past its end. */\n+\tredir.args[1] = pin.target_addr;\n+\tredir.ptr_len[1] = PIN_SIZE + 1;\n+\tEXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t \u0026redir));\n+\tEXPECT_EQ(EFAULT, errno);\n+\n+\t/* Happy path: redirect into the kernel-installed pin. */\n+\tredir.args[1] = pin.target_addr;\n+\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t \u0026redir));\n+\n+\tEXPECT_EQ(waitpid(pid, \u0026status, 0), pid);\n+\tEXPECT_EQ(true, WIFEXITED(status));\n+\tEXPECT_EQ(0, WEXITSTATUS(status)) {\n+\t\tTH_LOG(\"child exit %d (11=openat fail)\", WEXITSTATUS(status));\n+\t}\n+\n+cleanup:\n+\tmunmap(sup_view, PIN_SIZE);\n+\tclose(memfd);\n+\tclose(listener);\n+}\n+\n+/*\n+ * Helper for the execve test: read up to @max bytes of a NUL-terminated\n+ * string from @pid's mm at @addr into @out. Returns the length read\n+ * (excluding the NUL), or -1 on failure or no NUL.\n+ */\n+static ssize_t read_remote_string(pid_t pid, unsigned long addr,\n+\t\t\t\t char *out, size_t max)\n+{\n+\tstruct iovec local = { .iov_base = out, .iov_len = max };\n+\tstruct iovec remote = { .iov_base = (void *)addr, .iov_len = max };\n+\tssize_t n;\n+\tsize_t i;\n+\n+\tn = process_vm_readv(pid, \u0026local, 1, \u0026remote, 1, 0);\n+\tif (n \u003c= 0)\n+\t\treturn -1;\n+\tfor (i = 0; i \u003c (size_t)n; i++)\n+\t\tif (out[i] == '\\0')\n+\t\t\treturn (ssize_t)i;\n+\treturn -1;\n+}\n+\n+/*\n+ * Send a file descriptor over a connected UNIX socket via SCM_RIGHTS.\n+ * Used by the execve_scm test so the target child can hand its\n+ * SECCOMP_FILTER_FLAG_NEW_LISTENER fd to the supervising parent\n+ * without the parent having to inherit the seccomp filter itself.\n+ */\n+static int send_fd(int sock, int fd)\n+{\n+\tchar cbuf[CMSG_SPACE(sizeof(int))] = {};\n+\tchar data = 'x';\n+\tstruct iovec iov = { .iov_base = \u0026data, .iov_len = 1 };\n+\tstruct msghdr msg = {\n+\t\t.msg_iov = \u0026iov, .msg_iovlen = 1,\n+\t\t.msg_control = cbuf, .msg_controllen = sizeof(cbuf),\n+\t};\n+\tstruct cmsghdr *cmsg = CMSG_FIRSTHDR(\u0026msg);\n+\n+\tcmsg-\u003ecmsg_level = SOL_SOCKET;\n+\tcmsg-\u003ecmsg_type = SCM_RIGHTS;\n+\tcmsg-\u003ecmsg_len = CMSG_LEN(sizeof(int));\n+\tmemcpy(CMSG_DATA(cmsg), \u0026fd, sizeof(int));\n+\treturn sendmsg(sock, \u0026msg, 0) \u003c 0 ? -1 : 0;\n+}\n+\n+static int recv_fd(int sock)\n+{\n+\tchar cbuf[CMSG_SPACE(sizeof(int))] = {};\n+\tchar data;\n+\tstruct iovec iov = { .iov_base = \u0026data, .iov_len = 1 };\n+\tstruct msghdr msg = {\n+\t\t.msg_iov = \u0026iov, .msg_iovlen = 1,\n+\t\t.msg_control = cbuf, .msg_controllen = sizeof(cbuf),\n+\t};\n+\tstruct cmsghdr *cmsg;\n+\tint fd;\n+\n+\tif (recvmsg(sock, \u0026msg, 0) \u003c 0)\n+\t\treturn -1;\n+\tcmsg = CMSG_FIRSTHDR(\u0026msg);\n+\tif (!cmsg || cmsg-\u003ecmsg_level != SOL_SOCKET ||\n+\t cmsg-\u003ecmsg_type != SCM_RIGHTS ||\n+\t cmsg-\u003ecmsg_len != CMSG_LEN(sizeof(int)))\n+\t\treturn -1;\n+\tmemcpy(\u0026fd, CMSG_DATA(cmsg), sizeof(int));\n+\treturn fd;\n+}\n+\n+struct addr_range {\n+\tunsigned long start, end;\n+};\n+\n+/*\n+ * Parse /proc/\u003cpid\u003e/maps looking for the dynamic linker's executable\n+ * mapping (glibc ld-linux-*.so, musl ld-musl-*.so, etc.). The trapped\n+ * task's instruction_pointer falling in this range identifies a\n+ * loader-bootstrap syscall (race-free, kernel-truth) so the supervisor\n+ * can auto-allow it without inspecting argument content via the racy\n+ * process_vm_readv path.\n+ *\n+ * Requires the supervisor not to be subject to the seccomp filter\n+ * itself -- fopen() internally calls openat(). The execve_scm test\n+ * structure (child installs filter, sends listener fd to parent via\n+ * SCM_RIGHTS) satisfies that.\n+ *\n+ * Returns 0 on success with @out populated, -1 if not found.\n+ */\n+static int find_loader_text_range(pid_t pid, struct addr_range *out)\n+{\n+\tchar maps_path[64];\n+\tchar line[512];\n+\tFILE *f;\n+\tint found = 0;\n+\n+\tsnprintf(maps_path, sizeof(maps_path), \"/proc/%d/maps\", pid);\n+\tf = fopen(maps_path, \"r\");\n+\tif (!f)\n+\t\treturn -1;\n+\n+\twhile (fgets(line, sizeof(line), f)) {\n+\t\tunsigned long start, end;\n+\t\tchar perms[8];\n+\t\tchar *path;\n+\n+\t\tif (sscanf(line, \"%lx-%lx %7s\", \u0026start, \u0026end, perms) != 3)\n+\t\t\tcontinue;\n+\t\tif (!strchr(perms, 'x'))\n+\t\t\tcontinue;\n+\t\tpath = strchr(line, '/');\n+\t\tif (!path)\n+\t\t\tcontinue;\n+\t\t/*\n+\t\t * Match common dynamic-linker basenames: ld-linux-*.so\n+\t\t * (glibc), ld-musl-*.so (musl), ld-*.so (older glibc).\n+\t\t */\n+\t\tif (strstr(path, \"/ld-\") || strstr(path, \"/ld.so\")) {\n+\t\t\tout-\u003estart = start;\n+\t\t\tout-\u003eend = end;\n+\t\t\tfound = 1;\n+\t\t\tbreak;\n+\t\t}\n+\t}\n+\tfclose(f);\n+\treturn found ? 0 : -1;\n+}\n+\n+/*\n+ * Non-cooperative pinned-memfd across a real execve, using the proper\n+ * supervisor-isolation pattern: the child (target) installs the seccomp\n+ * filter on itself and sends its listener fd to the parent (supervisor)\n+ * via SCM_RIGHTS over a socketpair. The parent therefore does not carry\n+ * the seccomp filter and can freely call openat() -- which is what makes\n+ * the race-free, kernel-truth loader detection (req.data.instruction_pointer\n+ * + /proc/\u003cpid\u003e/maps) actually usable.\n+ *\n+ * Phase 1: child does a pre-execve openat; the supervisor PIN_INSTALLs and\n+ * SEND_REDIRECTs. Phase 2: child execve's, so the pre-execve pin VMA dies\n+ * with the old mm. Phase 3: in the fresh post-execve mm the supervisor\n+ * PIN_INSTALLs again (idempotent replace of the stale bookkeeping) and\n+ * SEND_REDIRECTs, proving the full redirect mechanism survives an mm\n+ * replacement, not just the install side.\n+ */\n+TEST(user_notification_pinned_memfd_execve_scm)\n+{\n+\tpid_t pid;\n+\tint status, listener, memfd, sv[2];\n+\tstruct seccomp_notif req = {};\n+\tstruct seccomp_notif_pin_install pin = {};\n+\tstruct seccomp_notif_resp_redirect redir = {};\n+\tstruct seccomp_notif_resp cont_resp = {};\n+\tchar *sup_view;\n+\tconst size_t PIN_SIZE = 4096;\n+\tconst char *safe_path = \"/dev/null\";\n+\tconst char *bait = \"/seccomp_pinned_memfd_test_bait_scm\";\n+\tbool post_exec_install_ok = false;\n+\tbool post_exec_redirect_done = false;\n+\tbool loader_known = false;\n+\tbool loader_check_attempted = false;\n+\tstruct addr_range loader_range = {};\n+\tint phase = 0;\n+\tint trap_count = 0;\n+\tconst int trap_limit = 200;\n+\n+\tif (access(\"/bin/cat\", X_OK) != 0)\n+\t\tSKIP(return, \"/bin/cat not present\");\n+\n+\tmemfd = make_pin_memfd(_metadata, \"pin-execve-scm\", PIN_SIZE,\n+\t\t\t \u0026sup_view, safe_path);\n+\n+\tASSERT_EQ(0, socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sv));\n+\n+\tpid = fork();\n+\tASSERT_GE(pid, 0);\n+\n+\tif (pid == 0) {\n+\t\tstruct sock_filter filter[] = {\n+\t\t\tBPF_STMT(BPF_LD | BPF_W | BPF_ABS,\n+\t\t\t\t offsetof(struct seccomp_data, nr)),\n+\t\t\tBPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat,\n+\t\t\t\t 0, 1),\n+\t\t\tBPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),\n+\t\t\tBPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),\n+\t\t};\n+\t\tstruct sock_fprog prog = {\n+\t\t\t.len = (unsigned short)ARRAY_SIZE(filter),\n+\t\t\t.filter = filter,\n+\t\t};\n+\t\tint my_listener;\n+\t\tint fd;\n+\n+\t\tclose(sv[0]);\n+\t\tif (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))\n+\t\t\t_exit(20);\n+\t\tmy_listener = seccomp(SECCOMP_SET_MODE_FILTER,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT,\n+\t\t\t\t \u0026prog);\n+\t\tif (my_listener \u003c 0)\n+\t\t\t_exit(21);\n+\t\tif (send_fd(sv[1], my_listener) \u003c 0)\n+\t\t\t_exit(22);\n+\t\tclose(my_listener);\n+\t\tclose(sv[1]);\n+\n+\t\t/* Pre-execve trap. */\n+\t\tfd = syscall(__NR_openat, AT_FDCWD,\n+\t\t\t \"/this/should/never/be/touched\", O_RDONLY, 0);\n+\t\tif (fd \u003c 0)\n+\t\t\t_exit(11);\n+\n+\t\texecl(\"/bin/cat\", \"cat\", bait, (char *)NULL);\n+\t\t_exit(12);\n+\t}\n+\n+\tclose(sv[1]);\n+\tlistener = recv_fd(sv[0]);\n+\tclose(sv[0]);\n+\tASSERT_GE(listener, 0);\n+\n+\t/*\n+\t * Parent has the listener fd and does NOT have the seccomp\n+\t * filter. fopen(/proc/\u003cpid\u003e/maps) below works without\n+\t * deadlocking on the parent's own openat.\n+\t */\n+\tfor (;;) {\n+\t\tstruct pollfd pfd = { .fd = listener, .events = POLLIN };\n+\t\tint pret = poll(\u0026pfd, 1, 500);\n+\t\tpid_t reaped;\n+\t\tbool ip_in_loader;\n+\n+\t\tif (pret \u003c 0)\n+\t\t\tbreak;\n+\t\tif (pret == 0 || !(pfd.revents \u0026 POLLIN)) {\n+\t\t\treaped = waitpid(pid, \u0026status, WNOHANG);\n+\t\t\tif (reaped == pid)\n+\t\t\t\tbreak;\n+\t\t\tif (pfd.revents \u0026 (POLLHUP | POLLERR))\n+\t\t\t\tbreak;\n+\t\t\tcontinue;\n+\t\t}\n+\n+\t\tmemset(\u0026req, 0, sizeof(req));\n+\t\tif (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req) \u003c 0) {\n+\t\t\tTH_LOG(\"NOTIF_RECV failed: errno=%d\", errno);\n+\t\t\tbreak;\n+\t\t}\n+\t\tif (++trap_count \u003e trap_limit) {\n+\t\t\tTH_LOG(\"trap_limit (%d) exceeded\", trap_limit);\n+\t\t\tbreak;\n+\t\t}\n+\n+\t\tif (phase == 0) {\n+\t\t\tpin.id = req.id;\n+\t\t\tpin.memfd = memfd;\n+\t\t\tpin.target_addr = 0;\n+\t\t\tpin.size = PIN_SIZE;\n+\t\t\tif (ioctl(listener,\n+\t\t\t\t SECCOMP_IOCTL_NOTIF_PIN_INSTALL,\n+\t\t\t\t \u0026pin) != 0) {\n+\t\t\t\tTH_LOG(\"pre-exec PIN_INSTALL failed: errno=%d\",\n+\t\t\t\t errno);\n+\t\t\t\tif (errno == EINVAL)\n+\t\t\t\t\tSKIP(goto cleanup_scm,\n+\t\t\t\t\t \"Kernel lacks pinned-memfd remote\");\n+\t\t\t\tgoto cleanup_scm;\n+\t\t\t}\n+\n+\t\t\tmemset(\u0026redir, 0, sizeof(redir));\n+\t\t\tredir.id = req.id;\n+\t\t\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\t\t\tredir.args_mask = 1U \u003c\u003c 1;\n+\t\t\tredir.ptr_mask = 1U \u003c\u003c 1;\n+\t\t\tredir.memfd = memfd;\n+\t\t\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\t\t\tredir.args[1] = pin.target_addr;\n+\t\t\tif (ioctl(listener,\n+\t\t\t\t SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t\t \u0026redir) != 0) {\n+\t\t\t\tTH_LOG(\"pre-exec SEND_REDIRECT failed: errno=%d\",\n+\t\t\t\t errno);\n+\t\t\t\tgoto cleanup_scm;\n+\t\t\t}\n+\t\t\tphase = 1;\n+\t\t\tcontinue;\n+\t\t}\n+\n+\t\t/*\n+\t\t * Post-execve. Lazily resolve the loader range. The\n+\t\t * supervisor's own openat (fopen on /proc/\u003cpid\u003e/maps)\n+\t\t * doesn't trap because the filter lives on the child,\n+\t\t * not on us.\n+\t\t */\n+\t\tif (!loader_known \u0026\u0026 !loader_check_attempted) {\n+\t\t\tif (find_loader_text_range(req.pid,\n+\t\t\t\t\t\t \u0026loader_range) == 0)\n+\t\t\t\tloader_known = true;\n+\t\t\tloader_check_attempted = true;\n+\t\t}\n+\n+\t\tip_in_loader = loader_known \u0026\u0026\n+\t\t\treq.data.instruction_pointer \u003e= loader_range.start \u0026\u0026\n+\t\t\treq.data.instruction_pointer \u003c loader_range.end;\n+\n+\t\tif (ip_in_loader) {\n+\t\t\tmemset(\u0026cont_resp, 0, sizeof(cont_resp));\n+\t\t\tcont_resp.id = req.id;\n+\t\t\tcont_resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;\n+\t\t\tioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, \u0026cont_resp);\n+\t\t\tcontinue;\n+\t\t}\n+\n+\t\t/* Program code: inspect the path to identify the bait. */\n+\t\t{\n+\t\t\tchar path[PATH_MAX];\n+\t\t\tssize_t n;\n+\n+\t\t\tn = read_remote_string(req.pid, req.data.args[1],\n+\t\t\t\t\t path, sizeof(path));\n+\t\t\tif (n \u003c 0 || strcmp(path, bait) != 0) {\n+\t\t\t\tmemset(\u0026cont_resp, 0, sizeof(cont_resp));\n+\t\t\t\tcont_resp.id = req.id;\n+\t\t\t\tcont_resp.flags =\n+\t\t\t\t\tSECCOMP_USER_NOTIF_FLAG_CONTINUE;\n+\t\t\t\tioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,\n+\t\t\t\t \u0026cont_resp);\n+\t\t\t\tcontinue;\n+\t\t\t}\n+\n+\t\t\tpin.id = req.id;\n+\t\t\tpin.memfd = memfd;\n+\t\t\tpin.target_addr = 0;\n+\t\t\tpin.size = PIN_SIZE;\n+\t\t\tif (ioctl(listener,\n+\t\t\t\t SECCOMP_IOCTL_NOTIF_PIN_INSTALL,\n+\t\t\t\t \u0026pin) == 0) {\n+\t\t\t\tpost_exec_install_ok = true;\n+\t\t\t} else {\n+\t\t\t\tTH_LOG(\"post-exec PIN_INSTALL failed: errno=%d\",\n+\t\t\t\t errno);\n+\t\t\t\tmemset(\u0026cont_resp, 0, sizeof(cont_resp));\n+\t\t\t\tcont_resp.id = req.id;\n+\t\t\t\tcont_resp.flags =\n+\t\t\t\t\tSECCOMP_USER_NOTIF_FLAG_CONTINUE;\n+\t\t\t\tioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,\n+\t\t\t\t \u0026cont_resp);\n+\t\t\t\tcontinue;\n+\t\t\t}\n+\n+\t\t\tmemset(\u0026redir, 0, sizeof(redir));\n+\t\t\tredir.id = req.id;\n+\t\t\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\t\t\tredir.args_mask = 1U \u003c\u003c 1;\n+\t\t\tredir.ptr_mask = 1U \u003c\u003c 1;\n+\t\t\tredir.memfd = memfd;\n+\t\t\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\t\t\tredir.args[1] = pin.target_addr;\n+\t\t\tif (ioctl(listener,\n+\t\t\t\t SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t\t \u0026redir) == 0) {\n+\t\t\t\tpost_exec_redirect_done = true;\n+\t\t\t} else {\n+\t\t\t\tTH_LOG(\"post-exec SEND_REDIRECT failed: errno=%d\",\n+\t\t\t\t errno);\n+\t\t\t\tmemset(\u0026cont_resp, 0, sizeof(cont_resp));\n+\t\t\t\tcont_resp.id = req.id;\n+\t\t\t\tcont_resp.flags =\n+\t\t\t\t\tSECCOMP_USER_NOTIF_FLAG_CONTINUE;\n+\t\t\t\tioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,\n+\t\t\t\t \u0026cont_resp);\n+\t\t\t}\n+\t\t}\n+\t}\n+\n+\tif (waitpid(pid, \u0026status, WNOHANG) == 0) {\n+\t\tkill(pid, SIGKILL);\n+\t\twaitpid(pid, \u0026status, 0);\n+\t}\n+\tEXPECT_EQ(true, loader_known) {\n+\t\tTH_LOG(\"find_loader_text_range never resolved\");\n+\t}\n+\tEXPECT_EQ(true, post_exec_install_ok);\n+\tEXPECT_EQ(true, post_exec_redirect_done);\n+\tEXPECT_EQ(true, WIFEXITED(status));\n+\tEXPECT_EQ(0, WEXITSTATUS(status));\n+\n+cleanup_scm:\n+\tmunmap(sup_view, PIN_SIZE);\n+\tclose(memfd);\n+\tclose(listener);\n+}\n+\n+/*\n+ * Stateless redirect validation must hold up across many short-lived\n+ * targets over one listener, and must not accumulate per-target state.\n+ *\n+ * PIN_INSTALL records nothing: the installed VM_SEALED VMA is the only\n+ * record, and SEND_REDIRECT re-validates the pointer against the live\n+ * mapping (sealed, read-only, backed by the supervisor's memfd inode).\n+ * So a supervisor servicing a long churn of targets keeps working with\n+ * no bookkeeping to leak. Each iteration lets the kernel choose the pin\n+ * address in the fresh target mm; every install/redirect must succeed, and\n+ * kmemleak/KASAN over the loop confirms nothing accumulates.\n+ */\n+TEST(user_notification_pinned_memfd_churn)\n+{\n+\tconst size_t PIN_SIZE = 4096;\n+\tconst char *safe_path = \"/dev/null\";\n+\tconst int iters = 16;\n+\tint listener, memfd, i;\n+\tchar *sup_view;\n+\tlong ret;\n+\n+\tret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);\n+\tASSERT_EQ(0, ret) {\n+\t\tTH_LOG(\"Kernel does not support PR_SET_NO_NEW_PRIVS!\");\n+\t}\n+\n+\tmemfd = make_pin_memfd(_metadata, \"pinned-reap\", PIN_SIZE,\n+\t\t\t \u0026sup_view, safe_path);\n+\n+\tlistener = user_notif_syscall(__NR_openat,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT);\n+\tASSERT_GE(listener, 0);\n+\n+\tfor (i = 0; i \u003c iters; i++) {\n+\t\tstruct seccomp_notif req = {};\n+\t\tstruct seccomp_notif_pin_install pin = {};\n+\t\tstruct seccomp_notif_resp_redirect redir = {};\n+\t\tint status;\n+\t\tpid_t pid;\n+\n+\t\tpid = fork();\n+\t\tASSERT_GE(pid, 0);\n+\t\tif (pid == 0) {\n+\t\t\tint fd = syscall(__NR_openat, AT_FDCWD,\n+\t\t\t\t\t \"/never/touched\", O_RDONLY, 0);\n+\t\t\t_exit(fd \u003c 0 ? 11 : 0);\n+\t\t}\n+\n+\t\tASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req));\n+\t\tEXPECT_EQ(req.data.nr, __NR_openat);\n+\n+\t\tpin.id = req.id;\n+\t\tpin.memfd = memfd;\n+\t\tpin.target_addr = 0;\n+\t\tpin.size = PIN_SIZE;\n+\t\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL,\n+\t\t\t\t \u0026pin)) {\n+\t\t\tif (errno == EINVAL) {\n+\t\t\t\tkill(pid, SIGKILL);\n+\t\t\t\twaitpid(pid, \u0026status, 0);\n+\t\t\t\tSKIP(goto cleanup,\n+\t\t\t\t \"Kernel lacks pinned-memfd remote install\");\n+\t\t\t}\n+\t\t\tTH_LOG(\"iter %d PIN_INSTALL failed: errno=%d\", i, errno);\n+\t\t}\n+\n+\t\tredir.id = req.id;\n+\t\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\t\tredir.args_mask = 1U \u003c\u003c 1;\n+\t\tredir.ptr_mask = 1U \u003c\u003c 1;\n+\t\tredir.memfd = memfd;\n+\t\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\t\tredir.args[1] = pin.target_addr;\n+\t\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t\t \u0026redir));\n+\n+\t\tEXPECT_EQ(waitpid(pid, \u0026status, 0), pid);\n+\t\tEXPECT_EQ(true, WIFEXITED(status));\n+\t\tEXPECT_EQ(0, WEXITSTATUS(status)) {\n+\t\t\tTH_LOG(\"iter %d child exit %d (11=openat fail)\",\n+\t\t\t i, WEXITSTATUS(status));\n+\t\t}\n+\t\t/*\n+\t\t * Target is dead now; its pin (this iter's mm, at the\n+\t\t * kernel-chosen address) is stale. The next iteration's\n+\t\t * PIN_INSTALL walk must reap it rather than leak the range +\n+\t\t * mm + memfd reference.\n+\t\t */\n+\t}\n+\n+cleanup:\n+\tmunmap(sup_view, PIN_SIZE);\n+\tclose(memfd);\n+\tclose(listener);\n+}\n+\n+#ifdef __NR_socket\n+/*\n+ * A redirect must not let an inner (more recently installed) filter's\n+ * notifier smuggle a syscall past an outer filter. Two filters are\n+ * stacked on the target:\n+ *\n+ * outer (installed first): socket(AF_INET, ...) -\u003e RET_ERRNO(EACCES),\n+ * everything else ALLOW.\n+ * inner (installed second): socket -\u003e RET_USER_NOTIF.\n+ *\n+ * The child calls socket(AF_UNIX, ...), which the outer filter allows, so\n+ * the inner notifier wins and fires. The supervisor SEND_REDIRECTs arg0\n+ * to AF_INET. The kernel must then re-run the outer filter against the\n+ * rewritten registers and block it with EACCES; without the outer-suffix\n+ * re-validation the inner filter would have bypassed the outer policy.\n+ */\n+TEST(user_notification_redirect_outer_refilter)\n+{\n+\tstruct sock_filter outer_filter[] = {\n+\t\tBPF_STMT(BPF_LD | BPF_W | BPF_ABS,\n+\t\t\t offsetof(struct seccomp_data, nr)),\n+\t\tBPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_socket, 0, 3),\n+\t\tBPF_STMT(BPF_LD | BPF_W | BPF_ABS, syscall_arg(0)),\n+\t\tBPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AF_INET, 0, 1),\n+\t\tBPF_STMT(BPF_RET | BPF_K,\n+\t\t\t SECCOMP_RET_ERRNO | (EACCES \u0026 SECCOMP_RET_DATA)),\n+\t\tBPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),\n+\t};\n+\tstruct sock_fprog outer_prog = {\n+\t\t.len = (unsigned short)ARRAY_SIZE(outer_filter),\n+\t\t.filter = outer_filter,\n+\t};\n+\tstruct seccomp_notif req = {};\n+\tstruct seccomp_notif_resp_redirect redir = {};\n+\tint status, listener;\n+\tpid_t pid;\n+\tlong ret;\n+\n+\tret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);\n+\tASSERT_EQ(0, ret) {\n+\t\tTH_LOG(\"Kernel does not support PR_SET_NO_NEW_PRIVS!\");\n+\t}\n+\n+\t/* Outer filter first =\u003e it becomes the outer/root of the stack. */\n+\tASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, \u0026outer_prog));\n+\n+\t/* Inner USER_NOTIF filter second (innermost); returns the listener. */\n+\tlistener = user_notif_syscall(__NR_socket,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT);\n+\tASSERT_GE(listener, 0);\n+\n+\tpid = fork();\n+\tASSERT_GE(pid, 0);\n+\n+\tif (pid == 0) {\n+\t\tint fd = syscall(__NR_socket, AF_UNIX, SOCK_STREAM, 0);\n+\n+\t\tif (fd \u003e= 0)\n+\t\t\t_exit(12);\t/* bypass: outer filter was skipped */\n+\t\tif (errno != EACCES)\n+\t\t\t_exit(13);\t/* unexpected errno */\n+\t\t_exit(0);\n+\t}\n+\n+\tASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req));\n+\tEXPECT_EQ(req.data.nr, __NR_socket);\n+\tEXPECT_EQ(req.data.args[0], AF_UNIX);\n+\n+\t/* Scalar redirect of arg0 (no pin needed): AF_UNIX -\u003e AF_INET. */\n+\tredir.id = req.id;\n+\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\tredir.args_mask = 1U \u003c\u003c 0;\n+\tredir.args[0] = AF_INET;\n+\tret = ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT, \u0026redir);\n+\tif (ret \u003c 0 \u0026\u0026 errno == EINVAL) {\n+\t\tkill(pid, SIGKILL);\n+\t\twaitpid(pid, \u0026status, 0);\n+\t\tSKIP(return, \"Kernel lacks SECCOMP_IOCTL_NOTIF_SEND_REDIRECT\");\n+\t}\n+\tEXPECT_EQ(0, ret);\n+\n+\tEXPECT_EQ(waitpid(pid, \u0026status, 0), pid);\n+\tEXPECT_EQ(true, WIFEXITED(status));\n+\tEXPECT_EQ(0, WEXITSTATUS(status)) {\n+\t\tswitch (WEXITSTATUS(status)) {\n+\t\tcase 12:\n+\t\t\tTH_LOG(\"child exit 12: redirect bypassed the outer filter\");\n+\t\t\tbreak;\n+\t\tcase 13:\n+\t\t\tTH_LOG(\"child exit 13: socket failed with unexpected errno\");\n+\t\t\tbreak;\n+\t\tdefault:\n+\t\t\tTH_LOG(\"child exit %d (unexpected)\", WEXITSTATUS(status));\n+\t\t}\n+\t}\n+\n+\tclose(listener);\n+}\n+#endif /* __NR_socket */\n+\n+#ifdef __x86_64__\n+/*\n+ * Load-bearing ABI check: after SEND_REDIRECT, the trapped task's\n+ * redirected arg register must be restored to its original value\n+ * before user-mode code resumes. The kernel's restore mechanism\n+ * (task_work_add(TWA_SIGNAL) -\u003e seccomp_redirect_restore_cb) is\n+ * what guarantees this; without a test the property is just an\n+ * assertion. Bypass libc's syscall() wrapper (which caller-saves\n+ * arg values and would mask a restore bug) and capture the actual\n+ * arg register immediately after the SYSCALL instruction.\n+ *\n+ * The child issues openat with RSI = sentinel_path. The supervisor\n+ * SEND_REDIRECTs args[1] (RSI) to point into the pin. The kernel:\n+ * - saves the original RSI into the knotif\n+ * - writes the pin address into RSI via syscall_set_arguments()\n+ * - runs the syscall (kernel reads path from the pin)\n+ * - on syscall_exit_to_user_mode, fires task_work which calls\n+ * syscall_set_arguments() again with the saved original\n+ * - returns to user mode\n+ *\n+ * If task_work fires correctly, the child observes RSI == sentinel.\n+ * If broken, RSI holds the pin address (the redirected value the\n+ * kernel left in pt_regs).\n+ */\n+TEST(user_notification_pinned_memfd_abi)\n+{\n+\tpid_t pid;\n+\tlong ret;\n+\tint status, listener, memfd;\n+\tstruct seccomp_notif req = {};\n+\tstruct seccomp_notif_pin_install pin = {};\n+\tstruct seccomp_notif_resp_redirect redir = {};\n+\tchar *sup_view;\n+\tconst size_t PIN_SIZE = 4096;\n+\tconst char *safe_path = \"/dev/null\";\n+\t/*\n+\t * The \"sentinel\" is a real string the child can also pass as\n+\t * the openat path. Its address is captured pre-syscall as RSI;\n+\t * post-syscall RSI must equal the same address.\n+\t */\n+\tstatic const char sentinel_path[] = \"/seccomp_abi_sentinel\";\n+\n+\tret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);\n+\tASSERT_EQ(0, ret) {\n+\t\tTH_LOG(\"Kernel does not support PR_SET_NO_NEW_PRIVS!\");\n+\t}\n+\n+\tmemfd = make_pin_memfd(_metadata, \"pin-abi\", PIN_SIZE,\n+\t\t\t \u0026sup_view, safe_path);\n+\n+\tlistener = user_notif_syscall(__NR_openat,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT);\n+\tASSERT_GE(listener, 0);\n+\n+\tpid = fork();\n+\tASSERT_GE(pid, 0);\n+\n+\tif (pid == 0) {\n+\t\tregister long r10_val asm(\"r10\") = 0;\n+\t\tunsigned long rsi_after;\n+\t\tlong fd;\n+\n+\t\tasm volatile(\n+\t\t\t\"syscall\\n\\t\"\n+\t\t\t\"mov %%rsi, %[after]\"\n+\t\t\t: \"=a\"(fd), [after] \"=\u0026r\"(rsi_after)\n+\t\t\t: \"0\"((long)__NR_openat),\n+\t\t\t \"D\"((long)AT_FDCWD),\n+\t\t\t \"S\"((unsigned long)sentinel_path),\n+\t\t\t \"d\"((long)O_RDONLY),\n+\t\t\t \"r\"(r10_val)\n+\t\t\t: \"rcx\", \"r11\", \"memory\"\n+\t\t);\n+\n+\t\tif (fd \u003c 0)\n+\t\t\t_exit(11);\n+\t\t/*\n+\t\t * Load-bearing check: RSI immediately post-SYSCALL must\n+\t\t * still be the sentinel pointer the child passed in. The\n+\t\t * kernel's REDIRECT-then-restore mechanism is the only\n+\t\t * thing that guarantees this; a broken restore would leave\n+\t\t * the pin address in RSI.\n+\t\t */\n+\t\tif (rsi_after != (unsigned long)sentinel_path)\n+\t\t\t_exit(12);\n+\t\t_exit(0);\n+\t}\n+\n+\tASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req));\n+\tEXPECT_EQ(req.data.nr, __NR_openat);\n+\tEXPECT_EQ(req.data.args[1], (unsigned long)sentinel_path);\n+\n+\tpin.id = req.id;\n+\tpin.memfd = memfd;\n+\tpin.target_addr = 0;\n+\tpin.size = PIN_SIZE;\n+\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, \u0026pin)) {\n+\t\tif (errno == EINVAL)\n+\t\t\tSKIP(goto cleanup,\n+\t\t\t \"Kernel lacks pinned-memfd remote install\");\n+\t}\n+\n+\tredir.id = req.id;\n+\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\tredir.args_mask = 1U \u003c\u003c 1;\n+\tredir.ptr_mask = 1U \u003c\u003c 1;\n+\tredir.memfd = memfd;\n+\tredir.ptr_len[1] = strlen(safe_path) + 1;\n+\tredir.args[1] = pin.target_addr;\n+\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t \u0026redir));\n+\n+\tEXPECT_EQ(waitpid(pid, \u0026status, 0), pid);\n+\tEXPECT_EQ(true, WIFEXITED(status));\n+\tEXPECT_EQ(0, WEXITSTATUS(status)) {\n+\t\tswitch (WEXITSTATUS(status)) {\n+\t\tcase 11:\n+\t\t\tTH_LOG(\"child exit 11: openat returned -errno\");\n+\t\t\tbreak;\n+\t\tcase 12:\n+\t\t\tTH_LOG(\"child exit 12: ABI violation -- RSI not restored after redirect\");\n+\t\t\tbreak;\n+\t\tdefault:\n+\t\t\tTH_LOG(\"child exit %d (unexpected)\", WEXITSTATUS(status));\n+\t\t}\n+\t}\n+\n+cleanup:\n+\tmunmap(sup_view, PIN_SIZE);\n+\tclose(memfd);\n+\tclose(listener);\n+}\n+\n+static void redir_sigusr1_handler(int signo)\n+{\n+\t/* _exit() is async-signal-safe; bail with a distinct code if the\n+\t * signal frame was clobbered so the handler sees the wrong signo.\n+\t */\n+\tif (signo != SIGUSR1)\n+\t\t_exit(12);\n+}\n+\n+/*\n+ * Regression test: a redirect's deferred arg-register restore must run\n+ * before a signal frame is built, not after.\n+ *\n+ * The restore was queued as a TWA_RESUME task_work, which runs in\n+ * exit_to_user_mode_loop() *after* arch_do_signal_or_restart() has\n+ * already set up the handler frame (regs-\u003edi = signo, regs-\u003esi =\n+ * \u0026siginfo, regs-\u003edx = \u0026ucontext). The restore then overwrote those\n+ * registers with the trapped syscall's original argument values, so the\n+ * handler was entered with a corrupted signal number. Queuing the\n+ * restore with TWA_SIGNAL makes it run at the top of get_signal(),\n+ * before the frame is built (and before any syscall-restart rewind).\n+ *\n+ * The child traps on pause(), the supervisor redirects arg0 (RDI), and\n+ * then interrupts it with SIGUSR1. The handler must observe\n+ * signo == SIGUSR1, not the leaked original RDI sentinel.\n+ */\n+TEST(user_notification_redirect_signal_abi)\n+{\n+\tpid_t pid;\n+\tlong ret;\n+\tint status, listener;\n+\tstruct seccomp_notif req = {};\n+\tstruct seccomp_notif_resp_redirect redir = {};\n+\t/* A recognizable original RDI the broken restore would leak in. */\n+\tconst unsigned long RDI_SENTINEL = 0x5a5a5a5aUL;\n+\n+\tret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);\n+\tASSERT_EQ(0, ret) {\n+\t\tTH_LOG(\"Kernel does not support PR_SET_NO_NEW_PRIVS!\");\n+\t}\n+\n+\tlistener = user_notif_syscall(__NR_pause,\n+\t\t\t\t SECCOMP_FILTER_FLAG_NEW_LISTENER |\n+\t\t\t\t SECCOMP_FILTER_FLAG_REDIRECT);\n+\tASSERT_GE(listener, 0);\n+\n+\tpid = fork();\n+\tASSERT_GE(pid, 0);\n+\n+\tif (pid == 0) {\n+\t\tstruct sigaction sa = {\n+\t\t\t.sa_handler = redir_sigusr1_handler,\n+\t\t};\n+\t\tlong rc;\n+\n+\t\tif (sigaction(SIGUSR1, \u0026sa, NULL))\n+\t\t\t_exit(10);\n+\n+\t\t/* Raw pause() carrying a controlled RDI sentinel. */\n+\t\tasm volatile(\n+\t\t\t\"syscall\"\n+\t\t\t: \"=a\"(rc)\n+\t\t\t: \"0\"((long)__NR_pause),\n+\t\t\t \"D\"(RDI_SENTINEL)\n+\t\t\t: \"rcx\", \"r11\", \"memory\");\n+\n+\t\t/* pause() returns -EINTR once the handler has run. */\n+\t\tif (rc != -EINTR)\n+\t\t\t_exit(11);\n+\t\t_exit(0);\n+\t}\n+\n+\tASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, \u0026req));\n+\tEXPECT_EQ(req.data.nr, __NR_pause);\n+\tEXPECT_EQ(req.data.args[0], RDI_SENTINEL);\n+\n+\t/* Redirect arg0 (non-pointer); this arms the original-RDI restore. */\n+\tredir.id = req.id;\n+\tredir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;\n+\tredir.args_mask = 1U \u003c\u003c 0;\n+\tredir.args[0] = 0;\n+\tEXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,\n+\t\t\t \u0026redir)) {\n+\t\tif (errno == EINVAL) {\n+\t\t\tkill(pid, SIGKILL);\n+\t\t\twaitpid(pid, \u0026status, 0);\n+\t\t\tSKIP(goto cleanup,\n+\t\t\t \"Kernel lacks SECCOMP_IOCTL_NOTIF_SEND_REDIRECT\");\n+\t\t}\n+\t}\n+\n+\t/* Let the child reach pause(), then interrupt it with a handler. */\n+\tusleep(100000);\n+\tEXPECT_EQ(0, kill(pid, SIGUSR1));\n+\n+\tEXPECT_EQ(waitpid(pid, \u0026status, 0), pid);\n+\tEXPECT_EQ(true, WIFEXITED(status));\n+\tEXPECT_EQ(0, WEXITSTATUS(status)) {\n+\t\tswitch (WEXITSTATUS(status)) {\n+\t\tcase 10:\n+\t\t\tTH_LOG(\"child exit 10: sigaction failed\");\n+\t\t\tbreak;\n+\t\tcase 11:\n+\t\t\tTH_LOG(\"child exit 11: pause() did not return -EINTR\");\n+\t\t\tbreak;\n+\t\tcase 12:\n+\t\t\tTH_LOG(\"child exit 12: handler saw wrong signo (frame clobbered)\");\n+\t\t\tbreak;\n+\t\tdefault:\n+\t\t\tTH_LOG(\"child exit %d (unexpected)\", WEXITSTATUS(status));\n+\t\t}\n+\t}\n+\n+cleanup:\n+\tclose(listener);\n+}\n+#endif /* __x86_64__ */\n+\n #ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP\n #define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL \u003c\u003c 0)\n #define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)\n"
} |
| 2/1 | 2026/06/27 01:53 | agent | patch-evaluator |
0mModel:gemini-3.1-pro-preview Args: null Results: {
"EnableConfigs": [
"SECCOMP_FILTER",
"MEMFD_CREATE",
"64BIT"
],
"FocusSymbols": [
"seccomp_notify_pin_install",
"seccomp_install_pin",
"seccomp_pin_check",
"seccomp_notify_send_redirect",
"seccomp_redirect_restore_cb",
"seccomp_redirect_is_sigreturn",
"seccomp_run_filters_seq",
"vm_mmap_seal_remote",
"mm_get_unmapped_area_remote"
],
"Reasoning": "The patch introduces a new non-cooperative pinned-memfd redirect feature for seccomp, allowing a supervisor to install a sealed memfd mapping into a trapped task's address space and redirect pointer arguments to it. This adds new ioctls (`SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`) and modifies core seccomp and memory management logic. These new execution paths should be heavily fuzzed to ensure they don't introduce vulnerabilities like memory leaks, race conditions, or unauthorized memory access.",
"WorthFuzzing": true
}Instruction: You are an expert Linux kernel maintainer. Your job is to review a provided patch series and determine if it makes functional changes to the kernel that should be fuzzed. IMPORTANT: The changes have ALREADY been applied and committed as the HEAD commit in your workspace. Do NOT rely on your internal knowledge of the kernel. You must actively use your code access tools to examine the actual source code and confirm any assumptions. Return WorthFuzzing=false if the patch only contains: - Modifications to Documentation/, Kconfig files, or code comments. - Purely decorative changes, such as logging (e.g., pr_err, printk) or tracepoints. - Changes to numeric constants or macros that do not functionally alter execution flow. - Code paths that are impossible to reach in virtualized environments like GCE or QEMU, even when utilizing software-emulated hardware (e.g., usb gadget, mac80211_hwsim). If it modifies reachable core kernel logic, drivers, or architectures, use your code search tools to verify the code can be executed, then return WorthFuzzing=true. When returning WorthFuzzing=true, you MUST ALSO: 1. Extract any specific kernel functions that should be heavily fuzzed into FocusSymbols. Avoid listing generic hot-path functions to prevent skewed test distributions. 2. Identify any specific CONFIG_ options required to properly test this new/modified feature. Go and look into the Kconfig files and check for ifdefs around the code, do not make assumptions. Do not list too generic configs (we already have them enabled). Only list those that specifically cover the modified code. List them in the EnableConfigs output array, and DO NOT add a 'CONFIG_' prefix (e.g., return "NET_IPV4" instead of "CONFIG_NET_IPV4"). Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: For your convenience, here is the diff of the changes:
commit 5ab21aee5036895747567f9993bf88600f1b136a
Author: syz-cluster <triage@syzkaller.com>
Date: Sat Jun 27 01:53:38 2026 +0000
syz-cluster: applied patch under review
diff --git a/Documentation/userspace-api/seccomp_filter.rst b/Documentation/userspace-api/seccomp_filter.rst
index cff0fa7f3175e4..3a8ac7775da6fd 100644
--- a/Documentation/userspace-api/seccomp_filter.rst
+++ b/Documentation/userspace-api/seccomp_filter.rst
@@ -289,6 +289,114 @@ above in this document: all arguments being read from the tracee's memory
should be read into the tracer's memory before any policy decisions are made.
This allows for an atomic decision on syscall arguments.
+Non-cooperative pinned-memfd redirect
+=====================================
+
+The TOCTOU described above means ``SECCOMP_USER_NOTIF_FLAG_CONTINUE`` cannot
+enforce a policy on pointer arguments: after the supervisor inspects the
+target's memory and lets the syscall continue, the target (or a thread sharing
+its address space) can rewrite that memory before the kernel reads it. The
+cooperative workaround, the target ``mmap()`` + ``mseal()``-ing a shared
+buffer, is unavailable in the fork+execve sandbox model, where the supervisor
+confines a binary it did not write.
+
+Two ioctls let the supervisor close this race without target cooperation. The
+redirect step (below) requires a listener created with
+``SECCOMP_FILTER_FLAG_REDIRECT`` (in addition to
+``SECCOMP_FILTER_FLAG_NEW_LISTENER``). Because it rewrites another task's
+registers, at most one such listener may exist in a task's filter chain; a
+second fails with ``-EBUSY``:
+
+.. code-block:: c
+
+ fd = seccomp(SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER | SECCOMP_FILTER_FLAG_REDIRECT,
+ &prog);
+
+``ioctl(SECCOMP_IOCTL_NOTIF_PIN_INSTALL)`` installs a sealed mapping of a
+supervisor-owned ``memfd`` directly into the trapped task's address space:
+
+.. code-block:: c
+
+ struct seccomp_notif_pin_install {
+ __u64 id;
+ __u32 flags; /* reserved, must be 0 */
+ __u32 memfd;
+ __u64 target_addr;
+ __u64 size;
+ __u64 offset; /* page-aligned offset into memfd */
+ };
+
+``id`` names an active notification (the trapped task to install into).
+``target_addr``, ``size`` and ``offset`` are page-aligned; ``offset`` selects
+where in ``memfd`` the mapping starts, so one memfd can back several pins. If
+``target_addr`` is ``0`` the kernel picks a free address and writes it back;
+otherwise an existing mapping there yields ``-EEXIST``. The pin is read-only
+and sealed, the target and its threads cannot unmap, move, reprotect or
+overwrite it, and lasts until the target ``execve()``s or exits.
+
+``memfd`` must be write-sealed (``F_SEAL_WRITE`` or ``F_SEAL_FUTURE_WRITE``)
+or the ioctl returns ``-EINVAL``; otherwise the target could rewrite the pin's
+bytes through a separate writable handle to the same memfd.
+``F_SEAL_FUTURE_WRITE`` still lets the supervisor update the contents through
+its own mapping made before the seal.
+
+``ioctl(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT)`` then resumes the trapped syscall
+like ``SECCOMP_USER_NOTIF_FLAG_CONTINUE``, but with selected argument
+registers replaced:
+
+.. code-block:: c
+
+ struct seccomp_notif_resp_redirect {
+ __u64 id;
+ __u32 flags; /* SECCOMP_REDIRECT_FLAG_CONTINUE must be set */
+ __u32 args_mask; /* which arg registers to replace */
+ __u32 ptr_mask; /* which of those are pointers into a pin */
+ __u32 memfd; /* the pin's backing memfd */
+ __u64 args[6]; /* replacement values */
+ __u64 ptr_len[6]; /* validated access length for each pointer arg */
+ };
+
+Each bit in ``ptr_mask`` (a subset of ``args_mask``) marks ``args[i]`` as a
+pointer; the access ``[args[i], args[i] + ptr_len[i])`` must lie within a
+single read-only pin of ``memfd`` in the target, or the ioctl returns
+``-EFAULT``. ``ptr_len[i]`` must be non-zero for those bits and ``0``
+otherwise. Bits in ``args_mask`` but not ``ptr_mask`` are scalar replacements
+written verbatim, e.g. to set the length register that goes with a redirected
+pointer. The original registers are restored at syscall exit, so the
+substitution is invisible to the target and the TOCTOU is closed.
+
+Scope and limitations
+---------------------
+
+The redirect mechanism is deliberately narrow and is *not* a general syscall
+rewriting facility:
+
+- **Read-only input pointers only.** A pin is read-only, so only an argument
+ the syscall *reads* (a pathname, a ``sockaddr``) may be redirected into it.
+ Aiming an output or in/out argument at a pin makes the syscall fail with
+ ``-EFAULT`` when it writes back.
+
+- **Same syscall only.** A redirect replaces arguments, never the syscall
+ number. ``rt_sigreturn()`` (and its compat variant) cannot be redirected and
+ return ``-EOPNOTSUPP``.
+
+- **Signals and restarts.** The redirected syscall really runs, so it can be
+ interrupted and restarted. On a restart the original arguments are restored
+ and the syscall re-traps, so the supervisor is notified again and must answer
+ consistently. Syscalls the kernel restarts without re-trapping (e.g.
+ ``nanosleep()``, ``futex(FUTEX_WAIT)``) keep the substituted arguments --
+ safe for read-only inputs, but a reason not to redirect arguments of syscalls
+ that block or wait.
+
+- **clone()/fork().** A child keeps the substituted argument registers (the
+ restore is not inherited). Redirect ``clone()``/``fork()`` arguments only if
+ that is acceptable.
+
+- **ptrace.** A tracer sees the substituted arguments at the syscall-exit stop;
+ they are restored before the task resumes, so a ``PTRACE_SETREGS`` of a
+ substituted register at that stop is overwritten.
+
Sysctls
=======
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 485df9c2dbddb3..73e5580442a6dd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4152,6 +4152,8 @@ extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
struct list_head *uf);
+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len, unsigned long pgoff);
extern int do_vmi_munmap(struct vma_iterator *vmi, struct mm_struct *mm,
unsigned long start, size_t len, struct list_head *uf,
bool unlock);
@@ -4192,6 +4194,7 @@ struct vm_unmapped_area_info {
unsigned long align_mask;
unsigned long align_offset;
unsigned long start_gap;
+ struct mm_struct *mm; /* mm to search; NULL means current->mm */
};
extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info);
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 9b959972bf4a22..5d53f8fce50896 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -10,12 +10,22 @@
SECCOMP_FILTER_FLAG_SPEC_ALLOW | \
SECCOMP_FILTER_FLAG_NEW_LISTENER | \
SECCOMP_FILTER_FLAG_TSYNC_ESRCH | \
- SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
+ SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV | \
+ SECCOMP_FILTER_FLAG_REDIRECT)
/* sizeof() the first published struct seccomp_notif_addfd */
#define SECCOMP_NOTIFY_ADDFD_SIZE_VER0 24
#define SECCOMP_NOTIFY_ADDFD_SIZE_LATEST SECCOMP_NOTIFY_ADDFD_SIZE_VER0
+/* sizeof() the first published struct seccomp_notif_pin_install */
+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 32 /* up to @size */
+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER1 40 /* adds @offset */
+#define SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER1
+
+/* sizeof() the first published struct seccomp_notif_resp_redirect */
+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 120
+#define SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0
+
#ifdef CONFIG_SECCOMP
#include <linux/thread_info.h>
diff --git a/include/uapi/linux/seccomp.h b/include/uapi/linux/seccomp.h
index dbfc9b37fcaee4..d6888691633c08 100644
--- a/include/uapi/linux/seccomp.h
+++ b/include/uapi/linux/seccomp.h
@@ -25,6 +25,12 @@
#define SECCOMP_FILTER_FLAG_TSYNC_ESRCH (1UL << 4)
/* Received notifications wait in killable state (only respond to fatal signals) */
#define SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV (1UL << 5)
+/*
+ * Declares that this listener's notifier may issue
+ * SECCOMP_IOCTL_NOTIF_PIN_INSTALL / SECCOMP_IOCTL_NOTIF_SEND_REDIRECT. At most
+ * one such filter may exist in a task's filter chain. Requires NEW_LISTENER.
+ */
+#define SECCOMP_FILTER_FLAG_REDIRECT (1UL << 6)
/*
* All BPF programs must return a 32-bit value.
@@ -137,6 +143,52 @@ struct seccomp_notif_addfd {
__u32 newfd_flags;
};
+/**
+ * struct seccomp_notif_pin_install - have the kernel install a sealed
+ * MAP_SHARED mapping of @memfd into the trapped task's mm at @target_addr,
+ * which SECCOMP_IOCTL_NOTIF_SEND_REDIRECT can then use as a target for
+ * substituted pointer arguments.
+ *
+ * The supervisor owns @memfd. The kernel installs the mapping into
+ * the trapped task's address space without target-side cooperation
+ * (the target need not mmap or mseal anything itself). The mapping
+ * is marked VM_SEALED at install time, so the target and any
+ * CLONE_VM peer cannot munmap, mremap, mprotect, or MAP_FIXED-stomp
+ * it. The mapping is read-only. The supervisor retains access via its
+ * own mapping of the same memfd in its own mm.
+ *
+ * @memfd must be write-sealed (F_SEAL_WRITE or F_SEAL_FUTURE_WRITE),
+ * otherwise the ioctl fails with -EINVAL. This guarantees the pin's bytes
+ * cannot be rewritten through any other reference to the same memfd (for
+ * example one the target reopened via the supervisor's /proc/<pid>/fd),
+ * not just through the read-only pin itself. F_SEAL_FUTURE_WRITE still
+ * lets the supervisor update the bytes through its own pre-seal mapping.
+ *
+ * @offset lets one memfd back several disjoint read-only pins.
+ *
+ * @id: The ID of an active seccomp notification on this listener,
+ * identifying the trapped task whose mm receives the pin.
+ * @flags: Reserved, must be 0.
+ * @memfd: Supervisor-side fd for the backing memfd. Must be write-sealed.
+ * @target_addr: Address in the trapped task's mm to install at. Must be
+ * page-aligned. If non-zero, MAP_FIXED semantics apply, no
+ * other mapping may exist in [@target_addr, @target_addr +
+ * @size). If zero, the kernel chooses a free area in the
+ * target mm. On success the actual mapped address is written
+ * back here.
+ * @size: Size of the pin in bytes. Must be page-aligned.
+ * @offset: Page-aligned byte offset into @memfd to map from. Zero maps
+ * from the start of the memfd.
+ */
+struct seccomp_notif_pin_install {
+ __u64 id;
+ __u32 flags;
+ __u32 memfd;
+ __u64 target_addr;
+ __u64 size;
+ __u64 offset;
+};
+
#define SECCOMP_IOC_MAGIC '!'
#define SECCOMP_IO(nr) _IO(SECCOMP_IOC_MAGIC, nr)
#define SECCOMP_IOR(nr, type) _IOR(SECCOMP_IOC_MAGIC, nr, type)
@@ -154,4 +206,78 @@ struct seccomp_notif_addfd {
#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
+/* Valid flags for struct seccomp_notif_resp_redirect. */
+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL << 0)
+
+/*
+ * Number of syscall argument registers a redirect response may
+ * substitute (matches struct seccomp_data::args[]).
+ */
+#define SECCOMP_REDIRECT_ARGS 6
+
+/**
+ * struct seccomp_notif_resp_redirect - resume the trapped syscall with
+ * substituted arg-register values, optionally pointing into previously
+ * installed pinned-memfd regions.
+ *
+ * Like SECCOMP_USER_NOTIF_FLAG_CONTINUE the syscall actually runs, but the
+ * kernel first rewrites the arg registers selected by @args_mask. Each
+ * pointer substitution (@ptr_mask) is validated against the trapped task's
+ * current address space: the whole access [args[i], args[i] + ptr_len[i])
+ * must lie inside a single VM_SEALED, read-only mapping of @memfd. No per-pin
+ * bookkeeping is kept; authorization is re-derived from the live mapping, so
+ * a target that has exited or execve()d (its mapping gone) simply fails
+ * validation. Original registers are saved and restored at syscall exit for
+ * ABI compliance - except after a successful execve, whose new register file
+ * is left untouched (the redirect still applies, as execve copies the
+ * pathname from the immutable pin before the old mm is gone, closing that
+ * TOCTOU too).
+ *
+ * @id: The ID of the seccomp notification this response consumes.
+ * @flags: SECCOMP_REDIRECT_FLAG_*. CONTINUE must be set.
+ * @args_mask: Bit i set means args[i] replaces the trapped task's
+ * corresponding arg register before the syscall runs.
+ * @ptr_mask: Subset of @args_mask. Bit i set means args[i] is a pointer and
+ * the access [args[i], args[i] + ptr_len[i]) is validated to lie
+ * entirely inside a single VM_SEALED, read-only mapping of @memfd.
+ * Scalar replacements (in @args_mask but not @ptr_mask) are
+ * written verbatim.
+ * @memfd: Supervisor-side fd for the backing memfd whose sealed mapping the
+ * pointer substitutions must fall within. Consulted only when
+ * @ptr_mask is non-zero.
+ * @args: Replacement values for the arg registers.
+ * @ptr_len: For each bit set in @ptr_mask, ptr_len[i] is the byte length of
+ * the access starting at args[i]; it must be non-zero and args[i] +
+ * ptr_len[i] must not overflow. For every i whose bit is clear in
+ * @ptr_mask it must be 0.
+ */
+struct seccomp_notif_resp_redirect {
+ __u64 id;
+ __u32 flags;
+ __u32 args_mask;
+ __u32 ptr_mask;
+ __u32 memfd;
+ __u64 args[SECCOMP_REDIRECT_ARGS];
+ __u64 ptr_len[SECCOMP_REDIRECT_ARGS];
+};
+
+/*
+ * Install a sealed memfd-backed pin in the trapped task's mm without
+ * target-side cooperation. The supervisor owns the backing memfd;
+ * the kernel installs the mapping and marks it VM_SEALED. The actual
+ * mapped address is written back to @target_addr (relevant when it was
+ * passed as 0 to let the kernel choose).
+ */
+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL SECCOMP_IOWR(5, \
+ struct seccomp_notif_pin_install)
+
+/*
+ * Resume the trapped syscall with substituted arg-register values
+ * pointing into an installed pin. The kernel saves and restores the
+ * original registers at syscall exit so the caller observes ABI-
+ * correct register preservation.
+ */
+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT SECCOMP_IOW(6, \
+ struct seccomp_notif_resp_redirect)
+
#endif /* _UAPI_LINUX_SECCOMP_H */
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 066909393c38f5..84812ce9bdb3b5 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -37,12 +37,19 @@
#ifdef CONFIG_SECCOMP_FILTER
#include <linux/file.h>
#include <linux/filter.h>
+#include <linux/memfd.h>
#include <linux/pid.h>
#include <linux/ptrace.h>
#include <linux/capability.h>
#include <linux/uaccess.h>
#include <linux/anon_inodes.h>
#include <linux/lockdep.h>
+#include <linux/mm.h>
+#include <linux/mman.h>
+#include <linux/mmap_lock.h>
+#include <linux/sched/mm.h>
+#include <linux/task_work.h>
+#include <uapi/asm-generic/mman-common.h>
/*
* When SECCOMP_IOCTL_NOTIF_ID_VALID was first introduced, it had the
@@ -87,6 +94,13 @@ struct seccomp_knotif {
long val;
u32 flags;
+ /*
+ * Set by SEND_REDIRECT: the reply rewrote the syscall's registers,
+ * so on resume the syscall must be re-evaluated against the filters
+ * outer to the one that notified (see __seccomp_filter()).
+ */
+ bool redirect;
+
/*
* Signals when this has changed states, such as the listener
* dying, a new seccomp addfd message, or changing to REPLIED
@@ -226,6 +240,7 @@ struct seccomp_filter {
refcount_t users;
bool log;
bool wait_killable_recv;
+ bool redirect_capable;
struct action_cache cache;
struct seccomp_filter *prev;
struct bpf_prog *prog;
@@ -946,6 +961,13 @@ static long seccomp_attach_filter(unsigned int flags,
}
}
+ if (flags & SECCOMP_FILTER_FLAG_REDIRECT) {
+ for (walker = current->seccomp.filter; walker;
+ walker = walker->prev)
+ if (walker->redirect_capable)
+ return -EBUSY;
+ }
+
/* Set log flag, if present. */
if (flags & SECCOMP_FILTER_FLAG_LOG)
filter->log = true;
@@ -954,6 +976,10 @@ static long seccomp_attach_filter(unsigned int flags,
if (flags & SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV)
filter->wait_killable_recv = true;
+ /* Set redirect-capable flag, if present. */
+ if (flags & SECCOMP_FILTER_FLAG_REDIRECT)
+ filter->redirect_capable = true;
+
/*
* If there is an existing filter, make it the prev and don't drop its
* task reference.
@@ -1162,10 +1188,12 @@ static bool should_sleep_killable(struct seccomp_filter *match,
static int seccomp_do_user_notification(int this_syscall,
struct seccomp_filter *match,
- const struct seccomp_data *sd)
+ const struct seccomp_data *sd,
+ bool *redirected)
{
int err;
u32 flags = 0;
+ bool redirect = false;
long ret = 0;
struct seccomp_knotif n = {};
struct seccomp_kaddfd *addfd, *tmp;
@@ -1222,6 +1250,7 @@ static int seccomp_do_user_notification(int this_syscall,
ret = n.val;
err = n.error;
flags = n.flags;
+ redirect = n.redirect;
interrupted:
/* If there were any pending addfd calls, clear them out */
@@ -1248,14 +1277,38 @@ static int seccomp_do_user_notification(int this_syscall,
mutex_unlock(&match->notify_lock);
/* Userspace requests to continue the syscall. */
- if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE)
+ if (flags & SECCOMP_USER_NOTIF_FLAG_CONTINUE) {
+ *redirected = redirect;
return 0;
+ }
syscall_set_return_value(current, current_pt_regs(),
err, ret);
return -1;
}
+static u32 seccomp_run_filters_seq(const struct seccomp_data *sd,
+ struct seccomp_filter **match,
+ struct seccomp_filter *f,
+ int this_syscall)
+{
+ for (; f; f = f->prev) {
+ u32 cur_ret = bpf_prog_run_pin_on_cpu(f->prog, sd);
+ u32 action = cur_ret & SECCOMP_RET_ACTION_FULL;
+
+ if (action == SECCOMP_RET_ALLOW)
+ continue;
+ /* LOG does not block the syscall; record it and continue. */
+ if (action == SECCOMP_RET_LOG) {
+ seccomp_log(this_syscall, 0, action, true);
+ continue;
+ }
+ *match = f;
+ return cur_ret;
+ }
+ return SECCOMP_RET_ALLOW;
+}
+
static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
{
u32 filter_ret, action;
@@ -1272,6 +1325,8 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
populate_seccomp_data(&sd);
filter_ret = seccomp_run_filters(&sd, &match);
+
+eval:
data = filter_ret & SECCOMP_RET_DATA;
action = filter_ret & SECCOMP_RET_ACTION_FULL;
@@ -1334,11 +1389,40 @@ static int __seccomp_filter(int this_syscall, const bool recheck_after_trace)
return 0;
- case SECCOMP_RET_USER_NOTIF:
- if (seccomp_do_user_notification(this_syscall, match, &sd))
+ case SECCOMP_RET_USER_NOTIF: {
+ struct seccomp_filter *outer;
+ bool redirected = false;
+
+ if (seccomp_do_user_notification(this_syscall, match, &sd,
+ &redirected))
goto skip;
+ if (redirected && match->prev) {
+ /*
+ * The notifier rewrote the registers. Resume
+ * evaluation at the next outer filter on the
+ * substituted syscall, sequentially toward the root:
+ * each outer filter judges the new syscall exactly as
+ * if the target had issued it. Walking outward is
+ * monotonic, so a notifier cannot re-notify on its own
+ * redirect.
+ */
+ this_syscall = syscall_get_nr(current,
+ current_pt_regs());
+ if (this_syscall < 0)
+ return 0;
+ outer = match->prev;
+ match = NULL;
+ populate_seccomp_data(&sd);
+ filter_ret = seccomp_run_filters_seq(&sd, &match, outer,
+ this_syscall);
+ if (!match)
+ return 0;
+ goto eval;
+ }
+
return 0;
+ }
case SECCOMP_RET_LOG:
seccomp_log(this_syscall, 0, action, true);
@@ -1823,6 +1907,346 @@ static long seccomp_notify_addfd(struct seccomp_filter *filter,
return ret;
}
+static unsigned long seccomp_install_pin(struct task_struct *target,
+ struct file *memfd_file,
+ unsigned long target_addr, size_t size,
+ unsigned long offset)
+{
+ struct mm_struct *mm;
+ unsigned long ret;
+
+ mm = get_task_mm(target);
+ if (!mm)
+ return -ESRCH;
+
+ /*
+ * Install a sealed, read-only mapping. A fixed request (@target_addr
+ * != 0) is MAP_FIXED_NOREPLACE: an existing mapping yields -EEXIST
+ * rather than being silently clobbered. A request of 0 lets the kernel
+ * pick a free area in the target mm.
+ */
+ ret = vm_mmap_seal_remote(mm, memfd_file, target_addr, size,
+ offset >> PAGE_SHIFT);
+ mmput(mm);
+ if (IS_ERR_VALUE(ret))
+ return ret;
+ if (target_addr && ret != target_addr)
+ return -ENOMEM;
+ return ret;
+}
+
+static long seccomp_notify_pin_install(struct seccomp_filter *filter,
+ struct seccomp_notif_pin_install __user *upin,
+ unsigned int size)
+{
+ struct seccomp_notif_pin_install pin;
+ struct seccomp_knotif *knotif;
+ struct task_struct *target;
+ struct file *memfd_file;
+ unsigned long addr;
+ int seals;
+ long ret;
+
+ BUILD_BUG_ON(sizeof(pin) < SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(pin) != SECCOMP_NOTIFY_PIN_INSTALL_SIZE_LATEST);
+
+ if (size < SECCOMP_NOTIFY_PIN_INSTALL_SIZE_VER0 || size >= PAGE_SIZE)
+ return -EINVAL;
+
+ ret = copy_struct_from_user(&pin, sizeof(pin), upin, size);
+ if (ret)
+ return ret;
+
+ if (pin.flags)
+ return -EINVAL;
+ if (!pin.size || !IS_ALIGNED(pin.target_addr, PAGE_SIZE) ||
+ !IS_ALIGNED(pin.size, PAGE_SIZE) || !IS_ALIGNED(pin.offset, PAGE_SIZE))
+ return -EINVAL;
+ if (pin.target_addr + pin.size < pin.target_addr)
+ return -EINVAL;
+ if (pin.offset + pin.size < pin.offset)
+ return -EINVAL;
+
+ memfd_file = fget(pin.memfd);
+ if (!memfd_file)
+ return -EBADF;
+
+ seals = memfd_get_seals(memfd_file);
+ if (seals < 0 || !(seals & (F_SEAL_WRITE | F_SEAL_FUTURE_WRITE))) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ goto out_fput;
+
+ knotif = find_notification(filter, pin.id);
+ if (!knotif) {
+ ret = -ENOENT;
+ goto out_unlock;
+ }
+ if (knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -EINPROGRESS;
+ goto out_unlock;
+ }
+
+ target = knotif->task;
+ get_task_struct(target);
+ mutex_unlock(&filter->notify_lock);
+
+ addr = seccomp_install_pin(target, memfd_file,
+ pin.target_addr, pin.size, pin.offset);
+ put_task_struct(target);
+ if (IS_ERR_VALUE(addr))
+ ret = addr;
+ else if (put_user(addr, &upin->target_addr))
+ /* Pin is installed (and sealed); we just can't report where. */
+ ret = -EFAULT;
+ else
+ ret = 0;
+ goto out_fput;
+
+out_unlock:
+ mutex_unlock(&filter->notify_lock);
+out_fput:
+ fput(memfd_file);
+ return ret;
+}
+
+static bool seccomp_pin_check(struct task_struct *target,
+ struct file *memfd_file, u64 ptr, u64 len)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ bool ok = false;
+ u64 end;
+
+ if (!len)
+ return false;
+ end = ptr + len;
+ if (end < ptr)
+ return false;
+
+ mm = get_task_mm(target);
+ if (!mm)
+ return false;
+
+ /*
+ * The access must lie in a single sealed, read-only, memfd-backed VMA.
+ * Read-only so no CLONE_VM peer can rewrite the bytes the kernel is
+ * about to read; VM_SEALED keeps the mapping itself immutable.
+ */
+ mmap_read_lock(mm);
+ vma = vma_lookup(mm, ptr);
+ if (vma && end <= vma->vm_end && (vma->vm_flags & VM_SEALED) &&
+ !(vma->vm_flags & VM_WRITE) &&
+ vma->vm_file && file_inode(vma->vm_file) == file_inode(memfd_file))
+ ok = true;
+ mmap_read_unlock(mm);
+
+ mmput(mm);
+ return ok;
+}
+
+struct seccomp_redirect_restore {
+ struct callback_head twork;
+ unsigned long orig_args[SECCOMP_REDIRECT_ARGS];
+ u32 args_mask; /* bit i: arg i was substituted, restore it */
+ u64 self_exec_id; /* snapshot to detect an intervening execve */
+};
+
+static void seccomp_redirect_restore_cb(struct callback_head *cb)
+{
+ struct seccomp_redirect_restore *r =
+ container_of(cb, struct seccomp_redirect_restore, twork);
+ unsigned long args[SECCOMP_REDIRECT_ARGS];
+ int i;
+
+ if (READ_ONCE(current->self_exec_id) != r->self_exec_id) {
+ kfree(r);
+ return;
+ }
+
+ syscall_get_arguments(current, current_pt_regs(), args);
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+ if (r->args_mask & (1U << i))
+ args[i] = r->orig_args[i];
+ syscall_set_arguments(current, current_pt_regs(), args);
+ kfree(r);
+}
+
+/*
+ * rt_sigreturn restores the entire register frame from the user signal
+ * stack; the SEND_REDIRECT register-restore (run from task_work at user-mode
+ * return) would corrupt that frame, and the syscall takes no arguments to
+ * substitute anyway. Refuse to redirect it, including the compat variant.
+ */
+static bool seccomp_redirect_is_sigreturn(const struct seccomp_data *sd)
+{
+#ifdef SECCOMP_ARCH_COMPAT
+ if (sd->arch == SECCOMP_ARCH_COMPAT)
+ return sd->nr == __NR_seccomp_sigreturn_32;
+#endif
+ return sd->nr == __NR_seccomp_sigreturn;
+}
+
+static long seccomp_notify_send_redirect(struct seccomp_filter *filter,
+ struct seccomp_notif_resp_redirect __user *uresp,
+ unsigned int size)
+{
+ struct seccomp_notif_resp_redirect resp;
+ struct seccomp_knotif *knotif;
+ struct seccomp_redirect_restore *restore;
+ struct file *memfd_file = NULL;
+ struct pt_regs *target_regs;
+ unsigned long args[SECCOMP_REDIRECT_ARGS];
+ long ret;
+ int i;
+
+ BUILD_BUG_ON(sizeof(resp) < SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0);
+ BUILD_BUG_ON(sizeof(resp) != SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_LATEST);
+
+ if (!filter->redirect_capable)
+ return -EPERM;
+
+ if (size < SECCOMP_NOTIFY_RESP_REDIRECT_SIZE_VER0 || size >= PAGE_SIZE)
+ return -EINVAL;
+
+ ret = copy_struct_from_user(&resp, sizeof(resp), uresp, size);
+ if (ret)
+ return ret;
+
+ if (!(resp.flags & SECCOMP_REDIRECT_FLAG_CONTINUE))
+ return -EINVAL;
+ if (resp.flags & ~SECCOMP_REDIRECT_FLAG_CONTINUE)
+ return -EINVAL;
+ if (resp.args_mask & ~((1U << SECCOMP_REDIRECT_ARGS) - 1))
+ return -EINVAL;
+ if (resp.ptr_mask & ~resp.args_mask)
+ return -EINVAL;
+ if (!resp.args_mask)
+ return -EINVAL;
+
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++) {
+ if (resp.ptr_mask & (1U << i)) {
+ if (!resp.ptr_len[i])
+ return -EINVAL;
+ } else if (resp.ptr_len[i]) {
+ return -EINVAL;
+ }
+ }
+
+ restore = kzalloc_obj(*restore, GFP_KERNEL_ACCOUNT);
+ if (!restore)
+ return -ENOMEM;
+ init_task_work(&restore->twork, seccomp_redirect_restore_cb);
+
+ /* The backing memfd is only consulted to validate pointer args. */
+ if (resp.ptr_mask) {
+ memfd_file = fget(resp.memfd);
+ if (!memfd_file) {
+ kfree(restore);
+ return -EBADF;
+ }
+ }
+
+ ret = mutex_lock_interruptible(&filter->notify_lock);
+ if (ret < 0)
+ goto out_free;
+
+ knotif = find_notification(filter, resp.id);
+ if (!knotif) {
+ ret = -ENOENT;
+ goto out_unlock_free;
+ }
+ if (knotif->state != SECCOMP_NOTIFY_SENT) {
+ ret = -EINPROGRESS;
+ goto out_unlock_free;
+ }
+
+ if (seccomp_redirect_is_sigreturn(knotif->data)) {
+ ret = -EOPNOTSUPP;
+ goto out_unlock_free;
+ }
+
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++) {
+ if (!(resp.ptr_mask & (1U << i)))
+ continue;
+ if (!seccomp_pin_check(knotif->task, memfd_file,
+ resp.args[i], resp.ptr_len[i])) {
+ ret = -EFAULT;
+ goto out_unlock_free;
+ }
+ }
+
+ /*
+ * Save original pt_regs args (target is parked in
+ * seccomp_do_user_notification, so its pt_regs is stable) and
+ * write substituted values. The trapped task's task_work fires
+ * at user-mode return, restoring originals for ABI compliance.
+ */
+ target_regs = task_pt_regs(knotif->task);
+ syscall_get_arguments(knotif->task, target_regs, args);
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+ restore->orig_args[i] = args[i];
+ restore->args_mask = resp.args_mask;
+ restore->self_exec_id = READ_ONCE(knotif->task->self_exec_id);
+
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+ if (resp.args_mask & (1U << i))
+ args[i] = resp.args[i];
+ syscall_set_arguments(knotif->task, target_regs, args);
+
+ /*
+ * Use TWA_RESUME, not TWA_SIGNAL. TWA_SIGNAL sets TIF_NOTIFY_SIGNAL,
+ * which makes signal_pending() true for the entire redirected syscall
+ * (the work is queued here, before the target resumes and runs it).
+ * An interruptible syscall would then bail out with -ERESTARTSYS before
+ * doing any work, restart, re-trap and get redirected again -- a
+ * livelock. TWA_RESUME does not feed signal_pending(), and the restore
+ * still runs before signal delivery: get_signal() runs task_work_run()
+ * before it dequeues a signal, so the original args are back in pt_regs
+ * before handle_signal() builds the sigframe or the -ERESTART* path
+ * rewinds for restart.
+ */
+ ret = task_work_add(knotif->task, &restore->twork, TWA_RESUME);
+ if (ret) {
+ for (i = 0; i < SECCOMP_REDIRECT_ARGS; i++)
+ args[i] = restore->orig_args[i];
+ syscall_set_arguments(knotif->task, target_regs, args);
+ goto out_unlock_free;
+ }
+
+ /*
+ * Mark REPLIED with FLAG_CONTINUE so the wait-loop exit path runs the
+ * syscall normally. Flag the redirect so the resume path re-validates
+ * the rewritten syscall against the filters outer to this one.
+ */
+ knotif->state = SECCOMP_NOTIFY_REPLIED;
+ knotif->error = 0;
+ knotif->val = 0;
+ knotif->flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ knotif->redirect = true;
+ if (filter->notif->flags & SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP)
+ complete_on_current_cpu(&knotif->ready);
+ else
+ complete(&knotif->ready);
+
+ mutex_unlock(&filter->notify_lock);
+ if (memfd_file)
+ fput(memfd_file);
+ return 0;
+
+out_unlock_free:
+ mutex_unlock(&filter->notify_lock);
+out_free:
+ if (memfd_file)
+ fput(memfd_file);
+ kfree(restore);
+ return ret;
+}
+
static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
unsigned long arg)
{
@@ -1847,6 +2271,12 @@ static long seccomp_notify_ioctl(struct file *file, unsigned int cmd,
switch (EA_IOCTL(cmd)) {
case EA_IOCTL(SECCOMP_IOCTL_NOTIF_ADDFD):
return seccomp_notify_addfd(filter, buf, _IOC_SIZE(cmd));
+ case EA_IOCTL(SECCOMP_IOCTL_NOTIF_PIN_INSTALL):
+ return seccomp_notify_pin_install(filter, buf,
+ _IOC_SIZE(cmd));
+ case EA_IOCTL(SECCOMP_IOCTL_NOTIF_SEND_REDIRECT):
+ return seccomp_notify_send_redirect(filter, buf,
+ _IOC_SIZE(cmd));
default:
return -EINVAL;
}
@@ -1986,6 +2416,14 @@ static long seccomp_set_mode_filter(unsigned int flags,
((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
return -EINVAL;
+ /*
+ * SECCOMP_FILTER_FLAG_REDIRECT declares intent to redirect via the
+ * listener notifier, so it requires a listener.
+ */
+ if ((flags & SECCOMP_FILTER_FLAG_REDIRECT) &&
+ ((flags & SECCOMP_FILTER_FLAG_NEW_LISTENER) == 0))
+ return -EINVAL;
+
/* Prepare the new filter before holding any locks. */
prepared = seccomp_prepare_user_filter(filter);
if (IS_ERR(prepared))
diff --git a/mm/internal.h b/mm/internal.h
index 181e79f1d6a207..3d698bccc10040 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1436,6 +1436,14 @@ extern unsigned long __must_check vm_mmap_pgoff(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);
+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len, unsigned long prot,
+ unsigned long flags, vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate, struct list_head *uf);
+
+unsigned long mm_get_unmapped_area_remote(struct mm_struct *mm,
+ unsigned long len);
+
extern void set_pageblock_order(void);
unsigned long reclaim_pages(struct list_head *folio_list);
unsigned int reclaim_clean_pages_from_list(struct zone *zone,
diff --git a/mm/mmap.c b/mm/mmap.c
index 2311ae7c2ff45c..4328dc21272d3f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -277,7 +277,7 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
}
/**
- * do_mmap() - Perform a userland memory mapping into the current process
+ * __do_mmap() - Perform a userland memory mapping into @mm's
* address space of length @len with protection bits @prot, mmap flags @flags
* (from which VMA flags will be inferred), and any additional VMA flags to
* apply @vm_flags. If this is a file-backed mapping then the file is specified
@@ -307,8 +307,11 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
* start of a VMA, rather only the start of a valid mapped range of length
* @len bytes, rounded down to the nearest page size.
*
- * The caller must write-lock current->mm->mmap_lock.
+ * The caller must write-lock @mm->mmap_lock. do_mmap() is the common
+ * wrapper that targets current->mm.
*
+ * @mm: The mm_struct to install the mapping into. The caller must hold a
+ * reference and write-lock its mmap_lock.
* @file: An optional struct file pointer describing the file which is to be
* mapped, if a file-backed mapping.
* @addr: If non-zero, hints at (or if @flags has MAP_FIXED set, specifies) the
@@ -333,13 +336,12 @@ static inline bool file_mmap_ok(struct file *file, struct inode *inode,
* Returns: Either an error, or the address at which the requested mapping has
* been performed.
*/
-unsigned long do_mmap(struct file *file, unsigned long addr,
- unsigned long len, unsigned long prot,
- unsigned long flags, vm_flags_t vm_flags,
- unsigned long pgoff, unsigned long *populate,
- struct list_head *uf)
+unsigned long __do_mmap(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate, struct list_head *uf)
{
- struct mm_struct *mm = current->mm;
int pkey = 0;
*populate = 0;
@@ -557,7 +559,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags |= VM_NORESERVE;
}
- addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+ addr = mmap_region(mm, file, addr, len, vm_flags, pgoff, uf);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -565,6 +567,15 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
return addr;
}
+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,
+ unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate, struct list_head *uf)
+{
+ return __do_mmap(current->mm, file, addr, len, prot, flags,
+ vm_flags, pgoff, populate, uf);
+}
+
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
@@ -809,6 +820,40 @@ unsigned long mm_get_unmapped_area_vmflags(struct file *filp, unsigned long addr
return arch_get_unmapped_area(filp, addr, len, pgoff, flags, vm_flags);
}
+/*
+ * Find a free @len-byte area in @mm, honoring @mm's mmap layout direction.
+ * Unlike the arch_get_unmapped_area() family, the search runs against @mm
+ * rather than current->mm, so a supervisor can place a mapping in a remote
+ * task's address space (see vm_mmap_seal_remote()). The caller must hold
+ * mmap_write_lock(@mm). Returns a page-aligned address or -ENOMEM.
+ */
+unsigned long mm_get_unmapped_area_remote(struct mm_struct *mm, unsigned long len)
+{
+ struct vm_unmapped_area_info info = {
+ .length = len,
+ .mm = mm,
+ };
+ unsigned long addr;
+
+ if (mm_flags_test(MMF_TOPDOWN, mm)) {
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = arch_get_mmap_base(0, mm->mmap_base);
+ addr = vm_unmapped_area(&info);
+ if (!offset_in_page(addr))
+ return addr;
+ /* Topdown exhausted (e.g. huge stack rlimit); retry bottom-up. */
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = arch_get_mmap_end(0, len, 0);
+ return vm_unmapped_area(&info);
+ }
+
+ info.low_limit = mm->mmap_base;
+ info.high_limit = arch_get_mmap_end(0, len, 0);
+ return vm_unmapped_area(&info);
+}
+
unsigned long
__get_unmapped_area(struct file *file, unsigned long addr, unsigned long len,
unsigned long pgoff, unsigned long flags, vm_flags_t vm_flags)
diff --git a/mm/nommu.c b/mm/nommu.c
index ed3934bc2de483..7f2136129c7294 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1009,7 +1009,8 @@ static int do_mmap_private(struct vm_area_struct *vma,
/*
* handle mapping creation for uClinux
*/
-unsigned long do_mmap(struct file *file,
+unsigned long __do_mmap(struct mm_struct *mm,
+ struct file *file,
unsigned long addr,
unsigned long len,
unsigned long prot,
@@ -1246,6 +1247,15 @@ unsigned long do_mmap(struct file *file,
return -ENOMEM;
}
+unsigned long do_mmap(struct file *file, unsigned long addr, unsigned long len,
+ unsigned long prot, unsigned long flags,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ unsigned long *populate, struct list_head *uf)
+{
+ return __do_mmap(current->mm, file, addr, len, prot, flags,
+ vm_flags, pgoff, populate, uf);
+}
+
unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff)
diff --git a/mm/util.c b/mm/util.c
index af2c2103f0d952..21568dd0e9f8b0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -588,6 +588,68 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,
return ret;
}
+/**
+ * vm_mmap_seal_remote - install a sealed MAP_SHARED file mapping into @mm,
+ * without target-side cooperation.
+ * @mm: Target mm; caller holds a reference (e.g. get_task_mm()).
+ * @file: Backing file.
+ * @addr: Page-aligned address. If non-zero, MAP_FIXED_NOREPLACE is used
+ * (-EEXIST if occupied); if zero, the kernel chooses a free area in
+ * @mm and returns it.
+ * @len: Length in bytes (page-aligned).
+ * @pgoff: Page offset into @file.
+ *
+ * The mapping is read-only. The VMA is created VM_SEALED, so it is immediately
+ * immutable against the target mm's owner and its CLONE_VM peers. LSM/fsnotify
+ * hooks run against %current; cross-task authorization is the caller's
+ * responsibility (no ptrace_may_access check).
+ *
+ * Returns the mapped address on success, or a negative errno.
+ */
+unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len, unsigned long pgoff)
+{
+ const unsigned long prot = PROT_READ;
+ const unsigned long flags = MAP_SHARED | MAP_FIXED_NOREPLACE;
+ loff_t off = (loff_t)pgoff << PAGE_SHIFT;
+ unsigned long ret;
+ unsigned long populate;
+ LIST_HEAD(uf);
+
+ if (WARN_ON_ONCE(!mm))
+ return -EINVAL;
+ if (!VM_SEALED) /* sealing unavailable (e.g. !CONFIG_64BIT) */
+ return -EOPNOTSUPP;
+
+ ret = security_mmap_file(file, prot, flags);
+ if (!ret)
+ ret = fsnotify_mmap_perm(file, prot, off, len);
+ if (ret)
+ return ret;
+
+ if (mmap_write_lock_killable(mm))
+ return -EINTR;
+
+ if (!addr) {
+ addr = mm_get_unmapped_area_remote(mm, PAGE_ALIGN(len));
+ if (IS_ERR_VALUE(addr)) {
+ ret = addr;
+ goto unlock;
+ }
+ }
+ ret = __do_mmap(mm, file, addr, len, prot, flags, VM_SEALED,
+ pgoff, &populate, &uf);
+ /*
+ * Do not mm_populate() against a foreign mm; the target task will
+ * fault pages in on first access.
+ */
+unlock:
+ mmap_write_unlock(mm);
+ userfaultfd_unmap_complete(mm, &uf);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(vm_mmap_seal_remote);
+
/*
* Perform a userland memory mapping into the current process address space. See
* the comment for do_mmap() for more details on this operation in general.
diff --git a/mm/vma.c b/mm/vma.c
index 9eea2850818a85..2f9159ab5123a3 100644
--- a/mm/vma.c
+++ b/mm/vma.c
@@ -2731,11 +2731,10 @@ static bool can_set_ksm_flags_early(struct mmap_state *map)
return false;
}
-static unsigned long __mmap_region(struct file *file, unsigned long addr,
- unsigned long len, vma_flags_t vma_flags,
+static unsigned long __mmap_region(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len, vma_flags_t vma_flags,
unsigned long pgoff, struct list_head *uf)
{
- struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
bool have_mmap_prepare = file && file->f_op->mmap_prepare;
VMA_ITERATOR(vmi, mm, addr);
@@ -2809,14 +2808,16 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
/**
* mmap_region() - Actually perform the userland mapping of a VMA into
- * current->mm with known, aligned and overflow-checked @addr and @len, and
+ * @mm with known, aligned and overflow-checked @addr and @len, and
* correctly determined VMA flags @vm_flags and page offset @pgoff.
*
* This is an internal memory management function, and should not be used
* directly.
*
- * The caller must write-lock current->mm->mmap_lock.
+ * The caller must write-lock @mm->mmap_lock.
*
+ * @mm: The mm_struct to install the mapping into. The caller must hold a
+ * reference and write-lock its mmap_lock.
* @file: If a file-backed mapping, a pointer to the struct file describing the
* file to be mapped, otherwise NULL.
* @addr: The page-aligned address at which to perform the mapping.
@@ -2830,15 +2831,16 @@ static unsigned long __mmap_region(struct file *file, unsigned long addr,
* Returns: Either an error, or the address at which the requested mapping has
* been performed.
*/
-unsigned long mmap_region(struct file *file, unsigned long addr,
- unsigned long len, vm_flags_t vm_flags,
- unsigned long pgoff, struct list_head *uf)
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len,
+ vm_flags_t vm_flags, unsigned long pgoff,
+ struct list_head *uf)
{
unsigned long ret;
bool writable_file_mapping = false;
const vma_flags_t vma_flags = legacy_to_vma_flags(vm_flags);
- mmap_assert_write_locked(current->mm);
+ mmap_assert_write_locked(mm);
/* Check to see if MDWE is applicable. */
if (map_deny_write_exec(&vma_flags, &vma_flags))
@@ -2857,13 +2859,13 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
writable_file_mapping = true;
}
- ret = __mmap_region(file, addr, len, vma_flags, pgoff, uf);
+ ret = __mmap_region(mm, file, addr, len, vma_flags, pgoff, uf);
/* Clear our write mapping regardless of error. */
if (writable_file_mapping)
mapping_unmap_writable(file->f_mapping);
- validate_mm(current->mm);
+ validate_mm(mm);
return ret;
}
@@ -2957,8 +2959,8 @@ int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
/**
* unmapped_area() - Find an area between the low_limit and the high_limit with
- * the correct alignment and offset, all from @info. Note: current->mm is used
- * for the search.
+ * the correct alignment and offset, all from @info. Note: @info->mm (or
+ * current->mm when it is NULL) is used for the search.
*
* @info: The unmapped area information including the range [low_limit -
* high_limit), the alignment offset and mask.
@@ -2970,7 +2972,7 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)
unsigned long length, gap;
unsigned long low_limit, high_limit;
struct vm_area_struct *tmp;
- VMA_ITERATOR(vmi, current->mm, 0);
+ VMA_ITERATOR(vmi, info->mm ? : current->mm, 0);
/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask + info->start_gap;
@@ -3016,7 +3018,8 @@ unsigned long unmapped_area(struct vm_unmapped_area_info *info)
/**
* unmapped_area_topdown() - Find an area between the low_limit and the
* high_limit with the correct alignment and offset at the highest available
- * address, all from @info. Note: current->mm is used for the search.
+ * address, all from @info. Note: @info->mm (or current->mm when it is NULL)
+ * is used for the search.
*
* @info: The unmapped area information including the range [low_limit -
* high_limit), the alignment offset and mask.
@@ -3028,7 +3031,7 @@ unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
unsigned long length, gap, gap_end;
unsigned long low_limit, high_limit;
struct vm_area_struct *tmp;
- VMA_ITERATOR(vmi, current->mm, 0);
+ VMA_ITERATOR(vmi, info->mm ? : current->mm, 0);
/* Adjust search length to account for worst case alignment overhead */
length = info->length + info->align_mask + info->start_gap;
diff --git a/mm/vma.h b/mm/vma.h
index 8e4b61a7304c68..4f5222ad2e9dde 100644
--- a/mm/vma.h
+++ b/mm/vma.h
@@ -459,9 +459,9 @@ bool vma_wants_writenotify(struct vm_area_struct *vma, pgprot_t vm_page_prot);
int mm_take_all_locks(struct mm_struct *mm);
void mm_drop_all_locks(struct mm_struct *mm);
-unsigned long mmap_region(struct file *file, unsigned long addr,
- unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf);
+unsigned long mmap_region(struct mm_struct *mm, struct file *file,
+ unsigned long addr, unsigned long len, vm_flags_t vm_flags,
+ unsigned long pgoff, struct list_head *uf);
int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *brkvma,
unsigned long addr, unsigned long request,
diff --git a/tools/testing/selftests/seccomp/seccomp_bpf.c b/tools/testing/selftests/seccomp/seccomp_bpf.c
index 358b6c65e120e8..1b1eec5051980d 100644
--- a/tools/testing/selftests/seccomp/seccomp_bpf.c
+++ b/tools/testing/selftests/seccomp/seccomp_bpf.c
@@ -217,6 +217,10 @@ struct seccomp_metadata {
#define SECCOMP_FILTER_FLAG_NEW_LISTENER (1UL << 3)
#endif
+#ifndef SECCOMP_FILTER_FLAG_REDIRECT
+#define SECCOMP_FILTER_FLAG_REDIRECT (1UL << 6)
+#endif
+
#ifndef SECCOMP_RET_USER_NOTIF
#define SECCOMP_RET_USER_NOTIF 0x7fc00000U
@@ -295,6 +299,35 @@ struct seccomp_notif_addfd_big {
#define PTRACE_EVENTMSG_SYSCALL_EXIT 2
#endif
+#ifndef SECCOMP_IOCTL_NOTIF_PIN_INSTALL
+struct seccomp_notif_pin_install {
+ __u64 id;
+ __u32 flags;
+ __u32 memfd;
+ __u64 target_addr;
+ __u64 size;
+ __u64 offset;
+};
+#define SECCOMP_IOCTL_NOTIF_PIN_INSTALL SECCOMP_IOWR(5, \
+ struct seccomp_notif_pin_install)
+#endif
+
+#ifndef SECCOMP_IOCTL_NOTIF_SEND_REDIRECT
+#define SECCOMP_REDIRECT_FLAG_CONTINUE (1UL << 0)
+#define SECCOMP_REDIRECT_ARGS 6
+struct seccomp_notif_resp_redirect {
+ __u64 id;
+ __u32 flags;
+ __u32 args_mask;
+ __u32 ptr_mask;
+ __u32 memfd;
+ __u64 args[SECCOMP_REDIRECT_ARGS];
+ __u64 ptr_len[SECCOMP_REDIRECT_ARGS];
+};
+#define SECCOMP_IOCTL_NOTIF_SEND_REDIRECT SECCOMP_IOW(6, \
+ struct seccomp_notif_resp_redirect)
+#endif
+
#ifndef SECCOMP_USER_NOTIF_FLAG_CONTINUE
#define SECCOMP_USER_NOTIF_FLAG_CONTINUE 0x00000001
#endif
@@ -4368,6 +4401,1000 @@ TEST(user_notification_addfd_rlimit)
close(memfd);
}
+/*
+ * Create a write-sealed memfd of @size for PIN_INSTALL and map a supervisor
+ * writable view, primed with @content. F_SEAL_FUTURE_WRITE keeps this
+ * pre-seal mapping writable (so the test can still stage content) while
+ * barring any other writable reference, as PIN_INSTALL requires. Returns
+ * the memfd.
+ */
+static int make_pin_memfd(struct __test_metadata *_metadata, const char *name,
+ size_t size, char **sup_view, const char *content)
+{
+ int memfd = memfd_create(name, MFD_ALLOW_SEALING);
+
+ ASSERT_GE(memfd, 0);
+ ASSERT_EQ(0, ftruncate(memfd, size));
+ ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK | F_SEAL_GROW));
+
+ *sup_view = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED,
+ memfd, 0);
+ ASSERT_NE(MAP_FAILED, *sup_view);
+ ASSERT_EQ(0, fcntl(memfd, F_ADD_SEALS, F_SEAL_FUTURE_WRITE));
+ memcpy(*sup_view, content, strlen(content) + 1);
+ return memfd;
+}
+
+/*
+ * Non-cooperative pinned-memfd: kernel installs a sealed PROT_READ
+ * MAP_SHARED mapping of the supervisor's memfd directly into the
+ * trapped task's mm. The target runs no mmap or mseal code itself —
+ * this exercises the same kernel path that a fork+execve sandbox
+ * supervisor would use to install a pin in the new image's fresh
+ * post-exec mm.
+ *
+ * Target child does nothing but call openat() on a bait path. The
+ * supervisor catches the trap, calls PIN_INSTALL (kernel does the
+ * mmap + seal in target's mm via vm_mmap_seal_remote()), writes a
+ * safe path into its own memfd view, and SEND_REDIRECTs args[1]
+ * into the freshly installed pin. The child's openat resumes,
+ * reads from the sealed pin, and returns an fd to the safe path.
+ */
+TEST(user_notification_pinned_memfd_remote)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, memfd, unsealed;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_pin_install pin = {};
+ struct seccomp_notif_pin_install unsealed_pin = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ char *sup_view;
+ const size_t PIN_SIZE = 4096;
+ const char *safe_path = "/dev/null";
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ memfd = make_pin_memfd(_metadata, "pinned-remote", PIN_SIZE,
+ &sup_view, safe_path);
+
+ listener = user_notif_syscall(__NR_openat,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ int fd;
+
+ /*
+ * Target performs no setup. Just trap on openat. Kernel
+ * (driven by the supervisor) will install the pin in this
+ * process's mm at a kernel-chosen address behind our back,
+ * and our openat will be redirected to read from there.
+ */
+ fd = syscall(__NR_openat, AT_FDCWD,
+ "/this/should/never/be/touched", O_RDONLY, 0);
+ if (fd < 0)
+ _exit(11);
+ _exit(0);
+ }
+
+ ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ EXPECT_EQ(req.data.nr, __NR_openat);
+
+ pin.id = req.id;
+ pin.memfd = memfd;
+ pin.target_addr = 0;
+ pin.size = PIN_SIZE;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin)) {
+ if (errno == EINVAL) {
+ SKIP(goto cleanup,
+ "Kernel does not support pinned-memfd remote install");
+ }
+ TH_LOG("PIN_INSTALL failed: errno=%d", errno);
+ }
+
+ /* The kernel wrote a non-zero, page-aligned address back to us. */
+ EXPECT_NE(0, pin.target_addr);
+ EXPECT_EQ(0, pin.target_addr & (PIN_SIZE - 1));
+
+ /* Reject: the backing memfd must be write-sealed. */
+ unsealed = memfd_create("unsealed", MFD_ALLOW_SEALING);
+ ASSERT_GE(unsealed, 0);
+ ASSERT_EQ(0, ftruncate(unsealed, PIN_SIZE));
+ unsealed_pin.id = req.id;
+ unsealed_pin.memfd = unsealed;
+ unsealed_pin.size = PIN_SIZE;
+ EXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+ &unsealed_pin));
+ EXPECT_EQ(EINVAL, errno);
+ close(unsealed);
+
+ /* Reject: redirect outside any installed pin. */
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 1;
+ redir.ptr_mask = 1U << 1;
+ redir.memfd = memfd;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ redir.args[1] = pin.target_addr + PIN_SIZE; /* one byte past */
+ EXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir));
+ EXPECT_EQ(EFAULT, errno);
+
+ /* Reject: base is inside the pin but the extent runs past its end. */
+ redir.args[1] = pin.target_addr;
+ redir.ptr_len[1] = PIN_SIZE + 1;
+ EXPECT_EQ(-1, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir));
+ EXPECT_EQ(EFAULT, errno);
+
+ /* Happy path: redirect into the kernel-installed pin. */
+ redir.args[1] = pin.target_addr;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ TH_LOG("child exit %d (11=openat fail)", WEXITSTATUS(status));
+ }
+
+cleanup:
+ munmap(sup_view, PIN_SIZE);
+ close(memfd);
+ close(listener);
+}
+
+/*
+ * Helper for the execve test: read up to @max bytes of a NUL-terminated
+ * string from @pid's mm at @addr into @out. Returns the length read
+ * (excluding the NUL), or -1 on failure or no NUL.
+ */
+static ssize_t read_remote_string(pid_t pid, unsigned long addr,
+ char *out, size_t max)
+{
+ struct iovec local = { .iov_base = out, .iov_len = max };
+ struct iovec remote = { .iov_base = (void *)addr, .iov_len = max };
+ ssize_t n;
+ size_t i;
+
+ n = process_vm_readv(pid, &local, 1, &remote, 1, 0);
+ if (n <= 0)
+ return -1;
+ for (i = 0; i < (size_t)n; i++)
+ if (out[i] == '\0')
+ return (ssize_t)i;
+ return -1;
+}
+
+/*
+ * Send a file descriptor over a connected UNIX socket via SCM_RIGHTS.
+ * Used by the execve_scm test so the target child can hand its
+ * SECCOMP_FILTER_FLAG_NEW_LISTENER fd to the supervising parent
+ * without the parent having to inherit the seccomp filter itself.
+ */
+static int send_fd(int sock, int fd)
+{
+ char cbuf[CMSG_SPACE(sizeof(int))] = {};
+ char data = 'x';
+ struct iovec iov = { .iov_base = &data, .iov_len = 1 };
+ struct msghdr msg = {
+ .msg_iov = &iov, .msg_iovlen = 1,
+ .msg_control = cbuf, .msg_controllen = sizeof(cbuf),
+ };
+ struct cmsghdr *cmsg = CMSG_FIRSTHDR(&msg);
+
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(int));
+ memcpy(CMSG_DATA(cmsg), &fd, sizeof(int));
+ return sendmsg(sock, &msg, 0) < 0 ? -1 : 0;
+}
+
+static int recv_fd(int sock)
+{
+ char cbuf[CMSG_SPACE(sizeof(int))] = {};
+ char data;
+ struct iovec iov = { .iov_base = &data, .iov_len = 1 };
+ struct msghdr msg = {
+ .msg_iov = &iov, .msg_iovlen = 1,
+ .msg_control = cbuf, .msg_controllen = sizeof(cbuf),
+ };
+ struct cmsghdr *cmsg;
+ int fd;
+
+ if (recvmsg(sock, &msg, 0) < 0)
+ return -1;
+ cmsg = CMSG_FIRSTHDR(&msg);
+ if (!cmsg || cmsg->cmsg_level != SOL_SOCKET ||
+ cmsg->cmsg_type != SCM_RIGHTS ||
+ cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
+ return -1;
+ memcpy(&fd, CMSG_DATA(cmsg), sizeof(int));
+ return fd;
+}
+
+struct addr_range {
+ unsigned long start, end;
+};
+
+/*
+ * Parse /proc/<pid>/maps looking for the dynamic linker's executable
+ * mapping (glibc ld-linux-*.so, musl ld-musl-*.so, etc.). The trapped
+ * task's instruction_pointer falling in this range identifies a
+ * loader-bootstrap syscall (race-free, kernel-truth) so the supervisor
+ * can auto-allow it without inspecting argument content via the racy
+ * process_vm_readv path.
+ *
+ * Requires the supervisor not to be subject to the seccomp filter
+ * itself -- fopen() internally calls openat(). The execve_scm test
+ * structure (child installs filter, sends listener fd to parent via
+ * SCM_RIGHTS) satisfies that.
+ *
+ * Returns 0 on success with @out populated, -1 if not found.
+ */
+static int find_loader_text_range(pid_t pid, struct addr_range *out)
+{
+ char maps_path[64];
+ char line[512];
+ FILE *f;
+ int found = 0;
+
+ snprintf(maps_path, sizeof(maps_path), "/proc/%d/maps", pid);
+ f = fopen(maps_path, "r");
+ if (!f)
+ return -1;
+
+ while (fgets(line, sizeof(line), f)) {
+ unsigned long start, end;
+ char perms[8];
+ char *path;
+
+ if (sscanf(line, "%lx-%lx %7s", &start, &end, perms) != 3)
+ continue;
+ if (!strchr(perms, 'x'))
+ continue;
+ path = strchr(line, '/');
+ if (!path)
+ continue;
+ /*
+ * Match common dynamic-linker basenames: ld-linux-*.so
+ * (glibc), ld-musl-*.so (musl), ld-*.so (older glibc).
+ */
+ if (strstr(path, "/ld-") || strstr(path, "/ld.so")) {
+ out->start = start;
+ out->end = end;
+ found = 1;
+ break;
+ }
+ }
+ fclose(f);
+ return found ? 0 : -1;
+}
+
+/*
+ * Non-cooperative pinned-memfd across a real execve, using the proper
+ * supervisor-isolation pattern: the child (target) installs the seccomp
+ * filter on itself and sends its listener fd to the parent (supervisor)
+ * via SCM_RIGHTS over a socketpair. The parent therefore does not carry
+ * the seccomp filter and can freely call openat() -- which is what makes
+ * the race-free, kernel-truth loader detection (req.data.instruction_pointer
+ * + /proc/<pid>/maps) actually usable.
+ *
+ * Phase 1: child does a pre-execve openat; the supervisor PIN_INSTALLs and
+ * SEND_REDIRECTs. Phase 2: child execve's, so the pre-execve pin VMA dies
+ * with the old mm. Phase 3: in the fresh post-execve mm the supervisor
+ * PIN_INSTALLs again (idempotent replace of the stale bookkeeping) and
+ * SEND_REDIRECTs, proving the full redirect mechanism survives an mm
+ * replacement, not just the install side.
+ */
+TEST(user_notification_pinned_memfd_execve_scm)
+{
+ pid_t pid;
+ int status, listener, memfd, sv[2];
+ struct seccomp_notif req = {};
+ struct seccomp_notif_pin_install pin = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ struct seccomp_notif_resp cont_resp = {};
+ char *sup_view;
+ const size_t PIN_SIZE = 4096;
+ const char *safe_path = "/dev/null";
+ const char *bait = "/seccomp_pinned_memfd_test_bait_scm";
+ bool post_exec_install_ok = false;
+ bool post_exec_redirect_done = false;
+ bool loader_known = false;
+ bool loader_check_attempted = false;
+ struct addr_range loader_range = {};
+ int phase = 0;
+ int trap_count = 0;
+ const int trap_limit = 200;
+
+ if (access("/bin/cat", X_OK) != 0)
+ SKIP(return, "/bin/cat not present");
+
+ memfd = make_pin_memfd(_metadata, "pin-execve-scm", PIN_SIZE,
+ &sup_view, safe_path);
+
+ ASSERT_EQ(0, socketpair(AF_UNIX, SOCK_SEQPACKET, 0, sv));
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ struct sock_filter filter[] = {
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_openat,
+ 0, 1),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_USER_NOTIF),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog prog = {
+ .len = (unsigned short)ARRAY_SIZE(filter),
+ .filter = filter,
+ };
+ int my_listener;
+ int fd;
+
+ close(sv[0]);
+ if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0))
+ _exit(20);
+ my_listener = seccomp(SECCOMP_SET_MODE_FILTER,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT,
+ &prog);
+ if (my_listener < 0)
+ _exit(21);
+ if (send_fd(sv[1], my_listener) < 0)
+ _exit(22);
+ close(my_listener);
+ close(sv[1]);
+
+ /* Pre-execve trap. */
+ fd = syscall(__NR_openat, AT_FDCWD,
+ "/this/should/never/be/touched", O_RDONLY, 0);
+ if (fd < 0)
+ _exit(11);
+
+ execl("/bin/cat", "cat", bait, (char *)NULL);
+ _exit(12);
+ }
+
+ close(sv[1]);
+ listener = recv_fd(sv[0]);
+ close(sv[0]);
+ ASSERT_GE(listener, 0);
+
+ /*
+ * Parent has the listener fd and does NOT have the seccomp
+ * filter. fopen(/proc/<pid>/maps) below works without
+ * deadlocking on the parent's own openat.
+ */
+ for (;;) {
+ struct pollfd pfd = { .fd = listener, .events = POLLIN };
+ int pret = poll(&pfd, 1, 500);
+ pid_t reaped;
+ bool ip_in_loader;
+
+ if (pret < 0)
+ break;
+ if (pret == 0 || !(pfd.revents & POLLIN)) {
+ reaped = waitpid(pid, &status, WNOHANG);
+ if (reaped == pid)
+ break;
+ if (pfd.revents & (POLLHUP | POLLERR))
+ break;
+ continue;
+ }
+
+ memset(&req, 0, sizeof(req));
+ if (ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req) < 0) {
+ TH_LOG("NOTIF_RECV failed: errno=%d", errno);
+ break;
+ }
+ if (++trap_count > trap_limit) {
+ TH_LOG("trap_limit (%d) exceeded", trap_limit);
+ break;
+ }
+
+ if (phase == 0) {
+ pin.id = req.id;
+ pin.memfd = memfd;
+ pin.target_addr = 0;
+ pin.size = PIN_SIZE;
+ if (ioctl(listener,
+ SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+ &pin) != 0) {
+ TH_LOG("pre-exec PIN_INSTALL failed: errno=%d",
+ errno);
+ if (errno == EINVAL)
+ SKIP(goto cleanup_scm,
+ "Kernel lacks pinned-memfd remote");
+ goto cleanup_scm;
+ }
+
+ memset(&redir, 0, sizeof(redir));
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 1;
+ redir.ptr_mask = 1U << 1;
+ redir.memfd = memfd;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ redir.args[1] = pin.target_addr;
+ if (ioctl(listener,
+ SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir) != 0) {
+ TH_LOG("pre-exec SEND_REDIRECT failed: errno=%d",
+ errno);
+ goto cleanup_scm;
+ }
+ phase = 1;
+ continue;
+ }
+
+ /*
+ * Post-execve. Lazily resolve the loader range. The
+ * supervisor's own openat (fopen on /proc/<pid>/maps)
+ * doesn't trap because the filter lives on the child,
+ * not on us.
+ */
+ if (!loader_known && !loader_check_attempted) {
+ if (find_loader_text_range(req.pid,
+ &loader_range) == 0)
+ loader_known = true;
+ loader_check_attempted = true;
+ }
+
+ ip_in_loader = loader_known &&
+ req.data.instruction_pointer >= loader_range.start &&
+ req.data.instruction_pointer < loader_range.end;
+
+ if (ip_in_loader) {
+ memset(&cont_resp, 0, sizeof(cont_resp));
+ cont_resp.id = req.id;
+ cont_resp.flags = SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND, &cont_resp);
+ continue;
+ }
+
+ /* Program code: inspect the path to identify the bait. */
+ {
+ char path[PATH_MAX];
+ ssize_t n;
+
+ n = read_remote_string(req.pid, req.data.args[1],
+ path, sizeof(path));
+ if (n < 0 || strcmp(path, bait) != 0) {
+ memset(&cont_resp, 0, sizeof(cont_resp));
+ cont_resp.id = req.id;
+ cont_resp.flags =
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+ &cont_resp);
+ continue;
+ }
+
+ pin.id = req.id;
+ pin.memfd = memfd;
+ pin.target_addr = 0;
+ pin.size = PIN_SIZE;
+ if (ioctl(listener,
+ SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+ &pin) == 0) {
+ post_exec_install_ok = true;
+ } else {
+ TH_LOG("post-exec PIN_INSTALL failed: errno=%d",
+ errno);
+ memset(&cont_resp, 0, sizeof(cont_resp));
+ cont_resp.id = req.id;
+ cont_resp.flags =
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+ &cont_resp);
+ continue;
+ }
+
+ memset(&redir, 0, sizeof(redir));
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 1;
+ redir.ptr_mask = 1U << 1;
+ redir.memfd = memfd;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ redir.args[1] = pin.target_addr;
+ if (ioctl(listener,
+ SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir) == 0) {
+ post_exec_redirect_done = true;
+ } else {
+ TH_LOG("post-exec SEND_REDIRECT failed: errno=%d",
+ errno);
+ memset(&cont_resp, 0, sizeof(cont_resp));
+ cont_resp.id = req.id;
+ cont_resp.flags =
+ SECCOMP_USER_NOTIF_FLAG_CONTINUE;
+ ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND,
+ &cont_resp);
+ }
+ }
+ }
+
+ if (waitpid(pid, &status, WNOHANG) == 0) {
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+ }
+ EXPECT_EQ(true, loader_known) {
+ TH_LOG("find_loader_text_range never resolved");
+ }
+ EXPECT_EQ(true, post_exec_install_ok);
+ EXPECT_EQ(true, post_exec_redirect_done);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status));
+
+cleanup_scm:
+ munmap(sup_view, PIN_SIZE);
+ close(memfd);
+ close(listener);
+}
+
+/*
+ * Stateless redirect validation must hold up across many short-lived
+ * targets over one listener, and must not accumulate per-target state.
+ *
+ * PIN_INSTALL records nothing: the installed VM_SEALED VMA is the only
+ * record, and SEND_REDIRECT re-validates the pointer against the live
+ * mapping (sealed, read-only, backed by the supervisor's memfd inode).
+ * So a supervisor servicing a long churn of targets keeps working with
+ * no bookkeeping to leak. Each iteration lets the kernel choose the pin
+ * address in the fresh target mm; every install/redirect must succeed, and
+ * kmemleak/KASAN over the loop confirms nothing accumulates.
+ */
+TEST(user_notification_pinned_memfd_churn)
+{
+ const size_t PIN_SIZE = 4096;
+ const char *safe_path = "/dev/null";
+ const int iters = 16;
+ int listener, memfd, i;
+ char *sup_view;
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ memfd = make_pin_memfd(_metadata, "pinned-reap", PIN_SIZE,
+ &sup_view, safe_path);
+
+ listener = user_notif_syscall(__NR_openat,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT);
+ ASSERT_GE(listener, 0);
+
+ for (i = 0; i < iters; i++) {
+ struct seccomp_notif req = {};
+ struct seccomp_notif_pin_install pin = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ int status;
+ pid_t pid;
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+ if (pid == 0) {
+ int fd = syscall(__NR_openat, AT_FDCWD,
+ "/never/touched", O_RDONLY, 0);
+ _exit(fd < 0 ? 11 : 0);
+ }
+
+ ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ EXPECT_EQ(req.data.nr, __NR_openat);
+
+ pin.id = req.id;
+ pin.memfd = memfd;
+ pin.target_addr = 0;
+ pin.size = PIN_SIZE;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL,
+ &pin)) {
+ if (errno == EINVAL) {
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+ SKIP(goto cleanup,
+ "Kernel lacks pinned-memfd remote install");
+ }
+ TH_LOG("iter %d PIN_INSTALL failed: errno=%d", i, errno);
+ }
+
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 1;
+ redir.ptr_mask = 1U << 1;
+ redir.memfd = memfd;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ redir.args[1] = pin.target_addr;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ TH_LOG("iter %d child exit %d (11=openat fail)",
+ i, WEXITSTATUS(status));
+ }
+ /*
+ * Target is dead now; its pin (this iter's mm, at the
+ * kernel-chosen address) is stale. The next iteration's
+ * PIN_INSTALL walk must reap it rather than leak the range +
+ * mm + memfd reference.
+ */
+ }
+
+cleanup:
+ munmap(sup_view, PIN_SIZE);
+ close(memfd);
+ close(listener);
+}
+
+#ifdef __NR_socket
+/*
+ * A redirect must not let an inner (more recently installed) filter's
+ * notifier smuggle a syscall past an outer filter. Two filters are
+ * stacked on the target:
+ *
+ * outer (installed first): socket(AF_INET, ...) -> RET_ERRNO(EACCES),
+ * everything else ALLOW.
+ * inner (installed second): socket -> RET_USER_NOTIF.
+ *
+ * The child calls socket(AF_UNIX, ...), which the outer filter allows, so
+ * the inner notifier wins and fires. The supervisor SEND_REDIRECTs arg0
+ * to AF_INET. The kernel must then re-run the outer filter against the
+ * rewritten registers and block it with EACCES; without the outer-suffix
+ * re-validation the inner filter would have bypassed the outer policy.
+ */
+TEST(user_notification_redirect_outer_refilter)
+{
+ struct sock_filter outer_filter[] = {
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS,
+ offsetof(struct seccomp_data, nr)),
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, __NR_socket, 0, 3),
+ BPF_STMT(BPF_LD | BPF_W | BPF_ABS, syscall_arg(0)),
+ BPF_JUMP(BPF_JMP | BPF_JEQ | BPF_K, AF_INET, 0, 1),
+ BPF_STMT(BPF_RET | BPF_K,
+ SECCOMP_RET_ERRNO | (EACCES & SECCOMP_RET_DATA)),
+ BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW),
+ };
+ struct sock_fprog outer_prog = {
+ .len = (unsigned short)ARRAY_SIZE(outer_filter),
+ .filter = outer_filter,
+ };
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ int status, listener;
+ pid_t pid;
+ long ret;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ /* Outer filter first => it becomes the outer/root of the stack. */
+ ASSERT_EQ(0, seccomp(SECCOMP_SET_MODE_FILTER, 0, &outer_prog));
+
+ /* Inner USER_NOTIF filter second (innermost); returns the listener. */
+ listener = user_notif_syscall(__NR_socket,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ int fd = syscall(__NR_socket, AF_UNIX, SOCK_STREAM, 0);
+
+ if (fd >= 0)
+ _exit(12); /* bypass: outer filter was skipped */
+ if (errno != EACCES)
+ _exit(13); /* unexpected errno */
+ _exit(0);
+ }
+
+ ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ EXPECT_EQ(req.data.nr, __NR_socket);
+ EXPECT_EQ(req.data.args[0], AF_UNIX);
+
+ /* Scalar redirect of arg0 (no pin needed): AF_UNIX -> AF_INET. */
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 0;
+ redir.args[0] = AF_INET;
+ ret = ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT, &redir);
+ if (ret < 0 && errno == EINVAL) {
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+ SKIP(return, "Kernel lacks SECCOMP_IOCTL_NOTIF_SEND_REDIRECT");
+ }
+ EXPECT_EQ(0, ret);
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ switch (WEXITSTATUS(status)) {
+ case 12:
+ TH_LOG("child exit 12: redirect bypassed the outer filter");
+ break;
+ case 13:
+ TH_LOG("child exit 13: socket failed with unexpected errno");
+ break;
+ default:
+ TH_LOG("child exit %d (unexpected)", WEXITSTATUS(status));
+ }
+ }
+
+ close(listener);
+}
+#endif /* __NR_socket */
+
+#ifdef __x86_64__
+/*
+ * Load-bearing ABI check: after SEND_REDIRECT, the trapped task's
+ * redirected arg register must be restored to its original value
+ * before user-mode code resumes. The kernel's restore mechanism
+ * (task_work_add(TWA_SIGNAL) -> seccomp_redirect_restore_cb) is
+ * what guarantees this; without a test the property is just an
+ * assertion. Bypass libc's syscall() wrapper (which caller-saves
+ * arg values and would mask a restore bug) and capture the actual
+ * arg register immediately after the SYSCALL instruction.
+ *
+ * The child issues openat with RSI = sentinel_path. The supervisor
+ * SEND_REDIRECTs args[1] (RSI) to point into the pin. The kernel:
+ * - saves the original RSI into the knotif
+ * - writes the pin address into RSI via syscall_set_arguments()
+ * - runs the syscall (kernel reads path from the pin)
+ * - on syscall_exit_to_user_mode, fires task_work which calls
+ * syscall_set_arguments() again with the saved original
+ * - returns to user mode
+ *
+ * If task_work fires correctly, the child observes RSI == sentinel.
+ * If broken, RSI holds the pin address (the redirected value the
+ * kernel left in pt_regs).
+ */
+TEST(user_notification_pinned_memfd_abi)
+{
+ pid_t pid;
+ long ret;
+ int status, listener, memfd;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_pin_install pin = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ char *sup_view;
+ const size_t PIN_SIZE = 4096;
+ const char *safe_path = "/dev/null";
+ /*
+ * The "sentinel" is a real string the child can also pass as
+ * the openat path. Its address is captured pre-syscall as RSI;
+ * post-syscall RSI must equal the same address.
+ */
+ static const char sentinel_path[] = "/seccomp_abi_sentinel";
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ memfd = make_pin_memfd(_metadata, "pin-abi", PIN_SIZE,
+ &sup_view, safe_path);
+
+ listener = user_notif_syscall(__NR_openat,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ register long r10_val asm("r10") = 0;
+ unsigned long rsi_after;
+ long fd;
+
+ asm volatile(
+ "syscall\n\t"
+ "mov %%rsi, %[after]"
+ : "=a"(fd), [after] "=&r"(rsi_after)
+ : "0"((long)__NR_openat),
+ "D"((long)AT_FDCWD),
+ "S"((unsigned long)sentinel_path),
+ "d"((long)O_RDONLY),
+ "r"(r10_val)
+ : "rcx", "r11", "memory"
+ );
+
+ if (fd < 0)
+ _exit(11);
+ /*
+ * Load-bearing check: RSI immediately post-SYSCALL must
+ * still be the sentinel pointer the child passed in. The
+ * kernel's REDIRECT-then-restore mechanism is the only
+ * thing that guarantees this; a broken restore would leave
+ * the pin address in RSI.
+ */
+ if (rsi_after != (unsigned long)sentinel_path)
+ _exit(12);
+ _exit(0);
+ }
+
+ ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ EXPECT_EQ(req.data.nr, __NR_openat);
+ EXPECT_EQ(req.data.args[1], (unsigned long)sentinel_path);
+
+ pin.id = req.id;
+ pin.memfd = memfd;
+ pin.target_addr = 0;
+ pin.size = PIN_SIZE;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_PIN_INSTALL, &pin)) {
+ if (errno == EINVAL)
+ SKIP(goto cleanup,
+ "Kernel lacks pinned-memfd remote install");
+ }
+
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 1;
+ redir.ptr_mask = 1U << 1;
+ redir.memfd = memfd;
+ redir.ptr_len[1] = strlen(safe_path) + 1;
+ redir.args[1] = pin.target_addr;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ switch (WEXITSTATUS(status)) {
+ case 11:
+ TH_LOG("child exit 11: openat returned -errno");
+ break;
+ case 12:
+ TH_LOG("child exit 12: ABI violation -- RSI not restored after redirect");
+ break;
+ default:
+ TH_LOG("child exit %d (unexpected)", WEXITSTATUS(status));
+ }
+ }
+
+cleanup:
+ munmap(sup_view, PIN_SIZE);
+ close(memfd);
+ close(listener);
+}
+
+static void redir_sigusr1_handler(int signo)
+{
+ /* _exit() is async-signal-safe; bail with a distinct code if the
+ * signal frame was clobbered so the handler sees the wrong signo.
+ */
+ if (signo != SIGUSR1)
+ _exit(12);
+}
+
+/*
+ * Regression test: a redirect's deferred arg-register restore must run
+ * before a signal frame is built, not after.
+ *
+ * The restore was queued as a TWA_RESUME task_work, which runs in
+ * exit_to_user_mode_loop() *after* arch_do_signal_or_restart() has
+ * already set up the handler frame (regs->di = signo, regs->si =
+ * &siginfo, regs->dx = &ucontext). The restore then overwrote those
+ * registers with the trapped syscall's original argument values, so the
+ * handler was entered with a corrupted signal number. Queuing the
+ * restore with TWA_SIGNAL makes it run at the top of get_signal(),
+ * before the frame is built (and before any syscall-restart rewind).
+ *
+ * The child traps on pause(), the supervisor redirects arg0 (RDI), and
+ * then interrupts it with SIGUSR1. The handler must observe
+ * signo == SIGUSR1, not the leaked original RDI sentinel.
+ */
+TEST(user_notification_redirect_signal_abi)
+{
+ pid_t pid;
+ long ret;
+ int status, listener;
+ struct seccomp_notif req = {};
+ struct seccomp_notif_resp_redirect redir = {};
+ /* A recognizable original RDI the broken restore would leak in. */
+ const unsigned long RDI_SENTINEL = 0x5a5a5a5aUL;
+
+ ret = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
+ ASSERT_EQ(0, ret) {
+ TH_LOG("Kernel does not support PR_SET_NO_NEW_PRIVS!");
+ }
+
+ listener = user_notif_syscall(__NR_pause,
+ SECCOMP_FILTER_FLAG_NEW_LISTENER |
+ SECCOMP_FILTER_FLAG_REDIRECT);
+ ASSERT_GE(listener, 0);
+
+ pid = fork();
+ ASSERT_GE(pid, 0);
+
+ if (pid == 0) {
+ struct sigaction sa = {
+ .sa_handler = redir_sigusr1_handler,
+ };
+ long rc;
+
+ if (sigaction(SIGUSR1, &sa, NULL))
+ _exit(10);
+
+ /* Raw pause() carrying a controlled RDI sentinel. */
+ asm volatile(
+ "syscall"
+ : "=a"(rc)
+ : "0"((long)__NR_pause),
+ "D"(RDI_SENTINEL)
+ : "rcx", "r11", "memory");
+
+ /* pause() returns -EINTR once the handler has run. */
+ if (rc != -EINTR)
+ _exit(11);
+ _exit(0);
+ }
+
+ ASSERT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_RECV, &req));
+ EXPECT_EQ(req.data.nr, __NR_pause);
+ EXPECT_EQ(req.data.args[0], RDI_SENTINEL);
+
+ /* Redirect arg0 (non-pointer); this arms the original-RDI restore. */
+ redir.id = req.id;
+ redir.flags = SECCOMP_REDIRECT_FLAG_CONTINUE;
+ redir.args_mask = 1U << 0;
+ redir.args[0] = 0;
+ EXPECT_EQ(0, ioctl(listener, SECCOMP_IOCTL_NOTIF_SEND_REDIRECT,
+ &redir)) {
+ if (errno == EINVAL) {
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+ SKIP(goto cleanup,
+ "Kernel lacks SECCOMP_IOCTL_NOTIF_SEND_REDIRECT");
+ }
+ }
+
+ /* Let the child reach pause(), then interrupt it with a handler. */
+ usleep(100000);
+ EXPECT_EQ(0, kill(pid, SIGUSR1));
+
+ EXPECT_EQ(waitpid(pid, &status, 0), pid);
+ EXPECT_EQ(true, WIFEXITED(status));
+ EXPECT_EQ(0, WEXITSTATUS(status)) {
+ switch (WEXITSTATUS(status)) {
+ case 10:
+ TH_LOG("child exit 10: sigaction failed");
+ break;
+ case 11:
+ TH_LOG("child exit 11: pause() did not return -EINTR");
+ break;
+ case 12:
+ TH_LOG("child exit 12: handler saw wrong signo (frame clobbered)");
+ break;
+ default:
+ TH_LOG("child exit %d (unexpected)", WEXITSTATUS(status));
+ }
+ }
+
+cleanup:
+ close(listener);
+}
+#endif /* __x86_64__ */
+
#ifndef SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP
#define SECCOMP_USER_NOTIF_FD_SYNC_WAKE_UP (1UL << 0)
#define SECCOMP_IOCTL_NOTIF_SET_FLAGS SECCOMP_IOW(4, __u64)
|
| 3/2 | 2026/06/27 01:53 | llm | patch-evaluator |
0mModel:gemini-3.1-pro-preview Args: null Results: null Tokens: input: 30826 output: 19 thoughts: 494 Thoughts: **Fuzzing Plan for New Seccomp Feature** Okay, so I've got this new patch in front of me, and my initial assessment is that this **absolutely** warrants fuzzing. It's not just a minor tweak; this adds a new "non-cooperative pinned-memfd redirect" feature to seccomp, which means functional changes to the kernel's core. Specifically, we're talking about new ioctls, `SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`, a new flag `SECCOMP_FILTER_FLAG_REDIRECT`, and the touchpoints include `kernel/seccomp.c`, various memory management files (`mm/mmap.c`, `mm/nommu.c`, `mm/util.c`, `mm/vma.c`), and some new tests. That's enough to trigger my fuzzing radar. Now, let's nail down the critical kernel functions we need to target. From `kernel/seccomp.c`, I've got these on my list: `seccomp_notify_pin_install`, `seccomp_install_pin`, `seccomp_pin_check`, `seccomp_notify_send_redirect`, `seccomp_redirect_restore_cb`, and `seccomp_redirect_is_sigreturn`. We also have some interesting memory management calls that we'll need to account for, including `vm_mmap_seal_remote` in `mm/util.c` and `mm_get_unmapped_area_remote` in `mm/mmap.c`. Next, I need to verify that we have all the right config options enabled. Looks like the core of the code is under `CONFIG_SECCOMP_FILTER` and `CONFIG_SECCOMP` within `kernel/seccomp.c`. `CONFIG_MEMFD_CREATE` is probably enabled already, but I'll make sure to double check that. Specifically, I'll need to check the preprocessor directives for `#ifdef CONFIG_SECCOMP_FILTER` to ensure the new ioctls added in `seccomp_notify_ioctl` are enabled, because they are key to the functionality. Also, I need to check for any dependencies on `VM_SEALED` and make sure it is not reliant on any config options that might not be enabled. I'll need to do a `grep` for `VM_SEALED`. This looks like a solid plan. |
| 4/2 | 2026/06/27 01:53 | tool | grepper |
0mArgs:{
"Expression": "VM_SEALED"
}Results: {
"Output": "arch/arm64/kernel/vdso.c=99=static int __setup_additional_pages(enum vdso_abi abi,\n--\narch/arm64/kernel/vdso.c-131-\t\t\t\t VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|\narch/arm64/kernel/vdso.c:132:\t\t\t\t VM_SEALED_SYSMAP,\narch/arm64/kernel/vdso.c-133-\t\t\t\t vdso_info[abi].cm);\n--\narch/arm64/kernel/vdso.c=245=static int aarch32_kuser_helpers_setup(struct mm_struct *mm)\n--\narch/arm64/kernel/vdso.c-258-\t\t\t\t VM_MAYREAD | VM_MAYEXEC |\narch/arm64/kernel/vdso.c:259:\t\t\t\t VM_SEALED_SYSMAP,\narch/arm64/kernel/vdso.c-260-\t\t\t\t \u0026aarch32_vdso_maps[AA32_MAP_VECTORS]);\n--\narch/arm64/kernel/vdso.c=265=static int aarch32_sigreturn_setup(struct mm_struct *mm)\n--\narch/arm64/kernel/vdso.c-282-\t\t\t\t VM_MAYWRITE | VM_MAYEXEC |\narch/arm64/kernel/vdso.c:283:\t\t\t\t VM_SEALED_SYSMAP,\narch/arm64/kernel/vdso.c-284-\t\t\t\t \u0026aarch32_vdso_maps[AA32_MAP_SIGPAGE]);\n--\narch/loongarch/kernel/vdso.c=81=int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)\n--\narch/loongarch/kernel/vdso.c-113-\t\t\t\t VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC |\narch/loongarch/kernel/vdso.c:114:\t\t\t\t VM_SEALED_SYSMAP,\narch/loongarch/kernel/vdso.c-115-\t\t\t\t \u0026info-\u003ecode_mapping);\n--\narch/riscv/kernel/vdso.c=115=static int __setup_additional_pages(struct mm_struct *mm,\n--\narch/riscv/kernel/vdso.c-143-\t _install_special_mapping(mm, vdso_base, vdso_text_len,\narch/riscv/kernel/vdso.c:144:\t\t(VM_READ | VM_EXEC | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC | VM_SEALED_SYSMAP),\narch/riscv/kernel/vdso.c-145-\t\tvdso_info-\u003ecm);\n--\narch/s390/kernel/vdso.c=46=static int map_vdso(unsigned long addr, unsigned long vdso_mapping_len)\n--\narch/s390/kernel/vdso.c-68-\tvma = _install_special_mapping(mm, vdso_text_start, vdso_text_len,\narch/s390/kernel/vdso.c:69:\t\t\t\t VM_READ|VM_EXEC|VM_SEALED_SYSMAP|\narch/s390/kernel/vdso.c-70-\t\t\t\t VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,\n--\narch/x86/entry/vdso/vma.c=156=static int map_vdso(const struct vdso_image *image, unsigned long addr)\n--\narch/x86/entry/vdso/vma.c-182-\t\t\t\t VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|\narch/x86/entry/vdso/vma.c:183:\t\t\t\t VM_SEALED_SYSMAP,\narch/x86/entry/vdso/vma.c-184-\t\t\t\t \u0026vdso_mapping);\n--\narch/x86/entry/vdso/vma.c-201-\t\t\t\t VM_READ|VM_MAYREAD|VM_IO|VM_DONTDUMP|\narch/x86/entry/vdso/vma.c:202:\t\t\t\t VM_PFNMAP|VM_SEALED_SYSMAP,\narch/x86/entry/vdso/vma.c-203-\t\t\t\t \u0026vvar_vclock_mapping);\n--\nfs/proc/task_mmu.c=1166=static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)\n--\nfs/proc/task_mmu.c-1246-#ifdef CONFIG_64BIT\nfs/proc/task_mmu.c:1247:\t\t[ilog2(VM_SEALED)] = \"sl\",\nfs/proc/task_mmu.c-1248-#endif\n--\ninclude/linux/mm.h=293=enum {\n--\ninclude/linux/mm.h-515-#define VM_ALLOW_ANY_UNCACHED\tINIT_VM_FLAG(ALLOW_ANY_UNCACHED)\ninclude/linux/mm.h:516:#define VM_SEALED\t\tINIT_VM_FLAG(SEALED)\ninclude/linux/mm.h-517-#else\ninclude/linux/mm.h-518-#define VM_ALLOW_ANY_UNCACHED\tVM_NONE\ninclude/linux/mm.h:519:#define VM_SEALED\t\tVM_NONE\ninclude/linux/mm.h-520-#endif\n--\ninclude/linux/mm.h-559-#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS\ninclude/linux/mm.h:560:#define VM_SEALED_SYSMAP\tVM_SEALED\ninclude/linux/mm.h-561-#else\ninclude/linux/mm.h:562:#define VM_SEALED_SYSMAP\tVM_NONE\ninclude/linux/mm.h-563-#endif\n--\ninclude/uapi/linux/seccomp.h=138=struct seccomp_notif_addfd {\n--\ninclude/uapi/linux/seccomp.h-154- * (the target need not mmap or mseal anything itself). The mapping\ninclude/uapi/linux/seccomp.h:155: * is marked VM_SEALED at install time, so the target and any\ninclude/uapi/linux/seccomp.h-156- * CLONE_VM peer cannot munmap, mremap, mprotect, or MAP_FIXED-stomp\n--\ninclude/uapi/linux/seccomp.h=183=struct seccomp_notif_pin_install {\n--\ninclude/uapi/linux/seccomp.h-226- * current address space: the whole access [args[i], args[i] + ptr_len[i])\ninclude/uapi/linux/seccomp.h:227: * must lie inside a single VM_SEALED, read-only mapping of @memfd. No per-pin\ninclude/uapi/linux/seccomp.h-228- * bookkeeping is kept; authorization is re-derived from the live mapping, so\n--\ninclude/uapi/linux/seccomp.h-241- * the access [args[i], args[i] + ptr_len[i]) is validated to lie\ninclude/uapi/linux/seccomp.h:242: * entirely inside a single VM_SEALED, read-only mapping of @memfd.\ninclude/uapi/linux/seccomp.h-243- * Scalar replacements (in @args_mask but not @ptr_mask) are\n--\ninclude/uapi/linux/seccomp.h=254=struct seccomp_notif_resp_redirect {\n--\ninclude/uapi/linux/seccomp.h-266- * target-side cooperation. The supervisor owns the backing memfd;\ninclude/uapi/linux/seccomp.h:267: * the kernel installs the mapping and marks it VM_SEALED. The actual\ninclude/uapi/linux/seccomp.h-268- * mapped address is written back to @target_addr (relevant when it was\n--\nkernel/events/uprobes.c=1704=static int xol_add_vma(struct mm_struct *mm, struct xol_area *area)\n--\nkernel/events/uprobes.c-1726-\t\t\t\tVM_EXEC|VM_MAYEXEC|VM_DONTCOPY|VM_IO|\nkernel/events/uprobes.c:1727:\t\t\t\tVM_SEALED_SYSMAP,\nkernel/events/uprobes.c-1728-\t\t\t\t\u0026xol_mapping);\n--\nkernel/seccomp.c=2017=static bool seccomp_pin_check(struct task_struct *target,\n--\nkernel/seccomp.c-2037-\t * Read-only so no CLONE_VM peer can rewrite the bytes the kernel is\nkernel/seccomp.c:2038:\t * about to read; VM_SEALED keeps the mapping itself immutable.\nkernel/seccomp.c-2039-\t */\n--\nkernel/seccomp.c-2041-\tvma = vma_lookup(mm, ptr);\nkernel/seccomp.c:2042:\tif (vma \u0026\u0026 end \u003c= vma-\u003evm_end \u0026\u0026 (vma-\u003evm_flags \u0026 VM_SEALED) \u0026\u0026\nkernel/seccomp.c-2043-\t !(vma-\u003evm_flags \u0026 VM_WRITE) \u0026\u0026\n--\nlib/vdso/datastore.c=128=struct vm_area_struct *vdso_install_vvar_mapping(struct mm_struct *mm, unsigned long addr)\n--\nlib/vdso/datastore.c-131-\t\t\t\t\tVM_READ | VM_MAYREAD | VM_IO | VM_DONTDUMP |\nlib/vdso/datastore.c:132:\t\t\t\t\tVM_MIXEDMAP | VM_SEALED_SYSMAP,\nlib/vdso/datastore.c-133-\t\t\t\t\t\u0026vdso_vvar_mapping);\n--\nmm/util.c=565=unsigned long vm_mmap_pgoff(struct file *file, unsigned long addr,\n--\nmm/util.c-601- *\nmm/util.c:602: * The mapping is read-only. The VMA is created VM_SEALED, so it is immediately\nmm/util.c-603- * immutable against the target mm's owner and its CLONE_VM peers. LSM/fsnotify\n--\nmm/util.c=609=unsigned long vm_mmap_seal_remote(struct mm_struct *mm, struct file *file,\n--\nmm/util.c-620-\t\treturn -EINVAL;\nmm/util.c:621:\tif (!VM_SEALED)\t\t/* sealing unavailable (e.g. !CONFIG_64BIT) */\nmm/util.c-622-\t\treturn -EOPNOTSUPP;\n--\nmm/util.c-639-\t}\nmm/util.c:640:\tret = __do_mmap(mm, file, addr, len, prot, flags, VM_SEALED,\nmm/util.c-641-\t\t\tpgoff, \u0026populate, \u0026uf);\n--\nmm/vma.h=671=static inline bool vma_is_sealed(struct vm_area_struct *vma)\nmm/vma.h-672-{\nmm/vma.h:673:\treturn (vma-\u003evm_flags \u0026 VM_SEALED);\nmm/vma.h-674-}\n--\ntools/testing/selftests/seccomp/seccomp_bpf.c=4699=TEST(user_notification_pinned_memfd_execve_scm)\n--\ntools/testing/selftests/seccomp/seccomp_bpf.c-4951- *\ntools/testing/selftests/seccomp/seccomp_bpf.c:4952: * PIN_INSTALL records nothing: the installed VM_SEALED VMA is the only\ntools/testing/selftests/seccomp/seccomp_bpf.c-4953- * record, and SEND_REDIRECT re-validates the pointer against the live\n--\ntools/testing/vma/include/dup.h=100=enum {\n--\ntools/testing/vma/include/dup.h-304-#define VM_ALLOW_ANY_UNCACHED\tINIT_VM_FLAG(ALLOW_ANY_UNCACHED)\ntools/testing/vma/include/dup.h:305:#define VM_SEALED\t\tINIT_VM_FLAG(SEALED)\ntools/testing/vma/include/dup.h-306-#else\ntools/testing/vma/include/dup.h-307-#define VM_ALLOW_ANY_UNCACHED\tVM_NONE\ntools/testing/vma/include/dup.h:308:#define VM_SEALED\t\tVM_NONE\ntools/testing/vma/include/dup.h-309-#endif\n"
} |
| 5/2 | 2026/06/27 01:53 | llm | patch-evaluator |
0mModel:gemini-3.1-pro-preview Args: null Results: null Tokens: input: 35215 output: 29 thoughts: 90 Thoughts: **Analyzing the Dependencies of `VM_SEALED`** Okay, so `VM_SEALED`, as defined in `include/linux/mm.h`, has a direct dependency on `CONFIG_64BIT`. That's not too surprising, and honestly, `CONFIG_64BIT` is almost always enabled these days. But I need to be thorough. I'm going to look for any *other* dependencies, just to be absolutely sure I haven't missed something fundamental. My experience tells me there might be more to this than meets the eye. Aha! Confirmed. `CONFIG_SECCOMP_FILTER` is definitely needed as well. Good, good. Now, let's take a look inside `kernel/seccomp.c` to see if there are any *other* configuration options that impact `VM_SEALED`. Gotta make sure I'm not overlooking some subtle interaction or a hidden requirement in the codebase. |
| 6/2 | 2026/06/27 01:53 | tool | grepper |
0mArgs:{
"Expression": "CONFIG_",
"PathPrefix": "kernel/seccomp.c"
}Results: {
"Output": "kernel/seccomp.c-36-\nkernel/seccomp.c:37:#ifdef CONFIG_SECCOMP_FILTER\nkernel/seccomp.c-38-#include \u003clinux/file.h\u003e\n--\nkernel/seccomp.c=419=static u32 seccomp_run_filters(const struct seccomp_data *sd,\n--\nkernel/seccomp.c-447-}\nkernel/seccomp.c:448:#endif /* CONFIG_SECCOMP_FILTER */\nkernel/seccomp.c-449-\n--\nkernel/seccomp.c=462=static inline void seccomp_assign_mode(struct task_struct *task,\n--\nkernel/seccomp.c-479-\nkernel/seccomp.c:480:#ifdef CONFIG_SECCOMP_FILTER\nkernel/seccomp.c-481-/* Returns 1 if the parent is an ancestor of the child. */\n--\nkernel/seccomp.c=684=static struct seccomp_filter *seccomp_prepare_filter(struct sock_fprog *fprog)\n--\nkernel/seccomp.c-688-\tconst bool save_orig =\nkernel/seccomp.c:689:#if defined(CONFIG_CHECKPOINT_RESTORE) || defined(SECCOMP_ARCH_NATIVE)\nkernel/seccomp.c-690-\t\ttrue;\n--\nkernel/seccomp.c=737=seccomp_prepare_user_filter(const char __user *user_filter)\n--\nkernel/seccomp.c-741-\nkernel/seccomp.c:742:#ifdef CONFIG_COMPAT\nkernel/seccomp.c-743-\tif (in_compat_syscall()) {\n--\nkernel/seccomp.c=1005=void get_seccomp_filter(struct task_struct *tsk)\n--\nkernel/seccomp.c-1013-\nkernel/seccomp.c:1014:#endif\t/* CONFIG_SECCOMP_FILTER */\nkernel/seccomp.c-1015-\n--\nkernel/seccomp.c=1093=static void __secure_computing_strict(int this_syscall)\n--\nkernel/seccomp.c-1095-\tconst int *allowed_syscalls = mode1_syscalls;\nkernel/seccomp.c:1096:#ifdef CONFIG_COMPAT\nkernel/seccomp.c-1097-\tif (in_compat_syscall())\n--\nkernel/seccomp.c-1112-\nkernel/seccomp.c:1113:#ifndef CONFIG_HAVE_ARCH_SECCOMP_FILTER\nkernel/seccomp.c-1114-void secure_computing_strict(int this_syscall)\n--\nkernel/seccomp.c-1117-\nkernel/seccomp.c:1118:\tif (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) \u0026\u0026\nkernel/seccomp.c-1119-\t unlikely(current-\u003eptrace \u0026 PT_SUSPEND_SECCOMP))\n--\nkernel/seccomp.c=1129=int __secure_computing(void)\n--\nkernel/seccomp.c-1137-\nkernel/seccomp.c:1138:#ifdef CONFIG_SECCOMP_FILTER\nkernel/seccomp.c-1139-static u64 seccomp_next_notify_id(struct seccomp_filter *filter)\n--\nkernel/seccomp.c=1472=int __secure_computing(void)\n--\nkernel/seccomp.c-1476-\nkernel/seccomp.c:1477:\tif (IS_ENABLED(CONFIG_CHECKPOINT_RESTORE) \u0026\u0026\nkernel/seccomp.c-1478-\t unlikely(current-\u003eptrace \u0026 PT_SUSPEND_SECCOMP))\n--\nkernel/seccomp.c-1497-}\nkernel/seccomp.c:1498:#endif /* CONFIG_HAVE_ARCH_SECCOMP_FILTER */\nkernel/seccomp.c-1499-\n--\nkernel/seccomp.c=1512=static long seccomp_set_mode_strict(void)\n--\nkernel/seccomp.c-1533-\nkernel/seccomp.c:1534:#ifdef CONFIG_SECCOMP_FILTER\nkernel/seccomp.c-1535-static void seccomp_notify_free(struct seccomp_filter *filter)\n--\nkernel/seccomp.c=2577=long prctl_set_seccomp(unsigned long seccomp_mode, void __user *filter)\n--\nkernel/seccomp.c-2603-\nkernel/seccomp.c:2604:#if defined(CONFIG_SECCOMP_FILTER) \u0026\u0026 defined(CONFIG_CHECKPOINT_RESTORE)\nkernel/seccomp.c-2605-static struct seccomp_filter *get_nth_filter(struct task_struct *task,\n--\nkernel/seccomp.c=2651=long seccomp_get_filter(struct task_struct *task, unsigned long filter_off,\n--\nkernel/seccomp.c-2670-\t\t * every cBPF filter's orig_prog above when\nkernel/seccomp.c:2671:\t\t * CONFIG_CHECKPOINT_RESTORE is enabled.\nkernel/seccomp.c-2672-\t\t */\n--\nkernel/seccomp.c=2689=long seccomp_get_metadata(struct task_struct *task,\n--\nkernel/seccomp.c-2724-\nkernel/seccomp.c:2725:#ifdef CONFIG_SYSCTL\nkernel/seccomp.c-2726-\n--\nkernel/seccomp.c=2949=device_initcall(seccomp_sysctl_init)\nkernel/seccomp.c-2950-\nkernel/seccomp.c:2951:#endif /* CONFIG_SYSCTL */\nkernel/seccomp.c-2952-\nkernel/seccomp.c:2953:#ifdef CONFIG_SECCOMP_CACHE_DEBUG\nkernel/seccomp.c:2954:/* Currently CONFIG_SECCOMP_CACHE_DEBUG implies SECCOMP_ARCH_NATIVE */\nkernel/seccomp.c-2955-static void proc_pid_seccomp_cache_arch(struct seq_file *m, const char *name,\n--\nkernel/seccomp.c=2968=int proc_pid_seccomp_cache(struct seq_file *m, struct pid_namespace *ns,\n--\nkernel/seccomp.c-3006-}\nkernel/seccomp.c:3007:#endif /* CONFIG_SECCOMP_CACHE_DEBUG */\n"
} |
| 7/2 | 2026/06/27 01:53 | llm | patch-evaluator |
0mModel:gemini-3.1-pro-preview Args: null Results: null Tokens: input: 37253 output: 20 thoughts: 486 Thoughts: **Analyzing Seccomp and VM Sealing Configuration** Okay, so I'm looking into the configurations related to seccomp filtering and memory sealing. The key configuration item here seems to be `CONFIG_SECCOMP_FILTER`. I need to ensure it's enabled, obviously. I also noticed that `CONFIG_64BIT` is a hard requirement for `VM_SEALED`. That makes sense; it's usually on for x86_64 and arm64, but I need to be explicit. Looking at `mm/util.c` confirms this: the check `!VM_SEALED` clearly indicates the dependency. Therefore, I need to add `SECCOMP_FILTER` to `EnableConfigs` to address this dependency. Now, let's nail down the core functions that are likely relevant. I've identified these: * `seccomp_notify_pin_install` * `seccomp_install_pin` * `seccomp_pin_check` * `seccomp_notify_send_redirect` * `seccomp_redirect_restore_cb` * `seccomp_redirect_is_sigreturn` * `vm_mmap_seal_remote` * `mm_get_unmapped_area_remote` * `seccomp_run_filters_seq` A quick scan shows some other functions had recent modifications so I'll also add to my list: * `seccomp_do_user_notification` * `__seccomp_filter` * `seccomp_notify_ioctl` So, let me list that again for posterity: * `seccomp_notify_pin_install` * `seccomp_install_pin` * `seccomp_pin_check` * `seccomp_notify_send_redirect` * `seccomp_redirect_restore_cb` * `seccomp_redirect_is_sigreturn` * `seccomp_run_filters_seq` * `vm_mmap_seal_remote` * `mm_get_unmapped_area_remote` * `seccomp_do_user_notification` * `__seccomp_filter` * `seccomp_notify_ioctl` Okay, and now, are there any other relevant configs? Let me check quickly... I'm curious about `CONFIG_MEMFD_CREATE`. Let's see if that's even a thing... |
| 8/2 | 2026/06/27 01:54 | tool | grepper |
0mArgs:{
"Expression": "CONFIG_MEMFD_CREATE"
}Results: {
"Output": "include/linux/memfd.h-8-\ninclude/linux/memfd.h:9:#ifdef CONFIG_MEMFD_CREATE\ninclude/linux/memfd.h-10-extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg);\n--\ninclude/linux/pid_namespace.h=17=struct fs_pin;\ninclude/linux/pid_namespace.h-18-\ninclude/linux/pid_namespace.h:19:#if defined(CONFIG_SYSCTL) \u0026\u0026 defined(CONFIG_MEMFD_CREATE)\ninclude/linux/pid_namespace.h-20-/* modes for vm.memfd_noexec sysctl */\n--\ninclude/linux/pid_namespace.h=26=struct pid_namespace {\n--\ninclude/linux/pid_namespace.h-30-#ifdef CONFIG_SYSCTL\ninclude/linux/pid_namespace.h:31:#if defined(CONFIG_MEMFD_CREATE)\ninclude/linux/pid_namespace.h-32-\tint memfd_noexec_scope;\n--\ninclude/linux/pid_namespace.h=62=static inline struct pid_namespace *get_pid_ns(struct pid_namespace *ns)\n--\ninclude/linux/pid_namespace.h-67-\ninclude/linux/pid_namespace.h:68:#if defined(CONFIG_SYSCTL) \u0026\u0026 defined(CONFIG_MEMFD_CREATE)\ninclude/linux/pid_namespace.h-69-static inline int pidns_memfd_noexec_scope(struct pid_namespace *ns)\n--\nkernel/pid.c=72=struct pid_namespace init_pid_ns = {\n--\nkernel/pid.c-79-\t.pid_max = PID_MAX_DEFAULT,\nkernel/pid.c:80:#if defined(CONFIG_SYSCTL) \u0026\u0026 defined(CONFIG_MEMFD_CREATE)\nkernel/pid.c-81-\t.memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC,\n--\nkernel/pid_namespace.c=76=static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns,\n--\nkernel/pid_namespace.c-121-\nkernel/pid_namespace.c:122:#if defined(CONFIG_SYSCTL) \u0026\u0026 defined(CONFIG_MEMFD_CREATE)\nkernel/pid_namespace.c-123-\tns-\u003ememfd_noexec_scope = pidns_memfd_noexec_scope(parent_pid_ns);\n--\nkernel/pid_sysctl.h-6-\nkernel/pid_sysctl.h:7:#if defined(CONFIG_SYSCTL) \u0026\u0026 defined(CONFIG_MEMFD_CREATE)\nkernel/pid_sysctl.h-8-static int pid_mfd_noexec_dointvec_minmax(const struct ctl_table *table,\n--\nmm/Makefile=139=obj-$(CONFIG_HMM_MIRROR) += hmm.o\nmm/Makefile:140:obj-$(CONFIG_MEMFD_CREATE) += memfd.o\nmm/Makefile-141-obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o\n--\ntools/testing/selftests/liveupdate/config=8=CONFIG_LIVEUPDATE_TEST=y\ntools/testing/selftests/liveupdate/config:9:CONFIG_MEMFD_CREATE=y\ntools/testing/selftests/liveupdate/config-10-CONFIG_TMPFS=y\n"
} |
| 9/2 | 2026/06/27 01:54 | llm | patch-evaluator |
0mModel:gemini-3.1-pro-preview Args: null Results: null Tokens: input: 39178 output: 275 thoughts: 457 Thoughts: **My Assessment of the Seccomp Patch** Okay, I've confirmed that `CONFIG_MEMFD_CREATE` is indeed present, so I'll add it to `EnableConfigs` along with `CONFIG_SECCOMP_FILTER`, just to be safe. We should always cover our bases. I've also identified `64BIT` as necessary. Now, let's refresh our memory on the key functions this patch touches: `seccomp_notify_pin_install`, `seccomp_install_pin`, `seccomp_pin_check`, `seccomp_notify_send_redirect`, `seccomp_redirect_restore_cb`, `seccomp_redirect_is_sigreturn`, `seccomp_run_filters_seq`, `vm_mmap_seal_remote`, and `mm_get_unmapped_area_remote`. Got it. This patch is functional, and the new non-cooperative pinned-memfd redirect feature within seccomp is definitely worth fuzzing. It's a significant change. It introduces a novel approach where a supervisor can now install a sealed memfd mapping into a trapped task's address space and redirect pointer arguments to it. This new mechanism brings in new ioctls, `SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`, along with changes to core seccomp and memory management code. Given the scope of the changes, fuzzing is *essential*. This new path has the potential to introduce serious vulnerabilities if not thoroughly tested. We need to focus the fuzzing on these new execution paths to discover any potential issues, such as memory leaks, race conditions, or unauthorized memory access. Therefore, I will prioritize these functions for fuzzing. |
| 10/2 | 2026/06/27 01:54 | tool | set-results |
0mArgs:{
"EnableConfigs": [
"SECCOMP_FILTER",
"MEMFD_CREATE",
"64BIT"
],
"FocusSymbols": [
"seccomp_notify_pin_install",
"seccomp_install_pin",
"seccomp_pin_check",
"seccomp_notify_send_redirect",
"seccomp_redirect_restore_cb",
"seccomp_redirect_is_sigreturn",
"seccomp_run_filters_seq",
"vm_mmap_seal_remote",
"mm_get_unmapped_area_remote"
],
"Reasoning": "The patch introduces a new non-cooperative pinned-memfd redirect feature for seccomp, allowing a supervisor to install a sealed memfd mapping into a trapped task's address space and redirect pointer arguments to it. This adds new ioctls (`SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`) and modifies core seccomp and memory management logic. These new execution paths should be heavily fuzzed to ensure they don't introduce vulnerabilities like memory leaks, race conditions, or unauthorized memory access.",
"WorthFuzzing": true
}Results: {
"EnableConfigs": [
"SECCOMP_FILTER",
"MEMFD_CREATE",
"64BIT"
],
"FocusSymbols": [
"seccomp_notify_pin_install",
"seccomp_install_pin",
"seccomp_pin_check",
"seccomp_notify_send_redirect",
"seccomp_redirect_restore_cb",
"seccomp_redirect_is_sigreturn",
"seccomp_run_filters_seq",
"vm_mmap_seal_remote",
"mm_get_unmapped_area_remote"
],
"Reasoning": "The patch introduces a new non-cooperative pinned-memfd redirect feature for seccomp, allowing a supervisor to install a sealed memfd mapping into a trapped task's address space and redirect pointer arguments to it. This adds new ioctls (`SECCOMP_IOCTL_NOTIF_PIN_INSTALL` and `SECCOMP_IOCTL_NOTIF_SEND_REDIRECT`) and modifies core seccomp and memory management logic. These new execution paths should be heavily fuzzed to ensure they don't introduce vulnerabilities like memory leaks, race conditions, or unauthorized memory access.",
"WorthFuzzing": true
} |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|