From: Jann Horn __ptrace_may_access() checks can happen on target tasks that are in the middle of do_exit(), past exit_mm(). At that point, the ->mm pointer has been NULLed out, and the mm_struct has been mmput(). Unfortunately, the mm_struct contains the dumpability and the user_ns in which the task last went through execve(), and we need those for __ptrace_may_access(). Currently, that problem is handled by failing open: If the ->mm is gone, we assume that the task was dumpable. In some edge cases, this could potentially expose access to things like /proc/$pid/fd/$fd of originally non-dumpable processes. (exit_files() comes after exit_mm(), so the file descriptor table is still there when we've gone through exit_mm().) I believe that this patch may be the least bad option to fix this - keep the mm_struct (but not process memory) around with an mmgrab() reference from exit_mm() until the task goes away completely. Note that this moves free_task() down in order to make mmdrop_async() available without a forward declaration. Christian Brauner (Amutable) says: I massaged the patch a bit and rewrote parts of the commit message. The cached task->user_dumpable bit introduced by commit 31e62c2ebbfd ("ptrace: slightly saner 'get_dumpable() logic'") is removed here: the mmgrab'd exit_mm reference supplies the real dumpable flag and the original user_ns. exit_mm() now publishes ->exit_mm before clearing ->mm so that task_still_dumpable() readers cannot observe both NULL on a user task. Cc: stable@vger.kernel.org Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") Signed-off-by: Jann Horn Signed-off-by: Christian Brauner (Amutable) --- In 2020 Jann actually pointed to the exact bug class that the recent ptrace bug used and asked for a architecturally very clean fix to be merged but it never got anywhere. In the off-list ptrace discussion I brought up my intention to have this merged on top. So this rebases Jann's patch [1] which addresses the root cause directly: stash an mmgrab() reference to the mm_struct in task->exit_mm during exit_mm(), and release it via mmdrop_async() in free_task(). The pinned mm_struct keeps the address space metadata (flags, user_ns) reachable for the full lifetime of the task_struct without keeping the actual memory mappings alive. ptrace_may_access() then simply takes the exit_mm into account. The exit_mm field is separate and none of the other in-kernel callers need to change so there's no regression risks. Beyond ptrace, the fact that task->mm isn't accessible post-exit anymore causes complications for other series. Minchan Kim's recent PROCESS_MRELEASE_REAP_KILL series [2] describes the same shape from the OOM-reaper side: process_mrelease() racing against do_exit() ends up with task->mm and falls over with -ESRCH, deferring reclamation indefinitely under Android LMKD pressure. That series works around it by moving the SIGKILL injection into process_mrelease() itself, but the underlying issue is identical: callers legitimately need to reach the mm of a task after exit. That whole patch becomes rather simple with this fix. Note that I had already brought this up with some mm folks that were involved in the ptrace issue off-list. Also note that the upstream patch has at least two regressions. Reading the exit status from /proc//stat and opening /proc//ns/{pid,user} of a zombie. The latter I've used multiple times from inside a container for non-dumpable tasks for nested and non-nested containers. Afaict, both will now be denied because we resort to checking in the initial userns for non-dumpable tasks. And tools like Incus do nested containers so I also suspect that anything that relies on looking at the userns or pidns of zombies risks spurious failures. I think we should do the clean thing and let ptrace look at the actual mm state post exit. Keep it in a separate field and there's no regression risk for users that expect task->mm itself to be NULL post exit. [1] https://lore.kernel.org/r/20201016024019.1882062-2-jannh@google.com [2] https://lore.kernel.org/r/20260511214226.937793-1-minchan@kernel.org --- include/linux/sched.h | 4 +--- kernel/exit.c | 3 ++- kernel/fork.c | 64 ++++++++++++++++++++++++++------------------------- kernel/ptrace.c | 20 +++++++++++----- 4 files changed, 50 insertions(+), 41 deletions(-) diff --git a/include/linux/sched.h b/include/linux/sched.h index ee06cba5c6f5..7cefeb6cbba7 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -961,6 +961,7 @@ struct task_struct { struct mm_struct *mm; struct mm_struct *active_mm; + struct mm_struct *exit_mm; int exit_state; int exit_code; @@ -1002,9 +1003,6 @@ struct task_struct { unsigned sched_rt_mutex:1; #endif - /* Save user-dumpable when mm goes away */ - unsigned user_dumpable:1; - /* Bit to tell TOMOYO we're in execve(): */ unsigned in_execve:1; unsigned in_iowait:1; diff --git a/kernel/exit.c b/kernel/exit.c index f50d73c272d6..9af227c23e2b 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -571,7 +571,8 @@ static void exit_mm(void) */ smp_mb__after_spinlock(); local_irq_disable(); - current->user_dumpable = (get_dumpable(mm) == SUID_DUMP_USER); + mmgrab(mm); + current->exit_mm = mm; current->mm = NULL; membarrier_update_current_mm(NULL); enter_lazy_tlb(mm, current); diff --git a/kernel/fork.c b/kernel/fork.c index 5f3fdfdb14c7..f43ea7f5888a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -528,37 +528,6 @@ void put_task_stack(struct task_struct *tsk) } #endif -void free_task(struct task_struct *tsk) -{ -#ifdef CONFIG_SECCOMP - WARN_ON_ONCE(tsk->seccomp.filter); -#endif - release_user_cpus_ptr(tsk); - scs_release(tsk); - -#ifndef CONFIG_THREAD_INFO_IN_TASK - /* - * The task is finally done with both the stack and thread_info, - * so free both. - */ - release_task_stack(tsk); -#else - /* - * If the task had a separate stack allocation, it should be gone - * by now. - */ - WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0); -#endif - rt_mutex_debug_task_free(tsk); - ftrace_graph_exit_task(tsk); - arch_release_task_struct(tsk); - if (tsk->flags & PF_KTHREAD) - free_kthread_struct(tsk); - bpf_task_storage_free(tsk); - free_task_struct(tsk); -} -EXPORT_SYMBOL(free_task); - void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) { struct file *exe_file; @@ -775,6 +744,39 @@ static inline void put_signal_struct(struct signal_struct *sig) free_signal_struct(sig); } +void free_task(struct task_struct *tsk) +{ +#ifdef CONFIG_SECCOMP + WARN_ON_ONCE(tsk->seccomp.filter); +#endif + release_user_cpus_ptr(tsk); + scs_release(tsk); + +#ifndef CONFIG_THREAD_INFO_IN_TASK + /* + * The task is finally done with both the stack and thread_info, + * so free both. + */ + release_task_stack(tsk); +#else + /* + * If the task had a separate stack allocation, it should be gone + * by now. + */ + WARN_ON_ONCE(refcount_read(&tsk->stack_refcount) != 0); +#endif + rt_mutex_debug_task_free(tsk); + ftrace_graph_exit_task(tsk); + arch_release_task_struct(tsk); + if (tsk->flags & PF_KTHREAD) + free_kthread_struct(tsk); + bpf_task_storage_free(tsk); + if (tsk->exit_mm) + mmdrop_async(tsk->exit_mm); + free_task_struct(tsk); +} +EXPORT_SYMBOL(free_task); + void __put_task_struct(struct task_struct *tsk) { WARN_ON(!tsk->exit_state); diff --git a/kernel/ptrace.c b/kernel/ptrace.c index 130043bfc209..2955a59c18cf 100644 --- a/kernel/ptrace.c +++ b/kernel/ptrace.c @@ -272,18 +272,26 @@ static bool ptrace_has_cap(struct user_namespace *ns, unsigned int mode) return ns_capable(ns, CAP_SYS_PTRACE); } -static bool task_still_dumpable(struct task_struct *task, unsigned int mode) +/* + * Decide whether ptrace access to @task is allowed based on its mm. + * Reads the dumpable flag and user_ns from ->mm, or from ->exit_mm if + * the task has gone through exit_mm(). Note that kernel threads may have + * neither. + */ +static bool may_access_mm(struct task_struct *task, unsigned int mode) { struct mm_struct *mm = task->mm; + struct user_namespace *mm_userns = &init_user_ns; + + if (!mm) + mm = task->exit_mm; if (mm) { if (get_dumpable(mm) == SUID_DUMP_USER) return true; - return ptrace_has_cap(mm->user_ns, mode); + mm_userns = mm->user_ns; } - if (task->user_dumpable) - return true; - return ptrace_has_cap(&init_user_ns, mode); + return ptrace_has_cap(mm_userns, mode); } /* Returns 0 on success, -errno on denial. */ @@ -350,7 +358,7 @@ static int __ptrace_may_access(struct task_struct *task, unsigned int mode) * Pairs with a write barrier in commit_creds(). */ smp_rmb(); - if (!task_still_dumpable(task, mode)) + if (!may_access_mm(task, mode)) return -EPERM; return security_ptrace_access_check(task, mode); --- base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6 change-id: 20260516-work-exit_mm-5aba5fb06d71