unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`). sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that socket is properly set up, which would include having a defined peer. IOW, there's a window when unix_stream_bpf_update_proto() can be called on socket which still has unix_peer(sk) == NULL. T0 bpf T1 connect ------ ---------- WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED) sock_map_sk_state_allowed(sk) ... sk_pair = unix_peer(sk) sock_hold(sk_pair) sock_hold(newsk) smp_mb__after_atomic() unix_peer(sk) = newsk BUG: kernel NULL pointer dereference, address: 0000000000000080 RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0 Call Trace: sock_map_link+0x564/0x8b0 sock_map_update_common+0x6e/0x340 sock_map_update_elem_sys+0x17d/0x240 __sys_bpf+0x26db/0x3250 __x64_sys_bpf+0x21/0x30 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Initial idea was to move peer assignment _before_ the sk_state update[1], but that involved an additional memory barrier, and changing the hot path was rejected. Then a check during proto update was considered[2], but a follow-up discussion[3] concluded the root cause is sockmap taking a wrong lock. Thus, teach sockmap about the af_unix-specific locking: instead of the usual lock_sock() involving sock::sk_lock, af_unix protects critical sections under unix_state_lock() operating on unix_sock::lock. [1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/ [2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/ [3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/ This patch also happens to fix a deadlock that may occur when bpf_iter_unix_seq_show()'s lock_sock_fast() takes the fast path and the iter prog attempts to update a sockmap. Which ends up spinning at sock_map_update_elem()'s bh_lock_sock(): WARNING: possible recursive locking detected -------------------------------------------- test_progs/1393 is trying to acquire lock: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: sock_map_update_elem+0xdb/0x1f0 but task is already holding lock: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(slock-AF_UNIX); lock(slock-AF_UNIX); *** DEADLOCK *** May be due to missing lock nesting notation 4 locks held by test_progs/1393: #0: ffff88814b59c790 (&p->lock){+.+.}-{4:4}, at: bpf_seq_read+0x59/0x10d0 #1: ffff88811ec25fd8 (sk_lock-AF_UNIX){+.+.}-{0:0}, at: bpf_seq_read+0x42c/0x10d0 #2: ffff88811ec25f58 (slock-AF_UNIX){+...}-{3:3}, at: __lock_sock_fast+0x37/0xe0 #3: ffffffff85a6a7c0 (rcu_read_lock){....}-{1:3}, at: bpf_iter_run_prog+0x51d/0xb00 Call Trace: dump_stack_lvl+0x5d/0x80 print_deadlock_bug.cold+0xc0/0xce __lock_acquire+0x130f/0x2590 lock_acquire+0x14e/0x2b0 _raw_spin_lock+0x30/0x40 sock_map_update_elem+0xdb/0x1f0 bpf_prog_2d0075e5d9b721cd_dump_unix+0x55/0x4f4 bpf_iter_run_prog+0x5b9/0xb00 bpf_iter_unix_seq_show+0x1f7/0x2e0 bpf_seq_read+0x42c/0x10d0 vfs_read+0x171/0xb20 ksys_read+0xff/0x200 do_syscall_64+0x6b/0x3a0 entry_SYSCALL_64_after_hwframe+0x76/0x7e Suggested-by: Kuniyuki Iwashima Suggested-by: Martin KaFai Lau Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()") Fixes: 2c860a43dd77 ("bpf: af_unix: Implement BPF iterator for UNIX domain socket.") Signed-off-by: Michal Luczaj --- Keeping sparse annotations in sock_map_sk_{acquire,release}() required some hackery I'm not proud of. Is there a better way? --- net/core/sock_map.c | 47 +++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 39 insertions(+), 8 deletions(-) diff --git a/net/core/sock_map.c b/net/core/sock_map.c index b6586d9590b7..0c638b1f363a 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -12,6 +12,7 @@ #include #include #include +#include #include struct bpf_stab { @@ -115,17 +116,49 @@ int sock_map_prog_detach(const union bpf_attr *attr, enum bpf_prog_type ptype) } static void sock_map_sk_acquire(struct sock *sk) - __acquires(&sk->sk_lock.slock) + __acquires(sock_or_unix_lock) { - lock_sock(sk); + if (sk_is_unix(sk)) { + unix_state_lock(sk); + __release(sk); /* Silence sparse. */ + } else { + lock_sock(sk); + } + rcu_read_lock(); } static void sock_map_sk_release(struct sock *sk) - __releases(&sk->sk_lock.slock) + __releases(sock_or_unix_lock) { rcu_read_unlock(); - release_sock(sk); + + if (sk_is_unix(sk)) { + unix_state_unlock(sk); + __acquire(sk); /* Silence sparse. */ + } else { + release_sock(sk); + } +} + +static inline void sock_map_sk_acquire_fast(struct sock *sk) +{ + local_bh_disable(); + + if (sk_is_unix(sk)) + unix_state_lock(sk); + else + bh_lock_sock(sk); +} + +static inline void sock_map_sk_release_fast(struct sock *sk) +{ + if (sk_is_unix(sk)) + unix_state_unlock(sk); + else + bh_unlock_sock(sk); + + local_bh_enable(); } static void sock_map_add_link(struct sk_psock *psock, @@ -604,16 +637,14 @@ static long sock_map_update_elem(struct bpf_map *map, void *key, if (!sock_map_sk_is_suitable(sk)) return -EOPNOTSUPP; - local_bh_disable(); - bh_lock_sock(sk); + sock_map_sk_acquire_fast(sk); if (!sock_map_sk_state_allowed(sk)) ret = -EOPNOTSUPP; else if (map->map_type == BPF_MAP_TYPE_SOCKMAP) ret = sock_map_update_common(map, *(u32 *)key, sk, flags); else ret = sock_hash_update_common(map, key, sk, flags); - bh_unlock_sock(sk); - local_bh_enable(); + sock_map_sk_release_fast(sk); return ret; } -- 2.52.0