From: Xuanqiang Luo Since the lookup of sk in ehash is lockless, when one CPU is performing a lookup while another CPU is executing delete and insert operations (deleting reqsk and inserting sk), the lookup CPU may miss either of them, if sk cannot be found, an RST may be sent. The call trace map is drawn as follows: CPU 0 CPU 1 ----- ----- spin_lock() sk_nulls_del_node_init_rcu(osk) __inet_lookup_established() __sk_nulls_add_node_rcu(sk, list) spin_unlock() We can try using spin_lock()/spin_unlock() to wait for ehash updates (ensuring all deletions and insertions are completed) after a failed lookup in ehash, then lookup sk again after the update. Since the sk expected to be found is unlikely to encounter the aforementioned scenario multiple times consecutively, we only need one update. Similarly, an issue occurs in tw hashdance. Try adjusting the order in which it operates on ehash: remove sk first, then add tw. If sk is missed during lookup, it will likewise wait for the update to find tw, without worrying about the skc_refcnt issue that would arise if tw were found first. Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls") Signed-off-by: Xuanqiang Luo --- net/ipv4/inet_hashtables.c | 12 ++++++++++++ net/ipv4/inet_timewait_sock.c | 9 ++++----- 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index ceeeec9b7290..4eb3a55b855b 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net, unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport); unsigned int slot = hash & hashinfo->ehash_mask; struct inet_ehash_bucket *head = &hashinfo->ehash[slot]; + bool try_lock = true; begin: sk_nulls_for_each_rcu(sk, node, &head->chain) { @@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net, */ if (get_nulls_value(node) != slot) goto begin; + + if (try_lock) { + spinlock_t *lock = inet_ehash_lockp(hashinfo, hash); + + try_lock = false; + spin_lock(lock); + /* Ensure ehash ops under spinlock complete. */ + spin_unlock(lock); + goto begin; + } + out: sk = NULL; found: diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c index 875ff923a8ed..a91e02e19c53 100644 --- a/net/ipv4/inet_timewait_sock.c +++ b/net/ipv4/inet_timewait_sock.c @@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw, spin_lock(lock); - /* Step 2: Hash TW into tcp ehash chain */ - inet_twsk_add_node_rcu(tw, &ehead->chain); - - /* Step 3: Remove SK from hash chain */ + /* Step 2: Remove SK from hash chain */ if (__sk_nulls_del_node_init_rcu(sk)) sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1); - /* Ensure above writes are committed into memory before updating the * refcount. * Provides ordering vs later refcount_inc(). @@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw, */ refcount_set(&tw->tw_refcnt, 3); + /* Step 3: Hash TW into tcp ehash chain */ + inet_twsk_add_node_rcu(tw, &ehead->chain); + inet_twsk_schedule(tw, timeo); spin_unlock(lock); -- 2.25.1