From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>

Since the lookup of sk in ehash is lockless, when one CPU is performing a
lookup while another CPU is executing delete and insert operations
(deleting reqsk and inserting sk), the lookup CPU may miss either of
them, if sk cannot be found, an RST may be sent.

The call trace map is drawn as follows:
   CPU 0                           CPU 1
   -----                           -----
                                spin_lock()
                                sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
                                __sk_nulls_add_node_rcu(sk, list)
                                spin_unlock()

We can try using spin_lock()/spin_unlock() to wait for ehash updates
(ensuring all deletions and insertions are completed) after a failed
lookup in ehash, then lookup sk again after the update. Since the sk
expected to be found is unlikely to encounter the aforementioned scenario
multiple times consecutively, we only need one update.

Similarly, an issue occurs in tw hashdance. Try adjusting the order in
which it operates on ehash: remove sk first, then add tw. If sk is missed
during lookup, it will likewise wait for the update to find tw, without
worrying about the skc_refcnt issue that would arise if tw were found
first.

Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
 net/ipv4/inet_hashtables.c    | 12 ++++++++++++
 net/ipv4/inet_timewait_sock.c |  9 ++++-----
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..4eb3a55b855b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net,
 	unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
 	unsigned int slot = hash & hashinfo->ehash_mask;
 	struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
+	bool try_lock = true;
 
 begin:
 	sk_nulls_for_each_rcu(sk, node, &head->chain) {
@@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net,
 	 */
 	if (get_nulls_value(node) != slot)
 		goto begin;
+
+	if (try_lock) {
+		spinlock_t *lock = inet_ehash_lockp(hashinfo, hash);
+
+		try_lock = false;
+		spin_lock(lock);
+		/* Ensure ehash ops under spinlock complete. */
+		spin_unlock(lock);
+		goto begin;
+	}
+
 out:
 	sk = NULL;
 found:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 875ff923a8ed..a91e02e19c53 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 
 	spin_lock(lock);
 
-	/* Step 2: Hash TW into tcp ehash chain */
-	inet_twsk_add_node_rcu(tw, &ehead->chain);
-
-	/* Step 3: Remove SK from hash chain */
+	/* Step 2: Remove SK from hash chain */
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
-
 	/* Ensure above writes are committed into memory before updating the
 	 * refcount.
 	 * Provides ordering vs later refcount_inc().
@@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 	 */
 	refcount_set(&tw->tw_refcnt, 3);
 
+	/* Step 3: Hash TW into tcp ehash chain */
+	inet_twsk_add_node_rcu(tw, &ehead->chain);
+
 	inet_twsk_schedule(tw, timeo);
 
 	spin_unlock(lock);
-- 
2.25.1