Situation --------- We observe the following scenario in production: inet_bind_bucket state for port 54321 -------------------- (bucket doesn't exist) // Process A opens a long-lived connection: s1 = socket(AF_INET, SOCK_STREAM) s1.setsockopt(IP_BIND_ADDRESS_NO_PORT) s1.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500) s1.bind(192.0.2.10, 0) s1.connect(192.51.100.1, 443) tb->reuse = -1 tb->reuseport = -1 s1.getsockname() -> 192.0.2.10:54321 s1.send() s1.recv() // ... s1 stays open. // Process B opens a short-lived connection: s2 = socket(AF_INET, SOCK_STREAM) s2.setsockopt(SO_REUSEADDR) s2.bind(192.0.2.20, 0) tb->reuse = 0 tb->reuseport = 0 s2.connect(192.51.100.2, 53) s2.getsockname() -> 192.0.2.20:54321 s2.send() s2.recv() s2.close() // bucket remains in this // state even though port // was released by s2 tb->reuse = 0 tb->reuseport = 0 // Process A attempts to open another connection // when there is connection pressure from // 192.0.2.30:54000..54500 to 192.51.100.1:443. // Assume only port 54321 is still available. s3 = socket(AF_INET, SOCK_STREAM) s3.setsockopt(IP_BIND_ADDRESS_NO_PORT) s3.setsockopt(IP_LOCAL_PORT_RANGE, 54000..54500) s3.bind(192.0.2.30, 0) s3.connect(192.51.100.1, 443) -> EADDRNOTAVAIL (99) Problem ------- We end up in a state where Process A can't reuse ephemeral port 54321 for as long as there are sockets, like s1, that keep the bind bucket alive. The bucket does not return to "reusable" state even when all sockets which blocked it from reuse, like s2, are gone. The ephemeral port becomes available for use again only after all sockets bound to it are gone and the bind bucket is destroyed. Programs which behave like Process B in this scenario - that is, binding to an IP address without setting IP_BIND_ADDRESS_NO_PORT - might be considered poorly written. However, the reality is that such implementation is not actually uncommon. Trying to fix each and every such program is like playing whack-a-mole. For instance, it could be any software using Golang's net.Dialer with LocalAddr provided: dialer := &net.Dialer{ LocalAddr: &net.TCPAddr{IP: srcIP}, } conn, err := dialer.Dial("tcp4", dialTarget) Or even a ubiquitous tool like dig when using a specific local address: $ dig -b 127.1.1.1 +tcp +short example.com Hence, we are proposing a systematic fix in the network stack itself. Solution -------- If there is no IP address conflict with any socket bound to a given local port, then from the protocol's perspective, the port can be safely shared. With that in mind, modify the port search during connect(), that is __inet_hash_connect, to consider all bind buckets (ports) when looking for a local port for egress. To achieve this, add an extra walk over bhash2 buckets for the port to check for IP conflicts. The additional walk is not free, so perform it only once per port - during the second phase of conflict checking, when the bhash bucket is locked. We enable this changed behavior only if the IP_LOCAL_PORT_RANGE socket option is set. The rationale is that users are likely to care about using every possible local port only when they have deliberately constrained the ephemeral port range. Limitations ----------- Sharing both the local IP and port with other established sockets, when the remote address is unique is still not possible if the bucket is in non-reusable state (tb->{fastreuse,fastreuseport} >= 0) because of a socket explicitly bound to that port. Alternatives ------------ * Update bind bucket state on port release A valid solution to the described problem would also be to walk the bind bucket owners when releasing the port and recalculate the tb->{reuse,reuseport} state. However, in comparison to the proposed solution, this alone would not allow sharing the local port with other sockets bound to non-conflicting IPs for as long as they exist. Another downside is that we'd pay the extra cost on each unbind (more frequent) rather than only when connecting with IP_LOCAL_PORT_RANGE set (less frequent). Due to that we would also likely need to guard it behind a sysctl (see below). * Run your service in a dedicated netns This would also solve the problem. While we don't rule out transitioning to this setup in the future at a cost of shifting the complexity elsewhere. Isolating your service in a netns requires assigning it dedicated IPs for egress. If the egress IPs must be shared with other processes, as in our case, then SNAT and connection tracking on egress are required - adding complexity. * Guard it behind a sysctl setting instead of a socket option Users are unlikely to encounter this problem unless their workload connects from a severely narrowed-down ephemeral port range. Hence, paying the bind bucket walk cost for each connect() call doesn't seem sensible. Whereas with a socket option, only a limited set of connections incur the performance overhead. Reported-by: Lee Valentine Signed-off-by: Jakub Sitnicki --- To: Eric Dumazet To: Paolo Abeni To: David S. Miller To: Jakub Kicinski To: Neal Cardwell To: Kuniyuki Iwashima Cc: netdev@vger.kernel.org Cc: kernel-team@cloudflare.com --- net/ipv4/inet_hashtables.c | 45 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 43 insertions(+), 2 deletions(-) diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 77a0b52b2eab..7f0819f53982 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -1005,6 +1005,42 @@ EXPORT_IPV6_MOD(inet_bhash2_reset_saddr); #define INET_TABLE_PERTURB_SIZE (1 << CONFIG_INET_TABLE_PERTURB_ORDER) static u32 *table_perturb; +/* True on source address conflict with another socket. False otherwise. + * Caller must hold hashbucket lock for this tb. + */ +static inline bool check_bound(const struct sock *sk, + const struct inet_bind_bucket *tb) +{ + const struct inet_bind2_bucket *tb2; + const struct sock *sk2; + + hlist_for_each_entry(tb2, &tb->bhash2, bhash_node) { +#if IS_ENABLED(CONFIG_IPV6) + if (sk->sk_family == AF_INET6) { + if (tb2->addr_type == IPV6_ADDR_ANY || + ipv6_addr_equal(&tb2->v6_rcv_saddr, + &sk->sk_v6_rcv_saddr)) + return true; + continue; + } + + /* Check for ipv6 non-v6only wildcard sockets */ + if (tb2->addr_type == IPV6_ADDR_ANY) + sk_for_each_bound(sk2, &tb2->owners) + if (!sk2->sk_ipv6only) + return true; + + if (tb2->addr_type != IPV6_ADDR_MAPPED) + continue; +#endif + if (tb2->rcv_saddr == INADDR_ANY || + tb2->rcv_saddr == sk->sk_rcv_saddr) + return true; + } + + return false; +} + int __inet_hash_connect(struct inet_timewait_death_row *death_row, struct sock *sk, u64 port_offset, u32 hash_port0, @@ -1070,6 +1106,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, if (!inet_bind_bucket_match(tb, net, port, l3mdev)) continue; if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) { + if (unlikely(local_ports)) + break; /* optimistic assumption */ rcu_read_unlock(); goto next_port; } @@ -1088,9 +1126,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row, */ inet_bind_bucket_for_each(tb, &head->chain) { if (inet_bind_bucket_match(tb, net, port, l3mdev)) { - if (tb->fastreuse >= 0 || - tb->fastreuseport >= 0) + if (tb->fastreuse >= 0 || tb->fastreuseport >= 0) { + if (unlikely(local_ports && + !check_bound(sk, tb))) + goto ok; goto next_port_unlock; + } WARN_ON(hlist_empty(&tb->bhash2)); if (!check_established(death_row, sk, port, &tw, false, -- 2.43.0