Introduce a new sysctl knob, net.ipv4.tcp_purge_receive_queue, to address a memory leak scenario related to TCP sockets. Issue: When a TCP socket in the CLOSE_WAIT state receives a RST packet, the current implementation does not clear the socket's receive queue. This causes SKBs in the queue to remain allocated until the socket is explicitly closed by the application. As a consequence: 1. The page pool pages held by these SKBs are not released. 2. The associated page pool cannot be freed. RFC 9293 Section 3.10.7.4 specifies that when a RST is received in CLOSE_WAIT state, "all segment queues should be flushed." However, the current implementation does not flush the receive queue. Solution: Add a per-namespace sysctl (net.ipv4.tcp_purge_receive_queue) that, when enabled, causes the kernel to purge the receive queue when a RST packet is received in CLOSE_WAIT state. This allows immediate release of SKBs and their associated memory resources. The feature is disabled by default to maintain backward compatibility with existing behavior. Signed-off-by: Leon Hwang --- Documentation/networking/ip-sysctl.rst | 18 ++++++++++++++++++ .../net_cachelines/netns_ipv4_sysctl.rst | 1 + include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 +++++++++ net/ipv4/tcp_input.c | 16 ++++++++++++++++ 5 files changed, 45 insertions(+) diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index d1eeb5323af0..71a529462baa 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -1441,6 +1441,24 @@ tcp_rto_max_ms - INTEGER Default: 120,000 +tcp_purge_receive_queue - BOOLEAN + When a socket in the TCP_CLOSE_WAIT state receives a RST packet, the + default behavior is to not clear its receive queue. As a result, + any SKBs in the queue are not freed until the socket is closed. + Consequently, the pages held by these SKBs are not released, which + can also prevent the associated page pool from being freed. + + If enabled, the receive queue is purged upon receiving the RST, + allowing the SKBs and their associated memory to be released + promptly. + + Possible values: + + - 0 (disabled) + - 1 (enabled) + + Default: 0 (disabled) + UDP variables ============= diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst index beaf1880a19b..f2c42e7d84a9 100644 --- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst +++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst @@ -123,6 +123,7 @@ unsigned_long sysctl_tcp_comp_sack_delay_ns unsigned_long sysctl_tcp_comp_sack_slack_ns __tcp_ack_snd_check int sysctl_max_syn_backlog int sysctl_tcp_fastopen +u8 sysctl_tcp_purge_receive_queue struct_tcp_congestion_ops tcp_congestion_control init_cc struct_tcp_fastopen_context tcp_fastopen_ctx unsigned_int sysctl_tcp_fastopen_blackhole_timeout diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index 8e971c7bf164..ab973f30f502 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -220,6 +220,7 @@ struct netns_ipv4 { u8 sysctl_tcp_nometrics_save; u8 sysctl_tcp_no_ssthresh_metrics_save; u8 sysctl_tcp_workaround_signed_windows; + u8 sysctl_tcp_purge_receive_queue; int sysctl_tcp_challenge_ack_limit; u8 sysctl_tcp_min_tso_segs; u8 sysctl_tcp_reflect_tos; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 643763bc2142..da30970bb5d5 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -1641,6 +1641,15 @@ static struct ctl_table ipv4_net_table[] = { .extra1 = SYSCTL_ONE_THOUSAND, .extra2 = &tcp_rto_max_max, }, + { + .procname = "tcp_purge_receive_queue", + .data = &init_net.ipv4.sysctl_tcp_purge_receive_queue, + .maxlen = sizeof(u8), + .mode = 0644, + .proc_handler = proc_dou8vec_minmax, + .extra1 = SYSCTL_ZERO, + .extra2 = SYSCTL_ONE, + }, }; static __net_init int ipv4_sysctl_init_net(struct net *net) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 6c3f1d031444..43f32fb5831d 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4895,6 +4895,7 @@ EXPORT_IPV6_MOD(tcp_done_with_error); /* When we get a reset we do this. */ void tcp_reset(struct sock *sk, struct sk_buff *skb) { + const struct net *net = sock_net(sk); int err; trace_tcp_receive_reset(sk); @@ -4911,6 +4912,21 @@ void tcp_reset(struct sock *sk, struct sk_buff *skb) err = ECONNREFUSED; break; case TCP_CLOSE_WAIT: + /* RFC9293 3.10.7.4. Other States + * Second, check the RST bit: + * CLOSE-WAIT STATE + * + * If the RST bit is set, then any outstanding RECEIVEs and + * SEND should receive "reset" responses. All segment queues + * should be flushed. Users should also receive an unsolicited + * general "connection reset" signal. Enter the CLOSED state, + * delete the TCB, and return. + * + * If net.ipv4.tcp_purge_receive_queue is enabled, + * sk_receive_queue will be flushed too. + */ + if (unlikely(net->ipv4.sysctl_tcp_purge_receive_queue)) + skb_queue_purge(&sk->sk_receive_queue); err = EPIPE; break; case TCP_CLOSE: -- 2.52.0