This patch is a result of our long-standing debug sessions, where it all started as "networking is slow", and TCP network throughput suddenly dropped from tens of Gbps to few Mbps, and we could not see anything in the kernel log or netstat counters. Currently, we have two memory pressure counters for TCP sockets [1], which we manipulate only when the memory pressure is signalled through the proto struct [2]. However, the memory pressure can also be signaled through the cgroup memory subsystem, which we do not reflect in the netstat counters. In the end, when the cgroup memory subsystem signals that it is under pressure, we silently reduce the advertised TCP window with tcp_adjust_rcv_ssthresh() to 4*advmss, which causes a significant throughput reduction. Keep in mind that when the cgroup memory subsystem signals the socket memory pressure for a given cgroup, it affects all sockets used in that cgroup, including children cgroups. This patch exposes a new file for each cgroup in sysfs which is a read-only single value file showing how many microseconds this cgroup contributed to throttling the throughput of network sockets. The file is accessible in the following path. /sys/fs/cgroup/**//memory.net.throttled_usec Just to summarize the proposals of different methods of hierarchical propagation of the memory.net.throttled_usec. 1) None - keeping the reported duration local to that cgroup: value = self Would not be too out of place, since memory.events.local already does not accumulate hierarchically. To determine whether sockets in a memcg were throttled, we would traverse the /sys/fs/cgroup/ hierarchy from root to the cgroup of interest and sum those local durations. 2) Propagating the duration upwards (using rstat or simple iteration towards root memcg during write): value = self + sum of children Most semantically consistent with other exposed stat files. Could be added as an entry into memory.stat. Since the pressure gets applied from ancestors to children (see mem_cgroup_under_socket_pressure()), determining the duration of throttling for sockets in some cgroup would be hardest in this variant. It would involve iterating from the root to the examined cgroup and at each node subtracting the values of its children from that nodes value, then the sum of that would correspond to the total duration throttled. 3) Propagating the duration downwards (write only locally, read traversing hierarchy upwards): value = self + sum of ancestors Mirrors the logic used in mem_cgroup_under_socket_pressure(), increase in the reported value for a memcg would coincide with more throttling being done to the sockets of that memcg. We chose variant 1, that is why it is a separate file instead of another counter in mem.stat. Variant 2 seems to be most fitting however the calculated value would be misleading and hard to interpret. Ideally, we would go with variant 3 as this mirrors the logic of mem_cgroup_under_socket_pressure(), but the third variant can be also calculated manually from variant 1, and thus we chose the variant 1 as it is the most versatile one without leaking the internal implementation that can change in the future. Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/uapi/linux/snmp.h#L231-L232 [1] Link: https://elixir.bootlin.com/linux/v6.15.4/source/include/net/sock.h#L1300-L1301 [2] Co-developed-by: Matyas Hurtik Signed-off-by: Matyas Hurtik Signed-off-by: Daniel Sedlak --- Sorry for the delay between the versions. Changes: v4 -> v5: - Rebased - Extend commit message with design decisions - Rename cgroup counter - Link to v4: https://lore.kernel.org/netdev/20250805064429.77876-1-daniel.sedlak@cdn77.com/ v3 -> v4: - Add documentation - Expose pressure as cummulative counter in microseconds - Link to v3: https://lore.kernel.org/netdev/20250722071146.48616-1-daniel.sedlak@cdn77.com/ v2 -> v3: - Expose the socket memory pressure on the cgroups instead of netstat - Split patch - Link to v2: https://lore.kernel.org/netdev/20250714143613.42184-1-daniel.sedlak@cdn77.com/ v1 -> v2: - Add tracepoint - Link to v1: https://lore.kernel.org/netdev/20250707105205.222558-1-daniel.sedlak@cdn77.com/ Documentation/admin-guide/cgroup-v2.rst | 10 ++++++ include/linux/memcontrol.h | 41 +++++++++++++++---------- mm/memcontrol.c | 31 ++++++++++++++++++- 3 files changed, 65 insertions(+), 17 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 51c0bc4c2dc5..fe81a134c156 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1887,6 +1887,16 @@ The following nested keys are defined. Shows pressure stall information for memory. See :ref:`Documentation/accounting/psi.rst ` for details. + memory.net.throttled_usec + A read-only single value file showing how many microseconds this cgroup + contributed to throttling the throughput of network sockets. + + Socket throttling is applied to a cgroup and to all its children, + as a consequence of high reclaim pressure. + + Observing throttling of sockets in a particular cgroup can be done + by checking this file for that cgroup and also for all its ancestors. + Usage Guidelines ~~~~~~~~~~~~~~~~ diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index fb27e3d2fdac..647fba7dcc8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -247,14 +247,19 @@ struct mem_cgroup { atomic_t kmem_stat; #endif /* - * Hint of reclaim pressure for socket memroy management. Note + * Hint of reclaim pressure for socket memory management. Note * that this indicator should NOT be used in legacy cgroup mode * where socket memory is accounted/charged separately. */ u64 socket_pressure; -#if BITS_PER_LONG < 64 + /* memory.net.throttled_usec */ + u64 socket_pressure_duration; +#if BITS_PER_LONG >= 64 + spinlock_t socket_pressure_spinlock; +#else seqlock_t socket_pressure_seqlock; #endif + int kmemcg_id; /* * memcg->objcg is wiped out as a part of the objcg repaprenting @@ -1607,19 +1612,33 @@ bool mem_cgroup_sk_charge(const struct sock *sk, unsigned int nr_pages, gfp_t gfp_mask); void mem_cgroup_sk_uncharge(const struct sock *sk, unsigned int nr_pages); -#if BITS_PER_LONG < 64 static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) { - u64 val = get_jiffies_64() + HZ; unsigned long flags; +#if BITS_PER_LONG >= 64 + spin_lock_irqsave(&memcg->socket_pressure_spinlock, flags); +#else write_seqlock_irqsave(&memcg->socket_pressure_seqlock, flags); - memcg->socket_pressure = val; +#endif + u64 old_socket_pressure = memcg->socket_pressure; + u64 new_socket_pressure = get_jiffies_64() + HZ; + + memcg->socket_pressure = new_socket_pressure; + memcg->socket_pressure_duration += jiffies_to_usecs( + min(new_socket_pressure - old_socket_pressure, HZ)); +#if BITS_PER_LONG >= 64 + spin_unlock_irqrestore(&memcg->socket_pressure_spinlock, flags); +#else write_sequnlock_irqrestore(&memcg->socket_pressure_seqlock, flags); +#endif } static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) { +#if BITS_PER_LONG >= 64 + return READ_ONCE(memcg->socket_pressure); +#else unsigned int seq; u64 val; @@ -1629,18 +1648,8 @@ static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) } while (read_seqretry(&memcg->socket_pressure_seqlock, seq)); return val; -} -#else -static inline void mem_cgroup_set_socket_pressure(struct mem_cgroup *memcg) -{ - WRITE_ONCE(memcg->socket_pressure, jiffies + HZ); -} - -static inline u64 mem_cgroup_get_socket_pressure(struct mem_cgroup *memcg) -{ - return READ_ONCE(memcg->socket_pressure); -} #endif +} int alloc_shrinker_info(struct mem_cgroup *memcg); void free_shrinker_info(struct mem_cgroup *memcg); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index df3e9205c9e6..d29147223822 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3755,7 +3755,10 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent) INIT_LIST_HEAD(&memcg->swap_peaks); spin_lock_init(&memcg->peaks_lock); memcg->socket_pressure = get_jiffies_64(); -#if BITS_PER_LONG < 64 + memcg->socket_pressure_duration = 0; +#if BITS_PER_LONG >= 64 + spin_lock_init(&memcg->socket_pressure_spinlock); +#else seqlock_init(&memcg->socket_pressure_seqlock); #endif memcg1_memcg_init(memcg); @@ -4579,6 +4582,27 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; } +static int memory_net_throttled_usec_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + u64 throttled_usec; + +#if BITS_PER_LONG >= 64 + throttled_usec = READ_ONCE(memcg->socket_pressure_duration); +#else + unsigned int seq; + + do { + seq = read_seqbegin(&memcg->socket_pressure_seqlock); + throttled_usec = memcg->socket_pressure_duration; + } while (read_seqretry(&memcg->socket_pressure_seqlock, seq)); +#endif + + seq_printf(m, "%llu\n", throttled_usec); + + return 0; +} + static struct cftype memory_files[] = { { .name = "current", @@ -4650,6 +4674,11 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, + { + .name = "net.throttled_usec", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = memory_net_throttled_usec_show, + }, { } /* terminate */ }; base-commit: 312e6f7676e63bbb9b81e5c68e580a9f776cc6f0 -- 2.39.5