On high-core count systems, memory cgroup statistics can become stale due to per-CPU caching and deferred aggregation. Monitoring tools and management applications sometimes need guaranteed up-to-date statistics at specific points in time to make accurate decisions. This patch adds write handlers to both memory.stat and memory.numa_stat files to allow userspace to explicitly force an immediate flush of memory statistics. When "1" is written to either file, it triggers __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes all pending statistics for the cgroup and its descendants. The write operation validates the input and only accepts the value "1", returning -EINVAL for any other input. Usage example: # Force immediate flush before reading critical statistics echo 1 > /sys/fs/cgroup/mygroup/memory.stat cat /sys/fs/cgroup/mygroup/memory.stat This provides several benefits: 1. On-demand accuracy: Tools can flush only when needed, avoiding continuous overhead 2. Targeted flushing: Allows flushing specific cgroups when precision is required for particular workloads 3. Integration flexibility: Monitoring scripts can decide when to pay the flush cost based on their specific accuracy requirements The implementation is shared between cgroup v1 and v2 interfaces, with memory_stat_write() providing the common validation and flush logic. Both memory.stat and memory.numa_stat use the same write handler since they both benefit from forcing accurate statistics. Documentation is updated to reflect that these files are now read-write instead of read-only, with clear explanation of the write behavior. Signed-off-by: Leon Huang Fu --- v1 -> v2: - Flush stats when write the file (per Michal). - https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/ Documentation/admin-guide/cgroup-v2.rst | 31 +++++++++++++++++-------- mm/memcontrol-v1.c | 2 ++ mm/memcontrol-v1.h | 1 + mm/memcontrol.c | 13 +++++++++++ 4 files changed, 37 insertions(+), 10 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index 3345961c30ac..2a4a81d2cc2f 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back. cgroup is within its effective low boundary, the cgroup's memory won't be reclaimed unless there is no reclaimable memory available in unprotected cgroups. - Above the effective low boundary (or + Above the effective low boundary (or effective min boundary if it is higher), pages are reclaimed proportionally to the overage, reducing reclaim pressure for smaller overages. @@ -1525,11 +1525,17 @@ The following nested keys are defined. generated on this file reflects only the local events. memory.stat - A read-only flat-keyed file which exists on non-root cgroups. + A read-write flat-keyed file which exists on non-root cgroups. - This breaks down the cgroup's memory footprint into different - types of memory, type-specific details, and other information - on the state and past events of the memory management system. + Reading this file breaks down the cgroup's memory footprint into + different types of memory, type-specific details, and other + information on the state and past events of the memory management + system. + + Writing the value "1" to this file forces an immediate flush of + memory statistics for this cgroup and its descendants, improving + the accuracy of subsequent reads. Any other value will result in + an error. All memory amounts are in bytes. @@ -1786,11 +1792,16 @@ The following nested keys are defined. cgroup is mounted with the memory_hugetlb_accounting option). memory.numa_stat - A read-only nested-keyed file which exists on non-root cgroups. + A read-write nested-keyed file which exists on non-root cgroups. + + Reading this file breaks down the cgroup's memory footprint into + different types of memory, type-specific details, and other + information per node on the state of the memory management system. - This breaks down the cgroup's memory footprint into different - types of memory, type-specific details, and other information - per node on the state of the memory management system. + Writing the value "1" to this file forces an immediate flush of + memory statistics for this cgroup and its descendants, improving + the accuracy of subsequent reads. Any other value will result in + an error. This is useful for providing visibility into the NUMA locality information within an memcg since the pages are allowed to be @@ -2173,7 +2184,7 @@ of the two is enforced. cgroup writeback requires explicit support from the underlying filesystem. Currently, cgroup writeback is implemented on ext2, ext4, -btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are +btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are attributed to the root cgroup. There are inherent differences in memory and writeback management diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c index 6eed14bff742..8cab6b52424b 100644 --- a/mm/memcontrol-v1.c +++ b/mm/memcontrol-v1.c @@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = { { .name = "stat", .seq_show = memory_stat_show, + .write_u64 = memory_stat_write, }, { .name = "force_empty", @@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = { { .name = "numa_stat", .seq_show = memcg_numa_stat_show, + .write_u64 = memory_stat_write, }, #endif { diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h index 6358464bb416..1c92d58330aa 100644 --- a/mm/memcontrol-v1.h +++ b/mm/memcontrol-v1.h @@ -29,6 +29,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg); unsigned long memcg_events(struct mem_cgroup *memcg, int event); unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item); int memory_stat_show(struct seq_file *m, void *v); +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val); void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n); struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index c34029e92bab..d6a5d872fbcb 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v) return 0; } +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val) +{ + if (val != 1) + return -EINVAL; + + if (css) + css_rstat_flush(css); + + return 0; +} + #ifdef CONFIG_NUMA static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec, int item) @@ -4666,11 +4677,13 @@ static struct cftype memory_files[] = { { .name = "stat", .seq_show = memory_stat_show, + .write_u64 = memory_stat_write, }, #ifdef CONFIG_NUMA { .name = "numa_stat", .seq_show = memory_numa_stat_show, + .write_u64 = memory_stat_write, }, #endif { -- 2.51.2