* Motivation The purpose of this hierarchical split-counter scheme is to: - Minimize contention when incrementing and decrementing counters, - Provide fast access to a sum approximation, - Provide a sum approximation with an acceptable accuracy level when scaling to many-core systems. - Provide approximate and precise comparison of two counters, and between a counter and a value. It aims at fixing the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems [1]. * Design The hierarchical per-CPU counters propagate a sum approximation through a N-way tree. When reaching the batch size, the carry is propagated through a binary tree which consists of logN(nr_cpu_ids) levels. The batch size for each level is twice the batch size of the prior level. Example propagation diagram with 8 cpus through a binary tree: Level 0: 0 1 2 3 4 5 6 7 | / | / | / | / | / | / | / | / | / | / | / | / Level 1: 0 1 2 3 | / | / | / | / | / | / Level 2: 0 1 | / | / | / Level 3: 0 For a binary tree, the maximum inaccuracy is bound by: batch_size * log2(nr_cpus) * nr_cpus which evolves with O(n*log(n)) as the number of CPUs increases. For a N-way tree, the maximum inaccuracy can be pre-calculated based on the the N-arity of each level and the batch size. Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Baolin Wang Cc: Aboorva Devarajan --- Changes since v8: - Remove migrate guard from the fast path. It does not matter through which path the carry is propagated up the tree. - Rebase on top of v6.18-rc6. - Introduce percpu_counter_tree_init_many and percpu_counter_tree_destroy_many APIs. - Move tree items allocation to the caller. - Introduce percpu_counter_tree_items_size(). - Move percpu_counter_tree_subsystem_init() call before mm_core_init() so percpu_counter_tree_items_size() is initialized before it is used. Changes since v7: - Explicitly initialize the subsystem from start_kernel() right after mm_core_init() so it is up and running before the creation of the first mm at boot. - Remove the module.h include which is not needed with the explicit initialization. - Only consider levels>0 items for order={0,1} nr_items. No functional change except to allocate only the amount of memory which is strictly needed. - Introduce positive precise sum API to handle a scenario where an unlucky precise sum iteration would hit negative counter values concurrently with counter updates. Changes since v5: - Introduce percpu_counter_tree_approximate_sum_positive. - Introduce !CONFIG_SMP static inlines for UP build. - Remove percpu_counter_tree_set_bias from the public API and make it static. Changes since v3: - Add gfp flags to init function. Changes since v2: - Introduce N-way tree to reduce tree depth on larger systems. Changes since v1: - Remove percpu_counter_tree_precise_sum_unbiased from public header, make this function static, - Introduce precise and approximate comparisons between two counters, - Reorder the struct percpu_counter_tree fields, - Introduce approx_sum field, which points to the approximate sum for the percpu_counter_tree_approximate_sum() fast path. --- include/linux/percpu_counter_tree.h | 239 +++++++++++++++ init/main.c | 2 + lib/Makefile | 1 + lib/percpu_counter_tree.c | 443 ++++++++++++++++++++++++++++ 4 files changed, 685 insertions(+) create mode 100644 include/linux/percpu_counter_tree.h create mode 100644 lib/percpu_counter_tree.c diff --git a/include/linux/percpu_counter_tree.h b/include/linux/percpu_counter_tree.h new file mode 100644 index 000000000000..1f4938b67730 --- /dev/null +++ b/include/linux/percpu_counter_tree.h @@ -0,0 +1,239 @@ +/* SPDX-License-Identifier: GPL-2.0+ OR MIT */ +/* SPDX-FileCopyrightText: 2025 Mathieu Desnoyers */ + +#ifndef _PERCPU_COUNTER_TREE_H +#define _PERCPU_COUNTER_TREE_H + +#include +#include +#include + +#ifdef CONFIG_SMP + +struct percpu_counter_tree_level_item { + atomic_t count; +} ____cacheline_aligned_in_smp; + +struct percpu_counter_tree { + /* Fast-path fields. */ + unsigned int __percpu *level0; + unsigned int level0_bit_mask; + union { + unsigned int *i; + atomic_t *a; + } approx_sum; + int bias; /* bias for counter_set */ + + /* Slow-path fields. */ + struct percpu_counter_tree_level_item *items; + unsigned int batch_size; + unsigned int inaccuracy; /* approximation imprecise within ± inaccuracy */ +}; + +size_t percpu_counter_tree_items_size(void); +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags); +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags); +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter, unsigned int nr_counters); +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter); +void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc); +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter); +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b); +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v); +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b); +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v); +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v); +unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter); +int percpu_counter_tree_subsystem_init(void); + +/* Fast paths */ + +static inline +int percpu_counter_tree_carry(int orig, int res, int inc, unsigned int bit_mask) +{ + if (inc < 0) { + inc = -(-inc & ~(bit_mask - 1)); + /* + * xor bit_mask: underflow. + * + * If inc has bit set, decrement an additional bit if + * there is _no_ bit transition between orig and res. + * Else, inc has bit cleared, decrement an additional + * bit if there is a bit transition between orig and + * res. + */ + if ((inc ^ orig ^ res) & bit_mask) + inc -= bit_mask; + } else { + inc &= ~(bit_mask - 1); + /* + * xor bit_mask: overflow. + * + * If inc has bit set, increment an additional bit if + * there is _no_ bit transition between orig and res. + * Else, inc has bit cleared, increment an additional + * bit if there is a bit transition between orig and + * res. + */ + if ((inc ^ orig ^ res) & bit_mask) + inc += bit_mask; + } + return inc; +} + +static inline +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc) +{ + unsigned int bit_mask = counter->level0_bit_mask, orig, res; + + res = this_cpu_add_return(*counter->level0, inc); + orig = res - inc; + inc = percpu_counter_tree_carry(orig, res, inc, bit_mask); + if (likely(!inc)) + return; + percpu_counter_tree_add_slowpath(counter, inc); +} + +static inline +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter) +{ + unsigned int v; + + if (!counter->level0_bit_mask) + v = READ_ONCE(*counter->approx_sum.i); + else + v = atomic_read(counter->approx_sum.a); + return (int) (v + (unsigned int)READ_ONCE(counter->bias)); +} + +#else /* !CONFIG_SMP */ + +struct percpu_counter_tree_level_item; + +struct percpu_counter_tree { + atomic_t count; +}; + +static inline +size_t percpu_counter_tree_items_size(void) +{ + return 0; +} + +static inline +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags) +{ + for (unsigned int i = 0; i < nr_counters; i++) + atomic_set(&counters[i].count, 0); + return 0; +} + +static inline +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags) +{ + return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_flags); +} + +static inline +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counter, unsigned int nr_counters) +{ +} + +static inline +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter) +{ +} + +static inline +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter) +{ + return atomic_read(&counter->count); +} + +static inline +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b) +{ + int count_a = percpu_counter_tree_precise_sum(a), + count_b = percpu_counter_tree_precise_sum(b); + + if (count_a == count_b) + return 0; + if (count_a < count_b) + return -1; + return 1; +} + +static inline +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v) +{ + int count = percpu_counter_tree_precise_sum(counter); + + if (count == v) + return 0; + if (count < v) + return -1; + return 1; +} + +static inline +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b) +{ + return percpu_counter_tree_precise_compare(a, b); +} + +static inline +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v) +{ + return percpu_counter_tree_precise_compare_value(counter, v); +} + +static inline +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v) +{ + atomic_set(&counter->count, v); +} + +static inline +unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter) +{ + return 0; +} + +static inline +void percpu_counter_tree_add(struct percpu_counter_tree *counter, int inc) +{ + atomic_add(inc, &counter->count); +} + +static inline +int percpu_counter_tree_approximate_sum(struct percpu_counter_tree *counter) +{ + return percpu_counter_tree_precise_sum(counter); +} + +static inline +int percpu_counter_tree_subsystem_init(void) +{ + return 0; +} + +#endif /* CONFIG_SMP */ + +static inline +int percpu_counter_tree_approximate_sum_positive(struct percpu_counter_tree *counter) +{ + int v = percpu_counter_tree_approximate_sum(counter); + return v > 0 ? v : 0; +} + +static inline +int percpu_counter_tree_precise_sum_positive(struct percpu_counter_tree *counter) +{ + int v = percpu_counter_tree_precise_sum(counter); + return v > 0 ? v : 0; +} + +#endif /* _PERCPU_COUNTER_TREE_H */ diff --git a/init/main.c b/init/main.c index 07a3116811c5..8487489b9d00 100644 --- a/init/main.c +++ b/init/main.c @@ -104,6 +104,7 @@ #include #include #include +#include #include #include @@ -968,6 +969,7 @@ void start_kernel(void) vfs_caches_init_early(); sort_main_extable(); trap_init(); + percpu_counter_tree_subsystem_init(); mm_core_init(); maple_tree_init(); poking_init(); diff --git a/lib/Makefile b/lib/Makefile index 1ab2c4be3b66..767dc178a55c 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -179,6 +179,7 @@ obj-$(CONFIG_TEXTSEARCH_KMP) += ts_kmp.o obj-$(CONFIG_TEXTSEARCH_BM) += ts_bm.o obj-$(CONFIG_TEXTSEARCH_FSM) += ts_fsm.o obj-$(CONFIG_SMP) += percpu_counter.o +obj-$(CONFIG_SMP) += percpu_counter_tree.o obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o diff --git a/lib/percpu_counter_tree.c b/lib/percpu_counter_tree.c new file mode 100644 index 000000000000..5056d470bdc2 --- /dev/null +++ b/lib/percpu_counter_tree.c @@ -0,0 +1,443 @@ +// SPDX-License-Identifier: GPL-2.0+ OR MIT +// SPDX-FileCopyrightText: 2025 Mathieu Desnoyers + +/* + * Split Counters With Tree Approximation Propagation + * + * * Propagation diagram when reaching batch size thresholds (± batch size): + * + * Example diagram for 8 CPUs: + * + * log2(8) = 3 levels + * + * At each level, each pair propagates its values to the next level when + * reaching the batch size thresholds. + * + * Counters at levels 0, 1, 2 can be kept on a single byte (±128 range), + * although it may be relevant to keep them on 32-bit counters for + * simplicity. (complexity vs memory footprint tradeoff) + * + * Counter at level 3 can be kept on a 32-bit counter. + * + * Level 0: 0 1 2 3 4 5 6 7 + * | / | / | / | / + * | / | / | / | / + * | / | / | / | / + * Level 1: 0 1 2 3 + * | / | / + * | / | / + * | / | / + * Level 2: 0 1 + * | / + * | / + * | / + * Level 3: 0 + * + * * Approximation inaccuracy: + * + * BATCH(level N): Level N batch size. + * + * Example for BATCH(level 0) = 32. + * + * BATCH(level 0) = 32 + * BATCH(level 1) = 64 + * BATCH(level 2) = 128 + * BATCH(level N) = BATCH(level 0) * 2^N + * + * per-counter global + * inaccuracy inaccuracy + * Level 0: [ -32 .. +31] ±256 (8 * 32) + * Level 1: [ -64 .. +63] ±256 (4 * 64) + * Level 2: [-128 .. +127] ±256 (2 * 128) + * Total: ------ ±768 (log2(nr_cpu_ids) * BATCH(level 0) * nr_cpu_ids) + * + * ----- + * + * Approximate Sum Carry Propagation + * + * Let's define a number of counter bits for each level, e.g.: + * + * log2(BATCH(level 0)) = log2(32) = 5 + * + * nr_bit value_mask range + * Level 0: 5 bits v 0 .. +31 + * Level 1: 1 bit (v & ~((1UL << 5) - 1)) 0 .. +63 + * Level 2: 1 bit (v & ~((1UL << 6) - 1)) 0 .. +127 + * Level 3: 25 bits (v & ~((1UL << 7) - 1)) 0 .. 2^32-1 + * + * Note: Use a full 32-bit per-cpu counter at level 0 to allow precise sum. + * + * Note: Use cacheline aligned counters at levels above 0 to prevent false sharing. + * If memory footprint is an issue, a specialized allocator could be used + * to eliminate padding. + * + * Example with expanded values: + * + * counter_add(counter, inc): + * + * if (!inc) + * return; + * + * res = percpu_add_return(counter @ Level 0, inc); + * orig = res - inc; + * if (inc < 0) { + * inc = -(-inc & ~0b00011111); // Clear used bits + * // xor bit 5: underflow + * if ((inc ^ orig ^ res) & 0b00100000) + * inc -= 0b00100000; + * } else { + * inc &= ~0b00011111; // Clear used bits + * // xor bit 5: overflow + * if ((inc ^ orig ^ res) & 0b00100000) + * inc += 0b00100000; + * } + * if (!inc) + * return; + * + * res = atomic_add_return(counter @ Level 1, inc); + * orig = res - inc; + * if (inc < 0) { + * inc = -(-inc & ~0b00111111); // Clear used bits + * // xor bit 6: underflow + * if ((inc ^ orig ^ res) & 0b01000000) + * inc -= 0b01000000; + * } else { + * inc &= ~0b00111111; // Clear used bits + * // xor bit 6: overflow + * if ((inc ^ orig ^ res) & 0b01000000) + * inc += 0b01000000; + * } + * if (!inc) + * return; + * + * res = atomic_add_return(counter @ Level 2, inc); + * orig = res - inc; + * if (inc < 0) { + * inc = -(-inc & ~0b01111111); // Clear used bits + * // xor bit 7: underflow + * if ((inc ^ orig ^ res) & 0b10000000) + * inc -= 0b10000000; + * } else { + * inc &= ~0b01111111; // Clear used bits + * // xor bit 7: overflow + * if ((inc ^ orig ^ res) & 0b10000000) + * inc += 0b10000000; + * } + * if (!inc) + * return; + * + * atomic_add(counter @ Level 3, inc); + */ + +#include +#include +#include +#include +#include +#include +#include + +#define MAX_NR_LEVELS 5 + +struct counter_config { + unsigned int nr_items; + unsigned char nr_levels; + unsigned char n_arity_order[MAX_NR_LEVELS]; +}; + +/* + * nr_items is the number of items in the tree for levels 1 to and + * including the final level (approximate sum). It excludes the level 0 + * per-cpu counters. + */ +static const struct counter_config per_nr_cpu_order_config[] = { + [0] = { .nr_items = 0, .nr_levels = 0, .n_arity_order = { 0 } }, + [1] = { .nr_items = 1, .nr_levels = 1, .n_arity_order = { 1 } }, + [2] = { .nr_items = 3, .nr_levels = 2, .n_arity_order = { 1, 1 } }, + [3] = { .nr_items = 7, .nr_levels = 3, .n_arity_order = { 1, 1, 1 } }, + [4] = { .nr_items = 7, .nr_levels = 3, .n_arity_order = { 2, 1, 1 } }, + [5] = { .nr_items = 11, .nr_levels = 3, .n_arity_order = { 2, 2, 1 } }, + [6] = { .nr_items = 21, .nr_levels = 3, .n_arity_order = { 2, 2, 2 } }, + [7] = { .nr_items = 21, .nr_levels = 3, .n_arity_order = { 3, 2, 2 } }, + [8] = { .nr_items = 37, .nr_levels = 3, .n_arity_order = { 3, 3, 2 } }, + [9] = { .nr_items = 73, .nr_levels = 3, .n_arity_order = { 3, 3, 3 } }, + [10] = { .nr_items = 149, .nr_levels = 4, .n_arity_order = { 3, 3, 2, 2 } }, + [11] = { .nr_items = 293, .nr_levels = 4, .n_arity_order = { 3, 3, 3, 2 } }, + [12] = { .nr_items = 585, .nr_levels = 4, .n_arity_order = { 3, 3, 3, 3 } }, + [13] = { .nr_items = 1173, .nr_levels = 5, .n_arity_order = { 3, 3, 3, 2, 2 } }, + [14] = { .nr_items = 2341, .nr_levels = 5, .n_arity_order = { 3, 3, 3, 3, 2 } }, + [15] = { .nr_items = 4681, .nr_levels = 5, .n_arity_order = { 3, 3, 3, 3, 3 } }, + [16] = { .nr_items = 4681, .nr_levels = 5, .n_arity_order = { 4, 3, 3, 3, 3 } }, + [17] = { .nr_items = 8777, .nr_levels = 5, .n_arity_order = { 4, 4, 3, 3, 3 } }, + [18] = { .nr_items = 17481, .nr_levels = 5, .n_arity_order = { 4, 4, 4, 3, 3 } }, + [19] = { .nr_items = 34953, .nr_levels = 5, .n_arity_order = { 4, 4, 4, 4, 3 } }, + [20] = { .nr_items = 69905, .nr_levels = 5, .n_arity_order = { 4, 4, 4, 4, 4 } }, +}; + +static const struct counter_config *counter_config; +static unsigned int nr_cpus_order, inaccuracy_multiplier; + +static +int __percpu_counter_tree_init(struct percpu_counter_tree *counter, + unsigned int batch_size, gfp_t gfp_flags, + unsigned int __percpu *level0, + struct percpu_counter_tree_level_item *items) +{ + /* Batch size must be power of 2 */ + if (!batch_size || (batch_size & (batch_size - 1))) + return -EINVAL; + counter->batch_size = batch_size; + counter->bias = 0; + counter->level0 = level0; + counter->items = items; + if (!nr_cpus_order) { + counter->approx_sum.i = per_cpu_ptr(counter->level0, 0); + counter->level0_bit_mask = 0; + } else { + counter->approx_sum.a = &counter->items[counter_config->nr_items - 1].count; + counter->level0_bit_mask = 1UL << get_count_order(batch_size); + } + counter->inaccuracy = batch_size * inaccuracy_multiplier; + return 0; +} + +int percpu_counter_tree_init_many(struct percpu_counter_tree *counters, struct percpu_counter_tree_level_item *items, + unsigned int nr_counters, unsigned int batch_size, gfp_t gfp_flags) +{ + void __percpu *level0, *level0_iter; + size_t counter_size, items_size = 0; + void *items_iter; + unsigned int i; + int ret; + + counter_size = ALIGN(sizeof(*counters), __alignof__(*counters)); + level0 = __alloc_percpu_gfp(nr_counters * counter_size, + __alignof__(*counters), gfp_flags); + if (!level0) + return -ENOMEM; + if (nr_cpus_order) { + items_size = percpu_counter_tree_items_size(); + memset(items, 0, items_size * nr_counters); + } + level0_iter = level0; + items_iter = items; + for (i = 0; i < nr_counters; i++) { + ret = __percpu_counter_tree_init(&counters[i], batch_size, gfp_flags, level0_iter, items_iter); + if (ret) + goto free_level0; + level0_iter += counter_size; + if (nr_cpus_order) + items_iter += items_size; + } + return 0; + +free_level0: + free_percpu(level0); + return ret; +} + +int percpu_counter_tree_init(struct percpu_counter_tree *counter, struct percpu_counter_tree_level_item *items, + unsigned int batch_size, gfp_t gfp_flags) +{ + return percpu_counter_tree_init_many(counter, items, 1, batch_size, gfp_flags); +} + +void percpu_counter_tree_destroy_many(struct percpu_counter_tree *counters, unsigned int nr_counters) +{ + free_percpu(counters->level0); +} + +void percpu_counter_tree_destroy(struct percpu_counter_tree *counter) +{ + return percpu_counter_tree_destroy_many(counter, 1); +} + +/* + * It does not matter through which path the carry propagates up the + * tree, therefore there is no need to disable preemption because the + * cpu number is only used to favor cache locality. + */ +void percpu_counter_tree_add_slowpath(struct percpu_counter_tree *counter, int inc) +{ + unsigned int level_items, nr_levels = counter_config->nr_levels, + level, n_arity_order, bit_mask; + struct percpu_counter_tree_level_item *item = counter->items; + unsigned int cpu = raw_smp_processor_id(); + + WARN_ON_ONCE(!nr_cpus_order); /* Should never be called for 1 cpu. */ + + n_arity_order = counter_config->n_arity_order[0]; + bit_mask = counter->level0_bit_mask << n_arity_order; + level_items = 1U << (nr_cpus_order - n_arity_order); + + for (level = 1; level < nr_levels; level++) { + atomic_t *count = &item[cpu & (level_items - 1)].count; + unsigned int orig, res; + + res = atomic_add_return_relaxed(inc, count); + orig = res - inc; + inc = percpu_counter_tree_carry(orig, res, inc, bit_mask); + if (likely(!inc)) + return; + item += level_items; + n_arity_order = counter_config->n_arity_order[level]; + level_items >>= n_arity_order; + bit_mask <<= n_arity_order; + } + atomic_add(inc, counter->approx_sum.a); +} + +/* + * Precise sum. Perform the sum of all per-cpu counters. + */ +static int percpu_counter_tree_precise_sum_unbiased(struct percpu_counter_tree *counter) +{ + unsigned int sum = 0; + int cpu; + + for_each_possible_cpu(cpu) + sum += *per_cpu_ptr(counter->level0, cpu); + return (int) sum; +} + +int percpu_counter_tree_precise_sum(struct percpu_counter_tree *counter) +{ + return percpu_counter_tree_precise_sum_unbiased(counter) + READ_ONCE(counter->bias); +} + +/* + * Do an approximate comparison of two counters. + * Return 0 if counters do not differ by more than the sum of their + * respective inaccuracy ranges, + * Return -1 if counter @a less than counter @b, + * Return 1 if counter @a is greater than counter @b. + */ +int percpu_counter_tree_approximate_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b) +{ + int count_a = percpu_counter_tree_approximate_sum(a), + count_b = percpu_counter_tree_approximate_sum(b); + + if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy)) + return 0; + if (count_a < count_b) + return -1; + return 1; +} + +/* + * Do an approximate comparison of a counter against a given value. + * Return 0 if the value is within the inaccuracy range of the counter, + * Return -1 if the value less than counter, + * Return 1 if the value is greater than counter. + */ +int percpu_counter_tree_approximate_compare_value(struct percpu_counter_tree *counter, int v) +{ + int count = percpu_counter_tree_approximate_sum(counter); + + if (abs(v - count) <= counter->inaccuracy) + return 0; + if (count < v) + return -1; + return 1; +} + +/* + * Do a precise comparison of two counters. + * Return 0 if the counters are equal, + * Return -1 if counter @a less than counter @b, + * Return 1 if counter @a is greater than counter @b. + */ +int percpu_counter_tree_precise_compare(struct percpu_counter_tree *a, struct percpu_counter_tree *b) +{ + int count_a = percpu_counter_tree_approximate_sum(a), + count_b = percpu_counter_tree_approximate_sum(b); + + if (abs(count_a - count_b) <= (a->inaccuracy + b->inaccuracy)) { + if (b->inaccuracy < a->inaccuracy) { + count_a = percpu_counter_tree_precise_sum(a); + if (abs(count_a - count_b) <= b->inaccuracy) + count_b = percpu_counter_tree_precise_sum(b); + } else { + count_b = percpu_counter_tree_precise_sum(b); + if (abs(count_a - count_b) <= a->inaccuracy) + count_a = percpu_counter_tree_precise_sum(a); + } + } + if (count_a > count_b) + return -1; + if (count_a > count_b) + return 1; + return 0; +} + +/* + * Do a precise comparision of a counter against a given value. + * Return 0 if the value is equal to the counter, + * Return -1 if the value less than counter, + * Return 1 if the value is greater than counter. + */ +int percpu_counter_tree_precise_compare_value(struct percpu_counter_tree *counter, int v) +{ + int count = percpu_counter_tree_approximate_sum(counter); + + if (abs(v - count) <= counter->inaccuracy) + count = percpu_counter_tree_precise_sum(counter); + if (count < v) + return -1; + if (count > v) + return 1; + return 0; +} + +static +void percpu_counter_tree_set_bias(struct percpu_counter_tree *counter, int bias) +{ + WRITE_ONCE(counter->bias, bias); +} + +void percpu_counter_tree_set(struct percpu_counter_tree *counter, int v) +{ + percpu_counter_tree_set_bias(counter, + v - percpu_counter_tree_precise_sum_unbiased(counter)); +} + +unsigned int percpu_counter_tree_inaccuracy(struct percpu_counter_tree *counter) +{ + return counter->inaccuracy; +} + +size_t percpu_counter_tree_items_size(void) +{ + if (!nr_cpus_order) + return 0; + return counter_config->nr_items * sizeof(struct percpu_counter_tree_level_item); +} + +static unsigned int __init calculate_inaccuracy_multiplier(void) +{ + unsigned int nr_levels = counter_config->nr_levels, level; + unsigned int level_items = 1U << nr_cpus_order; + unsigned int inaccuracy = 0, batch_size = 1; + + for (level = 0; level < nr_levels; level++) { + unsigned int n_arity_order = counter_config->n_arity_order[level]; + + inaccuracy += batch_size * level_items; + batch_size <<= n_arity_order; + level_items >>= n_arity_order; + } + return inaccuracy; +} + +int __init percpu_counter_tree_subsystem_init(void) +{ + + nr_cpus_order = get_count_order(nr_cpu_ids); + if (WARN_ON_ONCE(nr_cpus_order >= ARRAY_SIZE(per_nr_cpu_order_config))) { + printk(KERN_ERR "Unsupported number of CPUs (%u)\n", nr_cpu_ids); + return -1; + } + counter_config = &per_nr_cpu_order_config[nr_cpus_order]; + inaccuracy_multiplier = calculate_inaccuracy_multiplier(); + return 0; +} -- 2.39.5 Use hierarchical per-cpu counters for rss tracking to fix the per-mm RSS tracking which has become too inaccurate for OOM killer purposes on large many-core systems. The following rss tracking issues were noted by Sweet Tea Dorminy [1], which lead to picking wrong tasks as OOM kill target: Recently, several internal services had an RSS usage regression as part of a kernel upgrade. Previously, they were on a pre-6.2 kernel and were able to read RSS statistics in a backup watchdog process to monitor and decide if they'd overrun their memory budget. Now, however, a representative service with five threads, expected to use about a hundred MB of memory, on a 250-cpu machine had memory usage tens of megabytes different from the expected amount -- this constituted a significant percentage of inaccuracy, causing the watchdog to act. This was a result of commit f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") [1]. Previously, the memory error was bounded by 64*nr_threads pages, a very livable megabyte. Now, however, as a result of scheduler decisions moving the threads around the CPUs, the memory error could be as large as a gigabyte. This is a really tremendous inaccuracy for any few-threaded program on a large machine and impedes monitoring significantly. These stat counters are also used to make OOM killing decisions, so this additional inaccuracy could make a big difference in OOM situations -- either resulting in the wrong process being killed, or in less memory being returned from an OOM-kill than expected. Here is a (possibly incomplete) list of the prior approaches that were used or proposed, along with their downside: 1) Per-thread rss tracking: large error on many-thread processes. 2) Per-CPU counters: up to 12% slower for short-lived processes and 9% increased system time in make test workloads [1]. Moreover, the inaccuracy increases with O(n^2) with the number of CPUs. 3) Per-NUMA-node counters: requires atomics on fast-path (overhead), error is high with systems that have lots of NUMA nodes (32 times the number of NUMA nodes). The approach proposed here is to replace this by the hierarchical per-cpu counters, which bounds the inaccuracy based on the system topology with O(N*logN). commit 82241a83cd15 ("mm: fix the inaccurate memory statistics issue for users") introduced get_mm_counter_sum() for precise /proc memory status queries. Implement it with percpu_counter_tree_precise_sum() since it is not a fast path and precision is preferred over speed. * Testing results: Test hardware: 2 sockets AMD EPYC 9654 96-Core Processor (384 logical CPUs total) Methodology: Comparing the current upstream implementation with the hierarchical counters is done by keeping both implementations wired up in parallel, and running a single-process, single-threaded program which hops randomly across CPUs in the system, calling mmap(2) and munmap(2) on random CPUs, keeping track of an array of allocated mappings, randomly choosing entries to either map or unmap. get_mm_counter() is instrumented to compare the upstream counter approximation to the precise value, and print the delta when going over a given threshold. The delta of the hierarchical counter approximation to the precise value is also printed for comparison. After a few minutes running this test, the upstream implementation counter approximation reaches a 1GB delta from the precise value, compared to 80MB delta with the hierarchical counter. The hierarchical counter provides a guaranteed maximum approximation inaccuracy of 192MB on that hardware topology. * Fast path implementation comparison The new inline percpu_counter_tree_add() uses a this_cpu_add_return() for the fast path (under a certain allocation size threshold). Above that, it calls a slow path which "trickles up" the carry to upper level counters with atomic_add_return. In comparison, the upstream counters implementation calls percpu_counter_add_batch which uses this_cpu_try_cmpxchg() on the fast path, and does a raw_spin_lock_irqsave above a certain threshold. The hierarchical implementation is therefore expected to have less contention on mid-sized allocations than the upstream counters because the atomic counters tracking those bits are only shared across nearby CPUs. In comparison, the upstream counters immediately use a global spinlock when reaching the threshold. * Benchmarks Using will-it-scale page_fault1 benchmarks to compare the upstream counters to the hierarchical counters. This is done with hyperthreading disabled. The speedup is within the standard deviation of the upstream runs, so the overhead is not significant. upstream hierarchical speedup page_fault1_processes -s 100 -t 1 614783 615558 +0.1% page_fault1_threads -s 100 -t 1 612788 612447 -0.1% page_fault1_processes -s 100 -t 96 37994977 37932035 -0.2% page_fault1_threads -s 100 -t 96 2484130 2504860 +0.8% page_fault1_processes -s 100 -t 192 71262917 71118830 -0.2% page_fault1_threads -s 100 -t 192 2446437 2469296 +0.1% Link: https://lore.kernel.org/lkml/20250331223516.7810-2-sweettea-kernel@dorminy.me/ # [1] Link: https://lore.kernel.org/lkml/20250704150226.47980-1-mathieu.desnoyers@efficios.com/ Signed-off-by: Mathieu Desnoyers Cc: Andrew Morton Cc: "Paul E. McKenney" Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Mathieu Desnoyers Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Martin Liu Cc: David Rientjes Cc: christian.koenig@amd.com Cc: Shakeel Butt Cc: SeongJae Park Cc: Michal Hocko Cc: Johannes Weiner Cc: Sweet Tea Dorminy Cc: Lorenzo Stoakes Cc: "Liam R . Howlett" Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Vlastimil Babka Cc: Christian Brauner Cc: Wei Yang Cc: David Hildenbrand Cc: Miaohe Lin Cc: Al Viro Cc: linux-mm@kvack.org Cc: linux-trace-kernel@vger.kernel.org Cc: Yu Zhao Cc: Roman Gushchin Cc: Mateusz Guzik Cc: Matthew Wilcox Cc: Baolin Wang Cc: Aboorva Devarajan --- Changes since v8: - Use percpu_counter_tree_init_many and percpu_counter_tree_destroy_many APIs. - Remove percpu tree items allocation. Extend mm_struct size to include rss items. Those are handled through the new helpers get_rss_stat_items() and get_rss_stat_items_size() and passed as parameter to percpu_counter_tree_init_many(). Changes since v7: - Use precise sum positive API to handle a scenario where an unlucky precise sum iteration would observe negative counter values due to concurrent updates. Changes since v6: - Rebased on v6.18-rc3. - Implement get_mm_counter_sum as percpu_counter_tree_precise_sum for /proc virtual files memory state queries. Changes since v5: - Use percpu_counter_tree_approximate_sum_positive. Change since v4: - get_mm_counter needs to return 0 or a positive value. get_mm_counter_sum -> precise sum positive --- include/linux/mm.h | 28 +++++++++++++++++++++++----- include/linux/mm_types.h | 4 ++-- include/trace/events/kmem.h | 2 +- kernel/fork.c | 24 ++++++++++++++---------- 4 files changed, 40 insertions(+), 18 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7c79b3369b82..cd81fa8924eb 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2681,38 +2681,56 @@ static inline bool get_user_page_fast_only(unsigned long addr, { return get_user_pages_fast_only(addr, 1, gup_flags, pagep) == 1; } + +static inline size_t get_rss_stat_items_size(void) +{ + return percpu_counter_tree_items_size() * NR_MM_COUNTERS; +} + +static inline struct percpu_counter_tree_level_item *get_rss_stat_items(struct mm_struct *mm) +{ + unsigned long ptr = (unsigned long)mm; + + ptr += offsetof(struct mm_struct, cpu_bitmap); + /* Skip cpu_bitmap */ + ptr += cpumask_size(); + /* Skip mm_cidmask */ + ptr += mm_cid_size(); + return (struct percpu_counter_tree_level_item *)ptr; +} + /* * per-process(per-mm_struct) statistics. */ static inline unsigned long get_mm_counter(struct mm_struct *mm, int member) { - return percpu_counter_read_positive(&mm->rss_stat[member]); + return percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]); } static inline unsigned long get_mm_counter_sum(struct mm_struct *mm, int member) { - return percpu_counter_sum_positive(&mm->rss_stat[member]); + return percpu_counter_tree_precise_sum_positive(&mm->rss_stat[member]); } void mm_trace_rss_stat(struct mm_struct *mm, int member); static inline void add_mm_counter(struct mm_struct *mm, int member, long value) { - percpu_counter_add(&mm->rss_stat[member], value); + percpu_counter_tree_add(&mm->rss_stat[member], value); mm_trace_rss_stat(mm, member); } static inline void inc_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_inc(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], 1); mm_trace_rss_stat(mm, member); } static inline void dec_mm_counter(struct mm_struct *mm, int member) { - percpu_counter_dec(&mm->rss_stat[member]); + percpu_counter_tree_add(&mm->rss_stat[member], -1); mm_trace_rss_stat(mm, member); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 90e5790c318f..adb2f227bac7 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -18,7 +18,7 @@ #include #include #include -#include +#include #include #include @@ -1119,7 +1119,7 @@ struct mm_struct { unsigned long saved_e_flags; #endif - struct percpu_counter rss_stat[NR_MM_COUNTERS]; + struct percpu_counter_tree rss_stat[NR_MM_COUNTERS]; struct linux_binfmt *binfmt; diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h index 7f93e754da5c..91c81c44f884 100644 --- a/include/trace/events/kmem.h +++ b/include/trace/events/kmem.h @@ -442,7 +442,7 @@ TRACE_EVENT(rss_stat, __entry->mm_id = mm_ptr_to_hash(mm); __entry->curr = !!(current->mm == mm); __entry->member = member; - __entry->size = (percpu_counter_sum_positive(&mm->rss_stat[member]) + __entry->size = (percpu_counter_tree_approximate_sum_positive(&mm->rss_stat[member]) << PAGE_SHIFT); ), diff --git a/kernel/fork.c b/kernel/fork.c index 3da0f08615a9..5430d63c9e2d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -133,6 +133,11 @@ */ #define MAX_THREADS FUTEX_TID_MASK +/* + * Batch size of rss stat approximation + */ +#define RSS_STAT_BATCH_SIZE 32 + /* * Protected counters by write_lock_irq(&tasklist_lock) */ @@ -583,14 +588,12 @@ static void check_mm(struct mm_struct *mm) "Please make sure 'struct resident_page_types[]' is updated as well"); for (i = 0; i < NR_MM_COUNTERS; i++) { - long x = percpu_counter_sum(&mm->rss_stat[i]); - - if (unlikely(x)) { - pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld Comm:%s Pid:%d\n", - mm, resident_page_types[i], x, + if (unlikely(percpu_counter_tree_precise_compare_value(&mm->rss_stat[i], 0) != 0)) + pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%d Comm:%s Pid:%d\n", + mm, resident_page_types[i], + percpu_counter_tree_precise_sum(&mm->rss_stat[i]), current->comm, task_pid_nr(current)); - } } if (mm_pgtables_bytes(mm)) @@ -688,7 +691,7 @@ void __mmdrop(struct mm_struct *mm) put_user_ns(mm->user_ns); mm_pasid_drop(mm); mm_destroy_cid(mm); - percpu_counter_destroy_many(mm->rss_stat, NR_MM_COUNTERS); + percpu_counter_tree_destroy_many(mm->rss_stat, NR_MM_COUNTERS); free_mm(mm); } @@ -1083,8 +1086,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p, if (mm_alloc_cid(mm, p)) goto fail_cid; - if (percpu_counter_init_many(mm->rss_stat, 0, GFP_KERNEL_ACCOUNT, - NR_MM_COUNTERS)) + if (percpu_counter_tree_init_many(mm->rss_stat, get_rss_stat_items(mm), + NR_MM_COUNTERS, RSS_STAT_BATCH_SIZE, + GFP_KERNEL_ACCOUNT)) goto fail_pcpu; mm->user_ns = get_user_ns(user_ns); @@ -2964,7 +2968,7 @@ void __init mm_cache_init(void) * dynamically sized based on the maximum CPU number this system * can have, taking hotplug into account (nr_cpu_ids). */ - mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_cid_size(); + mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_cid_size() + get_rss_stat_items_size(); mm_cachep = kmem_cache_create_usercopy("mm_struct", mm_size, ARCH_MIN_MMSTRUCT_ALIGN, -- 2.39.5