The current implementation of damon_hot_score() uses a manual for-loop to calculate the value of 'age_in_log'. This can be efficiently replaced by the fls(). In a simulated performance test with 10,000,000 iterations, this optimization showed a significant reduction in latency: - Average Latency: Reduced from ~9ns to ~1ns. - P99 Latency: Reduced from ~60ns to ~41ns. - Throughput: The loop-based version mostly fell into the 40-50ns range, while the fls-based version shifted significantly towards the 20-39ns range in the test environment. Although these results are based on a simulated kernel module test environment [1], they indicate a clear instruction-level optimization. [1] https://github.com/aethernet65535/damon-hot-score-fls-optimize/blob/master/test-kernel-module/fls.c Signed-off-by: Liew Rui Yan --- Note on testing methodology: I attempted to measure the performance directly within the kernel using bpftrace, perf, and ktime inside damon_hot_score(). However, the results were highly unstable (ktime), and in some cases (perf/bpftrace) the function was difficult to trace reliably (likely due to my own tracing limitations). Despite the instability of in-kernel ktime measurements, one thing remained consistent: the fls-based version significantly improves the "long tail" latency compared to the for-loop. Test results from the simulated module: - fls-based: DAMON Perf Test: Starting 10000000 iterations ============================================= Total Iterations : 10000000 Average Latency : 1 ns P95 Latency : 40 ns P99 Latency : 41 ns --------------------------------------------- Range (ns) | Count | Percent --------------------------------------------- 20-39 | 3522000 | 35% 40-59 | 6478000 | 64% 60-79 | 0 | 0% ============================================= - for-loop: DAMON Perf Test: Starting 10000000 iterations ============================================= Total Iterations : 10000000 Average Latency : 9 ns P95 Latency : 51 ns P99 Latency : 60 ns --------------------------------------------- Range (ns) | Count | Percent --------------------------------------------- 20-39 | 0 | 0% 40-59 | 9894000 | 98% 60-79 | 98000 | 0% ============================================= Full raw benchmark results can be found at [2]. If anyone could suggest a more robust way to profile this specific function within live DAMON context, I would greatly appreciate the guidance. [2] https://github.com/aethernet65535/damon-hot-score-fls-optimize/tree/master/result-raw mm/damon/ops-common.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/mm/damon/ops-common.c b/mm/damon/ops-common.c index 8c6d613425c1..0294de61a23a 100644 --- a/mm/damon/ops-common.c +++ b/mm/damon/ops-common.c @@ -117,9 +117,7 @@ int damon_hot_score(struct damon_ctx *c, struct damon_region *r, damon_max_nr_accesses(&c->attrs); age_in_sec = (unsigned long)r->age * c->attrs.aggr_interval / 1000000; - for (age_in_log = 0; age_in_log < DAMON_MAX_AGE_IN_LOG && age_in_sec; - age_in_log++, age_in_sec >>= 1) - ; + age_in_log = min_t(int, fls(age_in_sec), DAMON_MAX_AGE_IN_LOG); /* If frequency is 0, higher age means it's colder */ if (freq_subscore == 0) -- 2.53.0