From: Liang Luo In try_pick_next_cgroup(), a cgroup is charged its full slice upfront. fcg_dispatch() is then supposed to refund the unused time by adding (expected_end - now) * FCG_HWEIGHT_ONE / hweight to cvtime_delta: __sync_fetch_and_add(&cgc->cvtime_delta, (cpuc->cur_at + cgrp_slice_ns - now) * FCG_HWEIGHT_ONE / (cgc->hweight ?: 1)); However, if a cgroup yields early (expected_end > now), this adds a positive value to cvtime_delta, which ultimately increases cvtime, heavily penalizing the cgroup instead of refunding it. Conversely, if a cgroup overuses its slice (now > expected_end), the unsigned subtraction underflows to a massive u64. When multiplied by FCG_HWEIGHT_ONE and divided by hweight via unsigned division, it yields an enormous positive number. Adding this huge value to cvtime_delta pushes the cgroup's virtual time roughly 3.25 days into the future, permanently starving it. cgrp_cap_budget() later applies cvtime_delta to cvtime reading it as u64. Once dispatch can produce negative (refund) deltas, reading them as u64 turns every refund into a huge positive value with the same starving effect, so read and apply the delta as s64 and clamp the result at zero. Order the time delta as (now - expected_end) so overruns are positive and early yields negative, and split the magnitude scaling by sign since BPF has no signed division. Observed on a nested cgroup hierarchy under mixed CPU load: cvtime grew roughly 300x faster than wall-clock time, cvtime_delta reached 2^47..2^49, and tasks in cgroups stalled. After the change cvtime advances at the expected weight-scaled rate and no anomalous deltas appear. Fixes: a4103eacc2ab ("sched_ext: Add a cgroup scheduler which uses flattened hierarchy") Reviewed-by: Andrea Righi Signed-off-by: Liang Luo --- Changes in v2: - Use the standard kernel time-comparison idiom for the delta computation in fcg_dispatch() (Andrea Righi). - Add Andrea's Reviewed-by tag. tools/sched_ext/scx_flatcg.bpf.c | 38 +++++++++++++++++++++++++++----- 1 file changed, 32 insertions(+), 6 deletions(-) diff --git a/tools/sched_ext/scx_flatcg.bpf.c b/tools/sched_ext/scx_flatcg.bpf.c index fec359581826..6aa0c62b76d2 100644 --- a/tools/sched_ext/scx_flatcg.bpf.c +++ b/tools/sched_ext/scx_flatcg.bpf.c @@ -249,15 +249,26 @@ static void cgrp_refresh_hweight(struct cgroup *cgrp, struct fcg_cgrp_ctx *cgc) static void cgrp_cap_budget(struct cgv_node *cgv_node, struct fcg_cgrp_ctx *cgc) { - u64 delta, cvtime, max_budget; + s64 delta; + u64 cvtime, max_budget; /* * A node which is on the rbtree can't be pointed to from elsewhere yet * and thus can't be updated and repositioned. Instead, we collect the * vtime deltas separately and apply it asynchronously here. */ - delta = __sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta); - cvtime = cgv_node->cvtime + delta; + delta = (s64)__sync_fetch_and_sub(&cgc->cvtime_delta, cgc->cvtime_delta); + /* + * cvtime_delta may be negative (early-yield refund from fcg_dispatch()). + * Apply it in the signed domain and clamp the resulting cvtime at zero + * so that a refund larger than the accumulated cvtime doesn't roll the + * field negative. + */ + s64 vtime = (s64)cgv_node->cvtime + delta; + + if (vtime < 0) + vtime = 0; + cvtime = (u64)vtime; /* * Allow a cgroup to carry the maximum budget proportional to its @@ -767,11 +778,26 @@ void BPF_STRUCT_OPS(fcg_dispatch, s32 cpu, struct task_struct *prev) * We want to update the vtime delta and then look for the next * cgroup to execute but the latter needs to be done in a loop * and we can't keep the lock held. Oh well... + * + * Charge the overrun or refund the unused time. The original + * (cur_at + slice - now) was inverted: early yields were + * penalized and overruns underflowed. Use (now - (cur_at + + * slice)) so overruns are positive and early yields negative; + * split the magnitude scaling by sign since BPF has no signed + * division. */ bpf_spin_lock(&cgv_tree_lock); - __sync_fetch_and_add(&cgc->cvtime_delta, - (cpuc->cur_at + cgrp_slice_ns - now) * - FCG_HWEIGHT_ONE / (cgc->hweight ?: 1)); + u64 expected_end = cpuc->cur_at + cgrp_slice_ns; + s64 excess = (s64)(now - expected_end); + s64 delta; + + if (excess > 0) + delta = (s64)((u64)excess * FCG_HWEIGHT_ONE / + (cgc->hweight ?: 1)); + else + delta = -(s64)((u64)-excess * FCG_HWEIGHT_ONE / + (cgc->hweight ?: 1)); + __sync_fetch_and_add(&cgc->cvtime_delta, delta); bpf_spin_unlock(&cgv_tree_lock); } else { stat_inc(FCG_STAT_CNS_GONE); -- 2.43.0