In public cloud environments, block devices usually enforce performance
limits based on two independent token buckets: IOPS and BPS. The device
is throttled when either the IOPS limit or the BPS limit is reached.

To effectively manage "noisy neighbor" problems, we configure iocost
model parameters (or vrate max) to approximately 95% of the cloud
provider's provisioned limits. The goal is to strictly avoid hitting
the storage backend's hard BPS/IOPS limits. By saturating the virtual
budget before the physical limit, iocost engages throttling first.
Unlike the indiscriminate throttling applied by cloud storage backends,
iocost selectively penalizes low-weight cgroups or heavy-traffic
perpetrators. Consequently, IO-latency-sensitive critical workloads
remain entirely unaffected by the congestion. Extensive testing has
verified that this approach yields excellent isolation results.

However, the existing 'linear' cost model leads to significant
performance loss in this specific configuration due to its additive
nature.

Using tools/cgroup/iocost_coef_gen.py, we measured the following
performance data on a typical cloud disk:

8:16 rbps=173471131 rseqiops=3566 rrandiops=3566 wbps=173333269 wseqiops=3566 wrandiops=3559

Dividing BPS by IOPS (173471131 / 3566) yields approximately 48607
bytes. When running fio with bs=48607, we observed a 50% drop in
throughput compared to running without iocost enabled.

The reason is that the current 'linear' model calculates cost as:

  Cost = BaseCost + (Pages * PerPageCost)

Expanding the internal variables relative to IOPS and BPS, this is
effectively:

  Cost = VTIME_PER_SEC * ((1 / IOPS - 4096 / BPS) + size / BPS)

When the I/O size is such that the IOPS cost component roughly equals
the BPS cost component (as in the bs=48607 case above), the linear
model sums them up. Since cloud disks throttle based on *either* IOPS
*or* BPS (whichever is exhausted first), summing them effectively
doubles the calculated cost. This causes iocost to drain virtual time
twice as fast as necessary, throttling the device to 50% utilization.

To solve this, this patch introduces a new 'linear-max' cost model.
Instead of adding the components, it takes the maximum:

  Cost = VTIME_PER_SEC * max(1 / IOPS, size / BPS)

Which translates to:

  Cost = max(BaseCost + PerPageCost, Pages * PerPageCost)

This formula correctly models the dual-bucket behavior of cloud disks.
It ensures that for any block size, the calculated cost aligns with the
actual bottleneck (IOPS or BPS). This allows the system to reach close
to the provisioned BPS/IOPS limits without premature throttling, while
still maintaining the latency protection benefits of iocost.

Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
---
 block/blk-iocost.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/block/blk-iocost.c b/block/blk-iocost.c
index ef543d163d46..ead478d8e5bc 100644
--- a/block/blk-iocost.c
+++ b/block/blk-iocost.c
@@ -445,6 +445,7 @@ struct ioc {
 	int				autop_idx;
 	bool				user_qos_params:1;
 	bool				user_cost_model:1;
+	bool				cost_model_linear_max:1;
 };
 
 struct iocg_pcpu_stat {
@@ -2565,7 +2566,12 @@ static void calc_vtime_cost_builtin(struct bio *bio, struct ioc_gq *iocg,
 			cost += coef_seqio;
 		}
 	}
-	cost += pages * coef_page;
+
+	if (ioc->cost_model_linear_max)
+		cost = max(cost + coef_page, pages * coef_page);
+	else
+		cost += pages * coef_page;
+
 out:
 	*costp = cost;
 }
@@ -3368,10 +3374,11 @@ static u64 ioc_cost_model_prfill(struct seq_file *sf,
 		return 0;
 
 	spin_lock(&ioc->lock);
-	seq_printf(sf, "%s ctrl=%s model=linear "
+	seq_printf(sf, "%s ctrl=%s model=%s "
 		   "rbps=%llu rseqiops=%llu rrandiops=%llu "
 		   "wbps=%llu wseqiops=%llu wrandiops=%llu\n",
 		   dname, ioc->user_cost_model ? "user" : "auto",
+		   ioc->cost_model_linear_max ? "linear-max" : "linear",
 		   u[I_LCOEF_RBPS], u[I_LCOEF_RSEQIOPS], u[I_LCOEF_RRANDIOPS],
 		   u[I_LCOEF_WBPS], u[I_LCOEF_WSEQIOPS], u[I_LCOEF_WRANDIOPS]);
 	spin_unlock(&ioc->lock);
@@ -3412,6 +3419,7 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
 	struct ioc *ioc;
 	u64 u[NR_I_LCOEFS];
 	bool user;
+	bool linear_max;
 	char *body, *p;
 	int ret;
 
@@ -3442,6 +3450,7 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
 	spin_lock_irq(&ioc->lock);
 	memcpy(u, ioc->params.i_lcoefs, sizeof(u));
 	user = ioc->user_cost_model;
+	linear_max = ioc->cost_model_linear_max;
 
 	while ((p = strsep(&body, " \t\n"))) {
 		substring_t args[MAX_OPT_ARGS];
@@ -3464,7 +3473,11 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
 			continue;
 		case COST_MODEL:
 			match_strlcpy(buf, &args[0], sizeof(buf));
-			if (strcmp(buf, "linear"))
+			if (!strcmp(buf, "linear"))
+				linear_max = false;
+			else if (!strcmp(buf, "linear-max"))
+				linear_max = true;
+			else
 				goto einval;
 			continue;
 		}
@@ -3481,8 +3494,10 @@ static ssize_t ioc_cost_model_write(struct kernfs_open_file *of, char *input,
 	if (user) {
 		memcpy(ioc->params.i_lcoefs, u, sizeof(u));
 		ioc->user_cost_model = true;
+		ioc->cost_model_linear_max = linear_max;
 	} else {
 		ioc->user_cost_model = false;
+		ioc->cost_model_linear_max = false;
 	}
 	ioc_refresh_params(ioc, true);
 	spin_unlock_irq(&ioc->lock);
-- 
2.52.0