When multiple threads issue I/O requests concurrently after a period of disk idle time, iostat can report abnormal %util spikes (100%+) even when the actual I/O load is extremely light. This issue can be reproduced using fio. By binding 8 fio threads to different CPUs, and having them issue 4KB I/Os every 1 second: fio --name=test --ioengine=sync --rw=randwrite --direct=1 --bs=4k \ --numjobs=8 --cpus_allowed=0-7 --cpus_allowed_policy=split \ --thinktime=1s --time_based --runtime=60 --group_reporting \ --filename=/mnt/sdb/test The iostat -d sda 1 output will show a false 100%+ %util randomly: Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util sdb ... 16.00 104.00 0.00 0.00 1.25 6.50 ... 0.02 0.90 Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 100.30 Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 0.20 Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util sdb ... 11.00 44.00 0.00 0.00 1.27 4.00 ... 0.01 82.80 The root cause is a race condition in update_io_ticks(). When the disk has been idle for a while (e.g., 1 second), part->bd_stamp holds an old timestamp. If CPU A and CPU B start I/O at the exact same time: 1. Both CPUs read the same old 'stamp' and pass the time_after() check. 2. CPU A executes try_cmpxchg() successfully. 3. CPU B fails try_cmpxchg(), exits update_io_ticks(), and immediately increments its local in_flight counter via part_stat_local_inc(). 4. CPU A continues to evaluate the 'busy' condition: end || bdev_count_inflight(part). 5. Since it is an I/O start, 'end' is false, so CPU A calls bdev_count_inflight() to check. 6. However, bdev_count_inflight() iterates over all CPUs and sees CPU B's newly incremented in_flight count. It returns true. 7. CPU A incorrectly assumes the disk was busy during the entire 'now - stamp' window (the 1-second idle period) and adds this large delta to io_ticks. To fix this, we capture the 'busy' state before performing the try_cmpxchg(). By taking a snapshot of whether the device is active prior to updating bd_stamp, we prevent CPU A from being misled by concurrent I/O submissions from other CPUs that occur after the timestamp comparison but before the inflight check. Fixes: 99dc422335d8 ("block: support to account io_ticks precisely") Signed-off-by: Jialin Wang --- block/blk-core.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/block/blk-core.c b/block/blk-core.c index 474700ffaa1c..1481daf1e664 100644 --- a/block/blk-core.c +++ b/block/blk-core.c @@ -1026,10 +1026,11 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end) unsigned long stamp; again: stamp = READ_ONCE(part->bd_stamp); - if (unlikely(time_after(now, stamp)) && - likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && - (end || bdev_count_inflight(part))) - __part_stat_add(part, io_ticks, now - stamp); + if (unlikely(time_after(now, stamp))) { + bool busy = end || bdev_count_inflight(part); + if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && busy) + __part_stat_add(part, io_ticks, now - stamp); + } if (bdev_is_partition(part)) { part = bdev_whole(part); -- 2.52.0