A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable from the buffered write path on a block device when the underlying device returns I/O errors at high density. Reproduced by fuzzing an NVMe controller (FEMU) that returns crafted error completions for a sustained workload from /dev/nvme0n1. The race is: CPU A: block_commit_write (folio lock held) CPU B: end_buffer_async_read set_buffer_uptodate(bh); clear_buffer_uptodate(bh); mark_buffer_dirty(bh); /* WARN fires */ The contract documented at set_buffer_uptodate() in include/linux/buffer_head.h:140 already states: "Any other serialization (with IO errors or whatever that might clear the bit) has to come from other state (eg BH_Lock)." block_commit_write() and the buffer_new() branch in __block_write_begin_int() violate this contract: they hold the folio lock but not BH_Lock when calling set_buffer_uptodate() immediately followed by mark_buffer_dirty(). Take BH_Lock around the pair so the documented serialization holds. The race is the same family as 558d6450c775 ("ext4: fix WARN_ON_ONCE(!buffer_uptodate) after an error writing the superblock"), which addressed the ext4 superblock-specific case via state recovery. No equivalent recovery hook exists in the generic block_commit_write() path, so apply BH_Lock instead. WARN stack: RIP: mark_buffer_dirty+0x4c2/0x560 fs/buffer.c:1183 Call Trace: block_commit_write fs/buffer.c block_write_end fs/buffer.c iomap_write_end fs/iomap/buffered-io.c iomap_file_buffered_write blkdev_buffered_write block/fops.c blkdev_write_iter vfs_write __x64_sys_write Found by FuzzNvme(Syzkaller with FEMU fuzzing framework). Acked-by: Sungwoo Kim Acked-by: Dave Tian Acked-by: Weidong Zhu Signed-off-by: Chao Shi --- Notes for reviewers (RFC): 1. lock_buffer() in the buffered-write end path: in the steady state the bh should be unlocked when block_commit_write() runs (block_write_begin already waited for any RMW read), so contention should be rare. fio numbers TBD; happy to defer until measured if that is the bar. 2. Several other call sites have the same set_buffer_uptodate(bh) immediately followed by mark_buffer_dirty(bh) pattern without BH_Lock: fs/nilfs2/mdt.c:60 fs/ocfs2/alloc.c:6840 fs/ocfs2/aops.c:655 fs/exfat/fatent.c:408 fs/exfat/misc.c:168 fs/ufs/ialloc.c:146 fs/ufs/inode.c:1076, 1088 fs/ufs/balloc.c:324 fs/jfs/super.c:770 fs/ext2/super.c:1594 fs/ntfs3/fsntfs.c:1096, 1491 fs/ntfs3/bitmap.c:755, 796, 1387 Most look like one-shot init / metadata paths where concurrent IO completion on the same bh is unlikely, but I have not audited each. Per-fs follow-ups by respective maintainers, or one tree-wide series? 3. Reproducer is currently fuzzer-only (FEMU + syz-executor). A minimal C reproducer using dm-flakey for read-error injection is in progress. fs/buffer.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 4d7f84e77d2..bc4fad93392 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -2041,9 +2041,16 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len, if (buffer_new(bh)) { clean_bdev_bh_alias(bh); if (folio_test_uptodate(folio)) { + /* + * See block_commit_write() for why we + * must hold BH_Lock around set_uptodate + * + mark_dirty. + */ + lock_buffer(bh); clear_buffer_new(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); + unlock_buffer(bh); continue; } if (block_end > to || block_start < from) @@ -2104,8 +2111,20 @@ void block_commit_write(struct folio *folio, size_t from, size_t to) if (!buffer_uptodate(bh)) partial = true; } else { + /* + * Per the contract documented at set_buffer_uptodate() + * (include/linux/buffer_head.h), callers must hold + * BH_Lock to serialize against concurrent clears of + * BH_Uptodate. Holding only the folio lock is not + * sufficient: a concurrent end_buffer_async_read() on + * a previously failed read can clear BH_Uptodate + * between set_buffer_uptodate() and mark_buffer_dirty(), + * tripping the WARN_ON_ONCE in mark_buffer_dirty(). + */ + lock_buffer(bh); set_buffer_uptodate(bh); mark_buffer_dirty(bh); + unlock_buffer(bh); } if (buffer_new(bh)) clear_buffer_new(bh); -- 2.43.0