A WARN_ON_ONCE(!buffer_uptodate(bh)) in mark_buffer_dirty() is reachable
from the buffered write path on a block device when the underlying
device returns I/O errors at high density.  Reproduced by fuzzing an
NVMe controller (FEMU) that returns crafted error completions for a
sustained workload from /dev/nvme0n1.

The race is:

  CPU A: block_commit_write (folio lock held)        CPU B: end_buffer_async_read
    set_buffer_uptodate(bh);
                                                       clear_buffer_uptodate(bh);
    mark_buffer_dirty(bh);  /* WARN fires */

The contract documented at set_buffer_uptodate() in
include/linux/buffer_head.h:140 already states:

  "Any other serialization (with IO errors or whatever that might
   clear the bit) has to come from other state (eg BH_Lock)."

block_commit_write() and the buffer_new() branch in
__block_write_begin_int() violate this contract: they hold the folio
lock but not BH_Lock when calling set_buffer_uptodate() immediately
followed by mark_buffer_dirty().  Take BH_Lock around the pair so the
documented serialization holds.

The race is the same family as 558d6450c775 ("ext4: fix
WARN_ON_ONCE(!buffer_uptodate) after an error writing the
superblock"), which addressed the ext4 superblock-specific case via
state recovery.  No equivalent recovery hook exists in the generic
block_commit_write() path, so apply BH_Lock instead.

WARN stack:

  RIP: mark_buffer_dirty+0x4c2/0x560 fs/buffer.c:1183
  Call Trace:
    block_commit_write          fs/buffer.c
    block_write_end             fs/buffer.c
    iomap_write_end             fs/iomap/buffered-io.c
    iomap_file_buffered_write
    blkdev_buffered_write       block/fops.c
    blkdev_write_iter
    vfs_write
    __x64_sys_write

Found by FuzzNvme(Syzkaller with FEMU fuzzing framework).

Acked-by: Sungwoo Kim <iam@sung-woo.kim>
Acked-by: Dave Tian <daveti@purdue.edu>
Acked-by: Weidong Zhu <weizhu@fiu.edu>
Signed-off-by: Chao Shi <coshi036@gmail.com>
---

Notes for reviewers (RFC):

1. lock_buffer() in the buffered-write end path: in the steady state
   the bh should be unlocked when block_commit_write() runs
   (block_write_begin already waited for any RMW read), so contention
   should be rare.  fio numbers TBD; happy to defer until measured if
   that is the bar.

2. Several other call sites have the same set_buffer_uptodate(bh)
   immediately followed by mark_buffer_dirty(bh) pattern without
   BH_Lock:

     fs/nilfs2/mdt.c:60
     fs/ocfs2/alloc.c:6840
     fs/ocfs2/aops.c:655
     fs/exfat/fatent.c:408
     fs/exfat/misc.c:168
     fs/ufs/ialloc.c:146
     fs/ufs/inode.c:1076, 1088
     fs/ufs/balloc.c:324
     fs/jfs/super.c:770
     fs/ext2/super.c:1594
     fs/ntfs3/fsntfs.c:1096, 1491
     fs/ntfs3/bitmap.c:755, 796, 1387

   Most look like one-shot init / metadata paths where concurrent IO
   completion on the same bh is unlikely, but I have not audited each.
   Per-fs follow-ups by respective maintainers, or one tree-wide
   series?

3. Reproducer is currently fuzzer-only (FEMU + syz-executor).  A
   minimal C reproducer using dm-flakey for read-error injection is
   in progress.

 fs/buffer.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 4d7f84e77d2..bc4fad93392 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2041,9 +2041,16 @@ int __block_write_begin_int(struct folio *folio, loff_t pos, unsigned len,
 			if (buffer_new(bh)) {
 				clean_bdev_bh_alias(bh);
 				if (folio_test_uptodate(folio)) {
+					/*
+					 * See block_commit_write() for why we
+					 * must hold BH_Lock around set_uptodate
+					 * + mark_dirty.
+					 */
+					lock_buffer(bh);
 					clear_buffer_new(bh);
 					set_buffer_uptodate(bh);
 					mark_buffer_dirty(bh);
+					unlock_buffer(bh);
 					continue;
 				}
 				if (block_end > to || block_start < from)
@@ -2104,8 +2111,20 @@ void block_commit_write(struct folio *folio, size_t from, size_t to)
 			if (!buffer_uptodate(bh))
 				partial = true;
 		} else {
+			/*
+			 * Per the contract documented at set_buffer_uptodate()
+			 * (include/linux/buffer_head.h), callers must hold
+			 * BH_Lock to serialize against concurrent clears of
+			 * BH_Uptodate.  Holding only the folio lock is not
+			 * sufficient: a concurrent end_buffer_async_read() on
+			 * a previously failed read can clear BH_Uptodate
+			 * between set_buffer_uptodate() and mark_buffer_dirty(),
+			 * tripping the WARN_ON_ONCE in mark_buffer_dirty().
+			 */
+			lock_buffer(bh);
 			set_buffer_uptodate(bh);
 			mark_buffer_dirty(bh);
+			unlock_buffer(bh);
 		}
 		if (buffer_new(bh))
 			clear_buffer_new(bh);
-- 
2.43.0