ext4_overwrite_io() decides whether a direct I/O write is an overwrite
(all target blocks already allocated) so the write can proceed under a
shared inode lock.  It calls ext4_map_blocks() once and returns false
if the mapped length is shorter than the requested length.

ext4_map_blocks() maps at most one extent per call.  When a write
straddles two extents (e.g. a written extent and an adjacent unwritten
extent created by fallocate), the single call returns only the first
extent's length.  ext4_overwrite_io() then mis-classifies the write as
non-overwrite and forces the caller to cycle i_rwsem from shared to
exclusive.

On workloads where a DIO writer appends through a fallocated region
while a DIO reader tails the same file, every write that crosses a
written/unwritten extent boundary triggers an exclusive lock
acquisition.  The writer must wait for the reader's shared lock to be
released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all
other shared acquirers.  This serialises all writers to queue-depth 1
and throughput collapses.

Fix by looping ext4_map_blocks() over the remaining range.  As long as
every queried extent reports allocated blocks (written or unwritten),
the function returns true and the write keeps the shared lock.

The *unwritten output now uses OR semantics across extents: set if any
block in the range is unwritten.  This is correct for the two callers:

 - (unaligned_io && unwritten) takes the exclusive lock, which is
   needed if any block requires partial-block zeroing.
 - (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops,
   which skips journal transactions and is only safe when every block
   is written/mapped.

The loop adds at most one extra ext4_map_blocks() call per extent
boundary, which is negligible compared to the lock contention it
eliminates.

Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file.
Thread 1 appends sequentially in 4-16 KB writes.  Thread 2 reads from
the tail of the file in up to 1 MB reads.  Both use the same fd with
the file preallocated via posix_fallocate().

Tested on ext4 over NVMe, 6.6 based kernel:

                              before          after
  writer-only throughput:     399 MB/s        412 MB/s
  mixed (writer + reader):     11 MB/s        381 MB/s
  write latency (mixed):     880 us            21 us
  rwsem_down_write_slowpath
   (5 s sample, mixed):       1792              2

Signed-off-by: Peng Wang <peng_wang@linux.alibaba.com>
---
 fs/ext4/file.c | 25 ++++++++++++++++---------
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index eb1a323962b1..d060de8eddac 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode,
 	map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits);
 	blklen = map.m_len;
 
-	err = ext4_map_blocks(NULL, inode, &map, 0);
-	if (err != blklen)
-		return false;
-	/*
-	 * 'err==len' means that all of the blocks have been preallocated,
-	 * regardless of whether they have been initialized or not. We need to
-	 * check m_flags to distinguish the unwritten extents.
-	 */
-	*unwritten = !(map.m_flags & EXT4_MAP_MAPPED);
+	*unwritten = false;
+
+	while (blklen > 0) {
+		map.m_len = blklen;
+		err = ext4_map_blocks(NULL, inode, &map, 0);
+		/*
+		 * err <= 0 means a hole or error; the write needs block
+		 * allocation so it cannot be treated as an overwrite.
+		 */
+		if (err <= 0)
+			return false;
+		if (!(map.m_flags & EXT4_MAP_MAPPED))
+			*unwritten = true;
+		blklen -= err;
+		map.m_lblk += err;
+	}
 	return true;
 }
 
-- 
2.43.0