ext4_overwrite_io() decides whether a direct I/O write is an overwrite (all target blocks already allocated) so the write can proceed under a shared inode lock. It calls ext4_map_blocks() once and returns false if the mapped length is shorter than the requested length. ext4_map_blocks() maps at most one extent per call. When a write straddles two extents (e.g. a written extent and an adjacent unwritten extent created by fallocate), the single call returns only the first extent's length. ext4_overwrite_io() then mis-classifies the write as non-overwrite and forces the caller to cycle i_rwsem from shared to exclusive. On workloads where a DIO writer appends through a fallocated region while a DIO reader tails the same file, every write that crosses a written/unwritten extent boundary triggers an exclusive lock acquisition. The writer must wait for the reader's shared lock to be released, and while waiting the RWSEM_FLAG_WAITERS bit blocks all other shared acquirers. This serialises all writers to queue-depth 1 and throughput collapses. Fix by looping ext4_map_blocks() over the remaining range. As long as every queried extent reports allocated blocks (written or unwritten), the function returns true and the write keeps the shared lock. The *unwritten output now uses OR semantics across extents: set if any block in the range is unwritten. This is correct for the two callers: - (unaligned_io && unwritten) takes the exclusive lock, which is needed if any block requires partial-block zeroing. - (ilock_shared && !unwritten) selects ext4_iomap_overwrite_ops, which skips journal transactions and is only safe when every block is written/mapped. The loop adds at most one extra ext4_map_blocks() call per extent boundary, which is negligible compared to the lock contention it eliminates. Reproducer: two threads doing O_DIRECT I/O on a fallocated ext4 file. Thread 1 appends sequentially in 4-16 KB writes. Thread 2 reads from the tail of the file in up to 1 MB reads. Both use the same fd with the file preallocated via posix_fallocate(). Tested on ext4 over NVMe, 6.6 based kernel: before after writer-only throughput: 399 MB/s 412 MB/s mixed (writer + reader): 11 MB/s 381 MB/s write latency (mixed): 880 us 21 us rwsem_down_write_slowpath (5 s sample, mixed): 1792 2 Signed-off-by: Peng Wang --- fs/ext4/file.c | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/fs/ext4/file.c b/fs/ext4/file.c index eb1a323962b1..d060de8eddac 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -228,15 +228,22 @@ static bool ext4_overwrite_io(struct inode *inode, map.m_len = EXT4_MAX_BLOCKS(len, pos, blkbits); blklen = map.m_len; - err = ext4_map_blocks(NULL, inode, &map, 0); - if (err != blklen) - return false; - /* - * 'err==len' means that all of the blocks have been preallocated, - * regardless of whether they have been initialized or not. We need to - * check m_flags to distinguish the unwritten extents. - */ - *unwritten = !(map.m_flags & EXT4_MAP_MAPPED); + *unwritten = false; + + while (blklen > 0) { + map.m_len = blklen; + err = ext4_map_blocks(NULL, inode, &map, 0); + /* + * err <= 0 means a hole or error; the write needs block + * allocation so it cannot be treated as an overwrite. + */ + if (err <= 0) + return false; + if (!(map.m_flags & EXT4_MAP_MAPPED)) + *unwritten = true; + blklen -= err; + map.m_lblk += err; + } return true; } -- 2.43.0