ext4_convert_inline_data() has two unsafe lockless fast paths that can observe a transient state from a concurrent convert_inline_data_nolock() that has destroyed inline data but then fails and restores it. A racing thread sees the transient state, concludes no conversion is needed, and creates dirty pages via block_page_mkwrite(). The restoring thread then re-sets inline data + MAY_INLINE_DATA, leaving dirty pages coexisting with inline data, triggering BUG_ON in ext4_do_writepages(). A similar race exists with ext4_da_convert_inline_data_to_extent(): after it copies data to page cache and clears MAY_INLINE_DATA, a concurrent convert_inline_data_nolock() can restore MAY_INLINE_DATA. Simply moving the MAY_INLINE_DATA check inside xattr_sem would require taking the write lock, starting a journal handle, and reading the inode location on every call -- even for inodes that have long since been converted. This causes measurable performance regression on filesystems with the inline_data feature enabled. Fix this by introducing EXT4_STATE_INLINE_CONVERTED, a monotonic (set once, never cleared) per-inode state bit indicating that inline data has been successfully copied out to page cache or a data block. This provides: 1. A safe lockless fast path in ext4_convert_inline_data(): once the bit is set, return 0 immediately without any locking overhead. 2. A WARN_ON_ONCE guard in convert_inline_data_nolock(): if the bit is unexpectedly set at restore time, warn. This should not happen given the MAY_INLINE_DATA check prevents entering nolock when DA conversion is in progress. 3. Inside xattr_sem, only call convert_inline_data_nolock() when both has_inline_data AND MAY_INLINE_DATA are set. When has_inline_data is set but MAY_INLINE_DATA is clear, DA conversion is in progress; skip convert_nolock to avoid destroy+restore re-setting MAY. The bit is set in all successful conversion paths: - ext4_convert_inline_data_nolock() on success - ext4_convert_inline_data() when inline data is gone after nolock - ext4_da_convert_inline_data_to_extent() after page cache copy - ext4_convert_inline_data_to_extent() after block_commit_write Fixes: 7b4cc9787fe3 ("ext4: evict inline data when writing to memory map") Reported-by: syzbot+d1da16f03614058fdc48@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d1da16f03614058fdc48 Signed-off-by: Yun Zhou --- v3: - Introduce EXT4_STATE_INLINE_CONVERTED monotonic bit for safe lockless fast path, avoiding performance regression from always taking xattr_sem. - Add WARN_ON_ONCE at restore point as sanity check. - Skip convert_nolock when MAY_INLINE_DATA is clear (DA convert in progress) to prevent destroy+restore re-setting MAY. v2: - Reworked fix: root cause confirmed via ftrace. - Removed both unsafe lockless fast paths, serialize under xattr_sem. - Skip convert_nolock when MAY_INLINE_DATA is clear (DA convert in progress) to prevent destroy+restore from re-setting MAY. fs/ext4/ext4.h | 1 + fs/ext4/inline.c | 44 ++++++++++++++++++++++++++------------------ 2 files changed, 27 insertions(+), 18 deletions(-) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index b9b0ada7774b..310aeb14f9f0 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2049,6 +2049,7 @@ enum { EXT4_STATE_FC_FLUSHING_DATA, /* Fast commit flushing data */ EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */ EXT4_STATE_FC_REQUEUE, /* Inode modified during fast commit */ + EXT4_STATE_INLINE_CONVERTED, /* inline data copied out, do not restore */ }; #define EXT4_INODE_BIT_FNS(name, field, offset) \ diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c index 8045e4ff270c..1b6abacf12e6 100644 --- a/fs/ext4/inline.c +++ b/fs/ext4/inline.c @@ -671,6 +671,8 @@ static int ext4_convert_inline_data_to_extent(struct address_space *mapping, if (folio) block_commit_write(folio, from, to); + if (folio && !ret) + ext4_set_inode_state(inode, EXT4_STATE_INLINE_CONVERTED); out: if (folio) { folio_unlock(folio); @@ -921,6 +923,7 @@ static int ext4_da_convert_inline_data_to_extent(struct address_space *mapping, clear_buffer_new(folio_buffers(folio)); folio_mark_dirty(folio); folio_mark_uptodate(folio); + ext4_set_inode_state(inode, EXT4_STATE_INLINE_CONVERTED); ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); *fsdata = (void *)CONVERT_INLINE_DATA; @@ -1172,8 +1175,14 @@ static int ext4_convert_inline_data_nolock(handle_t *handle, } out_restore: - if (error) - ext4_restore_inline_data(handle, inode, iloc, buf, inline_size); + if (error) { + WARN_ON_ONCE(ext4_test_inode_state(inode, + EXT4_STATE_INLINE_CONVERTED)); + ext4_restore_inline_data(handle, inode, iloc, buf, + inline_size); + } else { + ext4_set_inode_state(inode, EXT4_STATE_INLINE_CONVERTED); + } out: brelse(data_bh); @@ -1959,22 +1968,16 @@ int ext4_convert_inline_data(struct inode *inode) handle_t *handle; struct ext4_iloc iloc; - if (!ext4_has_inline_data(inode)) { - ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); + if (!ext4_has_feature_inline_data(inode->i_sb)) + return 0; + + /* + * Once inline data has been successfully copied out (to page + * cache or a data block), this bit is set and never cleared. + * It is safe to check without locks -- the bit is monotonic. + */ + if (ext4_test_inode_state(inode, EXT4_STATE_INLINE_CONVERTED)) return 0; - } else if (!ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) { - /* - * Inode has inline data but EXT4_STATE_MAY_INLINE_DATA is - * cleared. This means we are in the middle of moving of - * inline data to delay allocated block. Just force writeout - * here to finish conversion. - */ - error = filemap_flush(inode->i_mapping); - if (error) - return error; - if (!ext4_has_inline_data(inode)) - return 0; - } needed_blocks = ext4_chunk_trans_extent(inode, 1); @@ -1990,8 +1993,13 @@ int ext4_convert_inline_data(struct inode *inode) } ext4_write_lock_xattr(inode, &no_expand); - if (ext4_has_inline_data(inode)) + if (ext4_has_inline_data(inode) && + ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA)) error = ext4_convert_inline_data_nolock(handle, inode, &iloc); + if (!ext4_has_inline_data(inode)) { + ext4_set_inode_state(inode, EXT4_STATE_INLINE_CONVERTED); + ext4_clear_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA); + } ext4_write_unlock_xattr(inode, &no_expand); ext4_journal_stop(handle); out_free: -- 2.43.0