Currently the inode's lifetime is controlled by it's main refcount, i_count. We overload the eviction of the inode (the final unlink) with the deletion of the in-memory object, which leads to some hilarity when we iput() in unfortunate places. In order to make this less of a footgun, we want to separate the notion of "is this inode in use by a user" and "is this inode object currently in use", since deleting an inode is a much heavier operation that deleting the object and cleaning up its memory. To that end, introduce ->i_obj_count to the inode. This will be used to control the lifetime of the inode object itself. We will continue to use the ->i_count refcount as normal to reduce the churn of adding a new refcount to inode. Subsequent patches will expand the usage of ->i_obj_count for internal uses, and then I will separate out the tear down operations of the inode, and then finally adjust the refount rules for ->i_count to be more consistent with all other refcounts. Signed-off-by: Josef Bacik --- fs/inode.c | 20 ++++++++++++++++++++ include/linux/fs.h | 7 +++++++ 2 files changed, 27 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index cc0f717a140d..454770393fef 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -235,6 +235,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp inode->i_flags = 0; inode->i_state = 0; atomic64_set(&inode->i_sequence, 0); + refcount_set(&inode->i_obj_count, 1); atomic_set(&inode->i_count, 1); inode->i_op = &empty_iops; inode->i_fop = &no_open_fops; @@ -831,6 +832,11 @@ static void evict(struct inode *inode) inode_wake_up_bit(inode, __I_NEW); BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); + /* + * refcount_dec_and_test must be used here to avoid the underflow + * warning. + */ + WARN_ON(!refcount_dec_and_test(&inode->i_obj_count)); destroy_inode(inode); } @@ -1925,6 +1931,20 @@ void iput(struct inode *inode) } EXPORT_SYMBOL(iput); +/** + * iobj_put - put a object reference on an inode + * @inode: inode to put + * + * Puts a object reference on an inode. + */ +void iobj_put(struct inode *inode) +{ + if (!inode) + return; + refcount_dec(&inode->i_obj_count); +} +EXPORT_SYMBOL(iobj_put); + #ifdef CONFIG_BLOCK /** * bmap - find a block number in a file diff --git a/include/linux/fs.h b/include/linux/fs.h index a346422f5066..9a1ce67eed33 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -755,6 +755,7 @@ struct inode { #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING) atomic_t i_readcount; /* struct files open RO */ #endif + refcount_t i_obj_count; union { const struct file_operations *i_fop; /* former ->i_op->default_file_ops */ void (*free_inode)(struct inode *); @@ -2804,6 +2805,7 @@ extern int current_umask(void); extern void ihold(struct inode * inode); extern void iput(struct inode *); +extern void iobj_put(struct inode *inode); int inode_update_timestamps(struct inode *inode, int flags); int generic_update_time(struct inode *, int); @@ -3359,6 +3361,11 @@ static inline bool is_zero_ino(ino_t ino) return (u32)ino == 0; } +static inline void iobj_get(struct inode *inode) +{ + refcount_inc(&inode->i_obj_count); +} + /* * inode->i_lock must be held */ -- 2.49.0 Adjusting i_state flags always means updating the values manually. Bring these forward into the 2020's and make a nice clean macro for defining the i_state values as an enum, providing __ variants for the cases where we need the bit position instead of the actual value, and leaving the actual NAME as the 1U << bit value. Signed-off-by: Josef Bacik --- include/linux/fs.h | 234 +++++++++++++++++++++++---------------------- 1 file changed, 122 insertions(+), 112 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index 9a1ce67eed33..e741dc453c2c 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -665,6 +665,127 @@ is_uncached_acl(struct posix_acl *acl) #define IOP_MGTIME 0x0020 #define IOP_CACHED_LINK 0x0040 +/* + * Inode state bits. Protected by inode->i_lock + * + * Four bits determine the dirty state of the inode: I_DIRTY_SYNC, + * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME. + * + * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, + * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at + * various stages of removing an inode. + * + * Two bits are used for locking and completion notification, I_NEW and I_SYNC. + * + * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on + * fdatasync() (unless I_DIRTY_DATASYNC is also set). + * Timestamp updates are the usual cause. + * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of + * these changes separately from I_DIRTY_SYNC so that we + * don't have to write inode on fdatasync() when only + * e.g. the timestamps have changed. + * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean. + * I_DIRTY_TIME The inode itself has dirty timestamps, and the + * lazytime mount option is enabled. We keep track of this + * separately from I_DIRTY_SYNC in order to implement + * lazytime. This gets cleared if I_DIRTY_INODE + * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But + * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already + * in place because writeback might already be in progress + * and we don't want to lose the time update + * I_NEW Serves as both a mutex and completion notification. + * New inodes set I_NEW. If two processes both create + * the same inode, one of them will release its inode and + * wait for I_NEW to be released before returning. + * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can + * also cause waiting on I_NEW, without I_NEW actually + * being set. find_inode() uses this to prevent returning + * nearly-dead inodes. + * I_WILL_FREE Must be set when calling write_inode_now() if i_count + * is zero. I_FREEING must be set when I_WILL_FREE is + * cleared. + * I_FREEING Set when inode is about to be freed but still has dirty + * pages or buffers attached or the inode itself is still + * dirty. + * I_CLEAR Added by clear_inode(). In this state the inode is + * clean and can be destroyed. Inode keeps I_FREEING. + * + * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are + * prohibited for many purposes. iget() must wait for + * the inode to be completely released, then create it + * anew. Other functions will just ignore such inodes, + * if appropriate. I_NEW is used for waiting. + * + * I_SYNC Writeback of inode is running. The bit is set during + * data writeback, and cleared with a wakeup on the bit + * address once it is done. The bit is also used to pin + * the inode in memory for flusher thread. + * + * I_REFERENCED Marks the inode as recently references on the LRU list. + * + * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to + * synchronize competing switching instances and to tell + * wb stat updates to grab the i_pages lock. See + * inode_switch_wbs_work_fn() for details. + * + * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper + * and work dirs among overlayfs mounts. + * + * I_CREATING New object's inode in the middle of setting up. + * + * I_DONTCACHE Evict inode as soon as it is not used anymore. + * + * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists. + * Used to detect that mark_inode_dirty() should not move + * inode between dirty lists. + * + * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback. + * + * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding + * i_count. + * + * Q: What is the difference between I_WILL_FREE and I_FREEING? + * + * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait + * upon. There's one free address left. + */ + +/* + * As simple macro to define the inode state bits, __NAME will be the bit value + * (0, 1, 2, ...), and NAME will be the bit mask (1U << __NAME). The __NAME_SEQ + * is used to reset the sequence number so the next name gets the next bit value + * in the sequence. + */ +#define INODE_BIT(name) \ + __ ## name, \ + name = (1U << __ ## name), \ + __ ## name ## _SEQ = __ ## name + +enum inode_state_bits { + INODE_BIT(I_NEW), + INODE_BIT(I_SYNC), + INODE_BIT(I_LRU_ISOLATING), + INODE_BIT(I_DIRTY_SYNC), + INODE_BIT(I_DIRTY_DATASYNC), + INODE_BIT(I_DIRTY_PAGES), + INODE_BIT(I_WILL_FREE), + INODE_BIT(I_FREEING), + INODE_BIT(I_CLEAR), + INODE_BIT(I_REFERENCED), + INODE_BIT(I_LINKABLE), + INODE_BIT(I_DIRTY_TIME), + INODE_BIT(I_WB_SWITCH), + INODE_BIT(I_OVL_INUSE), + INODE_BIT(I_CREATING), + INODE_BIT(I_DONTCACHE), + INODE_BIT(I_SYNC_QUEUED), + INODE_BIT(I_PINNING_NETFS_WB), +}; + +#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) +#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) +#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME) + /* * Keep mostly read-only and often accessed (especially for * the RCU path lookup and 'stat' data) fields at the beginning @@ -723,7 +844,7 @@ struct inode { #endif /* Misc */ - u32 i_state; + enum inode_state_bits i_state; /* 32-bit hole */ struct rw_semaphore i_rwsem; @@ -2484,117 +2605,6 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src, }; } -/* - * Inode state bits. Protected by inode->i_lock - * - * Four bits determine the dirty state of the inode: I_DIRTY_SYNC, - * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME. - * - * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, - * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at - * various stages of removing an inode. - * - * Two bits are used for locking and completion notification, I_NEW and I_SYNC. - * - * I_DIRTY_SYNC Inode is dirty, but doesn't have to be written on - * fdatasync() (unless I_DIRTY_DATASYNC is also set). - * Timestamp updates are the usual cause. - * I_DIRTY_DATASYNC Data-related inode changes pending. We keep track of - * these changes separately from I_DIRTY_SYNC so that we - * don't have to write inode on fdatasync() when only - * e.g. the timestamps have changed. - * I_DIRTY_PAGES Inode has dirty pages. Inode itself may be clean. - * I_DIRTY_TIME The inode itself has dirty timestamps, and the - * lazytime mount option is enabled. We keep track of this - * separately from I_DIRTY_SYNC in order to implement - * lazytime. This gets cleared if I_DIRTY_INODE - * (I_DIRTY_SYNC and/or I_DIRTY_DATASYNC) gets set. But - * I_DIRTY_TIME can still be set if I_DIRTY_SYNC is already - * in place because writeback might already be in progress - * and we don't want to lose the time update - * I_NEW Serves as both a mutex and completion notification. - * New inodes set I_NEW. If two processes both create - * the same inode, one of them will release its inode and - * wait for I_NEW to be released before returning. - * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can - * also cause waiting on I_NEW, without I_NEW actually - * being set. find_inode() uses this to prevent returning - * nearly-dead inodes. - * I_WILL_FREE Must be set when calling write_inode_now() if i_count - * is zero. I_FREEING must be set when I_WILL_FREE is - * cleared. - * I_FREEING Set when inode is about to be freed but still has dirty - * pages or buffers attached or the inode itself is still - * dirty. - * I_CLEAR Added by clear_inode(). In this state the inode is - * clean and can be destroyed. Inode keeps I_FREEING. - * - * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are - * prohibited for many purposes. iget() must wait for - * the inode to be completely released, then create it - * anew. Other functions will just ignore such inodes, - * if appropriate. I_NEW is used for waiting. - * - * I_SYNC Writeback of inode is running. The bit is set during - * data writeback, and cleared with a wakeup on the bit - * address once it is done. The bit is also used to pin - * the inode in memory for flusher thread. - * - * I_REFERENCED Marks the inode as recently references on the LRU list. - * - * I_WB_SWITCH Cgroup bdi_writeback switching in progress. Used to - * synchronize competing switching instances and to tell - * wb stat updates to grab the i_pages lock. See - * inode_switch_wbs_work_fn() for details. - * - * I_OVL_INUSE Used by overlayfs to get exclusive ownership on upper - * and work dirs among overlayfs mounts. - * - * I_CREATING New object's inode in the middle of setting up. - * - * I_DONTCACHE Evict inode as soon as it is not used anymore. - * - * I_SYNC_QUEUED Inode is queued in b_io or b_more_io writeback lists. - * Used to detect that mark_inode_dirty() should not move - * inode between dirty lists. - * - * I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback. - * - * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding - * i_count. - * - * Q: What is the difference between I_WILL_FREE and I_FREEING? - * - * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait - * upon. There's one free address left. - */ -#define __I_NEW 0 -#define I_NEW (1 << __I_NEW) -#define __I_SYNC 1 -#define I_SYNC (1 << __I_SYNC) -#define __I_LRU_ISOLATING 2 -#define I_LRU_ISOLATING (1 << __I_LRU_ISOLATING) - -#define I_DIRTY_SYNC (1 << 3) -#define I_DIRTY_DATASYNC (1 << 4) -#define I_DIRTY_PAGES (1 << 5) -#define I_WILL_FREE (1 << 6) -#define I_FREEING (1 << 7) -#define I_CLEAR (1 << 8) -#define I_REFERENCED (1 << 9) -#define I_LINKABLE (1 << 10) -#define I_DIRTY_TIME (1 << 11) -#define I_WB_SWITCH (1 << 12) -#define I_OVL_INUSE (1 << 13) -#define I_CREATING (1 << 14) -#define I_DONTCACHE (1 << 15) -#define I_SYNC_QUEUED (1 << 16) -#define I_PINNING_NETFS_WB (1 << 17) - -#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) -#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES) -#define I_DIRTY_ALL (I_DIRTY | I_DIRTY_TIME) - extern void __mark_inode_dirty(struct inode *, int); static inline void mark_inode_dirty(struct inode *inode) { -- 2.49.0 In wait_sb_inodes we need to hold a reference for the inode while we're waiting on writeback to complete, hold a reference on the inode object during this operation. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index a6cc3d305b84..001773e6e95c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2704,6 +2704,7 @@ static void wait_sb_inodes(struct super_block *sb) continue; } __iget(inode); + iobj_get(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); @@ -2717,6 +2718,7 @@ static void wait_sb_inodes(struct super_block *sb) cond_resched(); iput(inode); + iobj_put(inode); rcu_read_lock(); spin_lock_irq(&sb->s_inode_wblist_lock); -- 2.49.0 If we're holding the inode on one of the writeback lists we need to have a reference on that inode. Grab a reference when we add i_wb_list to something, drop it when it's removed. This is potentially dangerous, because we remove the inode from the i_wb_list potentially under IRQ via folio_end_writeback(). This will be mitigated by making sure all writeback is completed on the final iput, before the final iobj_put, preventing a potential free under IRQ. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 14 ++++++++++++++ 1 file changed, 14 insertions(+) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 001773e6e95c..c2437e3d320a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1332,6 +1332,7 @@ void sb_mark_inode_writeback(struct inode *inode) if (list_empty(&inode->i_wb_list)) { spin_lock_irqsave(&sb->s_inode_wblist_lock, flags); if (list_empty(&inode->i_wb_list)) { + iobj_get(inode); list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb); trace_sb_mark_inode_writeback(inode); } @@ -1346,15 +1347,26 @@ void sb_clear_inode_writeback(struct inode *inode) { struct super_block *sb = inode->i_sb; unsigned long flags; + bool drop = false; if (!list_empty(&inode->i_wb_list)) { spin_lock_irqsave(&sb->s_inode_wblist_lock, flags); if (!list_empty(&inode->i_wb_list)) { + drop = true; list_del_init(&inode->i_wb_list); trace_sb_clear_inode_writeback(inode); } spin_unlock_irqrestore(&sb->s_inode_wblist_lock, flags); } + + /* + * This can be called in IRQ context when we're clearing writeback on + * the folio. This should not be the last iobj_put() on the inode, we + * run all of the writeback before we free the inode in order to avoid + * this possibility. + */ + if (drop) + iobj_put(inode); } /* @@ -2683,6 +2695,8 @@ static void wait_sb_inodes(struct super_block *sb) * to preserve consistency between i_wb_list and the mapping * writeback tag. Writeback completion is responsible to remove * the inode from either list once the writeback tag is cleared. + * At that point the i_obj_count reference will be dropped for + * the i_wb_list reference. */ list_move_tail(&inode->i_wb_list, &sb->s_inodes_wb); -- 2.49.0 While the inode is attached to a list with its i_io_list member we need to hold a reference on the object. The put is under the i_lock in some cases which could potentially be unsafe. It isn't currently because we're holding the i_obj_count throughout the entire lifetime of the inode, so it won't be the last currently. Subsequent patches will make sure we're holding an i_obj_count reference while we're manipulating this list to ensure this continues to be safe. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 25 +++++++++++++++++++++---- 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index c2437e3d320a..24fccb299de4 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -72,6 +72,14 @@ static inline struct inode *wb_inode(struct list_head *head) return list_entry(head, struct inode, i_io_list); } +static inline void inode_delete_from_io_list(struct inode *inode) +{ + if (!list_empty(&inode->i_io_list)) { + list_del_init(&inode->i_io_list); + iobj_put(inode); + } +} + /* * Include the creation of the trace points after defining the * wb_writeback_work structure and inline functions so that the definition @@ -123,6 +131,8 @@ static bool inode_io_list_move_locked(struct inode *inode, assert_spin_locked(&inode->i_lock); WARN_ON_ONCE(inode->i_state & I_FREEING); + if (list_empty(&inode->i_io_list)) + iobj_get(inode); list_move(&inode->i_io_list, head); /* dirty_time doesn't count as dirty_io until expiration */ @@ -310,7 +320,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode, if (wb != &wb->bdi->wb) list_move(&inode->i_io_list, &wb->b_attached); else - list_del_init(&inode->i_io_list); + inode_delete_from_io_list(inode); wb_io_lists_depopulated(wb); } @@ -1200,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode, WARN_ON_ONCE(inode->i_state & I_FREEING); inode->i_state &= ~I_SYNC_QUEUED; - list_del_init(&inode->i_io_list); + inode_delete_from_io_list(inode); wb_io_lists_depopulated(wb); } @@ -1308,16 +1318,23 @@ void wb_start_background_writeback(struct bdi_writeback *wb) void inode_io_list_del(struct inode *inode) { struct bdi_writeback *wb; + bool drop = false; wb = inode_to_wb_and_lock_list(inode); spin_lock(&inode->i_lock); inode->i_state &= ~I_SYNC_QUEUED; - list_del_init(&inode->i_io_list); + if (!list_empty(&inode->i_io_list)) { + drop = true; + list_del_init(&inode->i_io_list); + } wb_io_lists_depopulated(wb); spin_unlock(&inode->i_lock); spin_unlock(&wb->list_lock); + + if (drop) + iobj_put(inode); } EXPORT_SYMBOL(inode_io_list_del); @@ -1389,7 +1406,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb) * trigger assertions in inode_io_list_move_locked(). */ if (inode->i_state & I_FREEING) { - list_del_init(&inode->i_io_list); + inode_delete_from_io_list(inode); wb_io_lists_depopulated(wb); return; } -- 2.49.0 We drop the wb list_lock while writing back inodes, and we could manipulate the i_io_list while this is happening and drop our reference for the inode. Protect this by holding the i_obj_count reference during the writeback. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 24fccb299de4..2b0d26a58a5a 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1977,6 +1977,7 @@ static long writeback_sb_inodes(struct super_block *sb, trace_writeback_sb_inodes_requeue(inode); continue; } + iobj_get(inode); spin_unlock(&wb->list_lock); /* @@ -1987,6 +1988,7 @@ static long writeback_sb_inodes(struct super_block *sb, if (inode->i_state & I_SYNC) { /* Wait for I_SYNC. This function drops i_lock... */ inode_sleep_on_writeback(inode); + iobj_put(inode); /* Inode may be gone, start again */ spin_lock(&wb->list_lock); continue; @@ -2035,10 +2037,9 @@ static long writeback_sb_inodes(struct super_block *sb, inode_sync_complete(inode); spin_unlock(&inode->i_lock); - if (unlikely(tmp_wb != wb)) { - spin_unlock(&tmp_wb->list_lock); - spin_lock(&wb->list_lock); - } + spin_unlock(&tmp_wb->list_lock); + iobj_put(inode); + spin_lock(&wb->list_lock); /* * bail out to wb_writeback() often enough to check -- 2.49.0 While the inode is on the hashtable we need to hold a reference to the object itself. Signed-off-by: Josef Bacik --- fs/inode.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index 454770393fef..1ff46d9417de 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -667,6 +667,7 @@ void __insert_inode_hash(struct inode *inode, unsigned long hashval) spin_lock(&inode_hash_lock); spin_lock(&inode->i_lock); + iobj_get(inode); hlist_add_head_rcu(&inode->i_hash, b); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); @@ -681,11 +682,16 @@ EXPORT_SYMBOL(__insert_inode_hash); */ void __remove_inode_hash(struct inode *inode) { + bool putref; + spin_lock(&inode_hash_lock); spin_lock(&inode->i_lock); + putref = !hlist_unhashed(&inode->i_hash) && !hlist_fake(&inode->i_hash); hlist_del_init_rcu(&inode->i_hash); spin_unlock(&inode->i_lock); spin_unlock(&inode_hash_lock); + if (putref) + iobj_put(inode); } EXPORT_SYMBOL(__remove_inode_hash); @@ -1314,6 +1320,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval, * caller is responsible for filling in the contents */ spin_lock(&inode->i_lock); + iobj_get(inode); inode->i_state |= I_NEW; hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); @@ -1451,6 +1458,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino) if (!old) { inode->i_ino = ino; spin_lock(&inode->i_lock); + iobj_get(inode); inode->i_state = I_NEW; hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); @@ -1803,6 +1811,7 @@ int insert_inode_locked(struct inode *inode) } if (likely(!old)) { spin_lock(&inode->i_lock); + iobj_get(inode); inode->i_state |= I_NEW | I_CREATING; hlist_add_head_rcu(&inode->i_hash, head); spin_unlock(&inode->i_lock); -- 2.49.0 While on the LRU list we need to make sure the object itself does not disappear, so hold an i_obj_count reference. This is a little wonky currently as we're dropping the reference before we call evict(), because currently we drop the last reference right before we free the inode. This will be fixed in a future patch when the freeing of the inode is moved under the control of the i_obj_count reference. Signed-off-by: Josef Bacik --- fs/inode.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 1ff46d9417de..7e506050a0bc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -542,10 +542,12 @@ static void __inode_add_lru(struct inode *inode, bool rotate) if (!mapping_shrinkable(&inode->i_data)) return; - if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) + if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { + iobj_get(inode); this_cpu_inc(nr_unused); - else if (rotate) + } else if (rotate) { inode->i_state |= I_REFERENCED; + } } struct wait_queue_head *inode_bit_waitqueue(struct wait_bit_queue_entry *wqe, @@ -571,8 +573,10 @@ void inode_add_lru(struct inode *inode) static void inode_lru_list_del(struct inode *inode) { - if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) + if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { + iobj_put(inode); this_cpu_dec(nr_unused); + } } static void inode_pin_lru_isolating(struct inode *inode) @@ -861,6 +865,15 @@ static void dispose_list(struct list_head *head) inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); + /* + * This is going right here for now only because we are + * currently not using the i_obj_count reference for anything, + * and it needs to hit 0 when we call evict(). + * + * This will be moved when we change the lifetime rules in a + * future patch. + */ + iobj_put(inode); evict(inode); cond_resched(); } @@ -897,6 +910,7 @@ void evict_inodes(struct super_block *sb) } inode->i_state |= I_FREEING; + iobj_get(inode); inode_lru_list_del(inode); spin_unlock(&inode->i_lock); list_add(&inode->i_lru, &dispose); -- 2.49.0 We are holding this inode on a sb list, make sure we're holding an i_obj_count reference while it exists on the list. Signed-off-by: Josef Bacik --- fs/inode.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index 7e506050a0bc..12e2e01aae0c 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -630,6 +630,7 @@ void inode_sb_list_add(struct inode *inode) { struct super_block *sb = inode->i_sb; + iobj_get(inode); spin_lock(&sb->s_inode_list_lock); list_add(&inode->i_sb_list, &sb->s_inodes); spin_unlock(&sb->s_inode_list_lock); @@ -644,6 +645,7 @@ static inline void inode_sb_list_del(struct inode *inode) spin_lock(&sb->s_inode_list_lock); list_del_init(&inode->i_sb_list); spin_unlock(&sb->s_inode_list_lock); + iobj_put(inode); } } -- 2.49.0 Instead of accessing ->i_count directly in these file systems, use the appropriate __iget and iput helpers. Signed-off-by: Josef Bacik --- fs/f2fs/super.c | 4 ++-- fs/gfs2/ops_fstype.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c index 1db024b20e29..2045642cfe3b 100644 --- a/fs/f2fs/super.c +++ b/fs/f2fs/super.c @@ -1750,7 +1750,7 @@ static int f2fs_drop_inode(struct inode *inode) if ((!inode_unhashed(inode) && inode->i_state & I_SYNC)) { if (!inode->i_nlink && !is_bad_inode(inode)) { /* to avoid evict_inode call simultaneously */ - atomic_inc(&inode->i_count); + __iget(inode); spin_unlock(&inode->i_lock); /* should remain fi->extent_tree for writepage */ @@ -1769,7 +1769,7 @@ static int f2fs_drop_inode(struct inode *inode) sb_end_intwrite(inode->i_sb); spin_lock(&inode->i_lock); - atomic_dec(&inode->i_count); + iput(inode); } trace_f2fs_drop_inode(inode, 0); return 0; diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index efe99b732551..c770006f8889 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -1754,7 +1754,7 @@ static void gfs2_evict_inodes(struct super_block *sb) spin_unlock(&inode->i_lock); continue; } - atomic_inc(&inode->i_count); + __iget(inode); spin_unlock(&inode->i_lock); spin_unlock(&sb->s_inode_list_lock); -- 2.49.0 This is the start of the semantic changes of inode lifetimes. Unfortunately we have to do two things in one patch to be properly safe, but this is the only case where this happens. First we take and drop an i_obj_count reference every time we get an i_count reference. This is because we will be changing the i_count reference to be the indicator of a "live" inode. The second thing we do is move the life time of the memory allocation for the inode under the control of the i_obj_count reference. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 4 +++- fs/fs-writeback.c | 2 -- fs/inode.c | 29 +++++++++++------------------ include/linux/fs.h | 1 + 4 files changed, 15 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 784bd48b4da9..bbbcd96e8f5c 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3418,8 +3418,10 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode) struct btrfs_fs_info *fs_info = inode->root->fs_info; unsigned long flags; - if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) + if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) { + iobj_put(&inode->vfs_inode); return; + } WARN_ON_ONCE(test_bit(BTRFS_FS_STATE_NO_DELAYED_IPUT, &fs_info->fs_state)); atomic_inc(&fs_info->nr_delayed_iputs); diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 2b0d26a58a5a..d2e1fb1a0787 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2736,7 +2736,6 @@ static void wait_sb_inodes(struct super_block *sb) continue; } __iget(inode); - iobj_get(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); @@ -2750,7 +2749,6 @@ static void wait_sb_inodes(struct super_block *sb) cond_resched(); iput(inode); - iobj_put(inode); rcu_read_lock(); spin_lock_irq(&sb->s_inode_wblist_lock); diff --git a/fs/inode.c b/fs/inode.c index 12e2e01aae0c..16acad5583fc 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -527,6 +527,7 @@ static void init_once(void *foo) */ void ihold(struct inode *inode) { + iobj_get(inode); WARN_ON(atomic_inc_return(&inode->i_count) < 2); } EXPORT_SYMBOL(ihold); @@ -843,13 +844,6 @@ static void evict(struct inode *inode) */ inode_wake_up_bit(inode, __I_NEW); BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); - - /* - * refcount_dec_and_test must be used here to avoid the underflow - * warning. - */ - WARN_ON(!refcount_dec_and_test(&inode->i_obj_count)); - destroy_inode(inode); } /* @@ -867,16 +861,8 @@ static void dispose_list(struct list_head *head) inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); - /* - * This is going right here for now only because we are - * currently not using the i_obj_count reference for anything, - * and it needs to hit 0 when we call evict(). - * - * This will be moved when we change the lifetime rules in a - * future patch. - */ - iobj_put(inode); evict(inode); + iobj_put(inode); cond_resched(); } } @@ -1945,6 +1931,11 @@ void iput(struct inode *inode) retry: if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { + /* + * Increment i_count directly as we still have our + * i_obj_count reference still. This is temporary and + * will go away in a future patch. + */ atomic_inc(&inode->i_count); spin_unlock(&inode->i_lock); trace_writeback_lazytime_iput(inode); @@ -1953,6 +1944,7 @@ void iput(struct inode *inode) } iput_final(inode); } + iobj_put(inode); } EXPORT_SYMBOL(iput); @@ -1960,13 +1952,14 @@ EXPORT_SYMBOL(iput); * iobj_put - put a object reference on an inode * @inode: inode to put * - * Puts a object reference on an inode. + * Puts a object reference on an inode, free's it if we get to zero. */ void iobj_put(struct inode *inode) { if (!inode) return; - refcount_dec(&inode->i_obj_count); + if (refcount_dec_and_test(&inode->i_obj_count)) + destroy_inode(inode); } EXPORT_SYMBOL(iobj_put); diff --git a/include/linux/fs.h b/include/linux/fs.h index e741dc453c2c..b2048fd9c300 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3381,6 +3381,7 @@ static inline void iobj_get(struct inode *inode) */ static inline void __iget(struct inode *inode) { + iobj_get(inode); atomic_inc(&inode->i_count); } -- 2.49.0 Currently, if we are the last iput, and we have the I_DIRTY_TIME bit set, we will grab a reference on the inode again and then mark it dirty and then redo the put. This is to make sure we delay the time update for as long as possible. We can rework this logic to simply dec i_count if it is not 1, and if it is do the time update while still holding the i_count reference. Then we can replace the atomic_dec_and_lock with locking the ->i_lock and doing atomic_dec_and_test, since we did the atomic_add_unless above. This is preparation for no longer allowing 0 i_count inodes to exist. Signed-off-by: Josef Bacik --- fs/inode.c | 31 ++++++++++++++++--------------- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 16acad5583fc..814c03f5dbb1 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1928,22 +1928,23 @@ void iput(struct inode *inode) if (!inode) return; BUG_ON(inode->i_state & I_CLEAR); -retry: - if (atomic_dec_and_lock(&inode->i_count, &inode->i_lock)) { - if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { - /* - * Increment i_count directly as we still have our - * i_obj_count reference still. This is temporary and - * will go away in a future patch. - */ - atomic_inc(&inode->i_count); - spin_unlock(&inode->i_lock); - trace_writeback_lazytime_iput(inode); - mark_inode_dirty_sync(inode); - goto retry; - } - iput_final(inode); + + if (atomic_add_unless(&inode->i_count, -1, 1)) { + iobj_put(inode); + return; } + + if (inode->i_nlink && (inode->i_state & I_DIRTY_TIME)) { + trace_writeback_lazytime_iput(inode); + mark_inode_dirty_sync(inode); + } + + spin_lock(&inode->i_lock); + if (atomic_dec_and_test(&inode->i_count)) + iput_final(inode); + else + spin_unlock(&inode->i_lock); + iobj_put(inode); } EXPORT_SYMBOL(iput); -- 2.49.0 We will be adding another list for the inode to keep track of inodes that are being cached for other reasons. This is necessary to make sure we know which list the inode is on, and to differentiate it from the private dispose lists. Signed-off-by: Josef Bacik --- fs/inode.c | 7 +++++++ include/linux/fs.h | 6 ++++++ include/trace/events/writeback.h | 3 ++- 3 files changed, 15 insertions(+), 1 deletion(-) diff --git a/fs/inode.c b/fs/inode.c index 814c03f5dbb1..94769b356224 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -545,6 +545,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate) if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { iobj_get(inode); + inode->i_state |= I_LRU; this_cpu_inc(nr_unused); } else if (rotate) { inode->i_state |= I_REFERENCED; @@ -574,7 +575,11 @@ void inode_add_lru(struct inode *inode) static void inode_lru_list_del(struct inode *inode) { + if (!(inode->i_state & I_LRU)) + return; + if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { + inode->i_state &= ~I_LRU; iobj_put(inode); this_cpu_dec(nr_unused); } @@ -955,6 +960,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, (inode->i_state & ~I_REFERENCED) || !mapping_shrinkable(&inode->i_data)) { list_lru_isolate(lru, &inode->i_lru); + inode->i_state &= ~I_LRU; spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; @@ -991,6 +997,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, WARN_ON(inode->i_state & I_NEW); inode->i_state |= I_FREEING; + inode->i_state &= ~I_LRU; list_lru_isolate_move(lru, &inode->i_lru, freeable); spin_unlock(&inode->i_lock); diff --git a/include/linux/fs.h b/include/linux/fs.h index b2048fd9c300..509e696a4df0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -744,6 +744,11 @@ is_uncached_acl(struct posix_acl *acl) * I_LRU_ISOLATING Inode is pinned being isolated from LRU without holding * i_count. * + * I_LRU Inode is on the LRU list and has an associated LRU + * reference count. Used to distinguish inodes where + * ->i_lru is on the LRU and those that are using ->i_lru + * for some other means. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? * * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait @@ -780,6 +785,7 @@ enum inode_state_bits { INODE_BIT(I_DONTCACHE), INODE_BIT(I_SYNC_QUEUED), INODE_BIT(I_PINNING_NETFS_WB), + INODE_BIT(I_LRU), }; #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 1e23919c0da9..486f85aca84d 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -28,7 +28,8 @@ {I_DONTCACHE, "I_DONTCACHE"}, \ {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \ {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \ - {I_LRU_ISOLATING, "I_LRU_ISOLATING"} \ + {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \ + {I_LRU, "I_LRU"} \ ) /* enums need to be exported to user space */ -- 2.49.0 Currently we have relied on dirty inodes and inodes with cache on them to simply be left hanging around on the system outside of an LRU. The only way to make sure these inodes are eventually reclaimed is because dirty writeback will grab a reference on the inode and then iput it when it's done, potentially getting it on the LRU. For the cached case the page cache deletion path will call inode_add_lru when the inode no longer has cached pages in order to make sure the inode object can be freed eventually. In the unmount case we walk all inodes and free them so this all works out fine. But we want to eliminate 0 i_count objects as a concept, so we need a mechanism to hold a reference on these pinned inodes. To that end, add a list to the super block that contains any inodes that are cached for one reason or another. When we call inode_add_lru(), if the inode falls into one of these categories, we will add it to the cached inode list and hold an i_obj_count reference. If the inode does not fall into one of these categories it will be moved to the normal LRU, which is already holds an i_obj_count reference. The dirty case we will delete it from the LRU if it is on one, and then the iput after the writeout will make sure it's placed onto the correct list at that point. The page cache case will migrate it when it calls inode_add_lru() when deleting pages from the page cache. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 8 +++ fs/inode.c | 102 +++++++++++++++++++++++++++++-- fs/internal.h | 1 + fs/super.c | 3 + include/linux/fs.h | 11 ++++ include/trace/events/writeback.h | 3 +- 6 files changed, 121 insertions(+), 7 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index d2e1fb1a0787..111a9d8215bf 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -2736,6 +2736,14 @@ static void wait_sb_inodes(struct super_block *sb) continue; } __iget(inode); + + /* + * We could have potentially ended up on the cached LRU list, so + * remove ourselves from this list now that we have a reference, + * the iput will handle placing it back on the appropriate LRU + * list if necessary. + */ + inode_lru_list_del(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); diff --git a/fs/inode.c b/fs/inode.c index 94769b356224..adcba0a4d776 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -319,6 +319,23 @@ void free_inode_nonrcu(struct inode *inode) } EXPORT_SYMBOL(free_inode_nonrcu); +/* + * Some inodes need to stay pinned in memory because they are dirty or there are + * cached pages that the VM wants to keep around to avoid thrashing. This does + * the appropriate checks to see if we want to sheild this inode from periodic + * reclaim. Must be called with ->i_lock held. + */ +static bool inode_needs_cached(struct inode *inode) +{ + lockdep_assert_held(&inode->i_lock); + + if (inode->i_state & (I_DIRTY_ALL | I_SYNC)) + return true; + if (!mapping_shrinkable(&inode->i_data)) + return true; + return false; +} + static void i_callback(struct rcu_head *head) { struct inode *inode = container_of(head, struct inode, i_rcu); @@ -532,20 +549,67 @@ void ihold(struct inode *inode) } EXPORT_SYMBOL(ihold); +static void inode_add_cached_lru(struct inode *inode) +{ + lockdep_assert_held(&inode->i_lock); + + if (inode->i_state & I_CACHED_LRU) + return; + if (!list_empty(&inode->i_lru)) + return; + + inode->i_state |= I_CACHED_LRU; + spin_lock(&inode->i_sb->s_cached_inodes_lock); + list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes); + spin_unlock(&inode->i_sb->s_cached_inodes_lock); + iobj_get(inode); +} + +static bool __inode_del_cached_lru(struct inode *inode) +{ + lockdep_assert_held(&inode->i_lock); + + if (!(inode->i_state & I_CACHED_LRU)) + return false; + + inode->i_state &= ~I_CACHED_LRU; + spin_lock(&inode->i_sb->s_cached_inodes_lock); + list_del_init(&inode->i_lru); + spin_unlock(&inode->i_sb->s_cached_inodes_lock); + return true; +} + +static bool inode_del_cached_lru(struct inode *inode) +{ + if (__inode_del_cached_lru(inode)) { + iobj_put(inode); + return true; + } + return false; +} + static void __inode_add_lru(struct inode *inode, bool rotate) { - if (inode->i_state & (I_DIRTY_ALL | I_SYNC | I_FREEING | I_WILL_FREE)) + bool need_ref = true; + + lockdep_assert_held(&inode->i_lock); + + if (inode->i_state & (I_FREEING | I_WILL_FREE)) return; if (atomic_read(&inode->i_count)) return; if (!(inode->i_sb->s_flags & SB_ACTIVE)) return; - if (!mapping_shrinkable(&inode->i_data)) + if (inode_needs_cached(inode)) { + inode_add_cached_lru(inode); return; + } + need_ref = __inode_del_cached_lru(inode) == false; if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { - iobj_get(inode); inode->i_state |= I_LRU; + if (need_ref) + iobj_get(inode); this_cpu_inc(nr_unused); } else if (rotate) { inode->i_state |= I_REFERENCED; @@ -573,8 +637,19 @@ void inode_add_lru(struct inode *inode) __inode_add_lru(inode, false); } -static void inode_lru_list_del(struct inode *inode) +/* + * Caller must be holding it's own i_count reference on this inode in order to + * prevent this being the final iput. + * + * Needs inode->i_lock held. + */ +void inode_lru_list_del(struct inode *inode) { + lockdep_assert_held(&inode->i_lock); + + if (inode_del_cached_lru(inode)) + return; + if (!(inode->i_state & I_LRU)) return; @@ -950,6 +1025,22 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (!spin_trylock(&inode->i_lock)) return LRU_SKIP; + /* + * This inode is either dirty or has page cache we want to keep around, + * so move it to the cached list. + * + * We drop the extra i_obj_count reference we grab when adding it to the + * cached lru. + */ + if (inode_needs_cached(inode)) { + list_lru_isolate(lru, &inode->i_lru); + inode_add_cached_lru(inode); + iobj_put(inode); + spin_unlock(&inode->i_lock); + this_cpu_dec(nr_unused); + return LRU_REMOVED; + } + /* * Inodes can get referenced, redirtied, or repopulated while * they're already on the LRU, and this can make them @@ -957,8 +1048,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, * sync, or the last page cache deletion will requeue them. */ if (atomic_read(&inode->i_count) || - (inode->i_state & ~I_REFERENCED) || - !mapping_shrinkable(&inode->i_data)) { + (inode->i_state & ~I_REFERENCED)) { list_lru_isolate(lru, &inode->i_lru); inode->i_state &= ~I_LRU; spin_unlock(&inode->i_lock); diff --git a/fs/internal.h b/fs/internal.h index 38e8aab27bbd..17ecee7056d5 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -207,6 +207,7 @@ extern long prune_icache_sb(struct super_block *sb, struct shrink_control *sc); int dentry_needs_remove_privs(struct mnt_idmap *, struct dentry *dentry); bool in_group_or_capable(struct mnt_idmap *idmap, const struct inode *inode, vfsgid_t vfsgid); +void inode_lru_list_del(struct inode *inode); /* * fs-writeback.c diff --git a/fs/super.c b/fs/super.c index a038848e8d1f..bf3e6d9055af 100644 --- a/fs/super.c +++ b/fs/super.c @@ -364,6 +364,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags, spin_lock_init(&s->s_inode_list_lock); INIT_LIST_HEAD(&s->s_inodes_wb); spin_lock_init(&s->s_inode_wblist_lock); + INIT_LIST_HEAD(&s->s_cached_inodes); + spin_lock_init(&s->s_cached_inodes_lock); s->s_count = 1; atomic_set(&s->s_active, 1); @@ -409,6 +411,7 @@ static void __put_super(struct super_block *s) WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); WARN_ON(!list_empty(&s->s_mounts)); + WARN_ON(!list_empty(&s->s_cached_inodes)); call_rcu(&s->rcu, destroy_super_rcu); } } diff --git a/include/linux/fs.h b/include/linux/fs.h index 509e696a4df0..8384ed81a5ad 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -749,6 +749,9 @@ is_uncached_acl(struct posix_acl *acl) * ->i_lru is on the LRU and those that are using ->i_lru * for some other means. * + * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable, + * and thus is on the s_cached_inode_lru list. + * * Q: What is the difference between I_WILL_FREE and I_FREEING? * * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait @@ -786,6 +789,7 @@ enum inode_state_bits { INODE_BIT(I_SYNC_QUEUED), INODE_BIT(I_PINNING_NETFS_WB), INODE_BIT(I_LRU), + INODE_BIT(I_CACHED_LRU), }; #define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC) @@ -1584,6 +1588,13 @@ struct super_block { spinlock_t s_inode_wblist_lock; struct list_head s_inodes_wb; /* writeback inodes */ + + /* + * Cached inodes, any inodes that their reference is held by another + * mechanism, such as dirty inodes or unshrinkable inodes. + */ + spinlock_t s_cached_inodes_lock; + struct list_head s_cached_inodes; } __randomize_layout; static inline struct user_namespace *i_user_ns(const struct inode *inode) diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 486f85aca84d..6949329c744a 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -29,7 +29,8 @@ {I_SYNC_QUEUED, "I_SYNC_QUEUED"}, \ {I_PINNING_NETFS_WB, "I_PINNING_NETFS_WB"}, \ {I_LRU_ISOLATING, "I_LRU_ISOLATING"}, \ - {I_LRU, "I_LRU"} \ + {I_LRU, "I_LRU"}, \ + {I_CACHED_LRU, "I_CACHED_LRU"} \ ) /* enums need to be exported to user space */ -- 2.49.0 When we move to holding a full reference on the inode when it is on an LRU list we need to have a mechanism to re-run the LRU add logic. The use case for this is btrfs's snapshot delete, we will lookup all the inodes and try to drop them, but if they're on the LRU we will not call ->drop_inode() because their refcount will be elevated, so we won't know that we need to drop the inode. Fix this by simply removing the inode from it's respective LRU list when we grab a reference to it in a way that we have active users. This will ensure that the logic to add the inode to the LRU or drop the inode will be run on the final iput from the user. Signed-off-by: Josef Bacik --- fs/inode.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/inode.c b/fs/inode.c index adcba0a4d776..72981b890ec6 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1146,6 +1146,7 @@ static struct inode *find_inode(struct super_block *sb, return ERR_PTR(-ESTALE); } __iget(inode); + inode_lru_list_del(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); return inode; @@ -1187,6 +1188,7 @@ static struct inode *find_inode_fast(struct super_block *sb, return ERR_PTR(-ESTALE); } __iget(inode); + inode_lru_list_del(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); return inode; @@ -1653,6 +1655,7 @@ struct inode *igrab(struct inode *inode) spin_lock(&inode->i_lock); if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) { __iget(inode); + inode_lru_list_del(inode); spin_unlock(&inode->i_lock); } else { spin_unlock(&inode->i_lock); -- 2.49.0 At evict_inodes() time, we no longer have SB_ACTIVE set, so we can easily go through the normal iput path to clear any inodes. Update dispose_list() to check how we need to free the inode, and then grab a full reference to the inode while we're looping through the remaining inodes, and simply iput them at the end. Since we're just calling iput we don't really care about the i_count on the inode at the current time. Remove the i_count checks and just call iput on every inode we find. Signed-off-by: Josef Bacik --- fs/inode.c | 26 +++++++++++--------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 72981b890ec6..80ad327746a7 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -933,7 +933,7 @@ static void evict(struct inode *inode) * Dispose-list gets a local list with local inodes in it, so it doesn't * need to worry about list corruption and SMP locks. */ -static void dispose_list(struct list_head *head) +static void dispose_list(struct list_head *head, bool for_lru) { while (!list_empty(head)) { struct inode *inode; @@ -941,8 +941,12 @@ static void dispose_list(struct list_head *head) inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); - evict(inode); - iobj_put(inode); + if (for_lru) { + evict(inode); + iobj_put(inode); + } else { + iput(inode); + } cond_resched(); } } @@ -964,21 +968,13 @@ void evict_inodes(struct super_block *sb) again: spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - if (atomic_read(&inode->i_count)) - continue; - spin_lock(&inode->i_lock); - if (atomic_read(&inode->i_count)) { - spin_unlock(&inode->i_lock); - continue; - } if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(&inode->i_lock); continue; } - inode->i_state |= I_FREEING; - iobj_get(inode); + __iget(inode); inode_lru_list_del(inode); spin_unlock(&inode->i_lock); list_add(&inode->i_lru, &dispose); @@ -991,13 +987,13 @@ void evict_inodes(struct super_block *sb) if (need_resched()) { spin_unlock(&sb->s_inode_list_lock); cond_resched(); - dispose_list(&dispose); + dispose_list(&dispose, false); goto again; } } spin_unlock(&sb->s_inode_list_lock); - dispose_list(&dispose); + dispose_list(&dispose, false); } EXPORT_SYMBOL_GPL(evict_inodes); @@ -1108,7 +1104,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, inode_lru_isolate, &freeable); - dispose_list(&freeable); + dispose_list(&freeable, true); return freed; } -- 2.49.0 We want to eliminate 0 refcount inodes that can be used. To that end, make the LRU's hold a full reference on the inode while it is on an LRU list. From there we can change the eviction code to always just iput the inode, and the LRU operations will just add or drop a full reference where appropriate. We also now must take into account unlink, and drop our LRU reference when we go to an nlink of 0. We will also avoid adding inodes with a nlink of 0 as they can be reclaimed immediately. Signed-off-by: Josef Bacik --- fs/inode.c | 105 +++++++++++++++++++++++++++++------------------------ 1 file changed, 57 insertions(+), 48 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 80ad327746a7..de0ec791f9a3 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -434,8 +434,18 @@ void drop_nlink(struct inode *inode) { WARN_ON(inode->i_nlink == 0); inode->__i_nlink--; - if (!inode->i_nlink) + if (!inode->i_nlink) { + /* + * LRU's hold a full ref on the inode, but if we've unlinked it + * then we want the inode to be freed when the last user goes, + * so delete the inode from the LRU list. + */ + spin_lock(&inode->i_lock); + inode_lru_list_del(inode); + spin_unlock(&inode->i_lock); + atomic_long_inc(&inode->i_sb->s_remove_count); + } } EXPORT_SYMBOL(drop_nlink); @@ -451,6 +461,12 @@ void clear_nlink(struct inode *inode) { if (inode->i_nlink) { inode->__i_nlink = 0; + + /* See comment in drop_nlink(). */ + spin_lock(&inode->i_lock); + inode_lru_list_del(inode); + spin_unlock(&inode->i_lock); + atomic_long_inc(&inode->i_sb->s_remove_count); } } @@ -555,6 +571,8 @@ static void inode_add_cached_lru(struct inode *inode) if (inode->i_state & I_CACHED_LRU) return; + if (inode->__i_nlink == 0) + return; if (!list_empty(&inode->i_lru)) return; @@ -562,7 +580,7 @@ static void inode_add_cached_lru(struct inode *inode) spin_lock(&inode->i_sb->s_cached_inodes_lock); list_add(&inode->i_lru, &inode->i_sb->s_cached_inodes); spin_unlock(&inode->i_sb->s_cached_inodes_lock); - iobj_get(inode); + __iget(inode); } static bool __inode_del_cached_lru(struct inode *inode) @@ -582,7 +600,7 @@ static bool __inode_del_cached_lru(struct inode *inode) static bool inode_del_cached_lru(struct inode *inode) { if (__inode_del_cached_lru(inode)) { - iobj_put(inode); + iput(inode); return true; } return false; @@ -598,6 +616,8 @@ static void __inode_add_lru(struct inode *inode, bool rotate) return; if (atomic_read(&inode->i_count)) return; + if (inode->__i_nlink == 0) + return; if (!(inode->i_sb->s_flags & SB_ACTIVE)) return; if (inode_needs_cached(inode)) { @@ -609,7 +629,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate) if (list_lru_add_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { inode->i_state |= I_LRU; if (need_ref) - iobj_get(inode); + __iget(inode); this_cpu_inc(nr_unused); } else if (rotate) { inode->i_state |= I_REFERENCED; @@ -655,7 +675,7 @@ void inode_lru_list_del(struct inode *inode) if (list_lru_del_obj(&inode->i_sb->s_inode_lru, &inode->i_lru)) { inode->i_state &= ~I_LRU; - iobj_put(inode); + iput(inode); this_cpu_dec(nr_unused); } } @@ -926,6 +946,7 @@ static void evict(struct inode *inode) BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); } +static void iput_evict(struct inode *inode); /* * dispose_list - dispose of the contents of a local list * @head: the head of the list to free @@ -933,20 +954,14 @@ static void evict(struct inode *inode) * Dispose-list gets a local list with local inodes in it, so it doesn't * need to worry about list corruption and SMP locks. */ -static void dispose_list(struct list_head *head, bool for_lru) +static void dispose_list(struct list_head *head) { while (!list_empty(head)) { struct inode *inode; inode = list_first_entry(head, struct inode, i_lru); list_del_init(&inode->i_lru); - - if (for_lru) { - evict(inode); - iobj_put(inode); - } else { - iput(inode); - } + iput_evict(inode); cond_resched(); } } @@ -987,13 +1002,13 @@ void evict_inodes(struct super_block *sb) if (need_resched()) { spin_unlock(&sb->s_inode_list_lock); cond_resched(); - dispose_list(&dispose, false); + dispose_list(&dispose); goto again; } } spin_unlock(&sb->s_inode_list_lock); - dispose_list(&dispose, false); + dispose_list(&dispose); } EXPORT_SYMBOL_GPL(evict_inodes); @@ -1031,22 +1046,7 @@ static enum lru_status inode_lru_isolate(struct list_head *item, if (inode_needs_cached(inode)) { list_lru_isolate(lru, &inode->i_lru); inode_add_cached_lru(inode); - iobj_put(inode); - spin_unlock(&inode->i_lock); - this_cpu_dec(nr_unused); - return LRU_REMOVED; - } - - /* - * Inodes can get referenced, redirtied, or repopulated while - * they're already on the LRU, and this can make them - * unreclaimable for a while. Remove them lazily here; iput, - * sync, or the last page cache deletion will requeue them. - */ - if (atomic_read(&inode->i_count) || - (inode->i_state & ~I_REFERENCED)) { - list_lru_isolate(lru, &inode->i_lru); - inode->i_state &= ~I_LRU; + iput(inode); spin_unlock(&inode->i_lock); this_cpu_dec(nr_unused); return LRU_REMOVED; @@ -1082,7 +1082,6 @@ static enum lru_status inode_lru_isolate(struct list_head *item, } WARN_ON(inode->i_state & I_NEW); - inode->i_state |= I_FREEING; inode->i_state &= ~I_LRU; list_lru_isolate_move(lru, &inode->i_lru, freeable); spin_unlock(&inode->i_lock); @@ -1104,7 +1103,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) freed = list_lru_shrink_walk(&sb->s_inode_lru, sc, inode_lru_isolate, &freeable); - dispose_list(&freeable, true); + dispose_list(&freeable); return freed; } @@ -1967,7 +1966,7 @@ EXPORT_SYMBOL(generic_delete_inode); * in cache if fs is alive, sync and evict if fs is * shutting down. */ -static void iput_final(struct inode *inode) +static void iput_final(struct inode *inode, bool skip_lru) { struct super_block *sb = inode->i_sb; const struct super_operations *op = inode->i_sb->s_op; @@ -1981,7 +1980,7 @@ static void iput_final(struct inode *inode) else drop = generic_drop_inode(inode); - if (!drop && + if (!drop && !skip_lru && !(inode->i_state & I_DONTCACHE) && (sb->s_flags & SB_ACTIVE)) { __inode_add_lru(inode, true); @@ -1989,6 +1988,8 @@ static void iput_final(struct inode *inode) return; } + WARN_ON(!list_empty(&inode->i_lru)); + state = inode->i_state; if (!drop) { WRITE_ONCE(inode->i_state, state | I_WILL_FREE); @@ -2003,23 +2004,12 @@ static void iput_final(struct inode *inode) } WRITE_ONCE(inode->i_state, state | I_FREEING); - if (!list_empty(&inode->i_lru)) - inode_lru_list_del(inode); spin_unlock(&inode->i_lock); evict(inode); } -/** - * iput - put an inode - * @inode: inode to put - * - * Puts an inode, dropping its usage count. If the inode use count hits - * zero, the inode is then freed and may also be destroyed. - * - * Consequently, iput() can sleep. - */ -void iput(struct inode *inode) +static void __iput(struct inode *inode, bool skip_lru) { if (!inode) return; @@ -2037,12 +2027,31 @@ void iput(struct inode *inode) spin_lock(&inode->i_lock); if (atomic_dec_and_test(&inode->i_count)) - iput_final(inode); + iput_final(inode, skip_lru); else spin_unlock(&inode->i_lock); iobj_put(inode); } + +static void iput_evict(struct inode *inode) +{ + __iput(inode, true); +} + +/** + * iput - put an inode + * @inode: inode to put + * + * Puts an inode, dropping its usage count. If the inode use count hits + * zero, the inode is then freed and may also be destroyed. + * + * Consequently, iput() can sleep. + */ +void iput(struct inode *inode) +{ + __iput(inode, false); +} EXPORT_SYMBOL(iput); /** -- 2.49.0 Now that we take a full reference for inodes on the LRU, move the logic to add the inode to the LRU to before we drop our last reference. This allows us to ensure that if the inode has a reference count it can be used, and we no longer hold onto inodes that have a 0 reference count. Signed-off-by: Josef Bacik --- fs/inode.c | 53 +++++++++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index de0ec791f9a3..b4145ddbaf8e 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -614,7 +614,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate) if (inode->i_state & (I_FREEING | I_WILL_FREE)) return; - if (atomic_read(&inode->i_count)) + if (atomic_read(&inode->i_count) != 1) return; if (inode->__i_nlink == 0) return; @@ -1966,28 +1966,11 @@ EXPORT_SYMBOL(generic_delete_inode); * in cache if fs is alive, sync and evict if fs is * shutting down. */ -static void iput_final(struct inode *inode, bool skip_lru) +static void iput_final(struct inode *inode, bool drop) { - struct super_block *sb = inode->i_sb; - const struct super_operations *op = inode->i_sb->s_op; unsigned long state; - int drop; WARN_ON(inode->i_state & I_NEW); - - if (op->drop_inode) - drop = op->drop_inode(inode); - else - drop = generic_drop_inode(inode); - - if (!drop && !skip_lru && - !(inode->i_state & I_DONTCACHE) && - (sb->s_flags & SB_ACTIVE)) { - __inode_add_lru(inode, true); - spin_unlock(&inode->i_lock); - return; - } - WARN_ON(!list_empty(&inode->i_lru)); state = inode->i_state; @@ -2009,8 +1992,29 @@ static void iput_final(struct inode *inode, bool skip_lru) evict(inode); } +static bool maybe_add_lru(struct inode *inode, bool skip_lru) +{ + const struct super_operations *op = inode->i_sb->s_op; + struct super_block *sb = inode->i_sb; + bool drop = false; + + if (op->drop_inode) + drop = op->drop_inode(inode); + else + drop = generic_drop_inode(inode); + + if (!drop && !skip_lru && + !(inode->i_state & I_DONTCACHE) && + (sb->s_flags & SB_ACTIVE)) + __inode_add_lru(inode, true); + + return drop; +} + static void __iput(struct inode *inode, bool skip_lru) { + bool drop; + if (!inode) return; BUG_ON(inode->i_state & I_CLEAR); @@ -2026,8 +2030,17 @@ static void __iput(struct inode *inode, bool skip_lru) } spin_lock(&inode->i_lock); + + /* + * If we want to keep the inode around on an LRU we will grab a ref to + * the inode when we add it to the LRU list, so we can safely drop the + * callers reference after this. If we didn't add the inode to the LRU + * then the refcount will still be 1 and we can do the final iput. + */ + drop = maybe_add_lru(inode, skip_lru); + if (atomic_dec_and_test(&inode->i_count)) - iput_final(inode, skip_lru); + iput_final(inode, drop); else spin_unlock(&inode->i_lock); -- 2.49.0 In the future when we only serialize the freeing of the inode on the reference count we could potentially be relying on ->i_lru to be consistent, which means we need it to be consistent under the ->i_lock. Move the list_add in evict_inodes() to under the ->i_lock to prevent potential races where we think the inode isn't on a list but is going to be added to the private dispose list. Signed-off-by: Josef Bacik --- fs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/inode.c b/fs/inode.c index b4145ddbaf8e..07c8edb4b58a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -991,8 +991,8 @@ void evict_inodes(struct super_block *sb) __iget(inode); inode_lru_list_del(inode); - spin_unlock(&inode->i_lock); list_add(&inode->i_lru, &dispose); + spin_unlock(&inode->i_lock); /* * We can have a ton of inodes to evict at unmount time given -- 2.49.0 Now that we do not allow i_count to drop to 0 and be used we can convert it to a refcount_t and benefit from the protections those helpers add. Signed-off-by: Josef Bacik --- arch/powerpc/platforms/cell/spufs/file.c | 2 +- fs/btrfs/inode.c | 4 ++-- fs/ceph/mds_client.c | 2 +- fs/ext4/ialloc.c | 4 ++-- fs/fs-writeback.c | 2 +- fs/hpfs/inode.c | 2 +- fs/inode.c | 11 ++++++----- fs/nfs/inode.c | 4 ++-- fs/notify/fsnotify.c | 2 +- fs/ubifs/super.c | 2 +- fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_trace.h | 2 +- include/linux/fs.h | 4 ++-- include/trace/events/filelock.h | 2 +- security/landlock/fs.c | 2 +- 15 files changed, 24 insertions(+), 23 deletions(-) diff --git a/arch/powerpc/platforms/cell/spufs/file.c b/arch/powerpc/platforms/cell/spufs/file.c index d5a2c77bc908..3f768b003838 100644 --- a/arch/powerpc/platforms/cell/spufs/file.c +++ b/arch/powerpc/platforms/cell/spufs/file.c @@ -1430,7 +1430,7 @@ static int spufs_mfc_open(struct inode *inode, struct file *file) if (ctx->owner != current->mm) return -EINVAL; - if (atomic_read(&inode->i_count) != 1) + if (refcount_read(&inode->i_count) != 1) return -EBUSY; mutex_lock(&ctx->mapping_lock); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index bbbcd96e8f5c..e85e38df3ea0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3418,7 +3418,7 @@ void btrfs_add_delayed_iput(struct btrfs_inode *inode) struct btrfs_fs_info *fs_info = inode->root->fs_info; unsigned long flags; - if (atomic_add_unless(&inode->vfs_inode.i_count, -1, 1)) { + if (refcount_dec_not_one(&inode->vfs_inode.i_count)) { iobj_put(&inode->vfs_inode); return; } @@ -4559,7 +4559,7 @@ static void btrfs_prune_dentries(struct btrfs_root *root) inode = btrfs_find_first_inode(root, min_ino); while (inode) { - if (atomic_read(&inode->vfs_inode.i_count) > 1) + if (refcount_read(&inode->vfs_inode.i_count) > 1) d_prune_aliases(&inode->vfs_inode); min_ino = btrfs_ino(inode) + 1; diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 0f497c39ff82..ff666d18f6ad 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2221,7 +2221,7 @@ static int trim_caps_cb(struct inode *inode, int mds, void *arg) int count; dput(dentry); d_prune_aliases(inode); - count = atomic_read(&inode->i_count); + count = refcount_read(&inode->i_count); if (count == 1) (*remaining)--; doutc(cl, "%p %llx.%llx cap %p pruned, count now %d\n", diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c index df4051613b29..9a3c7f22a57e 100644 --- a/fs/ext4/ialloc.c +++ b/fs/ext4/ialloc.c @@ -252,10 +252,10 @@ void ext4_free_inode(handle_t *handle, struct inode *inode) "nonexistent device\n", __func__, __LINE__); return; } - if (atomic_read(&inode->i_count) > 1) { + if (refcount_read(&inode->i_count) > 1) { ext4_msg(sb, KERN_ERR, "%s:%d: inode #%lu: count=%d", __func__, __LINE__, inode->i_ino, - atomic_read(&inode->i_count)); + refcount_read(&inode->i_count)); return; } if (inode->i_nlink) { diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 111a9d8215bf..789c4228412c 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1796,7 +1796,7 @@ static int writeback_single_inode(struct inode *inode, int ret = 0; spin_lock(&inode->i_lock); - if (!atomic_read(&inode->i_count)) + if (!refcount_read(&inode->i_count)) WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING))); else WARN_ON(inode->i_state & I_WILL_FREE); diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c index a59e8fa630db..ee23a941d8f5 100644 --- a/fs/hpfs/inode.c +++ b/fs/hpfs/inode.c @@ -184,7 +184,7 @@ void hpfs_write_inode(struct inode *i) struct hpfs_inode_info *hpfs_inode = hpfs_i(i); struct inode *parent; if (i->i_ino == hpfs_sb(i->i_sb)->sb_root) return; - if (hpfs_inode->i_rddir_off && !atomic_read(&i->i_count)) { + if (hpfs_inode->i_rddir_off && !refcount_read(&i->i_count)) { if (*hpfs_inode->i_rddir_off) pr_err("write_inode: some position still there\n"); kfree(hpfs_inode->i_rddir_off); diff --git a/fs/inode.c b/fs/inode.c index 07c8edb4b58a..28d197731914 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -236,7 +236,7 @@ int inode_init_always_gfp(struct super_block *sb, struct inode *inode, gfp_t gfp inode->i_state = 0; atomic64_set(&inode->i_sequence, 0); refcount_set(&inode->i_obj_count, 1); - atomic_set(&inode->i_count, 1); + refcount_set(&inode->i_count, 1); inode->i_op = &empty_iops; inode->i_fop = &no_open_fops; inode->i_ino = 0; @@ -561,7 +561,8 @@ static void init_once(void *foo) void ihold(struct inode *inode) { iobj_get(inode); - WARN_ON(atomic_inc_return(&inode->i_count) < 2); + refcount_inc(&inode->i_count); + WARN_ON(refcount_read(&inode->i_count) < 2); } EXPORT_SYMBOL(ihold); @@ -614,7 +615,7 @@ static void __inode_add_lru(struct inode *inode, bool rotate) if (inode->i_state & (I_FREEING | I_WILL_FREE)) return; - if (atomic_read(&inode->i_count) != 1) + if (refcount_read(&inode->i_count) != 1) return; if (inode->__i_nlink == 0) return; @@ -2019,7 +2020,7 @@ static void __iput(struct inode *inode, bool skip_lru) return; BUG_ON(inode->i_state & I_CLEAR); - if (atomic_add_unless(&inode->i_count, -1, 1)) { + if (refcount_dec_not_one(&inode->i_count)) { iobj_put(inode); return; } @@ -2039,7 +2040,7 @@ static void __iput(struct inode *inode, bool skip_lru) */ drop = maybe_add_lru(inode, skip_lru); - if (atomic_dec_and_test(&inode->i_count)) + if (refcount_dec_and_test(&inode->i_count)) iput_final(inode, drop); else spin_unlock(&inode->i_lock); diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c index 338ef77ae423..9cc84f0afa9a 100644 --- a/fs/nfs/inode.c +++ b/fs/nfs/inode.c @@ -608,7 +608,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr) inode->i_sb->s_id, (unsigned long long)NFS_FILEID(inode), nfs_display_fhandle_hash(fh), - atomic_read(&inode->i_count)); + refcount_read(&inode->i_count)); out: return inode; @@ -2229,7 +2229,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr) dfprintk(VFS, "NFS: %s(%s/%lu fh_crc=0x%08x ct=%d info=0x%llx)\n", __func__, inode->i_sb->s_id, inode->i_ino, nfs_display_fhandle_hash(NFS_FH(inode)), - atomic_read(&inode->i_count), fattr->valid); + refcount_read(&inode->i_count), fattr->valid); if (!(fattr->valid & NFS_ATTR_FATTR_FILEID)) { /* Only a mounted-on-fileid? Just exit */ diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 079b868552c2..0883696f873d 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -66,7 +66,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb) * removed all zero refcount inodes, in any case. Test to * be sure. */ - if (!atomic_read(&inode->i_count)) { + if (!refcount_read(&inode->i_count)) { spin_unlock(&inode->i_lock); continue; } diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c index f3e3b2068608..79526f71fa8a 100644 --- a/fs/ubifs/super.c +++ b/fs/ubifs/super.c @@ -358,7 +358,7 @@ static void ubifs_evict_inode(struct inode *inode) goto out; dbg_gen("inode %lu, mode %#x", inode->i_ino, (int)inode->i_mode); - ubifs_assert(c, !atomic_read(&inode->i_count)); + ubifs_assert(c, !refcount_read(&inode->i_count)); truncate_inode_pages_final(&inode->i_data); diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 9c39251961a3..06af749fe5f3 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1035,7 +1035,7 @@ xfs_itruncate_extents_flags( int error = 0; xfs_assert_ilocked(ip, XFS_ILOCK_EXCL); - if (atomic_read(&VFS_I(ip)->i_count)) + if (refcount_read(&VFS_I(ip)->i_count)) xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL); ASSERT(new_size <= XFS_ISIZE(ip)); ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index ac344e42846c..167d33b8095c 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1152,7 +1152,7 @@ DECLARE_EVENT_CLASS(xfs_iref_class, TP_fast_assign( __entry->dev = VFS_I(ip)->i_sb->s_dev; __entry->ino = ip->i_ino; - __entry->count = atomic_read(&VFS_I(ip)->i_count); + __entry->count = refcount_read(&VFS_I(ip)->i_count); __entry->pincount = atomic_read(&ip->i_pincount); __entry->iflags = ip->i_flags; __entry->caller_ip = caller_ip; diff --git a/include/linux/fs.h b/include/linux/fs.h index 8384ed81a5ad..34fb40ba8a94 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -880,7 +880,7 @@ struct inode { }; atomic64_t i_version; atomic64_t i_sequence; /* see futex */ - atomic_t i_count; + refcount_t i_count; atomic_t i_dio_count; atomic_t i_writecount; #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING) @@ -3399,7 +3399,7 @@ static inline void iobj_get(struct inode *inode) static inline void __iget(struct inode *inode) { iobj_get(inode); - atomic_inc(&inode->i_count); + refcount_inc(&inode->i_count); } extern void iget_failed(struct inode *); diff --git a/include/trace/events/filelock.h b/include/trace/events/filelock.h index b8d1e00a7982..e745436cfcd2 100644 --- a/include/trace/events/filelock.h +++ b/include/trace/events/filelock.h @@ -189,7 +189,7 @@ TRACE_EVENT(generic_add_lease, __entry->i_ino = inode->i_ino; __entry->wcount = atomic_read(&inode->i_writecount); __entry->rcount = atomic_read(&inode->i_readcount); - __entry->icount = atomic_read(&inode->i_count); + __entry->icount = refcount_read(&inode->i_count); __entry->owner = fl->c.flc_owner; __entry->flags = fl->c.flc_flags; __entry->type = fl->c.flc_type; diff --git a/security/landlock/fs.c b/security/landlock/fs.c index c04f8879ad03..570f851dc469 100644 --- a/security/landlock/fs.c +++ b/security/landlock/fs.c @@ -1281,7 +1281,7 @@ static void hook_sb_delete(struct super_block *const sb) struct landlock_object *object; /* Only handles referenced inodes. */ - if (!atomic_read(&inode->i_count)) + if (!refcount_read(&inode->i_count)) continue; /* -- 2.49.0 We are going to use igrab everywhere we want to acquire a live inode. Update it to do a refcount_inc_not_zero on the i_count, and if successful grab an reference to i_obj_count. Add a comment explaining why we do this and the safety. Signed-off-by: Josef Bacik --- fs/inode.c | 26 +++++++++++++------------- include/linux/fs.h | 27 +++++++++++++++++++++++++++ 2 files changed, 40 insertions(+), 13 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 28d197731914..b9122c1eee1d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1648,20 +1648,20 @@ EXPORT_SYMBOL(iunique); struct inode *igrab(struct inode *inode) { + lockdep_assert_not_held(&inode->i_lock); + + inode = inode_tryget(inode); + if (!inode) + return NULL; + + /* + * If this inode is on the LRU, take it off so that we can re-run the + * LRU logic on the next iput(). + */ spin_lock(&inode->i_lock); - if (!(inode->i_state & (I_FREEING|I_WILL_FREE))) { - __iget(inode); - inode_lru_list_del(inode); - spin_unlock(&inode->i_lock); - } else { - spin_unlock(&inode->i_lock); - /* - * Handle the case where s_op->clear_inode is not been - * called yet, and somebody is calling igrab - * while the inode is getting freed. - */ - inode = NULL; - } + inode_lru_list_del(inode); + spin_unlock(&inode->i_lock); + return inode; } EXPORT_SYMBOL(igrab); diff --git a/include/linux/fs.h b/include/linux/fs.h index 34fb40ba8a94..b731224708be 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -3393,6 +3393,33 @@ static inline void iobj_get(struct inode *inode) refcount_inc(&inode->i_obj_count); } +static inline struct inode *inode_tryget(struct inode *inode) +{ + /* + * We are using inode_tryget() because we're interested in getting a + * live reference to the inode, which is ->i_count. Normally we would + * grab i_obj_count first, as it is the highe priority reference. + * However we're only interested in making sure we have a live inode, + * and we know that if we get a reference for i_count then we can safely + * acquire i_obj_count because we always drop i_obj_count after dropping + * an i_count reference. + * + * This is meant to be used either in a place where we have an existing + * i_obj_count reference on the inode, or under rcu_read_lock() so we + * know we're safe in accessing this inode still. + */ + if (!refcount_inc_not_zero(&inode->i_count)) { + /* + * If we failed to increment the reference count, then the + * inode is being freed or has been freed. We return NULL + * in this case. + */ + return NULL; + } + iobj_get(inode); + return inode; +} + /* * inode->i_lock must be held */ -- 2.49.0 Now that we never drop the i_count to 0 for valid objects, rework the logic in the find_inode* helpers to use inode_tryget() to see if they have a live inode. If this fails we can wait for the inode to be freed as we know it's currently being evicted. Signed-off-by: Josef Bacik --- fs/inode.c | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index b9122c1eee1d..893ac902268b 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1109,6 +1109,7 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc) } static void __wait_on_freeing_inode(struct inode *inode, bool is_inode_hash_locked); + /* * Called with the inode lock held. */ @@ -1132,16 +1133,15 @@ static struct inode *find_inode(struct super_block *sb, if (!test(inode, data)) continue; spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE)) { - __wait_on_freeing_inode(inode, is_inode_hash_locked); - goto repeat; - } if (unlikely(inode->i_state & I_CREATING)) { spin_unlock(&inode->i_lock); rcu_read_unlock(); return ERR_PTR(-ESTALE); } - __iget(inode); + if (!inode_tryget(inode)) { + __wait_on_freeing_inode(inode, is_inode_hash_locked); + goto repeat; + } inode_lru_list_del(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); @@ -1174,16 +1174,15 @@ static struct inode *find_inode_fast(struct super_block *sb, if (inode->i_sb != sb) continue; spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE)) { - __wait_on_freeing_inode(inode, is_inode_hash_locked); - goto repeat; - } if (unlikely(inode->i_state & I_CREATING)) { spin_unlock(&inode->i_lock); rcu_read_unlock(); return ERR_PTR(-ESTALE); } - __iget(inode); + if (!inode_tryget(inode)) { + __wait_on_freeing_inode(inode, is_inode_hash_locked); + goto repeat; + } inode_lru_list_del(inode); spin_unlock(&inode->i_lock); rcu_read_unlock(); -- 2.49.0 These two helpers are always used under the RCU and don't appear to mind if the inode state changes in between time of check and time of use. Update them to use the i_count refcount instead of I_WILL_FREE or I_FREEING. Signed-off-by: Josef Bacik --- fs/inode.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 893ac902268b..63ccd32fa221 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1839,7 +1839,7 @@ struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval, hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_sb == sb && - !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) && + refcount_read(&inode->i_count) > 0 && test(inode, data)) return inode; } @@ -1878,8 +1878,8 @@ struct inode *find_inode_by_ino_rcu(struct super_block *sb, hlist_for_each_entry_rcu(inode, head, i_hash) { if (inode->i_ino == ino && inode->i_sb == sb && - !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE))) - return inode; + refcount_read(&inode->i_count) > 0) + return inode; } return NULL; } -- 2.49.0 Follow the same pattern in find_inode*. Instead of checking for I_WILL_FREE|I_FREEING simply call igrab() and if it succeeds we're done. Signed-off-by: Josef Bacik --- fs/inode.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 63ccd32fa221..6b772b9883ec 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -1899,11 +1899,8 @@ int insert_inode_locked(struct inode *inode) continue; if (old->i_sb != sb) continue; - spin_lock(&old->i_lock); - if (old->i_state & (I_FREEING|I_WILL_FREE)) { - spin_unlock(&old->i_lock); + if (!igrab(old)) continue; - } break; } if (likely(!old)) { @@ -1915,12 +1912,13 @@ int insert_inode_locked(struct inode *inode) spin_unlock(&inode_hash_lock); return 0; } + spin_lock(&old->i_lock); if (unlikely(old->i_state & I_CREATING)) { spin_unlock(&old->i_lock); spin_unlock(&inode_hash_lock); + iput(old); return -EBUSY; } - __iget(old); spin_unlock(&old->i_lock); spin_unlock(&inode_hash_lock); wait_on_inode(old); -- 2.49.0 We only want to add to the LRU if the current caller is potentially the last one dropping a reference, so if our refcount is 0 we're being deleted, and if the refcount is > 1 then there is another ref holder and they can add the inode to the LRU list. Signed-off-by: Josef Bacik --- fs/inode.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 6b772b9883ec..c61400f39876 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -613,8 +613,6 @@ static void __inode_add_lru(struct inode *inode, bool rotate) lockdep_assert_held(&inode->i_lock); - if (inode->i_state & (I_FREEING | I_WILL_FREE)) - return; if (refcount_read(&inode->i_count) != 1) return; if (inode->__i_nlink == 0) -- 2.49.0 If the inode is on the LRU list then it has a valid reference and we do not need to check for these flags. Signed-off-by: Josef Bacik --- fs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/inode.c b/fs/inode.c index c61400f39876..a14b3a54c4b5 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -682,7 +682,7 @@ void inode_lru_list_del(struct inode *inode) static void inode_pin_lru_isolating(struct inode *inode) { lockdep_assert_held(&inode->i_lock); - WARN_ON(inode->i_state & (I_LRU_ISOLATING | I_FREEING | I_WILL_FREE)); + WARN_ON(inode->i_state & I_LRU_ISOLATING); inode->i_state |= I_LRU_ISOLATING; } -- 2.49.0 Instead of checking I_WILL_FREE|I_FREEING we can simply use inode_tryget() to determine if we have a live inode that can be evicted. Signed-off-by: Josef Bacik --- fs/inode.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index a14b3a54c4b5..4e1eeb0c3889 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -983,12 +983,16 @@ void evict_inodes(struct super_block *sb) spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { spin_lock(&inode->i_lock); - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + if (inode->i_state & I_NEW) { + spin_unlock(&inode->i_lock); + continue; + } + + if (!inode_tryget(inode)) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); inode_lru_list_del(inode); list_add(&inode->i_lru, &dispose); spin_unlock(&inode->i_lock); -- 2.49.0 Instead of checking for I_WILL_FREE|I_FREEING simply use the refcount to make sure we have a live inode. Signed-off-by: Josef Bacik --- fs/crypto/keyring.c | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/crypto/keyring.c b/fs/crypto/keyring.c index 7557f6a88b8f..969db498149a 100644 --- a/fs/crypto/keyring.c +++ b/fs/crypto/keyring.c @@ -956,13 +956,16 @@ static void evict_dentries_for_decrypted_inodes(struct fscrypt_master_key *mk) list_for_each_entry(ci, &mk->mk_decrypted_inodes, ci_master_key_link) { inode = ci->ci_inode; + spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) { + if (inode->i_state & I_NEW) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) + continue; spin_unlock(&mk->mk_decrypted_inodes_lock); shrink_dcache_inode(inode); -- 2.49.0 Instead of checking I_WILL_FREE or I_FREEING simply grab a reference to the inode, as it will only succeed if the inode is live. Signed-off-by: Josef Bacik --- block/bdev.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/block/bdev.c b/block/bdev.c index b77ddd12dc06..94ffc0b5a68c 100644 --- a/block/bdev.c +++ b/block/bdev.c @@ -1265,13 +1265,15 @@ void sync_bdevs(bool wait) struct block_device *bdev; spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW) || - mapping->nrpages == 0) { + if (inode->i_state & I_NEW || mapping->nrpages == 0) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) + continue; + spin_unlock(&blockdev_superblock->s_inode_list_lock); /* * We hold a reference to 'inode' so it couldn't have been -- 2.49.0 We can use the refcount to decide if the inode is alive instead of these flags. Signed-off-by: Josef Bacik --- fs/bcachefs/fs.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/bcachefs/fs.c b/fs/bcachefs/fs.c index 687af0eea0c2..de1c3c740a9d 100644 --- a/fs/bcachefs/fs.c +++ b/fs/bcachefs/fs.c @@ -347,7 +347,7 @@ static struct bch_inode_info *bch2_inode_hash_find(struct bch_fs *c, struct btre spin_unlock(&inode->v.i_lock); return NULL; } - if ((inode->v.i_state & (I_FREEING|I_WILL_FREE))) { + if (!refcount_read(&inode->v.i_count)) { if (!trans) { __wait_on_freeing_inode(c, inode, inum); } else { @@ -2225,7 +2225,6 @@ void bch2_evict_subvolume_inodes(struct bch_fs *c, snapshot_id_list *s) continue; if (!(inode->v.i_state & I_DONTCACHE) && - !(inode->v.i_state & I_FREEING) && igrab(&inode->v)) { this_pass_clean = false; -- 2.49.0 btrfs has it's own per-root inode list for snapshot uses, and it has a sanity check to make sure we're not overwriting a live inode when we add one to the root's xarray. Change this to check the refcount to validate it's not a live inode. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e85e38df3ea0..03fc3cbdb4af 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3860,7 +3860,7 @@ static int btrfs_add_inode_to_root(struct btrfs_inode *inode, bool prealloc) ASSERT(ret != -ENOMEM); return ret; } else if (existing) { - WARN_ON(!(existing->vfs_inode.i_state & (I_WILL_FREE | I_FREEING))); + WARN_ON(!refcount_read(&existing->vfs_inode.i_count)); } return 0; -- 2.49.0 Just use igrab to see if the inode is valid instead of checking I_FREEING|I_WILL_FREE. Signed-off-by: Josef Bacik --- fs/drop_caches.c | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/fs/drop_caches.c b/fs/drop_caches.c index 019a8b4eaaf9..852ccf8e84cb 100644 --- a/fs/drop_caches.c +++ b/fs/drop_caches.c @@ -23,18 +23,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused) spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { spin_lock(&inode->i_lock); - /* - * We must skip inodes in unusual state. We may also skip - * inodes without pages but we deliberately won't in case - * we need to reschedule to avoid softlockups. - */ - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || + if ((inode->i_state & I_NEW) || (mapping_empty(inode->i_mapping) && !need_resched())) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) + continue; spin_unlock(&sb->s_inode_list_lock); invalidate_mapping_pages(inode->i_mapping, 0, -1); -- 2.49.0 Instead of checking for I_FREEING, check the refcount of the inode to see if it is alive. Signed-off-by: Josef Bacik --- fs/dcache.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 60046ae23d51..fa72a7922517 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -1072,8 +1072,8 @@ struct dentry *d_find_alias_rcu(struct inode *inode) spin_lock(&inode->i_lock); // ->i_dentry and ->i_rcu are colocated, but the latter won't be - // used without having I_FREEING set, which means no aliases left - if (likely(!(inode->i_state & I_FREEING) && !hlist_empty(l))) { + // used without having an i_count reference, which means no aliases left + if (likely(refcount_read(&inode->i_count) && !hlist_empty(l))) { if (S_ISDIR(inode->i_mode)) { de = hlist_entry(l->first, struct dentry, d_u.d_alias); } else { -- 2.49.0 Instead check the refcount to see if the inode is alive. Signed-off-by: Josef Bacik --- fs/ext4/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 5b7a15db4953..7674c1f614b1 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -425,7 +425,7 @@ void ext4_check_map_extents_env(struct inode *inode) if (!S_ISREG(inode->i_mode) || IS_NOQUOTA(inode) || IS_VERITY(inode) || is_special_ino(inode->i_sb, inode->i_ino) || - (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) || + ((inode->i_state & I_NEW) || !refcount_read(&inode->i_count)) || ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) || ext4_verity_in_progress(inode)) return; -- 2.49.0 Now that we have the reference count to indicate live inodes, we can remove all of the uses of I_WILL_FREE and I_FREEING in fs-writeback.c and use the appropriate reference count checks. Signed-off-by: Josef Bacik --- fs/fs-writeback.c | 47 ++++++++++++++++++++++++++--------------------- 1 file changed, 26 insertions(+), 21 deletions(-) diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 789c4228412c..0ea6d0e86791 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -129,7 +129,7 @@ static bool inode_io_list_move_locked(struct inode *inode, { assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); - WARN_ON_ONCE(inode->i_state & I_FREEING); + WARN_ON_ONCE(!refcount_read(&inode->i_count)); if (list_empty(&inode->i_io_list)) iobj_get(inode); @@ -314,7 +314,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode, { assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); - WARN_ON_ONCE(inode->i_state & I_FREEING); + WARN_ON_ONCE(!refcount_read(&inode->i_count)); inode->i_state &= ~I_SYNC_QUEUED; if (wb != &wb->bdi->wb) @@ -415,10 +415,10 @@ static bool inode_do_switch_wbs(struct inode *inode, xa_lock_irq(&mapping->i_pages); /* - * Once I_FREEING or I_WILL_FREE are visible under i_lock, the eviction - * path owns the inode and we shouldn't modify ->i_io_list. + * Once the refcount is 0, the eviction path owns the inode and we + * shouldn't modify ->i_io_list. */ - if (unlikely(inode->i_state & (I_FREEING | I_WILL_FREE))) + if (unlikely(!refcount_read(&inode->i_count))) goto skip_switch; trace_inode_switch_wbs(inode, old_wb, new_wb); @@ -570,13 +570,16 @@ static bool inode_prepare_wbs_switch(struct inode *inode, /* while holding I_WB_SWITCH, no one else can update the association */ spin_lock(&inode->i_lock); if (!(inode->i_sb->s_flags & SB_ACTIVE) || - inode->i_state & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) || + inode->i_state & I_WB_SWITCH || inode_to_wb(inode) == new_wb) { spin_unlock(&inode->i_lock); return false; } + if (!inode_tryget(inode)) { + spin_unlock(&inode->i_lock); + return false; + } inode->i_state |= I_WB_SWITCH; - __iget(inode); spin_unlock(&inode->i_lock); return true; @@ -1207,7 +1210,7 @@ static void inode_cgwb_move_to_attached(struct inode *inode, { assert_spin_locked(&wb->list_lock); assert_spin_locked(&inode->i_lock); - WARN_ON_ONCE(inode->i_state & I_FREEING); + WARN_ON_ONCE(!refcount_read(&inode->i_count)); inode->i_state &= ~I_SYNC_QUEUED; inode_delete_from_io_list(inode); @@ -1405,7 +1408,7 @@ static void redirty_tail_locked(struct inode *inode, struct bdi_writeback *wb) * tracking. Flush worker will ignore this inode anyway and it will * trigger assertions in inode_io_list_move_locked(). */ - if (inode->i_state & I_FREEING) { + if (!refcount_read(&inode->i_count)) { inode_delete_from_io_list(inode); wb_io_lists_depopulated(wb); return; @@ -1621,7 +1624,7 @@ static void requeue_inode(struct inode *inode, struct bdi_writeback *wb, struct writeback_control *wbc, unsigned long dirtied_before) { - if (inode->i_state & I_FREEING) + if (!refcount_read(&inode->i_count)) return; /* @@ -1787,7 +1790,7 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc) * whether it is a data-integrity sync (%WB_SYNC_ALL) or not (%WB_SYNC_NONE). * * To prevent the inode from going away, either the caller must have a reference - * to the inode, or the inode must have I_WILL_FREE or I_FREEING set. + * to the inode, or the inode must have a zero refcount. */ static int writeback_single_inode(struct inode *inode, struct writeback_control *wbc) @@ -1797,9 +1800,7 @@ static int writeback_single_inode(struct inode *inode, spin_lock(&inode->i_lock); if (!refcount_read(&inode->i_count)) - WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING))); - else - WARN_ON(inode->i_state & I_WILL_FREE); + WARN_ON(inode->i_state & I_NEW); if (inode->i_state & I_SYNC) { /* @@ -1837,7 +1838,7 @@ static int writeback_single_inode(struct inode *inode, * If the inode is freeing, its i_io_list shoudn't be updated * as it can be finally deleted at this moment. */ - if (!(inode->i_state & I_FREEING)) { + if (refcount_read(&inode->i_count)) { /* * If the inode is now fully clean, then it can be safely * removed from its writeback list (if any). Otherwise the @@ -1957,7 +1958,7 @@ static long writeback_sb_inodes(struct super_block *sb, * kind writeout is handled by the freer. */ spin_lock(&inode->i_lock); - if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { + if ((inode->i_state & I_NEW) || !refcount_read(&inode->i_count)) { redirty_tail_locked(inode, wb); spin_unlock(&inode->i_lock); continue; @@ -2615,7 +2616,7 @@ void __mark_inode_dirty(struct inode *inode, int flags) if (inode_unhashed(inode)) goto out_unlock; } - if (inode->i_state & I_FREEING) + if (!refcount_read(&inode->i_count)) goto out_unlock; /* @@ -2729,13 +2730,17 @@ static void wait_sb_inodes(struct super_block *sb) spin_unlock_irq(&sb->s_inode_wblist_lock); spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + if (inode->i_state & I_NEW) { + spin_unlock(&inode->i_lock); + spin_lock_irq(&sb->s_inode_wblist_lock); + continue; + } + + if (!inode_tryget(inode)) { spin_unlock(&inode->i_lock); - spin_lock_irq(&sb->s_inode_wblist_lock); continue; } - __iget(inode); /* * We could have potentially ended up on the cached LRU list, so @@ -2886,7 +2891,7 @@ EXPORT_SYMBOL(sync_inodes_sb); * This function commits an inode to disk immediately if it is dirty. This is * primarily needed by knfsd. * - * The caller must either have a ref on the inode or must have set I_WILL_FREE. + * The caller must either have a ref on the inode or the inode must have a zero refcount. */ int write_inode_now(struct inode *inode, int sync) { -- 2.49.0 Now that we have the reference count to check if the inode is live, use that instead of checking I_WILL_FREE|I_FREEING. Signed-off-by: Josef Bacik --- fs/gfs2/ops_fstype.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c index c770006f8889..2b481fdc903d 100644 --- a/fs/gfs2/ops_fstype.c +++ b/fs/gfs2/ops_fstype.c @@ -1745,17 +1745,26 @@ static void gfs2_evict_inodes(struct super_block *sb) struct gfs2_sbd *sdp = sb->s_fs_info; set_bit(SDF_EVICTING, &sdp->sd_flags); - +again: spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) && - !need_resched()) { + if ((inode->i_state & I_NEW) && !need_resched()) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) { + if (need_resched()) { + spin_unlock(&sb->s_inode_list_lock); + iput(toput_inode); + toput_inode = NULL; + cond_resched(); + goto again; + } + continue; + } spin_unlock(&sb->s_inode_list_lock); iput(toput_inode); -- 2.49.0 We can use the reference count to see if the inode is live. Signed-off-by: Josef Bacik --- fs/quota/dquot.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c index df4a9b348769..90e69653c261 100644 --- a/fs/quota/dquot.c +++ b/fs/quota/dquot.c @@ -1030,14 +1030,16 @@ static int add_dquot_ref(struct super_block *sb, int type) spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { spin_lock(&inode->i_lock); - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) || + if ((inode->i_state & I_NEW) || !atomic_read(&inode->i_writecount) || !dqinit_needed(inode, type)) { spin_unlock(&inode->i_lock); continue; } - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) + continue; spin_unlock(&sb->s_inode_list_lock); #ifdef CONFIG_QUOTA_DEBUG -- 2.49.0 We can now just use igrab() to make sure we've got a live inode and remove the I_WILL_FREE|I_FREEING checks. Signed-off-by: Josef Bacik --- fs/notify/fsnotify.c | 26 ++++---------------------- 1 file changed, 4 insertions(+), 22 deletions(-) diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c index 0883696f873d..25996ad2a130 100644 --- a/fs/notify/fsnotify.c +++ b/fs/notify/fsnotify.c @@ -46,33 +46,15 @@ static void fsnotify_unmount_inodes(struct super_block *sb) spin_lock(&sb->s_inode_list_lock); list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { - /* - * We cannot __iget() an inode in state I_FREEING, - * I_WILL_FREE, or I_NEW which is fine because by that point - * the inode cannot have any associated watches. - */ spin_lock(&inode->i_lock); - if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) { + if (inode->i_state & I_NEW) { spin_unlock(&inode->i_lock); continue; } - - /* - * If i_count is zero, the inode cannot have any watches and - * doing an __iget/iput with SB_ACTIVE clear would actually - * evict all inodes with zero i_count from icache which is - * unnecessarily violent and may in fact be illegal to do. - * However, we should have been called /after/ evict_inodes - * removed all zero refcount inodes, in any case. Test to - * be sure. - */ - if (!refcount_read(&inode->i_count)) { - spin_unlock(&inode->i_lock); - continue; - } - - __iget(inode); spin_unlock(&inode->i_lock); + + if (!igrab(inode)) + continue; spin_unlock(&sb->s_inode_list_lock); iput(iput_inode); -- 2.49.0 We can simply use the reference count to see if this inode is alive. Signed-off-by: Josef Bacik --- fs/xfs/xfs_bmap_util.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 06ca11731e43..a32b5e9f0183 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -514,7 +514,7 @@ xfs_can_free_eofblocks( * Caller must either hold the exclusive io lock; or be inactivating * the inode, which guarantees there are no other users of the inode. */ - if (!(VFS_I(ip)->i_state & I_FREEING)) + if (refcount_read(&VFS_I(ip)->i_count)) xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL); /* prealloc/delalloc exists only on regular files */ -- 2.49.0 We have the reference count that we can use to see if the inode is alive. Signed-off-by: Josef Bacik --- security/landlock/fs.c | 22 ++++------------------ 1 file changed, 4 insertions(+), 18 deletions(-) diff --git a/security/landlock/fs.c b/security/landlock/fs.c index 570f851dc469..fc7e577b56e1 100644 --- a/security/landlock/fs.c +++ b/security/landlock/fs.c @@ -1280,23 +1280,8 @@ static void hook_sb_delete(struct super_block *const sb) list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { struct landlock_object *object; - /* Only handles referenced inodes. */ - if (!refcount_read(&inode->i_count)) - continue; - - /* - * Protects against concurrent modification of inode (e.g. - * from get_inode_object()). - */ spin_lock(&inode->i_lock); - /* - * Checks I_FREEING and I_WILL_FREE to protect against a race - * condition when release_inode() just called iput(), which - * could lead to a NULL dereference of inode->security or a - * second call to iput() for the same Landlock object. Also - * checks I_NEW because such inode cannot be tied to an object. - */ - if (inode->i_state & (I_FREEING | I_WILL_FREE | I_NEW)) { + if (inode->i_state & I_NEW) { spin_unlock(&inode->i_lock); continue; } @@ -1308,10 +1293,11 @@ static void hook_sb_delete(struct super_block *const sb) spin_unlock(&inode->i_lock); continue; } - /* Keeps a reference to this inode until the next loop walk. */ - __iget(inode); spin_unlock(&inode->i_lock); + if (!igrab(inode)) + continue; + /* * If there is no concurrent release_inode() ongoing, then we * are in charge of calling iput() on this inode, otherwise we -- 2.49.0 We don't need the I_WILL_FREE|I_FREEING check, we can use the refcount to see if the inode is valid. Signed-off-by: Josef Bacik --- include/linux/fs.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/fs.h b/include/linux/fs.h index b731224708be..9d9acbea6433 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2644,8 +2644,8 @@ static inline void mark_inode_dirty_sync(struct inode *inode) */ static inline bool inode_is_dirtytime_only(struct inode *inode) { - return (inode->i_state & (I_DIRTY_TIME | I_NEW | - I_FREEING | I_WILL_FREE)) == I_DIRTY_TIME; + return (inode->i_state & (I_DIRTY_TIME | I_NEW)) == I_DIRTY_TIME && + refcount_read(&inode->i_count); } extern void inc_nlink(struct inode *inode); -- 2.49.0 We do not need the BUG_ON in our evict truncate code, and in the invalidate path we can simply check for the i_count == 0 to see if we're evicting the inode. Signed-off-by: Josef Bacik --- fs/btrfs/inode.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 03fc3cbdb4af..191ded69ab38 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5338,7 +5338,6 @@ static void evict_inode_truncate_pages(struct inode *inode) struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct rb_node *node; - ASSERT(inode->i_state & I_FREEING); truncate_inode_pages_final(&inode->i_data); btrfs_drop_extent_map_range(BTRFS_I(inode), 0, (u64)-1, false); @@ -7448,7 +7447,7 @@ static void btrfs_invalidate_folio(struct folio *folio, size_t offset, u64 page_start = folio_pos(folio); u64 page_end = page_start + folio_size(folio) - 1; u64 cur; - int inode_evicting = inode->vfs_inode.i_state & I_FREEING; + int inode_evicting = refcount_read(&inode->vfs_inode.i_count) == 0; /* * We have folio locked so no new ordered extent can be created on this -- 2.49.0 Instead of checking I_FREEING, simply check the i_count reference to see if this inode is going away. Signed-off-by: Josef Bacik --- fs/ext4/inode.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7674c1f614b1..3950e19cf862 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -199,8 +199,8 @@ void ext4_evict_inode(struct inode *inode) * For inodes with journalled data, transaction commit could have * dirtied the inode. And for inodes with dioread_nolock, unwritten * extents converting worker could merge extents and also have dirtied - * the inode. Flush worker is ignoring it because of I_FREEING flag but - * we still need to remove the inode from the writeback lists. + * the inode. Flush worker is ignoring it because the of the 0 i_count + * but we still need to remove the inode from the writeback lists. */ if (!list_empty_careful(&inode->i_io_list)) inode_io_list_del(inode); @@ -4581,7 +4581,7 @@ int ext4_truncate(struct inode *inode) * or it's a completely new inode. In those cases we might not * have i_rwsem locked because it's not necessary. */ - if (!(inode->i_state & (I_NEW|I_FREEING))) + if (!(inode->i_state & I_NEW) && refcount_read(&inode->i_count) > 0) WARN_ON(!inode_is_locked(inode)); trace_ext4_truncate_enter(inode); -- 2.49.0 We can use the i_count refcount to see if this inode is being freed. Signed-off-by: Josef Bacik --- fs/ext4/orphan.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/ext4/orphan.c b/fs/ext4/orphan.c index 524d4658fa40..40d3eac7b003 100644 --- a/fs/ext4/orphan.c +++ b/fs/ext4/orphan.c @@ -107,7 +107,8 @@ int ext4_orphan_add(handle_t *handle, struct inode *inode) if (!sbi->s_journal || is_bad_inode(inode)) return 0; - WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + WARN_ON_ONCE(!(inode->i_state & I_NEW) && + refcount_read(&inode->i_count) > 0 && !inode_is_locked(inode)); /* * Inode orphaned in orphan file or in orphan list? @@ -236,7 +237,8 @@ int ext4_orphan_del(handle_t *handle, struct inode *inode) if (!sbi->s_journal && !(sbi->s_mount_state & EXT4_ORPHAN_FS)) return 0; - WARN_ON_ONCE(!(inode->i_state & (I_NEW | I_FREEING)) && + WARN_ON_ONCE(!(inode->i_state & I_NEW) && + refcount_read(&inode->i_count) > 0 && !inode_is_locked(inode)); if (ext4_test_inode_state(inode, EXT4_STATE_ORPHAN_FILE)) return ext4_orphan_file_del(handle, inode); -- 2.49.0 Remove the I_FREEING and I_CLEAR check in PNFS and replace it with a i_count reference check, which will indicate that the inode is going away. Signed-off-by: Josef Bacik --- fs/nfs/pnfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c index a3135b5af7ee..042a5f152068 100644 --- a/fs/nfs/pnfs.c +++ b/fs/nfs/pnfs.c @@ -317,7 +317,7 @@ pnfs_put_layout_hdr(struct pnfs_layout_hdr *lo) WARN_ONCE(1, "NFS: BUG unfreed layout segments.\n"); pnfs_detach_layout_hdr(lo); /* Notify pnfs_destroy_layout_final() that we're done */ - if (inode->i_state & (I_FREEING | I_CLEAR)) + if (refcount_read(&inode->i_count) == 0) wake_up_var_locked(lo, &inode->i_lock); spin_unlock(&inode->i_lock); pnfs_free_layout_hdr(lo); -- 2.49.0 Now that we have the i_count reference count rules set so that we only go into these evict paths with a 0 count, update the sanity checks to check that instead of I_FREEING. Signed-off-by: Josef Bacik --- fs/inode.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 4e1eeb0c3889..f715504778d2 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -874,7 +874,7 @@ void clear_inode(struct inode *inode) */ xa_unlock_irq(&inode->i_data.i_pages); BUG_ON(!list_empty(&inode->i_data.i_private_list)); - BUG_ON(!(inode->i_state & I_FREEING)); + BUG_ON(refcount_read(&inode->i_count) != 0); BUG_ON(inode->i_state & I_CLEAR); BUG_ON(!list_empty(&inode->i_wb_list)); /* don't need i_lock here, no concurrent mods to i_state */ @@ -887,19 +887,19 @@ EXPORT_SYMBOL(clear_inode); * to. We remove any pages still attached to the inode and wait for any IO that * is still in progress before finally destroying the inode. * - * An inode must already be marked I_FREEING so that we avoid the inode being + * An inode must already have an i_count of 0 so that we avoid the inode being * moved back onto lists if we race with other code that manipulates the lists * (e.g. writeback_single_inode). The caller is responsible for setting this. * * An inode must already be removed from the LRU list before being evicted from - * the cache. This should occur atomically with setting the I_FREEING state - * flag, so no inodes here should ever be on the LRU when being evicted. + * the cache. This should always be the case as the LRU list holds an i_count + * reference on the inode, and we only evict inodes with an i_count of 0. */ static void evict(struct inode *inode) { const struct super_operations *op = inode->i_sb->s_op; - BUG_ON(!(inode->i_state & I_FREEING)); + BUG_ON(refcount_read(&inode->i_count) != 0); BUG_ON(!list_empty(&inode->i_lru)); if (!list_empty(&inode->i_io_list)) @@ -913,8 +913,8 @@ static void evict(struct inode *inode) /* * Wait for flusher thread to be done with the inode so that filesystem * does not start destroying it while writeback is still running. Since - * the inode has I_FREEING set, flusher thread won't start new work on - * the inode. We just have to wait for running writeback to finish. + * the inode has a 0 i_count, flusher thread won't start new work on the + * inode. We just have to wait for running writeback to finish. */ inode_wait_for_writeback(inode); spin_unlock(&inode->i_lock); -- 2.49.0 xfs already is using igrab which will do the correct thing, simply remove the reference to these flags. Signed-off-by: Josef Bacik --- fs/xfs/scrub/common.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 2ef7742be7d3..b0bb40490493 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -1086,8 +1086,7 @@ xchk_install_handle_inode( /* * Install an already-referenced inode for scrubbing. Get our own reference to - * the inode to make disposal simpler. The inode must not be in I_FREEING or - * I_WILL_FREE state! + * the inode to make disposal simpler. The inode must be live. */ int xchk_install_live_inode( -- 2.49.0 This is a subtle behavior change. Before this change ocfs2 would keep this inode from being discovered and used while it was doing this because of I_WILL_FREE being set. However now we call ->drop_inode() before we drop the last i_count refcount, so we could potentially race here with somebody else and grab a reference to this inode. This isn't bad, the inode is still live and concurrent accesses will be safe. But we could potentially end up writing this inode multiple times if there are concurrent accesses while we're trying to drop the inode. Signed-off-by: Josef Bacik --- fs/ocfs2/inode.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c index 14bf440ea4df..d3c79d9a9635 100644 --- a/fs/ocfs2/inode.c +++ b/fs/ocfs2/inode.c @@ -1306,13 +1306,9 @@ int ocfs2_drop_inode(struct inode *inode) trace_ocfs2_drop_inode((unsigned long long)oi->ip_blkno, inode->i_nlink, oi->ip_flags); - assert_spin_locked(&inode->i_lock); - inode->i_state |= I_WILL_FREE; spin_unlock(&inode->i_lock); write_inode_now(inode, 1); spin_lock(&inode->i_lock); - WARN_ON(inode->i_state & I_NEW); - inode->i_state &= ~I_WILL_FREE; return 1; } -- 2.49.0 Now that we're using the i_count reference count as the ultimate arbiter of whether or not an inode is life we can remove the I_FREEING and I_WILL_FREE flags. Signed-off-by: Josef Bacik --- fs/inode.c | 8 ++------ include/linux/fs.h | 32 +++++++++++--------------------- include/trace/events/writeback.h | 2 -- 3 files changed, 13 insertions(+), 29 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index f715504778d2..1bb528405b3d 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -878,7 +878,7 @@ void clear_inode(struct inode *inode) BUG_ON(inode->i_state & I_CLEAR); BUG_ON(!list_empty(&inode->i_wb_list)); /* don't need i_lock here, no concurrent mods to i_state */ - inode->i_state = I_FREEING | I_CLEAR; + inode->i_state = I_CLEAR; } EXPORT_SYMBOL(clear_inode); @@ -942,7 +942,7 @@ static void evict(struct inode *inode) * This also means we don't need any fences for the call below. */ inode_wake_up_bit(inode, __I_NEW); - BUG_ON(inode->i_state != (I_FREEING | I_CLEAR)); + BUG_ON(inode->i_state != I_CLEAR); } static void iput_evict(struct inode *inode); @@ -1975,7 +1975,6 @@ static void iput_final(struct inode *inode, bool drop) state = inode->i_state; if (!drop) { - WRITE_ONCE(inode->i_state, state | I_WILL_FREE); spin_unlock(&inode->i_lock); write_inode_now(inode, 1); @@ -1983,10 +1982,7 @@ static void iput_final(struct inode *inode, bool drop) spin_lock(&inode->i_lock); state = inode->i_state; WARN_ON(state & I_NEW); - state &= ~I_WILL_FREE; } - - WRITE_ONCE(inode->i_state, state | I_FREEING); spin_unlock(&inode->i_lock); evict(inode); diff --git a/include/linux/fs.h b/include/linux/fs.h index 9d9acbea6433..0599faef0d6a 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -672,8 +672,8 @@ is_uncached_acl(struct posix_acl *acl) * I_DIRTY_DATASYNC, I_DIRTY_PAGES, and I_DIRTY_TIME. * * Four bits define the lifetime of an inode. Initially, inodes are I_NEW, - * until that flag is cleared. I_WILL_FREE, I_FREEING and I_CLEAR are set at - * various stages of removing an inode. + * until that flag is cleared. I_CLEAR is set when the inode is clean and ready + * to be freed. * * Two bits are used for locking and completion notification, I_NEW and I_SYNC. * @@ -697,24 +697,18 @@ is_uncached_acl(struct posix_acl *acl) * New inodes set I_NEW. If two processes both create * the same inode, one of them will release its inode and * wait for I_NEW to be released before returning. - * Inodes in I_WILL_FREE, I_FREEING or I_CLEAR state can - * also cause waiting on I_NEW, without I_NEW actually - * being set. find_inode() uses this to prevent returning + * Inodes with an i_count == 0 or I_CLEAR state can also + * cause waiting on I_NEW, without I_NEW actually being + * set. find_inode() uses this to prevent returning * nearly-dead inodes. - * I_WILL_FREE Must be set when calling write_inode_now() if i_count - * is zero. I_FREEING must be set when I_WILL_FREE is - * cleared. - * I_FREEING Set when inode is about to be freed but still has dirty - * pages or buffers attached or the inode itself is still - * dirty. * I_CLEAR Added by clear_inode(). In this state the inode is - * clean and can be destroyed. Inode keeps I_FREEING. + * clean and can be destroyed. * - * Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are - * prohibited for many purposes. iget() must wait for - * the inode to be completely released, then create it - * anew. Other functions will just ignore such inodes, - * if appropriate. I_NEW is used for waiting. + * Inodes that have i_count == 0 or I_CLEAR are prohibited + * for many purposes. iget() must wait for the inode to be + * completely released, then create it anew. Other + * functions will just ignore such inodes, if appropriate. + * I_NEW is used for waiting. * * I_SYNC Writeback of inode is running. The bit is set during * data writeback, and cleared with a wakeup on the bit @@ -752,8 +746,6 @@ is_uncached_acl(struct posix_acl *acl) * I_CACHED_LRU Inode is cached because it is dirty or isn't shrinkable, * and thus is on the s_cached_inode_lru list. * - * Q: What is the difference between I_WILL_FREE and I_FREEING? - * * __I_{SYNC,NEW,LRU_ISOLATING} are used to derive unique addresses to wait * upon. There's one free address left. */ @@ -776,8 +768,6 @@ enum inode_state_bits { INODE_BIT(I_DIRTY_SYNC), INODE_BIT(I_DIRTY_DATASYNC), INODE_BIT(I_DIRTY_PAGES), - INODE_BIT(I_WILL_FREE), - INODE_BIT(I_FREEING), INODE_BIT(I_CLEAR), INODE_BIT(I_REFERENCED), INODE_BIT(I_LINKABLE), diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h index 6949329c744a..58ee61f3d91d 100644 --- a/include/trace/events/writeback.h +++ b/include/trace/events/writeback.h @@ -15,8 +15,6 @@ {I_DIRTY_DATASYNC, "I_DIRTY_DATASYNC"}, \ {I_DIRTY_PAGES, "I_DIRTY_PAGES"}, \ {I_NEW, "I_NEW"}, \ - {I_WILL_FREE, "I_WILL_FREE"}, \ - {I_FREEING, "I_FREEING"}, \ {I_CLEAR, "I_CLEAR"}, \ {I_SYNC, "I_SYNC"}, \ {I_DIRTY_TIME, "I_DIRTY_TIME"}, \ -- 2.49.0 Now that we've made these changes to the inode, document the reference count rules in the vfs documentation. Signed-off-by: Josef Bacik --- Documentation/filesystems/vfs.rst | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 229eb90c96f2..5bfe7863a5de 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -457,6 +457,29 @@ The Inode Object An inode object represents an object within the filesystem. +Reference counting rules +------------------------ + +The inode is reference counted in two distinct ways, an i_obj_count refcount and +an i_count refcount. These control two different lifetimes of the inode. The +i_obj_count is the simplest, think of it as a reference count on the object +itself. When the i_obj_count reaches zero, the inode is freed. Inode freeing +happens in the RCU context, so the inode is not freed immediately, but rather +after a grace period. + +The i_count reference is the indicator that the inode is "alive". That is to +say, it is available for use by all the ways that a user can access the inode. +Once this count reaches zero, we begin the process of evicting the inode. This +is where the final truncate of an unlinked inode will normally occur. Once +i_count has reached 0, only the final iput() is allowed to do things like +writeback, truncate, etc. All users that want to do these style of operation +must use igrab() or, in very rare and specific circumstances, use +inode_tryget(). + +Every access to an inode must include one of these two references. Generally +i_obj_count is reserved for internal VFS references, the s_inode_list for +example. All file systems should use igrab()/lookup() to get a live reference on +the inode, with very few exceptions. struct inode_operations ----------------------- -- 2.49.0