Now that we've made these changes to the inode, document the reference count rules in the vfs documentation. Signed-off-by: Josef Bacik --- Documentation/filesystems/vfs.rst | 86 +++++++++++++++++++++++++++++++ 1 file changed, 86 insertions(+) diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst index 229eb90c96f2..e285cf0499ab 100644 --- a/Documentation/filesystems/vfs.rst +++ b/Documentation/filesystems/vfs.rst @@ -457,6 +457,92 @@ The Inode Object An inode object represents an object within the filesystem. +Reference counting rules +------------------------ + +The inode is reference counted in two distinct ways, an i_obj_count refcount and +an i_count refcount. These control two different lifetimes of the inode. The +i_obj_count is the simplest, think of it as a reference count on the object +itself. When the i_obj_count reaches zero, the inode is freed. Inode freeing +happens in the RCU context, so the inode is not freed immediately, but rather +after a grace period. + +The i_count reference is the indicator that the inode is "alive". That is to +say, it is available for use by all the ways that a user can access the inode. +Once this count reaches zero, we begin the process of evicting the inode. This +is where the final truncate of an unlinked inode will normally occur. Once +i_count has reached 0, only the final iput() is allowed to do things like +writeback, truncate, etc. All users that want to do these style of operation +must use igrab() or, in very rare and specific circumstances, use +inode_tryget(). + +Every access to an inode must include one of these two references. Generally +i_obj_count is reserved for internal VFS references, the s_inode_list for +example. All file systems should use igrab()/lookup() to get a live reference on +the inode, with very few exceptions. + +LRU rules +--------- + +This is tightly coupled with the reference counting rules above. If the inode is +being held on an LRU it must be holding both an i_count and an i_obj_count +reference. This is because we need the inode to be "live" while it is on the LRU +so it can be accessed again in the future. + +This is different how we traditionally operated. Traditionally we put 0 refcount +objects on the LRU, and then when eviction happened we would remove the inode +from the LRU if it had a non-zero refcount, or evict it if it had a zero +refcount. + +Now the rules are much simpler. The LRU has a live reference on the inode. That +means that eviction simply has to remove the LRU and call iput_evict(), which +will make sure the inode is not re-added to the LRU when putting the reference. +If there are other active references to the inode, then when those references +are dropped the inode will be added back to the LRU. + +We have two uses for i_lru, one is for the normal inactive inode LRU, and the +other is for pinned inodes that are pinned because they are dirty or because +they have pagecache attached to them. + +The dirty case is easy to reason about. If the inode is dirty we cannot reclaim +it until it has been written back. The inode gets added to super block's cached +inode list when it is dirty, and removed when it is clean. + +The pagecache case is a little more complex. The VM wants to pin inodes into +memory as long as they have pagecache. This is because the pagecache has much +better reclaim logic, it accounts for thrashing and refaulting, so it needs to +be the ultimate arbiter of when an inode can be reclaimed. The inode remains on +the cached list as long as it has pagecache to account for this. When pages are +removed from the inode the VM calls inode_add_lru() to see if the inode still +needs to be on the cached list or on the inactive LRU. + +Holding a live reference on the inode has one drawback. We must remove the inode +from the LRU in more cases that previously, which can increase contention on the +LRU. In practice this won't be a problem, because we only put the inode on the +LRU that doesn't have a dentry associated with it. When we grab a live reference +to an inode we must delete it from the LRU in order to make sure that any unlink +operation results in the inode being removed on the final iput(). + +Consider the case where we've removed the last dentry from an inode and the +inode is added to the LRU list. We then lookup the inode to do an unlink. The +final iput in the unlink path will just reduce the i_count to 1, and the inode +will not be truly removed until eviction or unmount. To avoid this we have two +choices, make sure we delete the inode from the LRU at +drop_nlink()/clear_nlink() time, or make sure we delete the inode from the LRU +when we grab a live reference to it. We cannot do the drop at +drop_nlink()/clear_nlink() time because we could be holding the i_lock. +Additionally there are awkward things like BTRFS subvolume delete that do not +use the nlink of the subvolume as the indicator that it needs to be removed, and +so we would have to audit all of the possible unlink paths to make sure we +properly deleted the inode from the LRU. Instead, to provide a more robust +system, we remove an inode from the LRU at igrab() time. Internally where we're +already holding the i_lock and use inode_tryget() we will delete the inode from +the LRU at this point. + +The other case is in the unlink path itself. If there was a truncate at all we +could have ended up on the cached list, so we already have an elevated i_count. +Removing the inode from the LRU explicitly at this stage is necessary to make +sure the inode is freed as soon as possible. struct inode_operations ----------------------- -- 2.49.0