Btrfs currently tracks its metadata pages in the page cache, using a fake inode (fs_info->btree_inode) with offsets corresponding to where the metadata is stored in the filesystem's full logical address space. A consequence of this is that when btrfs uses filemap_add_folio(), this usage is charged to the cgroup of whichever task happens to be running at the time. These folios don't belong to any particular user cgroup, so I don't think it makes much sense for them to be charged in that way. Some negative consequences as a result: - A task can be holding some important btrfs locks, then need to lookup some metadata and go into reclaim, extending the duration it holds that lock for, and unfairly pushing its own reclaim pain onto other cgroups. - If that cgroup goes into reclaim, it might reclaim these folios a different non-reclaiming cgroup might need soon. This is naturally offset by LRU reclaim, but still. We have two options for how to manage such file pages: 1. charge them to the root cgroup. 2. don't charge them to any cgroup at all. 2. breaks the invariant that every mapped page has a cgroup. This is workable, but unnecessarily risky. Therefore, go with 1. A very similar proposal to use the root cgroup was previously made by Qu, where he eventually proposed the idea of setting it per address_space. This makes good sense for the btrfs use case, as the behavior should apply to all use of the address_space, not select allocations. I.e., if someone adds another filemap_add_folio() call using btrfs's btree_inode, we would almost certainly want to account that to the root cgroup as well. Link: https://lore.kernel.org/linux-mm/b5fef5372ae454a7b6da4f2f75c427aeab6a07d6.1727498749.git.wqu@suse.com/ Suggested-by: Qu Wenruo Suggested-by: Shakeel Butt Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Boris Burkov --- include/linux/pagemap.h | 2 ++ mm/filemap.c | 6 ++++++ 2 files changed, 8 insertions(+) diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index c9ba69e02e3e..a3e16d74792f 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -211,6 +211,8 @@ enum mapping_flags { folio contents */ AS_INACCESSIBLE = 8, /* Do not attempt direct R/W access to the mapping */ AS_WRITEBACK_MAY_DEADLOCK_ON_RECLAIM = 9, + AS_KERNEL_FILE = 10, /* mapping for a fake kernel file that shouldn't + account usage to user cgroups */ /* Bits 16-25 are used for FOLIO_ORDER */ AS_FOLIO_ORDER_BITS = 5, AS_FOLIO_ORDER_MIN = 16, diff --git a/mm/filemap.c b/mm/filemap.c index e4a5a46db89b..05c1384bd611 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -960,8 +960,14 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, { void *shadow = NULL; int ret; + struct mem_cgroup *tmp; + bool kernel_file = test_bit(AS_KERNEL_FILE, &mapping->flags); + if (kernel_file) + tmp = set_active_memcg(root_mem_cgroup); ret = mem_cgroup_charge(folio, NULL, gfp); + if (kernel_file) + set_active_memcg(tmp); if (ret) return ret; -- 2.50.1 Kernel file pages are tricky to track because they are indistinguishable from files whose usage is accounted to the root cgroup. To maintain good accounting, introduce a vmstat counter tracking kernel file pages. Confirmed that these work as expected at a high level by mounting a btrfs using AS_KERNEL_FILE for metadata pages, and seeing the counter rise with fs usage then go back to a minimal level after drop_caches and finally down to 0 after unmounting the fs. Suggested-by: Shakeel Butt Acked-by: Shakeel Butt Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Boris Burkov --- include/linux/mmzone.h | 1 + mm/filemap.c | 7 +++++++ mm/vmstat.c | 1 + 3 files changed, 9 insertions(+) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fe13ad175fed..f3272ef5131b 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -259,6 +259,7 @@ enum node_stat_item { NR_HUGETLB, #endif NR_BALLOON_PAGES, + NR_KERNEL_FILE_PAGES, NR_VM_NODE_STAT_ITEMS }; diff --git a/mm/filemap.c b/mm/filemap.c index 05c1384bd611..344ab106c21c 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -190,6 +190,9 @@ static void filemap_unaccount_folio(struct address_space *mapping, __lruvec_stat_mod_folio(folio, NR_FILE_THPS, -nr); filemap_nr_thps_dec(mapping); } + if (test_bit(AS_KERNEL_FILE, &folio->mapping->flags)) + mod_node_page_state(folio_pgdat(folio), + NR_KERNEL_FILE_PAGES, -nr); /* * At this point folio must be either written or cleaned by @@ -989,6 +992,10 @@ int filemap_add_folio(struct address_space *mapping, struct folio *folio, if (!(gfp & __GFP_WRITE) && shadow) workingset_refault(folio, shadow); folio_add_lru(folio); + if (kernel_file) + mod_node_page_state(folio_pgdat(folio), + NR_KERNEL_FILE_PAGES, + folio_nr_pages(folio)); } return ret; } diff --git a/mm/vmstat.c b/mm/vmstat.c index e74f0b2a1021..e522decf6a72 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1290,6 +1290,7 @@ const char * const vmstat_text[] = { [I(NR_HUGETLB)] = "nr_hugetlb", #endif [I(NR_BALLOON_PAGES)] = "nr_balloon_pages", + [I(NR_KERNEL_FILE_PAGES)] = "nr_kernel_file_pages", #undef I /* system-wide enum vm_stat_item counters */ -- 2.50.1 extent_buffers are global and shared so their pages should not belong to any particular cgroup (currently whichever cgroups happens to allocate the extent_buffer). Btrfs tree operations should not arbitrarily block on cgroup reclaim or have the shared extent_buffer pages on a cgroup's reclaim lists. Acked-by: Shakeel Butt Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Boris Burkov --- fs/btrfs/disk-io.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 70fc4e7cc5a0..7fab5057cf8e 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1930,6 +1930,7 @@ static int btrfs_init_btree_inode(struct super_block *sb) BTRFS_I(inode)->root = btrfs_grab_root(fs_info->tree_root); set_bit(BTRFS_INODE_DUMMY, &BTRFS_I(inode)->runtime_flags); __insert_inode_hash(inode, hash); + set_bit(AS_KERNEL_FILE, &inode->i_mapping->flags); fs_info->btree_inode = inode; return 0; -- 2.50.1