From: Hongru Zhang On mobile devices, some user-space memory management components check memory pressure and fragmentation status periodically or via PSI, and take actions such as killing processes or performing memory compaction based on this information. Under high load scenarios, reading /proc/pagetypeinfo causes memory management components or memory allocation/free paths to be blocked for extended periods waiting for the zone lock, leading to the following issues: 1. Long interrupt-disabled spinlocks - occasionally exceeding 10ms on Qcom 8750 platforms, reducing system real-time performance 2. Memory management components being blocked for extended periods, preventing rapid acquisition of memory fragmentation information for critical memory management decisions and actions 3. Increased latency in memory allocation and free paths due to prolonged zone lock contention This patch adds per-migratetype counts to the buddy allocator in preparation for optimizing /proc/pagetypeinfo access. The optimized implementation: - Make per-migratetype count updates protected by zone lock on the write side while /proc/pagetypeinfo reads are lock-free, which reduces interrupt-disabled spinlock duration and improves system real-time performance (addressing issue #1) - Reduce blocking time for memory management components when reading /proc/pagetypeinfo, enabling more rapid acquisition of memory fragmentation information (addressing issue #2) - Minimize the critical section held during /proc/pagetypeinfo reads to reduce zone lock contention on memory allocation and free paths (addressing issue #3) The main overhead is a slight increase in latency on the memory allocation and free paths due to additional per-migratetype counting, with theoretically minimal impact on overall performance. Signed-off-by: Hongru Zhang --- include/linux/mmzone.h | 1 + mm/mm_init.c | 1 + mm/page_alloc.c | 7 ++++++- 3 files changed, 8 insertions(+), 1 deletion(-) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 7fb7331c5725..6eeefe6a3727 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -138,6 +138,7 @@ extern int page_group_by_mobility_disabled; struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; + unsigned long mt_nr_free[MIGRATE_TYPES]; }; struct pglist_data; diff --git a/mm/mm_init.c b/mm/mm_init.c index 7712d887b696..dca2be8cc3b1 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1439,6 +1439,7 @@ static void __meminit zone_init_free_lists(struct zone *zone) for_each_migratetype_order(order, t) { INIT_LIST_HEAD(&zone->free_area[order].free_list[t]); zone->free_area[order].nr_free = 0; + zone->free_area[order].mt_nr_free[t] = 0; } #ifdef CONFIG_UNACCEPTED_MEMORY diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ed82ee55e66a..9431073e7255 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -818,6 +818,7 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, else list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; + area->mt_nr_free[migratetype]++; if (order >= pageblock_order && !is_migrate_isolate(migratetype)) __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages); @@ -840,6 +841,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, get_pageblock_migratetype(page), old_mt, nr_pages); list_move_tail(&page->buddy_list, &area->free_list[new_mt]); + area->mt_nr_free[old_mt]--; + area->mt_nr_free[new_mt]++; account_freepages(zone, -nr_pages, old_mt); account_freepages(zone, nr_pages, new_mt); @@ -855,6 +858,7 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, static inline void __del_page_from_free_list(struct page *page, struct zone *zone, unsigned int order, int migratetype) { + struct free_area *area = &zone->free_area[order]; int nr_pages = 1 << order; VM_WARN_ONCE(get_pageblock_migratetype(page) != migratetype, @@ -868,7 +872,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon list_del(&page->buddy_list); __ClearPageBuddy(page); set_page_private(page, 0); - zone->free_area[order].nr_free--; + area->nr_free--; + area->mt_nr_free[migratetype]--; if (order >= pageblock_order && !is_migrate_isolate(migratetype)) __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages); -- 2.43.0 From: Hongru Zhang This patch optimizes /proc/pagetypeinfo access by utilizing the per-migratetype free page block counts already maintained by the buddy allocator, instead of iterating through free lists under zone lock. Accuracy. Both implementations have accuracy limitations. The previous implementation required acquiring and releasing the zone lock for counting each order and migratetype, making it potentially inaccurate. Under high memory pressure, accuracy would further degrade due to zone lock contention or fragmentation. The new implementation collects data within a short time window, which helps maintain relatively small errors, and is unaffected by memory pressure. Furthermore, user-space memory management components inherently experience decision latency - by the time they process the collected data and execute actions, the memory state has already changed. This means that even perfectly accurate data at collection time becomes stale by decision time. Considering these factors, the accuracy trade-off introduced by the new implementation should be acceptable for practical use cases, offering a balance between performance and accuracy requirements. Performance benefits: System setup: - 12th Gen Intel(R) Core(TM) i7-12700 - 1 NUMA node, 16G memory in total - Turbo disabled - cpufreq governor set to performance 1. Average latency over 10,000 /proc/pagetypeinfo accesses +-----------------------+----------+------------+ | | no-patch | with-patch | +-----------------------+----------+------------+ | Just after boot | 700.9 us | 268.6 us | +-----------------------+----------+------------+ | After building kernel | 28.7 ms | 269.8 us | +-----------------------+----------+------------+ 2. Page alloc/free latency with concurrent /proc/pagetypeinfo access Test setup: - Using config-pagealloc-micro - Monitor set to proc-pagetypeinfo, update frequency set to 10ms - PAGEALLOC_ORDER_MIN=4, PAGEALLOC_ORDER_MAX=4 Without patch test results: vanilla vanilla no-monitor monitor Min alloc-odr4-1 8539.00 ( 0.00%) 8762.00 ( -2.61%) Min alloc-odr4-2 6501.00 ( 0.00%) 6683.00 ( -2.80%) Min alloc-odr4-4 5537.00 ( 0.00%) 5873.00 ( -6.07%) Min alloc-odr4-8 5030.00 ( 0.00%) 5361.00 ( -6.58%) Min alloc-odr4-16 4782.00 ( 0.00%) 5162.00 ( -7.95%) Min alloc-odr4-32 5838.00 ( 0.00%) 6499.00 ( -11.32%) Min alloc-odr4-64 6565.00 ( 0.00%) 7413.00 ( -12.92%) Min alloc-odr4-128 6896.00 ( 0.00%) 7898.00 ( -14.53%) Min alloc-odr4-256 7303.00 ( 0.00%) 8163.00 ( -11.78%) Min alloc-odr4-512 10179.00 ( 0.00%) 11985.00 ( -17.74%) Min alloc-odr4-1024 11000.00 ( 0.00%) 12165.00 ( -10.59%) Min free-odr4-1 820.00 ( 0.00%) 1230.00 ( -50.00%) Min free-odr4-2 511.00 ( 0.00%) 952.00 ( -86.30%) Min free-odr4-4 347.00 ( 0.00%) 434.00 ( -25.07%) Min free-odr4-8 286.00 ( 0.00%) 399.00 ( -39.51%) Min free-odr4-16 250.00 ( 0.00%) 405.00 ( -62.00%) Min free-odr4-32 294.00 ( 0.00%) 405.00 ( -37.76%) Min free-odr4-64 333.00 ( 0.00%) 363.00 ( -9.01%) Min free-odr4-128 340.00 ( 0.00%) 412.00 ( -21.18%) Min free-odr4-256 339.00 ( 0.00%) 329.00 ( 2.95%) Min free-odr4-512 361.00 ( 0.00%) 409.00 ( -13.30%) Min free-odr4-1024 300.00 ( 0.00%) 361.00 ( -20.33%) Stddev alloc-odr4-1 7.29 ( 0.00%) 90.78 (-1146.00%) Stddev alloc-odr4-2 3.87 ( 0.00%) 51.30 (-1225.75%) Stddev alloc-odr4-4 3.20 ( 0.00%) 50.90 (-1491.24%) Stddev alloc-odr4-8 4.67 ( 0.00%) 52.23 (-1019.35%) Stddev alloc-odr4-16 5.72 ( 0.00%) 27.53 (-381.04%) Stddev alloc-odr4-32 6.25 ( 0.00%) 641.23 (-10154.46%) Stddev alloc-odr4-64 2.06 ( 0.00%) 386.99 (-18714.22%) Stddev alloc-odr4-128 14.36 ( 0.00%) 52.39 (-264.77%) Stddev alloc-odr4-256 32.42 ( 0.00%) 326.19 (-906.05%) Stddev alloc-odr4-512 65.58 ( 0.00%) 184.49 (-181.31%) Stddev alloc-odr4-1024 8.88 ( 0.00%) 153.01 (-1622.67%) Stddev free-odr4-1 2.29 ( 0.00%) 152.27 (-6549.85%) Stddev free-odr4-2 10.99 ( 0.00%) 73.10 (-564.89%) Stddev free-odr4-4 1.99 ( 0.00%) 28.40 (-1324.45%) Stddev free-odr4-8 2.51 ( 0.00%) 52.93 (-2007.64%) Stddev free-odr4-16 2.85 ( 0.00%) 26.04 (-814.88%) Stddev free-odr4-32 4.04 ( 0.00%) 27.05 (-569.79%) Stddev free-odr4-64 2.10 ( 0.00%) 48.07 (-2185.66%) Stddev free-odr4-128 2.63 ( 0.00%) 26.23 (-897.86%) Stddev free-odr4-256 6.29 ( 0.00%) 37.04 (-488.71%) Stddev free-odr4-512 2.56 ( 0.00%) 10.65 (-315.28%) Stddev free-odr4-1024 0.95 ( 0.00%) 6.46 (-582.22%) Max alloc-odr4-1 8564.00 ( 0.00%) 9099.00 ( -6.25%) Max alloc-odr4-2 6511.00 ( 0.00%) 6844.00 ( -5.11%) Max alloc-odr4-4 5549.00 ( 0.00%) 6038.00 ( -8.81%) Max alloc-odr4-8 5045.00 ( 0.00%) 5551.00 ( -10.03%) Max alloc-odr4-16 4800.00 ( 0.00%) 5257.00 ( -9.52%) Max alloc-odr4-32 5861.00 ( 0.00%) 8115.00 ( -38.46%) Max alloc-odr4-64 6571.00 ( 0.00%) 8292.00 ( -26.19%) Max alloc-odr4-128 6930.00 ( 0.00%) 8081.00 ( -16.61%) Max alloc-odr4-256 7372.00 ( 0.00%) 9150.00 ( -24.12%) Max alloc-odr4-512 10333.00 ( 0.00%) 12636.00 ( -22.29%) Max alloc-odr4-1024 11035.00 ( 0.00%) 12590.00 ( -14.09%) Max free-odr4-1 828.00 ( 0.00%) 1724.00 (-108.21%) Max free-odr4-2 543.00 ( 0.00%) 1192.00 (-119.52%) Max free-odr4-4 354.00 ( 0.00%) 519.00 ( -46.61%) Max free-odr4-8 293.00 ( 0.00%) 617.00 (-110.58%) Max free-odr4-16 260.00 ( 0.00%) 483.00 ( -85.77%) Max free-odr4-32 308.00 ( 0.00%) 488.00 ( -58.44%) Max free-odr4-64 341.00 ( 0.00%) 505.00 ( -48.09%) Max free-odr4-128 346.00 ( 0.00%) 497.00 ( -43.64%) Max free-odr4-256 353.00 ( 0.00%) 463.00 ( -31.16%) Max free-odr4-512 367.00 ( 0.00%) 442.00 ( -20.44%) Max free-odr4-1024 303.00 ( 0.00%) 381.00 ( -25.74%) With patch test results: patched patched no-monitor monitor Min alloc-odr4-1 8488.00 ( 0.00%) 8514.00 ( -0.31%) Min alloc-odr4-2 6551.00 ( 0.00%) 6527.00 ( 0.37%) Min alloc-odr4-4 5536.00 ( 0.00%) 5591.00 ( -0.99%) Min alloc-odr4-8 5008.00 ( 0.00%) 5098.00 ( -1.80%) Min alloc-odr4-16 4760.00 ( 0.00%) 4857.00 ( -2.04%) Min alloc-odr4-32 5827.00 ( 0.00%) 5919.00 ( -1.58%) Min alloc-odr4-64 6561.00 ( 0.00%) 6680.00 ( -1.81%) Min alloc-odr4-128 6898.00 ( 0.00%) 7014.00 ( -1.68%) Min alloc-odr4-256 7311.00 ( 0.00%) 7464.00 ( -2.09%) Min alloc-odr4-512 10181.00 ( 0.00%) 10286.00 ( -1.03%) Min alloc-odr4-1024 11205.00 ( 0.00%) 11725.00 ( -4.64%) Min free-odr4-1 789.00 ( 0.00%) 867.00 ( -9.89%) Min free-odr4-2 490.00 ( 0.00%) 526.00 ( -7.35%) Min free-odr4-4 350.00 ( 0.00%) 360.00 ( -2.86%) Min free-odr4-8 272.00 ( 0.00%) 287.00 ( -5.51%) Min free-odr4-16 247.00 ( 0.00%) 254.00 ( -2.83%) Min free-odr4-32 298.00 ( 0.00%) 304.00 ( -2.01%) Min free-odr4-64 334.00 ( 0.00%) 325.00 ( 2.69%) Min free-odr4-128 334.00 ( 0.00%) 329.00 ( 1.50%) Min free-odr4-256 336.00 ( 0.00%) 336.00 ( 0.00%) Min free-odr4-512 360.00 ( 0.00%) 342.00 ( 5.00%) Min free-odr4-1024 327.00 ( 0.00%) 355.00 ( -8.56%) Stddev alloc-odr4-1 5.19 ( 0.00%) 45.38 (-775.09%) Stddev alloc-odr4-2 6.99 ( 0.00%) 37.63 (-437.98%) Stddev alloc-odr4-4 3.91 ( 0.00%) 17.85 (-356.28%) Stddev alloc-odr4-8 5.15 ( 0.00%) 9.34 ( -81.47%) Stddev alloc-odr4-16 3.83 ( 0.00%) 5.34 ( -39.34%) Stddev alloc-odr4-32 1.96 ( 0.00%) 10.28 (-425.09%) Stddev alloc-odr4-64 1.32 ( 0.00%) 333.30 (-25141.39%) Stddev alloc-odr4-128 2.06 ( 0.00%) 7.37 (-258.28%) Stddev alloc-odr4-256 15.56 ( 0.00%) 113.48 (-629.25%) Stddev alloc-odr4-512 61.25 ( 0.00%) 165.09 (-169.53%) Stddev alloc-odr4-1024 18.89 ( 0.00%) 2.93 ( 84.51%) Stddev free-odr4-1 4.45 ( 0.00%) 40.12 (-800.98%) Stddev free-odr4-2 1.50 ( 0.00%) 29.30 (-1850.31%) Stddev free-odr4-4 1.27 ( 0.00%) 19.49 (-1439.40%) Stddev free-odr4-8 0.97 ( 0.00%) 8.93 (-823.07%) Stddev free-odr4-16 8.38 ( 0.00%) 4.51 ( 46.21%) Stddev free-odr4-32 3.18 ( 0.00%) 6.59 (-107.42%) Stddev free-odr4-64 2.40 ( 0.00%) 3.09 ( -28.50%) Stddev free-odr4-128 1.55 ( 0.00%) 2.53 ( -62.92%) Stddev free-odr4-256 0.41 ( 0.00%) 2.80 (-585.57%) Stddev free-odr4-512 1.60 ( 0.00%) 4.84 (-202.08%) Stddev free-odr4-1024 0.66 ( 0.00%) 1.19 ( -80.68%) Max alloc-odr4-1 8505.00 ( 0.00%) 8676.00 ( -2.01%) Max alloc-odr4-2 6572.00 ( 0.00%) 6651.00 ( -1.20%) Max alloc-odr4-4 5552.00 ( 0.00%) 5646.00 ( -1.69%) Max alloc-odr4-8 5024.00 ( 0.00%) 5131.00 ( -2.13%) Max alloc-odr4-16 4774.00 ( 0.00%) 4875.00 ( -2.12%) Max alloc-odr4-32 5834.00 ( 0.00%) 5950.00 ( -1.99%) Max alloc-odr4-64 6565.00 ( 0.00%) 7434.00 ( -13.24%) Max alloc-odr4-128 6907.00 ( 0.00%) 7034.00 ( -1.84%) Max alloc-odr4-256 7347.00 ( 0.00%) 7843.00 ( -6.75%) Max alloc-odr4-512 10315.00 ( 0.00%) 10866.00 ( -5.34%) Max alloc-odr4-1024 11278.00 ( 0.00%) 11733.00 ( -4.03%) Max free-odr4-1 803.00 ( 0.00%) 1009.00 ( -25.65%) Max free-odr4-2 495.00 ( 0.00%) 607.00 ( -22.63%) Max free-odr4-4 354.00 ( 0.00%) 417.00 ( -17.80%) Max free-odr4-8 275.00 ( 0.00%) 313.00 ( -13.82%) Max free-odr4-16 273.00 ( 0.00%) 272.00 ( 0.37%) Max free-odr4-32 309.00 ( 0.00%) 324.00 ( -4.85%) Max free-odr4-64 340.00 ( 0.00%) 335.00 ( 1.47%) Max free-odr4-128 340.00 ( 0.00%) 338.00 ( 0.59%) Max free-odr4-256 338.00 ( 0.00%) 346.00 ( -2.37%) Max free-odr4-512 364.00 ( 0.00%) 359.00 ( 1.37%) Max free-odr4-1024 329.00 ( 0.00%) 359.00 ( -9.12%) Signed-off-by: Hongru Zhang --- mm/page_alloc.c | 10 ++++++---- mm/vmstat.c | 30 +++++++----------------------- 2 files changed, 13 insertions(+), 27 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 9431073e7255..a90f2bf735f6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -818,7 +818,8 @@ static inline void __add_to_free_list(struct page *page, struct zone *zone, else list_add(&page->buddy_list, &area->free_list[migratetype]); area->nr_free++; - area->mt_nr_free[migratetype]++; + WRITE_ONCE(area->mt_nr_free[migratetype], + area->mt_nr_free[migratetype] + 1); if (order >= pageblock_order && !is_migrate_isolate(migratetype)) __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, nr_pages); @@ -841,8 +842,8 @@ static inline void move_to_free_list(struct page *page, struct zone *zone, get_pageblock_migratetype(page), old_mt, nr_pages); list_move_tail(&page->buddy_list, &area->free_list[new_mt]); - area->mt_nr_free[old_mt]--; - area->mt_nr_free[new_mt]++; + WRITE_ONCE(area->mt_nr_free[old_mt], area->mt_nr_free[old_mt] - 1); + WRITE_ONCE(area->mt_nr_free[new_mt], area->mt_nr_free[new_mt] + 1); account_freepages(zone, -nr_pages, old_mt); account_freepages(zone, nr_pages, new_mt); @@ -873,7 +874,8 @@ static inline void __del_page_from_free_list(struct page *page, struct zone *zon __ClearPageBuddy(page); set_page_private(page, 0); area->nr_free--; - area->mt_nr_free[migratetype]--; + WRITE_ONCE(area->mt_nr_free[migratetype], + area->mt_nr_free[migratetype] - 1); if (order >= pageblock_order && !is_migrate_isolate(migratetype)) __mod_zone_page_state(zone, NR_FREE_PAGES_BLOCKS, -nr_pages); diff --git a/mm/vmstat.c b/mm/vmstat.c index bb09c032eecf..9334bbbe1e16 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1590,32 +1590,16 @@ static void pagetypeinfo_showfree_print(struct seq_file *m, zone->name, migratetype_names[mtype]); for (order = 0; order < NR_PAGE_ORDERS; ++order) { - unsigned long freecount = 0; - struct free_area *area; - struct list_head *curr; + unsigned long freecount; bool overflow = false; - area = &(zone->free_area[order]); - - list_for_each(curr, &area->free_list[mtype]) { - /* - * Cap the free_list iteration because it might - * be really large and we are under a spinlock - * so a long time spent here could trigger a - * hard lockup detector. Anyway this is a - * debugging tool so knowing there is a handful - * of pages of this order should be more than - * sufficient. - */ - if (++freecount >= 100000) { - overflow = true; - break; - } + /* Keep the same output format for user-space tools compatibility */ + freecount = READ_ONCE(zone->free_area[order].mt_nr_free[mtype]); + if (freecount >= 100000) { + overflow = true; + freecount = 100000; } seq_printf(m, "%s%6lu ", overflow ? ">" : "", freecount); - spin_unlock_irq(&zone->lock); - cond_resched(); - spin_lock_irq(&zone->lock); } seq_putc(m, '\n'); } @@ -1633,7 +1617,7 @@ static void pagetypeinfo_showfree(struct seq_file *m, void *arg) seq_printf(m, "%6d ", order); seq_putc(m, '\n'); - walk_zones_in_node(m, pgdat, true, false, pagetypeinfo_showfree_print); + walk_zones_in_node(m, pgdat, true, true, pagetypeinfo_showfree_print); } static void pagetypeinfo_showblockcount_print(struct seq_file *m, -- 2.43.0 From: Hongru Zhang Use per-migratetype counts instead of list_empty() helps reduce a few cpu instructions. Signed-off-by: Hongru Zhang --- mm/internal.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/internal.h b/mm/internal.h index 1561fc2ff5b8..7759f8fdf445 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order, static inline bool free_area_empty(struct free_area *area, int migratetype) { - return list_empty(&area->free_list[migratetype]); + return !READ_ONCE(area->mt_nr_free[migratetype]); } /* mm/util.c */ -- 2.43.0