This makes it possible to create memory tiers using fake numa nodes generated by numa emulation. The new "numa_emulation.adistance" kernel parameter allows you to set the abstract distance for each NUMA node. For example, if the system is booted with the parameters "numa=fake=2 numa_emulation.adistance=576,704", it will configure memory tiers with node0 having the default DRAM adistance value and node1 having a lower adistance value. Signed-off-by: Akinobu Mita --- mm/numa_emulation.c | 26 ++++++++++++++++++++++++++ 1 file changed, 26 insertions(+) diff --git a/mm/numa_emulation.c b/mm/numa_emulation.c index 703c8fa05048..a4266da21344 100644 --- a/mm/numa_emulation.c +++ b/mm/numa_emulation.c @@ -6,6 +6,9 @@ #include #include #include +#include +#include +#include #include #include #include @@ -344,6 +347,27 @@ static int __init setup_emu2phys_nid(int *dfl_phys_nid) return max_emu_nid; } +static int adistance[MAX_NUMNODES]; +module_param_array(adistance, int, NULL, 0400); +MODULE_PARM_DESC(adistance, "Abstract distance values for each NUMA node"); + +static int emu_calculate_adistance(struct notifier_block *self, + unsigned long nid, void *data) +{ + if (adistance[nid]) { + int *adist = data; + + *adist = adistance[nid]; + return NOTIFY_STOP; + } + return NOTIFY_OK; +} + +static struct notifier_block emu_adist_nb = { + .notifier_call = emu_calculate_adistance, + .priority = INT_MIN, +}; + /** * numa_emulation - Emulate NUMA nodes * @numa_meminfo: NUMA configuration to massage @@ -532,6 +556,8 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt) } } + register_mt_adistance_algorithm(&emu_adist_nb); + /* free the copied physical distance table */ memblock_free(phys_dist, phys_size); return; -- 2.43.0 On systems with multiple memory-tiers consisting of DRAM and CXL memory, the OOM killer is not invoked properly. Here's the command to reproduce: $ sudo swapoff -a $ stress-ng --oomable -v --memrate 20 --memrate-bytes 10G \ --memrate-rd-mbs 1 --memrate-wr-mbs 1 The memory usage is the number of workers specified with the --memrate option multiplied by the buffer size specified with the --memrate-bytes option, so please adjust it so that it exceeds the total size of the installed DRAM and CXL memory. If swap is disabled, you can usually expect the OOM killer to terminate the stress-ng process when memory usage approaches the installed memory size. However, if multiple memory-tiers exist (multiple /sys/devices/virtual/memory_tiering/memory_tier directories exist), and /sys/kernel/mm/numa/demotion_enabled is true and /sys/kernel/mm/lru_gen/min_ttl_ms is 0, the OOM killer will not be invoked and the system will become inoperable. This issue can be reproduced using NUMA emulation even on systems with only DRAM. You can create two-fake memory-tiers by booting a single-node system with "numa=fake=2 numa_emulation.adistance=576,704" kernel parameters. The reason for this issue is that if the target node for allocation has an underlying memory tier, it is always assumed that it can be reclaimed via demotion. So this change avoids this issue by not attempting to demote if the demoting node has less free memory than the minimum watermark. Signed-off-by: Akinobu Mita --- mm/vmscan.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index fddd168a9737..f4748f258294 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -356,7 +356,20 @@ static bool can_demote(int nid, struct scan_control *sc, return false; /* If demotion node isn't in the cgroup's mems_allowed, fall back */ - return mem_cgroup_node_allowed(memcg, demotion_nid); + if (mem_cgroup_node_allowed(memcg, demotion_nid)) { + int z; + struct zone *zone; + struct pglist_data *pgdat = NODE_DATA(demotion_nid); + unsigned int highest_zoneidx = sc ? sc->reclaim_idx : MAX_NR_ZONES - 1; + int order = sc ? sc->order : 0; + + for_each_managed_zone_pgdat(zone, pgdat, z, highest_zoneidx) { + if (zone_watermark_ok(zone, order, min_wmark_pages(zone), + highest_zoneidx, 0)) + return true; + } + } + return false; } static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg, -- 2.43.0