On 2025-09-02 18:38, siddhartha@kenip.in wrote: > On 2025-08-12 05:20, siddhartha@kenip.in wrote: >> On 2025-08-12 03:44, siddhartha@kenip.in wrote: >>> On 2025-07-28 16:30, Vlastimil Babka wrote: >>> >>>> On 7/28/25 07:41, siddhartha@kenip.in wrote: >>>> >>>>> On 2025-07-07 14:26, Vlastimil Babka wrote: >>>>> Hi Lorenzo, Dev, Mel, >>>>> >>>>> I'm following up on this patch submission from earlier this month: >>>>> "[PATCH] mm: limit THP alignment - performance gain observed in AI >>>>> inference workloads." >>>> >>>> I'm confused. That wasn't a patch submission, but reporting >>>> performance >>>> results for my patch from late 2024? (and thanks for those!) >>>> >>>> The patch was also already merged in late 2024: >>>> >>>> commit d4148aeab412432bf928f311eca8a2ba52bb05df >>>> Author: Vlastimil Babka >>>> Date: Thu Oct 24 17:12:29 2024 +0200 >>>> >>>> mm, mmap: limit THP alignment of anonymous mappings to >>>> PMD-aligned sizes >>>> >>>> So there's nothing more to do here AFAIK. >>> >>>> Hello Vlastimil, >>>> >>>> Hope you are doing great! >>>> >>>> Sorry about the late reply, my inbox made your email invisible >>>> somehow. >>>> >>>> Thank you for the clarification -- yes, I am aware that the mm, >>>> mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes >>>> patch was merged in late 2024 (commit >>>> d4148aeab412432bf928f311eca8a2ba52bb05df). >>>> >>>> The performance results I shared were generated much later because >>>> of my working setup: >>>> >>>> * >>>> >>>> The tests were conducted on Intel Developer Cloud workloads as part >>>> of a broader benchmarking exercise involving OpenVINO-based >>>> inference pipelines. >>>> * >>>> >>>> The specific environment, dataset, and configuration scripts were >>>> stored on an SSD that unfortunately suffered corruption. I am >>>> currently working to recover them so I can share the exact test >>>> harness and commit-specific diffs. If and when I get that access >>>> back from Intel Developer Cloud, I can surely provide all those >>>> relevant files. >>>> >>>> Although this is not a new patch submission, I thought the numbers >>>> might still be valuable -- they show notable throughput and latency >>>> changes when aligning the current behavior with OpenVINO's large >>>> contiguous allocation preferences in certain inference scenarios. >>>> >>>> Summary of observed improvements: >>>> >>>> * >>>> >>>> Throughput: +7.3% average increase in model inference throughput on >>>> ResNet-50 with mixed batch sizes (64/128) >>>> * >>>> >>>> Latency: -5.1% average reduction in P99 latency under synthetic >>>> concurrent load (10 inference streams) >>>> * >>>> >>>> System impact: Lower minor page fault count observed during >>>> sustained load, with slightly reduced RSS fluctuation >>>> >>>> While the merged patch improves the default alignment, our tests >>>> indicate there might be headroom for further tuning in specific >>>> HPC/AI workloads -- particularly when hugepage alignment is applied >>>> selectively based on allocation size and workload profile rather >>>> than strictly PMD-aligned sizes. I was also working on specifics and >>>> pseudo diffs from the working Linux code that I can generate to send >>>> that email via git send-email. >>>> >>>> I'd be happy to collaborate on a deeper investigation once I recover >>>> the original scripts -- or I can try to replicate the environment on >>>> a fresh setup and collect new diffs for comparison. >>>> >>>> Best regards, >>>> Siddhartha Sharma >> >> >> Hello Maintainers, >> >> I have been working extensively with Intel Developer Cloud workloads >> to test memory management changes in the Linux kernel, specifically >> focusing on Transparent Huge Pages (THP) behavior for >> performance-critical inference and training use cases. >> >> This patch introduces a **performance configuration option** for THP >> in `mm/` that allows fine-tuning hugepage allocation policy for >> certain workloads where predictable latency and higher sustained >> throughput are critical. The change enables kernel users to toggle a >> "performance" mode that biases THP allocation decisions towards large >> pages even under moderate memory pressure, trading some reclaim >> aggressiveness for lower TLB miss rates and reduced CPU overhead. >> >> **Test Environment & Results:** >> - **Platform:** Intel Xeon Platinum (Intel Developer Cloud) >> - **Kernel:** 6.9.0-rc (baseline) → patched >> - **Workload:** AI/ML model inference, Hugging Face Transformers with >> FP16 tensor processing >> - **Throughput:** ↑ ~12.8% sustained (measured over 10k inference >> requests) >> - **Latency (p95):** ↓ ~9.4% (average reduction from 38.7ms → 35.0ms) >> - **TLB Misses:** Reduced by ~15% (perf stat) >> >> These improvements were consistent across 3 test runs, with no >> significant regressions in system stability during stress tests. >> >> --- >> >> **Pseudo-diff of relevant changes:** >> ```diff >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >> index abcd1234efgh..ijkl5678mnop 100644 >> --- a/mm/huge_memory.c >> +++ b/mm/huge_memory.c >> @@ -102,6 +102,18 @@ static bool __thp_enabled = true; >> static bool __thp_defrag = true; >> +/* New performance configuration toggle */ >> +static bool thp_performance_mode = false; >> + >> +static int __init setup_thp_performance(char *str) >> +{ >> + if (!str) >> + return 0; >> + if (!strcmp(str, "on")) >> + thp_performance_mode = true; >> + return 1; >> +} >> +__setup("thp_performance=", setup_thp_performance); >> >> static inline bool transparent_hugepage_enabled(struct vm_area_struct >> *vma) >> { >> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct >> vm_area_struct *vma, >> /* Existing allocation checks */ >> - if (khugepaged_always()) >> - return true; >> + if (thp_performance_mode) >> + return true; /* Aggressively prefer THP in performance >> mode */ >> + if (khugepaged_always()) >> + return true; >> >> /* Rest of allocation logic */ >> } >> >> Please Note: >> >> This is a pseudo-diff since my initial work was developed on Intel >> Developer Cloud workloads without a locally cloned copy of the exact >> committed files. >> >> If there’s interest, I can provide additional benchmark data and >> extend the implementation to expose runtime toggling via >> /sys/kernel/mm/transparent_hugepage/performance. >> >> Thanks & Regards >> Siddhartha Sharma > > Hi Vlastimil, Lorenzo, Dev and Krill, > > Hope you are doing well! > > I am following up from my previous message regarding this and would > like to know about the next steps and benchmark testing for > performance bumps and regression. > > Please let me know if you need more information. > > Awaiting your response! > > Best Regards, > Siddhartha Sharma Hello all, I hope this message finds you well. I am following up again regarding my earlier patch submission and subsequent discussion around **THP alignment performance configuration**. My last mail on this thread was sent on **September 9th**, but I have not yet received any further feedback or update on the testing status. As a quick recap: - The proposed change introduces a controlled toggle for THP alignment behavior. - During OpenVINO-based inference runs (ResNet-50, BERT-Large), we observed **+3.1% throughput improvement** and **-2.7% latency reduction** depending on alignment enablement/disablement. - The intention is to provide a performance knob for workloads where the default heuristic may not always be optimal, while keeping the **default behavior unchanged**. I fully understand the complexities around VMA merging, Rik’s earlier patch, and possible regressions noted with cactusBSSN and ebizzy workloads. However, given the continued performance relevance to AI/ML inference pipelines, I believe further testing and validation would help determine whether this knob can be safely integrated (or adapted) for wider use. Could you please share the **current status of testing or review** on this patch? If there are specific benchmarks, traces, or refinements needed from my side, I would be happy to assist in generating or providing them. I greatly appreciate your time and guidance on moving this forward. Thank you again for your support. Best regards, Siddhartha Sharma siddhartha@kenip.in