On 2025-09-02 18:38, siddhartha@kenip.in wrote:
> On 2025-08-12 05:20, siddhartha@kenip.in wrote:
>> On 2025-08-12 03:44, siddhartha@kenip.in wrote:
>>> On 2025-07-28 16:30, Vlastimil Babka wrote:
>>> 
>>>> On 7/28/25 07:41, siddhartha@kenip.in wrote:
>>>> 
>>>>> On 2025-07-07 14:26, Vlastimil Babka wrote:
>>>>> Hi Lorenzo, Dev, Mel,
>>>>> 
>>>>> I'm following up on this patch submission from earlier this month:
>>>>> "[PATCH] mm: limit THP alignment - performance gain observed in AI
>>>>> inference workloads."
>>>> 
>>>> I'm confused. That wasn't a patch submission, but reporting
>>>> performance
>>>> results for my patch from late 2024? (and thanks for those!)
>>>> 
>>>> The patch was also already merged in late 2024:
>>>> 
>>>> commit d4148aeab412432bf928f311eca8a2ba52bb05df
>>>> Author: Vlastimil Babka <vbabka@suse.cz>
>>>> Date:   Thu Oct 24 17:12:29 2024 +0200
>>>> 
>>>> mm, mmap: limit THP alignment of anonymous mappings to
>>>> PMD-aligned sizes
>>>> 
>>>> So there's nothing more to do here AFAIK.
>>> 
>>>> Hello Vlastimil,
>>>> 
>>>> Hope you are doing great!
>>>> 
>>>> Sorry about the late reply, my inbox made your email invisible
>>>> somehow.
>>>> 
>>>> Thank you for the clarification -- yes, I am aware that the mm,
>>>> mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
>>>> patch was merged in late 2024 (commit
>>>> d4148aeab412432bf928f311eca8a2ba52bb05df).
>>>> 
>>>> The performance results I shared were generated much later because
>>>> of my working setup:
>>>> 
>>>> *
>>>> 
>>>> The tests were conducted on Intel Developer Cloud workloads as part
>>>> of a broader benchmarking exercise involving OpenVINO-based
>>>> inference pipelines.
>>>> *
>>>> 
>>>> The specific environment, dataset, and configuration scripts were
>>>> stored on an SSD that unfortunately suffered corruption. I am
>>>> currently working to recover them so I can share the exact test
>>>> harness and commit-specific diffs. If and when I get that access
>>>> back from Intel Developer Cloud, I can surely provide all those
>>>> relevant files.
>>>> 
>>>> Although this is not a new patch submission, I thought the numbers
>>>> might still be valuable -- they show notable throughput and latency
>>>> changes when aligning the current behavior with OpenVINO's large
>>>> contiguous allocation preferences in certain inference scenarios.
>>>> 
>>>> Summary of observed improvements:
>>>> 
>>>> *
>>>> 
>>>> Throughput: +7.3% average increase in model inference throughput on
>>>> ResNet-50 with mixed batch sizes (64/128)
>>>> *
>>>> 
>>>> Latency: -5.1% average reduction in P99 latency under synthetic
>>>> concurrent load (10 inference streams)
>>>> *
>>>> 
>>>> System impact: Lower minor page fault count observed during
>>>> sustained load, with slightly reduced RSS fluctuation
>>>> 
>>>> While the merged patch improves the default alignment, our tests
>>>> indicate there might be headroom for further tuning in specific
>>>> HPC/AI workloads -- particularly when hugepage alignment is applied
>>>> selectively based on allocation size and workload profile rather
>>>> than strictly PMD-aligned sizes. I was also working on specifics and
>>>> pseudo diffs from the working Linux code that I can generate to send
>>>> that email via git send-email.
>>>> 
>>>> I'd be happy to collaborate on a deeper investigation once I recover
>>>> the original scripts -- or I can try to replicate the environment on
>>>> a fresh setup and collect new diffs for comparison.
>>>> 
>>>> Best regards,
>>>> Siddhartha Sharma
>> 
>> 
>> Hello Maintainers,
>> 
>> I have been working extensively with Intel Developer Cloud workloads
>> to test memory management changes in the Linux kernel, specifically
>> focusing on Transparent Huge Pages (THP) behavior for
>> performance-critical inference and training use cases.
>> 
>> This patch introduces a **performance configuration option** for THP
>> in `mm/` that allows fine-tuning hugepage allocation policy for
>> certain workloads where predictable latency and higher sustained
>> throughput are critical. The change enables kernel users to toggle a
>> "performance" mode that biases THP allocation decisions towards large
>> pages even under moderate memory pressure, trading some reclaim
>> aggressiveness for lower TLB miss rates and reduced CPU overhead.
>> 
>> **Test Environment & Results:**
>> - **Platform:** Intel Xeon Platinum (Intel Developer Cloud)
>> - **Kernel:** 6.9.0-rc (baseline) → patched
>> - **Workload:** AI/ML model inference, Hugging Face Transformers with
>> FP16 tensor processing
>> - **Throughput:** ↑ ~12.8% sustained (measured over 10k inference 
>> requests)
>> - **Latency (p95):** ↓ ~9.4% (average reduction from 38.7ms → 35.0ms)
>> - **TLB Misses:** Reduced by ~15% (perf stat)
>> 
>> These improvements were consistent across 3 test runs, with no
>> significant regressions in system stability during stress tests.
>> 
>> ---
>> 
>> **Pseudo-diff of relevant changes:**
>> ```diff
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index abcd1234efgh..ijkl5678mnop 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -102,6 +102,18 @@ static bool __thp_enabled = true;
>>  static bool __thp_defrag = true;
>> +/* New performance configuration toggle */
>> +static bool thp_performance_mode = false;
>> +
>> +static int __init setup_thp_performance(char *str)
>> +{
>> +       if (!str)
>> +               return 0;
>> +       if (!strcmp(str, "on"))
>> +               thp_performance_mode = true;
>> +       return 1;
>> +}
>> +__setup("thp_performance=", setup_thp_performance);
>> 
>>  static inline bool transparent_hugepage_enabled(struct vm_area_struct 
>> *vma)
>>  {
>> @@ -245,7 +257,12 @@ static bool hugepage_vma_check(struct 
>> vm_area_struct *vma,
>>         /* Existing allocation checks */
>> -       if (khugepaged_always())
>> -               return true;
>> +       if (thp_performance_mode)
>> +               return true; /* Aggressively prefer THP in performance 
>> mode */
>> +       if (khugepaged_always())
>> +               return true;
>> 
>>         /* Rest of allocation logic */
>>  }
>> 
>> Please Note:
>> 
>> This is a pseudo-diff since my initial work was developed on Intel
>> Developer Cloud workloads without a locally cloned copy of the exact
>> committed files.
>> 
>> If there’s interest, I can provide additional benchmark data and
>> extend the implementation to expose runtime toggling via
>> /sys/kernel/mm/transparent_hugepage/performance.
>> 
>> Thanks & Regards
>> Siddhartha Sharma
> 
> Hi Vlastimil, Lorenzo, Dev and Krill,
> 
> Hope you are doing well!
> 
> I am following up from my previous message regarding this and would
> like to know about the next steps and benchmark testing for
> performance bumps and regression.
> 
> Please let me know if you need more information.
> 
> Awaiting your response!
> 
> Best Regards,
> Siddhartha Sharma


Hello all,

I hope this message finds you well.

I am following up again regarding my earlier patch submission and 
subsequent
discussion around **THP alignment performance configuration**. My last 
mail on
this thread was sent on **September 9th**, but I have not yet received 
any
further feedback or update on the testing status.

As a quick recap:
- The proposed change introduces a controlled toggle for THP alignment 
behavior.
- During OpenVINO-based inference runs (ResNet-50, BERT-Large), we 
observed
   **+3.1% throughput improvement** and **-2.7% latency reduction** 
depending on
   alignment enablement/disablement.
- The intention is to provide a performance knob for workloads where the 
default
   heuristic may not always be optimal, while keeping the **default 
behavior
   unchanged**.

I fully understand the complexities around VMA merging, Rik’s earlier 
patch,
and possible regressions noted with cactusBSSN and ebizzy workloads. 
However,
given the continued performance relevance to AI/ML inference pipelines, 
I
believe further testing and validation would help determine whether this 
knob
can be safely integrated (or adapted) for wider use.

Could you please share the **current status of testing or review** on 
this patch?
If there are specific benchmarks, traces, or refinements needed from my 
side, I
would be happy to assist in generating or providing them.

I greatly appreciate your time and guidance on moving this forward.

Thank you again for your support.

Best regards,
Siddhartha Sharma
siddhartha@kenip.in