From: Jason Xing perf c2c profiling of the AF_XDP generic-copy batch TX path reveals that ~45% of all cache-line contention (HITM) comes from a single cacheline inside struct xsk_buff_pool. The sendmsg CPU reads pool geometry fields (addrs, chunk_size, headroom, tx_metadata_len, etc.) in the validate and build hot path, while the NAPI TX-completion CPU writes cq_prod_lock (via xsk_destruct_skb -> xsk_cq_submit_addr_locked) and cached_need_wakeup (via xsk_set/clear_tx_need_wakeup) on the same cacheline—classic false sharing. This adds one extra cacheline (64 bytes) to the per-pool allocation but eliminates cross-CPU false sharing between the TX sendmsg and TX completion paths. This reorganization improves overall performance by 5-6%, which can be captured by xdpsock. After this, the only one hotpot is 6% refcount process, which has already been batched to minimize the impact in the series. Signed-off-by: Jason Xing --- include/net/xsk_buff_pool.h | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h index ccb3b350001f..b1b11e3aa273 100644 --- a/include/net/xsk_buff_pool.h +++ b/include/net/xsk_buff_pool.h @@ -73,23 +73,27 @@ struct xsk_buff_pool { u64 addrs_cnt; u32 free_list_cnt; u32 dma_pages_cnt; - u32 free_heads_cnt; + + /* Read-mostly fields */ u32 headroom; u32 chunk_size; u32 chunk_shift; u32 frame_len; u32 xdp_zc_max_segs; u8 tx_metadata_len; /* inherited from umem */ - u8 cached_need_wakeup; bool uses_need_wakeup; bool unaligned; bool tx_sw_csum; void *addrs; + + /* Write-heavy fields */ /* Mutual exclusion of the completion ring in the SKB mode. * Protect: NAPI TX thread and sendmsg error paths in the SKB * destructor callback. */ - spinlock_t cq_prod_lock; + spinlock_t cq_prod_lock ____cacheline_aligned_in_smp; + u8 cached_need_wakeup; + u32 free_heads_cnt; struct xdp_buff_xsk *free_heads[]; }; -- 2.41.3