Expose this as a constant so we can reuse it in drivers. Signed-off-by: Mina Almasry --- include/net/page_pool/types.h | 2 ++ net/core/page_pool.c | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 1509a536cb85..5edba3122b10 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -58,6 +58,8 @@ struct pp_alloc_cache { netmem_ref cache[PP_ALLOC_CACHE_SIZE]; }; +#define PAGE_POOL_MAX_RING_SIZE 16384 + /** * struct page_pool_params - page pool parameters * @fast: params accessed frequently on hotpath diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 1a5edec485f1..7b2808da294f 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -211,7 +211,7 @@ static int page_pool_init(struct page_pool *pool, return -EINVAL; if (pool->p.pool_size) - ring_qsize = min(pool->p.pool_size, 16384); + ring_qsize = min(pool->p.pool_size, PAGE_POOL_MAX_RING_SIZE); /* DMA direction is either DMA_FROM_DEVICE or DMA_BIDIRECTIONAL. * DMA_BIDIRECTIONAL is for allowing page used for DMA sending, base-commit: 327c20c21d80e0d87834b392d83ae73c955ad8ff -- 2.51.2.1026.g39e6a42477-goog NCCL workloads with NCCL_P2P_PXN_LEVEL=2 or 1 are very slow with the current gve devmem tcp configuration. Root causing showed that this particular workload results in a very bursty pattern of devmem allocations and frees, exhausting the page_pool ring buffer. This results in sock_devmem_dontneed taking up to 5ms to free a batch of 128 netmems, as each free does not find an available entry in the pp->ring, and going all the way down to the (slow) gen_pool, and gve_alloc_buffer running into a burst of successive allocations which also don't find entries in the pp->ring (not dontneed'd yet, presumably), each allocation taking up to 100us, slowing down the napi poll loop. From there, the slowness of the napi poll loop results, I suspect, in the rx buffers not being processed in time, and packet drops detected by tcpdump. The total sum of all this badness results in this workload running at around 0.5 GB/s, when expected perf is around 12 GB/s. This entire behavior can be avoided by increasing the pp->ring size to the max allowed 16384. This makes the pp able to handle the bursty alloc/frees of this particular workload. AFACT there should be no negative side effect of arbitrarily increasing the pp->ring size in this manner for ZC configs - the memory is prealloced and pinned by the memory provider anyway. Tested by running AllToAll PXN=2 workload. Before: Avg bus bandwidth : 0.434191 After: Avg bus bandwidth : 12.5494 Note that there is more we can do to optimize this path, such as bulk netmem dontneeds, bulk netmem pp refills, and possibly taking a page from the iouring zcrx playbook and replacing the gen_pool with a simpler fixed-size array based allocator, but this seems sufficient to fix these critcal workloads. With thanks to Willem and Eric for helping root cause this, Cc: ziweixiao@google.com Fixes: 62d7f40503bc ("gve: support unreadable netmem") Reported-by: Vedant Mathur Signed-off-by: Mina Almasry --- drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c index 0e2b703c673a..f63ffdd3b3ba 100644 --- a/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c +++ b/drivers/net/ethernet/google/gve/gve_buffer_mgmt_dqo.c @@ -8,6 +8,8 @@ #include "gve.h" #include "gve_utils.h" +#include "net/netdev_queues.h" + int gve_buf_ref_cnt(struct gve_rx_buf_state_dqo *bs) { return page_count(bs->page_info.page) - bs->page_info.pagecnt_bias; @@ -263,6 +265,8 @@ struct page_pool *gve_rx_create_page_pool(struct gve_priv *priv, if (priv->header_split_enabled) { pp.flags |= PP_FLAG_ALLOW_UNREADABLE_NETMEM; pp.queue_idx = rx->q_num; + if (netif_rxq_has_unreadable_mp(priv->dev, rx->q_num)) + pp.pool_size = PAGE_POOL_MAX_RING_SIZE; } return page_pool_create(&pp); -- 2.51.2.1026.g39e6a42477-goog