On a 640 CPU system running virtio-net VMs with the vhost-net driver, and
multiqueue (64) tap devices testing has shown contention on the zone lock
of the page allocator.

A 'perf record -F99 -g sleep 5' of the CPUs where the vhost worker threads run shows

    # perf report -i perf.data.vhost --stdio --sort overhead  --no-children | head -22
    ...
    #
       100.00%
                |
                |--9.47%--queued_spin_lock_slowpath
                |          |
                |           --9.37%--_raw_spin_lock_irqsave
                |                     |
                |                     |--5.00%--__rmqueue_pcplist
                |                     |          get_page_from_freelist
                |                     |          __alloc_pages_noprof
                |                     |          |
                |                     |          |--3.34%--napi_alloc_skb
    #

That is, for Rx packets
- ksoftirqd threads pinned 1:1 to CPUs do SKB allocation.
- vhost-net threads float across CPUs do SKB free.

One method to avoid this contention is to free SKB allocations on the same
CPU as they were allocated on. This allows freed pages to be placed on the
per-cpu page (PCP) lists so that any new allocations can be taken directly
from the PCP list rather than having to request new pages from the page
allocator (and taking the zone lock).

Fortunately, previous work has provided all the infrastructure to do this
via the skb_attempt_defer_free call which this change uses instead of
consume_skb in tun_do_read.

Testing done with a 6.12 based kernel and the patch ported forward.

Server is Dual Socket AMD SP5 - 2x AMD SP5 9845 (Turin) with 2 VMs
Load generator: iPerf2 x 1200 clients MSS=400

Before:
Maximum traffic rate: 55Gbps

After:
Maximum traffic rate 110Gbps
---
 drivers/net/tun.c | 2 +-
 net/core/skbuff.c | 2 ++
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 8192740357a0..388f3ffc6657 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2185,7 +2185,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		if (unlikely(ret < 0))
 			kfree_skb(skb);
 		else
-			consume_skb(skb);
+                       skb_attempt_defer_free(skb);
 	}
 
 	return ret;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6be01454f262..89217c43c639 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7201,6 +7201,7 @@ nodefer:	kfree_skb_napi_cache(skb);
 	DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
 	DEBUG_NET_WARN_ON_ONCE(skb->destructor);
 	DEBUG_NET_WARN_ON_ONCE(skb_nfct(skb));
+	DEBUG_NET_WARN_ON_ONCE(skb_shared(skb));
 
 	sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
 
@@ -7221,6 +7222,7 @@ nodefer:	kfree_skb_napi_cache(skb);
 	if (unlikely(kick))
 		kick_defer_list_purge(cpu);
 }
+EXPORT_SYMBOL(skb_attempt_defer_free);
 
 static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
 				 size_t offset, size_t len)
-- 
2.34.1