In order to keep track of loops across the stack we need to _remember the global loop state in the skb_. We introduce a 2 bit per-skb ttl field to keep track of this state. The following shows the before and after pahole diff: pahole before(-) and after (+) diff looks like: __u8 slow_gro:1; /* 132: 3 1 */ __u8 csum_not_inet:1; /* 132: 4 1 */ __u8 unreadable:1; /* 132: 5 1 */ + __u8 ttl:2; /* 132: 6 1 */ - /* XXX 2 bits hole, try to pack */ /* XXX 1 byte hole, try to pack */ __u16 tc_index; /* 134 2 */ There used to be a ttl field removed as part of tc_verd in commit aec745e2c520 ("net-tc: remove unused tc_verd fields"). It was already unused by that time removed earlier in commit c19ae86a510c ("tc: remove unused redirect ttl"). An existing per-cpu loop count, MIRRED_NEST_LIMIT, exists; however, this count assumes a single call stack assumption and suffers from two challenges: 1)if we queue the packet somewhere and then restart processing later the per-cpu state is lost (example, it gets wiped out the moment we go egress->ingress and queue the packet in the backlog and later packets are being pulled from backlog) 2) If we have X/RPS where a packet came in one CPU but may end up on a different CPU. Our first attempt was to "liberate" the skb->from_ingress bit into the skb->cb field (v1) and after a lot of deeper reviews found that it does get trampled in case of hardware offload via the mlnx driver. Our second attempt (which we didnt post) was to "liberate" the skb->tc_skip_classify bit into the skb->cb - but that led us to a path of making changes that are sensitive such as making mods to dev queue xmit. This is our third attempt. Use cases: 1) Mirred increments the ttl whenever it sees an skb. This in combination with MIRRED_NEST_LIMIT helps us resolve both challenges mentioned above. This is ilustrated in patch #2. 2) netem increments the ttl when using the "duplicate" feature and catches it when it sees the packet the second time. This is ilustrated in patch #5. Fixes: fe946a751d9b ("net/sched: act_mirred: add loop detection") Fixes: 0afb51e72855 ("[PKT_SCHED]: netem: reinsert for duplication") Signed-off-by: Jamal Hadi Salim --- include/linux/skbuff.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index daa4e4944ce3..f1326c4b4bcc 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -848,6 +848,7 @@ enum skb_tstamp_type { * CHECKSUM_UNNECESSARY (max 3) * @unreadable: indicates that at least 1 of the fragments in this skb is * unreadable. + * @ttl: time to live counter for packet loops. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -1030,6 +1031,7 @@ struct sk_buff { __u8 csum_not_inet:1; #endif __u8 unreadable:1; + __u8 ttl:2; #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif -- 2.34.1