In a system with high real-time requirements, the timeout mechanism of ordinary timers with jiffies granularity is insufficient to meet the demands for real-time performance. Meanwhile, the optimization of CPU usage with af_packet is quite significant. Use hrtimer instead of timer to help compensate for the shortcomings in real-time performance. In HZ=100 or HZ=250 system, the update of TP_STATUS_USER is not real-time enough, with fluctuations reaching over 8ms (on a system with HZ=250). This is unacceptable in some high real-time systems that require timely processing of network packets. By replacing it with hrtimer, if a timeout of 2ms is set, the update of TP_STATUS_USER can be stabilized to within 3 ms. Signed-off-by: Xin Zhao --- Changes in v8: - Delete delete_blk_timer field, as suggested by Willem de Bruijn, hrtimer_cancel will check and wait until the timer callback return and ensure enter enter callback again; - Simplify the logic related to setting timeout, as suggestd by Willem de Bruijn. Currently timer callback just restarts itself unconditionally, so delete the 'out:' label, do not forward hrtimer in prb_open_block, call hrtimer_forward_now directly and always return HRTIMER_RESTART. The only special case is when prb_open_block is called from tpacket_rcv. That would set the timeout further into the future than the already queued timer. An earlier timeout is not problematic. No need to add complexity to avoid that. Changes in v7: - Only update the hrtimer expire time within the hrtimer callback. When the callback return, without sk_buff_head lock protection, __run_hrtimer will enqueue the timer if return HRTIMER_RESTART. Setting the hrtimer expires while enqueuing a timer may cause chaos in the hrtimer red-black tree. The setting expire time is monotonic, so if we do not update the expire time to the retire_blk_timer when it is not in callback, it will not cause problem if we skip the timeout event and update it when find out that expire_ktime is bigger than the expire time of retire_blk_timer. - Use hrtimer_set_expires instead of hrtimer_forward_now. The end time for retiring each block is not fixed because when network packets are received quickly, blocks are retired rapidly, and the new block retire time needs to be recalculated. However, hrtimer_forward_now increments the previous timeout by an interval, which is not correct. - The expire time is monotonic, so if we do not update the expire time to the retire_blk_timer when it is not in callback, it will not cause problem if we skip the timeout event and update it when find out that expire_ktime is bigger than the expire time of retire_blk_timer. - Adding the 'bool callback' parameter back is intended to more accurately determine whether we are inside the hrtimer callback when executing _prb_refresh_rx_retire_blk_timer. This ensures that we only update the hrtimer's timeout value within the hrtimer callback. - Link to v7: https://lore.kernel.org/all/20250822132051.266787-1-jackzxcui1989@163.com/ Changes in v6: - Use hrtimer_is_queued instead to check whether it is within the callback function. So do not need to add 'bool callback' parameter to _prb_refresh_rx_retire_blk_timer as suggested by Willem de Bruijn; - Do not need local_irq_save and local_irq_restore to protect the race of the timer callback running in softirq context or the open_block from tpacket_rcv in process context as suggested by Willem de Bruijn; - Link to v6: https://lore.kernel.org/all/20250820092925.2115372-1-jackzxcui1989@163.com/ Changes in v5: - Remove the unnecessary comments at the top of the _prb_refresh_rx_retire_blk_timer, branch is self-explanatory enough as suggested by Willem de Bruijn; - Indentation of _prb_refresh_rx_retire_blk_timer, align with first argument on previous line as suggested by Willem de Bruijn; - Do not call hrtimer_start within the hrtimer callback as suggested by Willem de Bruijn So add 'bool callback' parameter to _prb_refresh_rx_retire_blk_timer to indicate whether it is within the callback function. Use hrtimer_forward_now instead of hrtimer_start when it is in the callback function and is doing prb_open_block. - Link to v5: https://lore.kernel.org/all/20250819091447.1199980-1-jackzxcui1989@163.com/ Changes in v4: - Add 'bool start' to distinguish whether the call to _prb_refresh_rx_retire_blk_timer is for prb_open_block. When it is for prb_open_block, execute hrtimer_start to (re)start the hrtimer; otherwise, use hrtimer_forward_now to set the expiration time as it is more commonly used compared to hrtimer_set_expires. as suggested by Willem de Bruijn; - Delete the comments to explain why hrtimer_set_expires(not hrtimer_forward_now) is used, as we do not use hrtimer_set_expires any more; - Link to v4: https://lore.kernel.org/all/20250818050233.155344-1-jackzxcui1989@163.com/ Changes in v3: - return HRTIMER_NORESTART when pkc->delete_blk_timer is true as suggested by Willem de Bruijn; - Drop the retire_blk_tov field of tpacket_kbdq_core, add interval_ktime instead as suggested by Willem de Bruijn; - Add comments to explain why hrtimer_set_expires(not hrtimer_forward_now) is used in _prb_refresh_rx_retire_blk_timer as suggested by Willem de Bruijn; - Link to v3: https://lore.kernel.org/all/20250816170130.3969354-1-jackzxcui1989@163.com/ Changes in v2: - Drop the tov_in_msecs field of tpacket_kbdq_core added by the patch as suggested by Willem de Bruijn; - Link to v2: https://lore.kernel.org/all/20250815044141.1374446-1-jackzxcui1989@163.com/ Changes in v1: - Do not add another config for the current changes as suggested by Eric Dumazet; - Mention the beneficial cases 'HZ=100 or HZ=250' in the changelog as suggested by Eric Dumazet; - Add some performance details to the changelog as suggested by Ferenc Fejes; - Delete the 'pkc->tov_in_msecs == 0' bounds check which is not necessary as suggested by Willem de Bruijn; - Use hrtimer_set_expires instead of hrtimer_start_range_ns when retire timer needs update as suggested by Willem de Bruijn. Start the hrtimer in prb_setup_retire_blk_timer; - Just return HRTIMER_RESTART directly as all cases return the same value as suggested by Willem de Bruijn; - Link to v1: https://lore.kernel.org/all/20250813165201.1492779-1-jackzxcui1989@163.com/ - Link to v0: https://lore.kernel.org/all/20250806055210.1530081-1-jackzxcui1989@163.com/ --- net/packet/af_packet.c | 65 ++++++++++-------------------------------- net/packet/diag.c | 2 +- net/packet/internal.h | 6 ++-- 3 files changed, 18 insertions(+), 55 deletions(-) diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index a7017d7f0..bf1127d87 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -203,8 +203,7 @@ static void prb_retire_current_block(struct tpacket_kbdq_core *, static int prb_queue_frozen(struct tpacket_kbdq_core *); static void prb_open_block(struct tpacket_kbdq_core *, struct tpacket_block_desc *); -static void prb_retire_rx_blk_timer_expired(struct timer_list *); -static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *); +static enum hrtimer_restart prb_retire_rx_blk_timer_expired(struct hrtimer *); static void prb_fill_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *); static void prb_clear_rxhash(struct tpacket_kbdq_core *, struct tpacket3_hdr *); @@ -579,23 +578,13 @@ static __be16 vlan_get_protocol_dgram(const struct sk_buff *skb) return proto; } -static void prb_del_retire_blk_timer(struct tpacket_kbdq_core *pkc) -{ - timer_delete_sync(&pkc->retire_blk_timer); -} - static void prb_shutdown_retire_blk_timer(struct packet_sock *po, struct sk_buff_head *rb_queue) { struct tpacket_kbdq_core *pkc; pkc = GET_PBDQC_FROM_RB(&po->rx_ring); - - spin_lock_bh(&rb_queue->lock); - pkc->delete_blk_timer = 1; - spin_unlock_bh(&rb_queue->lock); - - prb_del_retire_blk_timer(pkc); + hrtimer_cancel(&pkc->retire_blk_timer); } static void prb_setup_retire_blk_timer(struct packet_sock *po) @@ -603,9 +592,10 @@ static void prb_setup_retire_blk_timer(struct packet_sock *po) struct tpacket_kbdq_core *pkc; pkc = GET_PBDQC_FROM_RB(&po->rx_ring); - timer_setup(&pkc->retire_blk_timer, prb_retire_rx_blk_timer_expired, - 0); - pkc->retire_blk_timer.expires = jiffies; + hrtimer_setup(&pkc->retire_blk_timer, prb_retire_rx_blk_timer_expired, + CLOCK_MONOTONIC, HRTIMER_MODE_REL_SOFT); + hrtimer_start(&pkc->retire_blk_timer, pkc->interval_ktime, + HRTIMER_MODE_REL_SOFT); } static int prb_calc_retire_blk_tmo(struct packet_sock *po, @@ -672,11 +662,10 @@ static void init_prb_bdqc(struct packet_sock *po, p1->last_kactive_blk_num = 0; po->stats.stats3.tp_freeze_q_cnt = 0; if (req_u->req3.tp_retire_blk_tov) - p1->retire_blk_tov = req_u->req3.tp_retire_blk_tov; + p1->interval_ktime = ms_to_ktime(req_u->req3.tp_retire_blk_tov); else - p1->retire_blk_tov = prb_calc_retire_blk_tmo(po, - req_u->req3.tp_block_size); - p1->tov_in_jiffies = msecs_to_jiffies(p1->retire_blk_tov); + p1->interval_ktime = ms_to_ktime(prb_calc_retire_blk_tmo(po, + req_u->req3.tp_block_size)); p1->blk_sizeof_priv = req_u->req3.tp_sizeof_priv; rwlock_init(&p1->blk_fill_in_prog_lock); @@ -686,16 +675,6 @@ static void init_prb_bdqc(struct packet_sock *po, prb_open_block(p1, pbd); } -/* Do NOT update the last_blk_num first. - * Assumes sk_buff_head lock is held. - */ -static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc) -{ - mod_timer(&pkc->retire_blk_timer, - jiffies + pkc->tov_in_jiffies); - pkc->last_kactive_blk_num = pkc->kactive_blk_num; -} - /* * Timer logic: * 1) We refresh the timer only when we open a block. @@ -719,7 +698,7 @@ static void _prb_refresh_rx_retire_blk_timer(struct tpacket_kbdq_core *pkc) * prb_calc_retire_blk_tmo() calculates the tmo. * */ -static void prb_retire_rx_blk_timer_expired(struct timer_list *t) +static enum hrtimer_restart prb_retire_rx_blk_timer_expired(struct hrtimer *t) { struct packet_sock *po = timer_container_of(po, t, rx_ring.prb_bdqc.retire_blk_timer); @@ -732,9 +711,6 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t) frozen = prb_queue_frozen(pkc); pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); - if (unlikely(pkc->delete_blk_timer)) - goto out; - /* We only need to plug the race when the block is partially filled. * tpacket_rcv: * lock(); increment BLOCK_NUM_PKTS; unlock() @@ -757,21 +733,12 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t) goto refresh_timer; } prb_retire_current_block(pkc, po, TP_STATUS_BLK_TMO); - if (!prb_dispatch_next_block(pkc, po)) - goto refresh_timer; - else - goto out; + prb_dispatch_next_block(pkc, po); } else { /* Case 1. Queue was frozen because user-space was * lagging behind. */ - if (prb_curr_blk_in_use(pbd)) { - /* - * Ok, user-space is still behind. - * So just refresh the timer. - */ - goto refresh_timer; - } else { + if (!prb_curr_blk_in_use(pbd)) { /* Case 2. queue was frozen,user-space caught up, * now the link went idle && the timer fired. * We don't have a block to close.So we open this @@ -780,16 +747,14 @@ static void prb_retire_rx_blk_timer_expired(struct timer_list *t) * Thawing/timer-refresh is a side effect. */ prb_open_block(pkc, pbd); - goto out; } } } refresh_timer: - _prb_refresh_rx_retire_blk_timer(pkc); - -out: + hrtimer_forward_now(&pkc->retire_blk_timer, pkc->interval_ktime); spin_unlock(&po->sk.sk_receive_queue.lock); + return HRTIMER_RESTART; } static void prb_flush_block(struct tpacket_kbdq_core *pkc1, @@ -921,7 +886,7 @@ static void prb_open_block(struct tpacket_kbdq_core *pkc1, pkc1->pkblk_end = pkc1->pkblk_start + pkc1->kblk_size; prb_thaw_queue(pkc1); - _prb_refresh_rx_retire_blk_timer(pkc1); + pkc1->last_kactive_blk_num = pkc1->kactive_blk_num; smp_wmb(); } diff --git a/net/packet/diag.c b/net/packet/diag.c index 6ce1dcc28..c8f43e0c1 100644 --- a/net/packet/diag.c +++ b/net/packet/diag.c @@ -83,7 +83,7 @@ static int pdiag_put_ring(struct packet_ring_buffer *ring, int ver, int nl_type, pdr.pdr_frame_nr = ring->frame_max + 1; if (ver > TPACKET_V2) { - pdr.pdr_retire_tmo = ring->prb_bdqc.retire_blk_tov; + pdr.pdr_retire_tmo = ktime_to_ms(ring->prb_bdqc.interval_ktime); pdr.pdr_sizeof_priv = ring->prb_bdqc.blk_sizeof_priv; pdr.pdr_features = ring->prb_bdqc.feature_req_word; } else { diff --git a/net/packet/internal.h b/net/packet/internal.h index 1e743d031..30ae8979f 100644 --- a/net/packet/internal.h +++ b/net/packet/internal.h @@ -20,7 +20,6 @@ struct tpacket_kbdq_core { unsigned int feature_req_word; unsigned int hdrlen; unsigned char reset_pending_on_curr_blk; - unsigned char delete_blk_timer; unsigned short kactive_blk_num; unsigned short blk_sizeof_priv; @@ -45,12 +44,11 @@ struct tpacket_kbdq_core { /* Default is set to 8ms */ #define DEFAULT_PRB_RETIRE_TOV (8) - unsigned short retire_blk_tov; + ktime_t interval_ktime; unsigned short version; - unsigned long tov_in_jiffies; /* timer to retire an outstanding block */ - struct timer_list retire_blk_timer; + struct hrtimer retire_blk_timer; }; struct pgv { -- 2.34.1