Under certain conditions a queue can be left out with interrupts disabled and with the napi re-scheduling timer permanently stopped. This behaviour is triggered by the napi busy poll path when gro-flush-timeout and defer-hard-irq are set. Here's a sequence of operations: 1. Busy poll starts, NAPI_STATE_SCHED is set to avoid rescheduling napi from the timer. 2. During napi poll, driver disables interrupts due to being in poll mode (napi_complete_done() returns false because napi->state has NAPIF_STATE_IN_BUSY_POLL set). 3. At the end of the busy poll (busy_poll_stop()): 3.1 napi timer is scheduled and skip_schedule is set (due to config) 3.2 napi->poll() is called: - driver poll() processes exactly budget packets and exits early => napi not scheduled. (interrupts are still disabled at this point) 3.3 Since napi poll processed budget packets, __busy_poll_stop() is called with skip_schedule set => napi is not scheduled here either. 4. If the napi timer from 3.1 gets to be triggered due to slow napi poll or some other reason, the timer will run with no effect (due to NAPI_STATE_SCHED being set). 5. Busy poll finishes. Interrupts are still disabled and there is no timer to re-schedule. Unless another busy poll call happens, the queue will be stuck. This patch defers the scheduling of the timer to right before NAPI_STATE_SCHED is cleared. The timer is rescheduled and the NAPI_STATE_SCHED bit cleared with interrupts disabled to make sure the timer cannot fire before the bit is cleared, otherwise the situation described in this bug can reoccur. The timer is no longer scheduled when the napi poll returns < budget because napi_complete_done() will re-enable the interrupts or scheduled another napi. Fixes: 7fd3253a7de6 ("net: Introduce preferred busy-polling") Co-developed-by: Martin Karsten Signed-off-by: Martin Karsten Signed-off-by: Dragos Tatulea --- net/core/dev.c | 26 +++++++++++++++++--------- 1 file changed, 17 insertions(+), 9 deletions(-) diff --git a/net/core/dev.c b/net/core/dev.c index e59f6025067c..1487d4946dcf 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6869,9 +6869,11 @@ static void skb_defer_free_flush(void) #if defined(CONFIG_NET_RX_BUSY_POLL) -static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule) +static void __busy_poll_stop(struct napi_struct *napi, unsigned long timeout) { - if (!skip_schedule) { + unsigned long flags; + + if (!timeout) { gro_normal_list(&napi->gro); __napi_schedule(napi); return; @@ -6880,7 +6882,11 @@ static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule) /* Flush too old packets. If HZ < 1000, flush all packets */ gro_flush_normal(&napi->gro, HZ >= 1000); + local_irq_save(flags); + hrtimer_start(&napi->timer, ns_to_ktime(timeout), + HRTIMER_MODE_REL_PINNED); clear_bit(NAPI_STATE_SCHED, &napi->state); + local_irq_restore(flags); } enum { @@ -6892,8 +6898,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, unsigned flags, u16 budget) { struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx; - bool skip_schedule = false; - unsigned long timeout; + unsigned long timeout = 0; int rc; /* Busy polling means there is a high chance device driver hard irq @@ -6913,10 +6918,13 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, if (flags & NAPI_F_PREFER_BUSY_POLL) { napi->defer_hard_irqs_count = napi_get_defer_hard_irqs(napi); - timeout = napi_get_gro_flush_timeout(napi); - if (napi->defer_hard_irqs_count && timeout) { - hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED); - skip_schedule = true; + if (napi->defer_hard_irqs_count) { + /* Timer will be scheduled after napi poll to avoid + * firing during a slow poll which could cause the + * queue to get stuck with interrupts disabled and no + * scheduled timer. + */ + timeout = napi_get_gro_flush_timeout(napi); } } @@ -6931,7 +6939,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, trace_napi_poll(napi, rc, budget); netpoll_poll_unlock(have_poll_lock); if (rc == budget) - __busy_poll_stop(napi, skip_schedule); + __busy_poll_stop(napi, timeout); bpf_net_ctx_clear(bpf_net_ctx); local_bh_enable(); } -- 2.43.0