macb_start_xmit() and macb_tx_restart() kick transmission by OR-ing MACB_BIT(TSTART) into NCR. On PCIe-attached macb instances (BCM2712 + RP1 PCIe south bridge on Raspberry Pi 5 is the setup we have in front of us), writes to NCR are posted PCIe writes: they are not guaranteed to reach the device before the issuing CPU returns. If the TSTART doorbell does not reach the MAC, no TX begins, no TCOMP completion arrives, and the ring remains quiescent without any kernel-visible indication. Note that the raspberrypi/linux vendor fork carries a local patch around the TSTART site (a queue->tx_pending breadcrumb that is promoted to queue->txubr_pending by the next TCOMP interrupt, triggering macb_tx_restart()). That workaround makes the loss recoverable under traffic, but it cannot help if TCOMP itself is not raised because no TX started -- which is exactly the case we are targeting here. The handshake is not present in mainline. Add a read-back of NCR after each TSTART write in macb_start_xmit() and macb_tx_restart(). The read is an architected PCIe read barrier for earlier posted writes on the same path; it ensures the doorbell has reached the MAC before the functions return. We do not yet have direct hardware evidence that TSTART is being lost on the RP1 path (that would require a PCIe protocol analyser, or at minimum a before/after counter on queue->tx_stall_last_tail with and without this patch applied in isolation). This patch is one of a three-patch series ("candidate fixes for silent TX stall on BCM2712/RP1"); see the cover letter for context. We have verified the series compiles and applies cleanly against mainline HEAD and against raspberrypi/linux rpi-6.18.y @ f2f68e79f16f; runtime verification is pending. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 Signed-off-by: Lukasz Raczylo --- drivers/net/ethernet/cadence/macb_main.c | 12 ++++++++++++ 1 file changed, 12 insertions(+) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index a12aa2124..b6cca55ad 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -1922,6 +1922,13 @@ static void macb_tx_restart(struct macb_queue *queue) spin_lock(&bp->lock); macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART)); + /* + * Flush the PCIe posted-write queue so the TSTART doorbell + * reliably reaches the MAC. Without this, the write can sit + * in the fabric and the MAC never advances, causing a silent + * TX stall. + */ + (void)macb_readl(bp, NCR); spin_unlock(&bp->lock); out_tx_ptr_unlock: @@ -2560,6 +2567,11 @@ static netdev_tx_t macb_start_xmit(struct sk_buff *skb, struct net_device *dev) spin_lock(&bp->lock); macb_tx_lpi_wake(bp); macb_writel(bp, NCR, macb_readl(bp, NCR) | MACB_BIT(TSTART)); + /* + * Flush the PCIe posted-write queue; see the comment in + * macb_tx_restart() for the reasoning. + */ + (void)macb_readl(bp, NCR); spin_unlock(&bp->lock); if (CIRC_SPACE(queue->tx_head, queue->tx_tail, bp->tx_ring_size) < 1) -- 2.53.0 macb_tx_poll() runs with TCOMP masked, drains the TX ring, then calls napi_complete_done() and re-enables TCOMP via IER. An existing comment in the function notes: /* Packet completions only seem to propagate to raise * interrupts when interrupts are enabled at the time, so if * packets were sent while interrupts were disabled, * they will not cause another interrupt to be generated when * interrupts are re-enabled. */ and mitigates this by calling macb_tx_complete_pending(), which inspects driver-visible ring state (descriptor->ctrl, after rmb()) and reschedules NAPI if a completion is observable in memory. On PCIe-attached parts (BCM2712 + RP1 on Raspberry Pi 5 is the setup we have in front of us), the descriptor DMA write that sets TX_USED may not have retired to system memory at the point macb_tx_complete_pending() runs. The rmb() synchronises the CPU view of earlier CPU writes; it is not sufficient to retire an in-flight peripheral DMA write. Under that ordering the in-memory descriptor can still read TX_USED=0 when the hardware has in fact completed the frame; the check returns false; NAPI exits; the quirk above prevents the re-enabled IER from re-firing; the ring goes quiescent. Add an explicit ISR read after the IER write. The MMIO read serves two independent purposes: (1) It is an architected PCIe read barrier for earlier peripheral-originated DMA writes on the same path, so a subsequent macb_tx_complete_pending() observes any TX_USED write that was in flight at the time of the barrier. (2) It samples the hardware ISR directly, so a TCOMP bit that the hardware set while TCOMP was masked is visible here, independently of whether the descriptor DMA has retired. If either signal indicates pending work, reschedule NAPI via the same path as the existing check. This patch addresses one of three candidate races for the silent TX stall described in the cover letter. Whether it is sufficient by itself, or whether it requires the PCIe posted-write flush in patch 1/3 to cover the observed behaviour, we have not yet verified at runtime. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 Signed-off-by: Lukasz Raczylo --- drivers/net/ethernet/cadence/macb_main.c | 28 +++++++++++++++--------- 1 file changed, 18 insertions(+), 10 deletions(-) diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index b6cca55ad..ea231b1c5 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -1973,17 +1973,25 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) if (work_done < budget && napi_complete_done(napi, work_done)) { queue_writel(queue, IER, MACB_BIT(TCOMP)); - /* Packet completions only seem to propagate to raise - * interrupts when interrupts are enabled at the time, so if - * packets were sent while interrupts were disabled, - * they will not cause another interrupt to be generated when - * interrupts are re-enabled. - * Check for this case here to avoid losing a wakeup. This can - * potentially race with the interrupt handler doing the same - * actions if an interrupt is raised just after enabling them, - * but this should be harmless. + /* + * TCOMP events that fire while the interrupt is masked do + * not re-fire when IER is re-enabled. Catch this two ways + * to avoid losing a wakeup: + * + * (1) Read ISR -- catches completions the hardware flagged + * but that we did not see as an interrupt. The MMIO + * read doubles as a PCIe read barrier, flushing any + * in-flight descriptor TX_USED DMA writes into memory. + * (2) macb_tx_complete_pending() inspects the ring after + * that flush, catching a descriptor whose TX_USED is + * now visible as a result of the barrier. + * + * This can race with the interrupt handler taking the same + * path if an interrupt fires just after the IER write; + * rescheduling NAPI in that case is harmless. */ - if (macb_tx_complete_pending(queue)) { + if ((queue_readl(queue, ISR) & MACB_BIT(TCOMP)) || + macb_tx_complete_pending(queue)) { queue_writel(queue, IDR, MACB_BIT(TCOMP)); macb_queue_isr_clear(bp, queue, MACB_BIT(TCOMP)); netdev_vdbg(bp->dev, "TX poll: packets pending, reschedule\n"); -- 2.53.0 Patches 1/3 and 2/3 address two candidate races that could lead to a TCOMP completion being missed on PCIe-attached macb instances. This patch adds a defence-in-depth safety net, in case a further race remains that we have not identified. The watchdog is a per-queue delayed_work that runs once per second. It snapshots queue->tx_tail; if the ring is non-empty (queue->tx_head != queue->tx_tail) and tx_tail has not advanced since the previous tick, it calls macb_tx_restart(). No new recovery logic is introduced. macb_tx_restart() already exists in this file, is correctly locked (tx_ptr_lock, bp->lock), and verifies that the hardware's TBQP is behind the driver's head index before re-asserting TSTART. On a healthy ring it is a no-op at the hardware level; the watchdog only supplies the missing trigger. On a healthy queue the per-tick cost is one spin_lock_irqsave() / spin_unlock_irqrestore() and one branch. The delayed_work is only scheduled between macb_open() and macb_close(), and is cancelled synchronously on close. Context for submission: on our 24-node Raspberry Pi 5 fleet, before this series, an out-of-band user-space watchdog (monitoring tx_packets from /sys/class/net/.../statistics and toggling the link down/up when it froze) was required to keep nodes usable. We include this kernel-side watchdog as a cleaner in-kernel equivalent for any residual stall that patches 1 and 2 do not cover. We are willing to drop this patch if the view is that 1 and 2 should stand alone. Link: https://github.com/cilium/cilium/issues/43198 Link: https://bugs.launchpad.net/ubuntu/+source/linux-raspi/+bug/2133877 Signed-off-by: Lukasz Raczylo --- drivers/net/ethernet/cadence/macb.h | 5 ++ drivers/net/ethernet/cadence/macb_main.c | 59 ++++++++++++++++++++++++ 2 files changed, 64 insertions(+) diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h index 2de56017e..9115f2b47 100644 --- a/drivers/net/ethernet/cadence/macb.h +++ b/drivers/net/ethernet/cadence/macb.h @@ -1278,6 +1278,11 @@ struct macb_queue { dma_addr_t tx_ring_dma; struct work_struct tx_error_task; bool txubr_pending; + + /* TX stall watchdog -- see macb_tx_stall_watchdog() in macb_main.c */ + struct delayed_work tx_stall_watchdog_work; + unsigned int tx_stall_last_tail; + struct napi_struct napi_tx; dma_addr_t rx_ring_dma; diff --git a/drivers/net/ethernet/cadence/macb_main.c b/drivers/net/ethernet/cadence/macb_main.c index ea231b1c5..ea2306ef7 100644 --- a/drivers/net/ethernet/cadence/macb_main.c +++ b/drivers/net/ethernet/cadence/macb_main.c @@ -2002,6 +2002,59 @@ static int macb_tx_poll(struct napi_struct *napi, int budget) return work_done; } +#define MACB_TX_STALL_INTERVAL_MS 1000 + +/* + * TX stall watchdog. + * + * Defence-in-depth against lost TCOMP interrupts. macb already has a + * recovery chain (tx_pending -> txubr_pending -> macb_tx_restart()) + * that fires on TCOMP; if TCOMP itself is lost the TX ring stalls + * silently until something else kicks TSTART. This watchdog runs + * once per second per queue, snapshots tx_tail, and calls + * macb_tx_restart() if the ring is non-empty and tx_tail has not + * advanced since the previous tick. + * + * macb_tx_restart() already checks the hardware's TBQP against the + * driver's head index before re-asserting TSTART, so on a healthy + * ring this is a no-op at the hardware level. The watchdog only + * adds the missing trigger. + */ +static void macb_tx_stall_watchdog(struct work_struct *work) +{ + struct macb_queue *queue = container_of(to_delayed_work(work), + struct macb_queue, + tx_stall_watchdog_work); + struct macb *bp = queue->bp; + unsigned int cur_tail, cur_head; + bool stalled = false; + unsigned long flags; + + if (!netif_running(bp->dev)) + return; + + spin_lock_irqsave(&queue->tx_ptr_lock, flags); + cur_tail = queue->tx_tail; + cur_head = queue->tx_head; + if (cur_head != cur_tail && + cur_tail == queue->tx_stall_last_tail) + stalled = true; + else + queue->tx_stall_last_tail = cur_tail; + spin_unlock_irqrestore(&queue->tx_ptr_lock, flags); + + if (stalled) { + netdev_warn_once(bp->dev, + "TX stall detected on queue %u (tail=%u head=%u); re-kicking TSTART\n", + (unsigned int)(queue - bp->queues), + cur_tail, cur_head); + macb_tx_restart(queue); + } + + schedule_delayed_work(&queue->tx_stall_watchdog_work, + msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); +} + static void macb_hresp_error_task(struct work_struct *work) { struct macb *bp = from_work(bp, work, hresp_err_bh_work); @@ -3190,6 +3243,9 @@ static int macb_open(struct net_device *dev) for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { napi_enable(&queue->napi_rx); napi_enable(&queue->napi_tx); + queue->tx_stall_last_tail = queue->tx_tail; + schedule_delayed_work(&queue->tx_stall_watchdog_work, + msecs_to_jiffies(MACB_TX_STALL_INTERVAL_MS)); } macb_init_hw(bp); @@ -3240,6 +3296,7 @@ static int macb_close(struct net_device *dev) for (q = 0, queue = bp->queues; q < bp->num_queues; ++q, ++queue) { napi_disable(&queue->napi_rx); napi_disable(&queue->napi_tx); + cancel_delayed_work_sync(&queue->tx_stall_watchdog_work); netdev_tx_reset_queue(netdev_get_tx_queue(dev, q)); } @@ -4802,6 +4859,8 @@ static int macb_init_dflt(struct platform_device *pdev) } INIT_WORK(&queue->tx_error_task, macb_tx_error_task); + INIT_DELAYED_WORK(&queue->tx_stall_watchdog_work, + macb_tx_stall_watchdog); q++; } -- 2.53.0