The i.MX8MP-based Debix Model A board experiences an interrupt storm on the ENET_EQOS IRQ (135) when connected to an EEE-enabled peer. Setting the eee-broken-1000t DT property in the PHY node solves the problem, which confirms that the issue is related to EEE. Device trees for 8 boards in the mainline kernel, including the i.MX8MP EVK, set the property, which indicates the issue is likely not limited to the Debix board, although some of those device trees may have blindly copied the property from the EVK. The IRQ is documented in the reference manual as the logical OR of 4 signals: - ENET QOS TSN LPI RX Exit Interrupt - ENET QOS TSN Host System Interrupt - ENET QOS TSN Host System RX Channel Interrupts, Logical OR of channels[4:0] - ENET QOS TSN Host System TX Channel Interrupts, Logical OR of channels[4:0] Debugging the issue showed no unmasked interrupt sources from the Host System Interrupt (GMAC_INT_STATUS), Host System RX Channel Interrupts or Host System TX Channel Interrupts (MTL_INT_STATUS, MTL_CHAN_INT_CTRL and DMA_CHAN_STATUS) that was flagged at an unexpected high rate. This leaves the LPI RX Exit Interrupt as the most likely culprit. The reference manual doesn't clearly indicate what the interrupt signal is, but from its name we can reasonably infer that it would be connected to the EQOS lpi_intr_o output. That interrupt is cleared when reading the LPI control/status register. However, its deassertion is synchronous to the RX clock domain, so it will take time to clear. It appears that it could even fail to clear at all, as in the following sequence of events: - When the PHY exits LPI mode, it restarts generating the RX clock (clk_rx_i input signal to the GMAC). - The MAC detects exit from LPI, and asserts lpi_intr_o. This triggers the ENET_EQOS interrupt. - Before the CPU has time to process the interrupt, the PHY enters LPI mode again, and stops generating the RX clock. - The CPU processes the interrupt and reads the GMAC4_LPI_CTRL_STATUS registers. This does not clear lpi_intr_o as there's no clk_rx_i. The ENET_EQOS interrupt will keep firing until the PHY resumes generating the RX clock when it eventually exits LPI mode again. As LPI exit is reported by the LPIIS bit in GMAC_INT_STATUS, the lpi_intr_o signal may not have been meant to be wired to a CPU interrupt. It can't be masked in GMAC registers, and OR'ing it to the other GMAC interrupt signals seems to be a design mistake as it makes it impossible to selectively mask the interrupt in the GIC either. Setting the STMMAC_FLAG_RX_CLK_RUNS_IN_LPI platform data flag gets rid of the interrupt storm, which confirms the above theory. The i.MX8DXL and i.MX93, which also integrate an EQOS, may also be affected, as hinted by the eee-broken-1000t property being set in the i.MX8DXL EVK and the i.MX93 Variscite SoM device trees. The reference manual of the i.MX93 indicates that the ENET_EQOS interrupt also OR's the "ENET QOS TSN LPI RX exit Interrupt", while the i.MX8DXL reference manual doesn't provide details about the ENET_EQOS interrupt. Additional testing is needed with the i.MX8DXL and i.MX93, so for now set the flag for the i.MX8MP only. The eee-broken-1000t property could possibly be removed from some of the i.MX8MP device trees, but that also require per-board testing. Suggested-by: Russell King Signed-off-by: Laurent Pinchart --- I have CC'ed authors and maintainers of the i.MX8DXL, i.MX8MP and i.MX93 device trees that set the eee-broken-1000t property for awareness. To test if the property can be dropped, you will need to - Connect the EQOS interface to an EEE-enabled peer with a 1000T link. - Drop the eee-broken-1000t property from the device tree. - Boot the board and check with `ethtool --show-eee` that EEE is active. - Check the number of interrupts received from the EQOS in /proc/interrupts. After boot on my system (with an NFS root) I have ~6000 interrupts when no interrupt storm occurs, and hundreds of thousands otherwise. - Apply this patch and check that EEE works as expected without any interrupt storm. For i.MX8DXL and i.MX93, you will need to set the STMMAC_FLAG_RX_CLK_RUNS_IN_LPI in the corresponding imx_dwmac_ops instances in drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c. --- drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c index 4268b9987237..125bf435c281 100644 --- a/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c +++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-imx.c @@ -362,8 +362,7 @@ static int imx_dwmac_probe(struct platform_device *pdev) return ret; } - if (data->flags & STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY) - plat_dat->flags |= STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY; + plat_dat->flags |= data->flags; /* Default TX Q0 to use TSO and rest TXQ for TBS */ for (int i = 1; i < plat_dat->tx_queues_to_use; i++) @@ -401,7 +400,8 @@ static struct imx_dwmac_ops imx8mp_dwmac_data = { .addr_width = 34, .mac_rgmii_txclk_auto_adj = false, .set_intf_mode = imx8mp_set_intf_mode, - .flags = STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY, + .flags = STMMAC_FLAG_HWTSTAMP_CORRECT_LATENCY + | STMMAC_FLAG_RX_CLK_RUNS_IN_LPI, }; static struct imx_dwmac_ops imx8dxl_dwmac_data = { base-commit: dcb6fa37fd7bc9c3d2b066329b0d27dedf8becaa -- Regards, Laurent Pinchart