Like with switches the current MFD algorithm does not consider asymmetric ACS within a MFD. If any MFD function has ACS that permits P2P the spec says it can reach through the MFD internal loopback any other function in the device. For discussion let's consider a simple MFD topology like the below: -- MFD 00:1f.0 ACS != REQ_ACS_FLAGS Root 00:00.00 --|- MFD 00:1f.2 ACS != REQ_ACS_FLAGS |- MFD 00:1f.6 ACS = REQ_ACS_FLAGS This asymmetric ACS could be created using the config_acs kernel command line parameter, from quirks, or from a poorly thought out device that has ACS flags only on some functions. Since ACS is an egress property the asymmetric flags allow for 00:1f.0 to do memory acesses into 00:1f.6's BARs, but 00:1f.6 cannot reach any other function. Thus we expect an iommu_group to contain all three devices. Instead the current algorithm gives a group of [1f.0, 1f.2] and a single device group of 1f.6. The current algorithm sees the good ACS flags on 00:1f.6 and does not consider ACS on any other MFD functions. For path properties the ACS flags say that 00:1f.6 is safe to use with PASID and supports SVA as it will not have any portions of its address space routed away from the IOMMU, this part of the ACS system is working correctly. Further, if one of the MFD functions is a bridge, eg like 1f.2: -- MFD 00:1f.0 Root 00:00.00 --|- MFD 00:1f.2 Root Port --- 01:01.0 |- MFD 00:1f.6 Then the correct grouping will include 01:01.0, 00:1f.0/2/6 together in a group if there is any internal loopback within the MFD 00:1f. The current algorithm does not understand this and gives 01:01.0 it's own group even if it thinks there is an internal loopback in the MFD. Unfortunately this detail makes it hard to fix. Currently the code assumes that any MFD without an ACS cap has an internal loopback which will cause a large number of modern real systems to group in a pessimistic way. However, the PCI spec does not really support this: PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and Multi-Function Devices ACS P2P Request Redirect: must be implemented by Functions that support peer-to-peer traffic with other Functions. Meaning from a spec perspective the absence of ACS indicates the absence of internal loopback. Granted I think we are aware of older real devices that ignore this, but it seems to be the only way forward. So, rely on 6.12.1.2 and assume functions without ACS do not have internal loopback. This resolves the common issue with modern systems and MFD root ports, but it makes the ACS quirks system less used. Instead we'd want quirks that say self-loopback is actually present, not like today's quirks that say it is absent. This is surely negative for older hardware, but positive for new HW that complies with the spec. Use pci_reachable_set() in pci_device_group() to make the resulting algorithm faster and easier to understand. Add pci_mfds_are_same_group() which specifically looks pair-wise at all functions in the MFDs. Any function with ACS capabilities and non-isolated aCS flags will become reachable to all other functions. pci_reachable_set() does the calculations for figuring out the set of devices under the pci_bus_sem, which is better than repeatedly searching across all PCI devices. Once the set of devices is determined and the set has more than one device use pci_get_slot() to search for any existing groups in the reachable set. Fixes: 104a1c13ac66 ("iommu/core: Create central IOMMU group lookup/creation interface") Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 189 +++++++++++++++++++----------------------- 1 file changed, 87 insertions(+), 102 deletions(-) diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 543d6347c0e5e3..fc3c71b243a850 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -1413,85 +1413,6 @@ int iommu_group_id(struct iommu_group *group) } EXPORT_SYMBOL_GPL(iommu_group_id); -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev, - unsigned long *devfns); - -/* - * For multifunction devices which are not isolated from each other, find - * all the other non-isolated functions and look for existing groups. For - * each function, we also need to look for aliases to or from other devices - * that may already have a group. - */ -static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev, - unsigned long *devfns) -{ - struct pci_dev *tmp = NULL; - struct iommu_group *group; - - if (!pdev->multifunction || pci_acs_enabled(pdev, PCI_ACS_ISOLATED)) - return NULL; - - for_each_pci_dev(tmp) { - if (tmp == pdev || tmp->bus != pdev->bus || - PCI_SLOT(tmp->devfn) != PCI_SLOT(pdev->devfn) || - pci_acs_enabled(tmp, PCI_ACS_ISOLATED)) - continue; - - group = get_pci_alias_group(tmp, devfns); - if (group) { - pci_dev_put(tmp); - return group; - } - } - - return NULL; -} - -/* - * Look for aliases to or from the given device for existing groups. DMA - * aliases are only supported on the same bus, therefore the search - * space is quite small (especially since we're really only looking at pcie - * device, and therefore only expect multiple slots on the root complex or - * downstream switch ports). It's conceivable though that a pair of - * multifunction devices could have aliases between them that would cause a - * loop. To prevent this, we use a bitmap to track where we've been. - */ -static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev, - unsigned long *devfns) -{ - struct pci_dev *tmp = NULL; - struct iommu_group *group; - - if (test_and_set_bit(pdev->devfn & 0xff, devfns)) - return NULL; - - group = iommu_group_get(&pdev->dev); - if (group) - return group; - - for_each_pci_dev(tmp) { - if (tmp == pdev || tmp->bus != pdev->bus) - continue; - - /* We alias them or they alias us */ - if (pci_devs_are_dma_aliases(pdev, tmp)) { - group = get_pci_alias_group(tmp, devfns); - if (group) { - pci_dev_put(tmp); - return group; - } - - group = get_pci_function_alias_group(tmp, devfns); - if (group) { - pci_dev_put(tmp); - return group; - } - } - } - - return NULL; -} - /* * Generic device_group call-back function. It just allocates one * iommu-group per device. @@ -1534,44 +1455,108 @@ static struct iommu_group *pci_group_alloc_non_isolated(void) return group; } +/* + * All functions in the MFD need to be isolated from each other and get their + * own groups, otherwise the whole MFD will share a group. + */ +static bool pci_mfds_are_same_group(struct pci_dev *deva, struct pci_dev *devb) +{ + /* + * SRIOV VFs will use the group of the PF if it has + * BUS_DATA_PCI_NON_ISOLATED. We don't support VFs that also have ACS + * that are set to non-isolating. + */ + if (deva->is_virtfn || devb->is_virtfn) + return false; + + /* Are deva/devb functions in the same MFD? */ + if (PCI_SLOT(deva->devfn) != PCI_SLOT(devb->devfn)) + return false; + /* Don't understand what is happening, be conservative */ + if (deva->multifunction != devb->multifunction) + return true; + if (!deva->multifunction) + return false; + + /* + * PCI Section 6.12.1.2 ACS Functions in SR-IOV, SIOV, and + * Multi-Function Devices + * ... + * ACS P2P Request Redirect: must be implemented by Functions that + * support peer-to-peer traffic with other Functions. + * + * Therefore assume if a MFD has no ACS capability then it does not + * support a loopback. This is the reverse of what Linux <= v6.16 + * assumed - that any MFD was capable of P2P and used quirks identify + * devices that complied with the above. + */ + if (deva->acs_cap && !pci_acs_enabled(deva, PCI_ACS_ISOLATED)) + return true; + if (devb->acs_cap && !pci_acs_enabled(devb, PCI_ACS_ISOLATED)) + return true; + return false; +} + +static bool pci_devs_are_same_group(struct pci_dev *deva, struct pci_dev *devb) +{ + /* + * This is allowed to return cycles: a,b -> b,c -> c,a can be aliases. + */ + if (pci_devs_are_dma_aliases(deva, devb)) + return true; + + return pci_mfds_are_same_group(deva, devb); +} + /* * Return a group if the function has isolation restrictions related to * aliases or MFD ACS. */ static struct iommu_group *pci_get_function_group(struct pci_dev *pdev) { - struct iommu_group *group; - DECLARE_BITMAP(devfns, 256) = {}; + struct pci_reachable_set devfns; + const unsigned int NR_DEVFNS = sizeof(devfns.devfns) * BITS_PER_BYTE; + unsigned int devfn; /* - * Look for existing groups on device aliases. If we alias another - * device or another device aliases us, use the same group. + * Look for existing groups on device aliases and multi-function ACS. If + * we alias another device or another device aliases us, use the same + * group. + * + * pci_reachable_set() should return the same bitmap if called for any + * device in the set and we want all devices in the set to have the same + * group. */ - group = get_pci_alias_group(pdev, devfns); - if (group) - return group; + pci_reachable_set(pdev, &devfns, pci_devs_are_same_group); + /* start is known to have iommu_group_get() == NULL */ + __clear_bit(pdev->devfn, devfns.devfns); /* - * Look for existing groups on non-isolated functions on the same - * slot and aliases of those funcions, if any. No need to clear - * the search bitmap, the tested devfns are still valid. - */ - group = get_pci_function_alias_group(pdev, devfns); - if (group) - return group; - - /* - * When MFD's are included in the set due to ACS we assume that if ACS - * permits an internal loopback between functions it also permits the - * loopback to go downstream if a function is a bridge. + * When MFD functions are included in the set due to ACS we assume that + * if ACS permits an internal loopback between functions it also permits + * the loopback to go downstream if any function is a bridge. * * It is less clear what aliases mean when applied to a bridge. For now * be conservative and also propagate the group downstream. */ - __clear_bit(pdev->devfn & 0xFF, devfns); - if (!bitmap_empty(devfns, sizeof(devfns) * BITS_PER_BYTE)) - return pci_group_alloc_non_isolated(); - return NULL; + if (bitmap_empty(devfns.devfns, NR_DEVFNS)) + return NULL; + + for_each_set_bit(devfn, devfns.devfns, NR_DEVFNS) { + struct iommu_group *group; + struct pci_dev *pdev_slot; + + pdev_slot = pci_get_slot(pdev->bus, devfn); + group = iommu_group_get(&pdev_slot->dev); + pci_dev_put(pdev_slot); + if (group) { + if (WARN_ON(!(group->bus_data & + BUS_DATA_PCI_NON_ISOLATED))) + group->bus_data |= BUS_DATA_PCI_NON_ISOLATED; + return group; + } + } + return pci_group_alloc_non_isolated(); } /* Return a group if the upstream hierarchy has isolation restrictions. */ -- 2.43.0