irqfd_resampler_shutdown() calls synchronize_srcu_expedited() while holding kvm->irqfds.resampler_lock. This can deadlock when multiple irqfd_shutdown workers run concurrently on the kvm-irqfd-cleanup workqueue during VM teardown (e.g. crosvm shutdown on Android): CPU A (mutex holder) CPU B/C/D (mutex waiters) irqfd_shutdown() irqfd_shutdown() irqfd_resampler_shutdown() irqfd_resampler_shutdown() mutex_lock(resampler_lock) <---- mutex_lock(resampler_lock) // BLOCKED list_del_rcu(...) ...blocked... synchronize_srcu_expedited() // Waiters block workqueue, // waits for SRCU grace preventing SRCU grace // period which requires period from completing // workqueue progress --- DEADLOCK --- The synchronize_srcu_expedited() in the else branch is called directly within the mutex. In the if-last branch, kvm_unregister_irq_ack_notifier() also calls synchronize_srcu_expedited() internally. Both paths can block indefinitely because: 1. synchronize_srcu_expedited() waits for an SRCU grace period 2. SRCU grace period completion needs workqueue workers to run 3. The blocked mutex waiters occupy workqueue slots, preventing progress 4. The mutex holder never releases the lock -> deadlock Fix by performing all list manipulations and the last-entry check under the mutex, then releasing the mutex before the SRCU synchronization. This is safe because: - list_del_rcu() removes the irqfd from resampler->list under the mutex, so no concurrent reader or writer can access it. - When last==true, list_del_rcu(&resampler->link) has already removed the resampler from kvm->irqfds.resampler_list under the mutex, so no other worker can find or operate on this resampler. - kvm_unregister_irq_ack_notifier() uses its own locking (kvm->irq_lock) and is safe to call without resampler_lock. - synchronize_srcu_expedited() does not require any KVM mutex. - kfree(resampler) is safe after SRCU sync guarantees all readers have finished. Signed-off-by: Sonam Sanju --- virt/kvm/eventfd.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c index 0e8b8a2c5b79..27bcf2b1a81d 100644 --- a/virt/kvm/eventfd.c +++ b/virt/kvm/eventfd.c @@ -93,6 +93,7 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd) { struct kvm_kernel_irqfd_resampler *resampler = irqfd->resampler; struct kvm *kvm = resampler->kvm; + bool last = false; mutex_lock(&kvm->irqfds.resampler_lock); @@ -100,19 +101,27 @@ irqfd_resampler_shutdown(struct kvm_kernel_irqfd *irqfd) if (list_empty(&resampler->list)) { list_del_rcu(&resampler->link); + last = true; + } + + mutex_unlock(&kvm->irqfds.resampler_lock); + + /* + * synchronize_srcu_expedited() (called explicitly below, or internally + * by kvm_unregister_irq_ack_notifier()) must not be invoked under + * resampler_lock. Holding the mutex while waiting for an SRCU grace + * period creates a deadlock: the blocked mutex waiters occupy workqueue + * slots that the SRCU grace period machinery needs to make forward + * progress. + */ + if (last) { kvm_unregister_irq_ack_notifier(kvm, &resampler->notifier); - /* - * synchronize_srcu_expedited(&kvm->irq_srcu) already called - * in kvm_unregister_irq_ack_notifier(). - */ kvm_set_irq(kvm, KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID, resampler->notifier.gsi, 0, false); kfree(resampler); } else { synchronize_srcu_expedited(&kvm->irq_srcu); } - - mutex_unlock(&kvm->irqfds.resampler_lock); } /* -- 2.34.1