From: Sean Christopherson Return -EAGAIN if userspace attempts to delete or move a memslot while also prefaulting memory for that same memslot, i.e. force userspace to retry instead of trying to handle the scenario entirely within KVM. Unlike KVM_RUN, which needs to handle the scenario entirely within KVM because userspace has come to depend on such behavior, KVM_PRE_FAULT_MEMORY can return -EAGAIN without breaking userspace as this scenario can't have ever worked (and there's no sane use case for prefaulting to a memslot that's being deleted/moved). And also unlike KVM_RUN, the prefault path doesn't naturally gaurantee forward progress. E.g. to handle such a scenario, KVM would need to drop and reacquire SRCU to break the deadlock between the memslot update (synchronizes SRCU) and the prefault (waits for the memslot update to complete). However, dropping SRCU creates more problems, as completing the memslot update will bump the memslot generation, which in turn will invalidate the MMU root. To handle that, prefaulting would need to handle pending KVM_REQ_MMU_FREE_OBSOLETE_ROOTS requests and do kvm_mmu_reload() prior to mapping each individual. I.e. to fully handle this scenario, prefaulting would eventually need to look a lot like vcpu_enter_guest(). Given that there's no reasonable use case and practically zero risk of breaking userspace, punt the problem to userspace and avoid adding unnecessary complexity to the prefualt path. Note, TDX's guest_memfd post-populate path is unaffected as slots_lock is held for the entire duration of populate(), i.e. any memslot modifications will be fully serialized against TDX's flavor of prefaulting. Reported-by: Reinette Chatre Closes: https://lore.kernel.org/all/20250519023737.30360-1-yan.y.zhao@intel.com Debugged-by: Yan Zhao Signed-off-by: Sean Christopherson --- arch/x86/kvm/mmu/mmu.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c index 92ff15969a36..f31fad33c423 100644 --- a/arch/x86/kvm/mmu/mmu.c +++ b/arch/x86/kvm/mmu/mmu.c @@ -4653,10 +4653,16 @@ static int kvm_mmu_faultin_pfn(struct kvm_vcpu *vcpu, /* * Retry the page fault if the gfn hit a memslot that is being deleted * or moved. This ensures any existing SPTEs for the old memslot will - * be zapped before KVM inserts a new MMIO SPTE for the gfn. + * be zapped before KVM inserts a new MMIO SPTE for the gfn. Punt the + * error to userspace if this is a prefault, as KVM's prefaulting ABI + * doesn't need provide the same forward progress guarantees as KVM_RUN. */ - if (slot->flags & KVM_MEMSLOT_INVALID) + if (slot->flags & KVM_MEMSLOT_INVALID) { + if (fault->prefetch) + return -EAGAIN; + return RET_PF_RETRY; + } if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT) { /* -- 2.43.2