On Intel VMX, CR2 is not part of the VMCS guest/host state area. The CPU does not save or restore it automatically across VM transitions, so KVM manages it in software: before VM entry it writes vcpu->arch.cr2 into the hardware register if it differs from the current value, and after VM exit it reads the hardware register back into vcpu->arch.cr2. The host CR2 is intentionally left clobbered by the guest after VM exit, as an optimization: the expectation is that the next host page fault will overwrite it before anything else looks at it. That expectation is fragile. The rest of the kernel treats CR2 as an invariant. - exc_page_fault() reads it at the very start of #PF handling, before any instruction could have updated it. - __show_regs() reads and prints it from die()/oops/crash paths. Any flow that reaches a #PF handler, or that reads CR2 in an oops or crash context, without the CPU having just taken a real host #PF, will observe the guest's CR2 instead of the host's. On nested setups the stale guest CR2 left in the hardware register has the form of a kernel virtual address in the inner guest's address space, which overlaps 1:1 with the outer-guest kernel layout. That makes the stale value visually indistinguishable from a plausible outer-guest fault address, which can lead to confusing oops reports whose CR2 has no relation to the reported faulting RIP. Fix: save the host CR2 before VM entry into a local variable. After VM exit, compare the already-read vcpu->arch.cr2 against the saved host value, and write the host CR2 back if the guest modified it. In the common case where the guest did not touch CR2 this is a single register compare with no write; the restore is placed under unlikely() because most VM-entry/exit cycles do not involve a guest CR2 write. The change stays within the existing noinstr region; native_read_cr2()/native_write_cr2() are plain inline asm with no instrumentation. This brings VMX in line with the CR2 invariant the rest of the kernel already relies on. AMD SVM is not affected. On SVM, CR2 is part of the VMCB save area and the CPU saves and restores host and guest CR2 automatically on VMRUN and #VMEXIT. KVM's SVM code only accesses svm->vmcb->save.cr2 and never touches the hardware CR2 register. Signed-off-by: Konstantin Khorenko --- arch/x86/kvm/vmx/vmx.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index a29896a9ef145..dd441b90dfd4a 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -7458,6 +7458,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, unsigned int flags) { struct vcpu_vmx *vmx = to_vmx(vcpu); + unsigned long host_cr2; guest_state_enter_irqoff(); @@ -7465,13 +7466,25 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu, vmx_disable_fb_clear(vmx); - if (vcpu->arch.cr2 != native_read_cr2()) + host_cr2 = native_read_cr2(); + if (vcpu->arch.cr2 != host_cr2) native_write_cr2(vcpu->arch.cr2); vmx->fail = __vmx_vcpu_run(vmx, (unsigned long *)&vcpu->arch.regs, flags); vcpu->arch.cr2 = native_read_cr2(); + + /* + * Restore host CR2 if the guest modified it. The rest of the + * kernel relies on CR2 holding the address of the last host + * #PF; leaving the guest value there can mislead any code path + * that reads CR2 without the CPU having just taken a real host + * #PF (exc_page_fault(), __show_regs() from oops/crash paths, + * NMI/MCE report, nested-virt corner cases, etc.). + */ + if (unlikely(vcpu->arch.cr2 != host_cr2)) + native_write_cr2(host_cr2); vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET; vmx->idt_vectoring_info = 0; -- 2.43.0