From: Dongli Zhang The APICv (apic->apicv_active) can be activated or deactivated at runtime, for instance, because of APICv inhibit reasons. Intel VMX employs different mechanisms to virtualize LAPIC based on whether APICv is active. When APICv is activated at runtime, GUEST_INTR_STATUS is used to configure and report the current pending IRR and ISR states. Unless a specific vector is explicitly included in EOI_EXIT_BITMAP, its EOI will not be trapped to KVM. Intel VMX automatically clears the corresponding ISR bit based on the GUEST_INTR_STATUS.SVI field. When APICv is deactivated at runtime, the VM_ENTRY_INTR_INFO_FIELD is used to specify the next interrupt vector to invoke upon VM-entry. The VMX IDT_VECTORING_INFO_FIELD is used to report un-invoked vectors on VM-exit. EOIs are always trapped to KVM, so the software can manually clear pending ISR bits. There are scenarios where, with APICv activated at runtime, a guest-issued EOI may not be able to clear the pending ISR bit. Taking vector 236 as an example, here is one scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. After VM-entry, vector 236 is invoked through the guest IDT. At this point, the data in VM_ENTRY_INTR_INFO_FIELD is no longer valid. The guest interrupt handler for vector 236 is invoked. 4. Suppose a VM exit occurs very early in the guest interrupt handler, before the EOI is issued. 5. Nothing is reported through the IDT_VECTORING_INFO_FIELD because vector 236 has already been invoked in the guest. 6. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 7. Unfortunately, GUEST_INTR_STATUS.SVI is not configured, although vector 236 is still pending in the ISR. 8. After VM-entry, the guest finally issues the EOI for vector 236. However, because SVI is not configured, vector 236 is not cleared. 9. ISR is stalled forever on vector 236. Here is another scenario. 1. Suppose APICv is inactive. Vector 236 is pending in the IRR. 2. To handle KVM_REQ_EVENT, KVM moves vector 236 from the IRR to the ISR, and configures the VM_ENTRY_INTR_INFO_FIELD via vmx_inject_irq(). 3. VM-exit occurs immediately after the next VM-entry. The vector 236 is not invoked through the guest IDT. Instead, it is saved to the IDT_VECTORING_INFO_FIELD during the VM-exit. 4. KVM calls kvm_queue_interrupt() to re-queue the un-invoked vector 236 into vcpu->arch.interrupt. A KVM_REQ_EVENT is requested. 5. Now, suppose APICv is activated. Before the next VM-entry, KVM calls kvm_vcpu_update_apicv() to activate APICv. 6. Although APICv is now active, KVM still uses the legacy VM_ENTRY_INTR_INFO_FIELD to re-inject vector 236. GUEST_INTR_STATUS.SVI is not configured. 7. After the next VM-entry, vector 236 is invoked through the guest IDT. Finally, an EOI occurs. However, due to the lack of GUEST_INTR_STATUS.SVI configuration, vector 236 is not cleared from the ISR. 8. ISR is stalled forever on vector 236. Using QEMU as an example, vector 236 is stuck in ISR forever. (qemu) info lapic 1 dumping local APIC state for CPU 1 LVT0 0x00010700 active-hi edge masked ExtINT (vec 0) LVT1 0x00010400 active-hi edge masked NMI LVTPC 0x00000400 active-hi edge NMI LVTERR 0x000000fe active-hi edge Fixed (vec 254) LVTTHMR 0x00010000 active-hi edge masked Fixed (vec 0) LVTT 0x000400ec active-hi edge tsc-deadline Fixed (vec 236) Timer DCR=0x0 (divide by 2) initial_count = 0 current_count = 0 SPIV 0x000001ff APIC enabled, focus=off, spurious vec 255 ICR 0x000000fd physical edge de-assert no-shorthand ICR2 0x00000000 cpu 0 (X2APIC ID) ESR 0x00000000 ISR 236 IRR 37(level) 236 The issue isn't applicable to AMD SVM as KVM simply writes vmcb01 directly irrespective of whether L1 (vmcs01) or L2 (vmcb02) is active (unlike VMX, there is no need/cost to switch between VMCBs). In addition, APICV_INHIBIT_REASON_IRQWIN ensures AMD SVM AVIC is not activated until the last interrupt is EOI'd. Fix the bug by configuring Intel VMX GUEST_INTR_STATUS.SVI if APICv is activated at runtime. Signed-off-by: Dongli Zhang Reviewed-by: Chao Gao Link: https://patch.msgid.link/20251110063212.34902-1-dongli.zhang@oracle.com [sean: call out that SVM writes vmcb01 directly, tweak comment] Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/vmx.c | 9 --------- arch/x86/kvm/x86.c | 7 +++++++ 2 files changed, 7 insertions(+), 9 deletions(-) diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 4cbe8c84b636..6b96f7aea20b 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6937,15 +6937,6 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) * VM-Exit, otherwise L1 with run with a stale SVI. */ if (is_guest_mode(vcpu)) { - /* - * KVM is supposed to forward intercepted L2 EOIs to L1 if VID - * is enabled in vmcs12; as above, the EOIs affect L2's vAPIC. - * Note, userspace can stuff state while L2 is active; assert - * that VID is disabled if and only if the vCPU is in KVM_RUN - * to avoid false positives if userspace is setting APIC state. - */ - WARN_ON_ONCE(vcpu->wants_to_run && - nested_cpu_has_vid(get_vmcs12(vcpu))); to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true; return; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0c6d899d53dd..ff8812f3a129 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10886,9 +10886,16 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) * pending. At the same time, KVM_REQ_EVENT may not be set as APICv was * still active when the interrupt got accepted. Make sure * kvm_check_and_inject_events() is called to check for that. + * + * Update SVI when APICv gets enabled, otherwise SVI won't reflect the + * highest bit in vISR and the next accelerated EOI in the guest won't + * be virtualized correctly (the CPU uses SVI to determine which vISR + * vector to clear). */ if (!apic->apicv_active) kvm_make_request(KVM_REQ_EVENT, vcpu); + else + kvm_apic_update_hwapic_isr(vcpu); out: preempt_enable(); -- 2.52.0.223.gf5cc29aaa4-goog From: Dongli Zhang If an APICv status updated was pended while L2 was active, immediately refresh vmcs01's controls instead of pending KVM_REQ_APICV_UPDATE as kvm_vcpu_update_apicv() only calls into vendor code if a change is necessary. E.g. if APICv is inhibited, and then activated while L2 is running: kvm_vcpu_update_apicv() | -> __kvm_vcpu_update_apicv() | -> apic->apicv_active = true | -> vmx_refresh_apicv_exec_ctrl() | -> vmx->nested.update_vmcs01_apicv_status = true | -> return Then L2 exits to L1: __nested_vmx_vmexit() | -> kvm_make_request(KVM_REQ_APICV_UPDATE) vcpu_enter_guest(): KVM_REQ_APICV_UPDATE -> kvm_vcpu_update_apicv() | -> __kvm_vcpu_update_apicv() | -> return // because if (apic->apicv_active == activate) Reported-by: Chao Gao Closes: https://lore.kernel.org/all/aQ2jmnN8wUYVEawF@intel.com Fixes: 7c69661e225c ("KVM: nVMX: Defer APICv updates while L2 is active until L1 is active") Cc: stable@vger.kernel.org Signed-off-by: Dongli Zhang [sean: write changelog] Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 40777278eabb..6137e5307d0f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -19,6 +19,7 @@ #include "trace.h" #include "vmx.h" #include "smm.h" +#include "x86_ops.h" static bool __read_mostly enable_shadow_vmcs = 1; module_param_named(enable_shadow_vmcs, enable_shadow_vmcs, bool, S_IRUGO); @@ -5165,7 +5166,7 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, if (vmx->nested.update_vmcs01_apicv_status) { vmx->nested.update_vmcs01_apicv_status = false; - kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu); + vmx_refresh_apicv_exec_ctrl(vcpu); } if (vmx->nested.update_vmcs01_hwapic_isr) { -- 2.52.0.223.gf5cc29aaa4-goog Add a test to verify KVM correctly handles a variety of edge cases related to APICv updates, and in particular updates that are triggered while L2 is actively running. Signed-off-by: Sean Christopherson --- tools/testing/selftests/kvm/Makefile.kvm | 1 + .../testing/selftests/kvm/include/x86/apic.h | 4 + .../kvm/x86/vmx_apicv_updates_test.c | 181 ++++++++++++++++++ 3 files changed, 186 insertions(+) create mode 100644 tools/testing/selftests/kvm/x86/vmx_apicv_updates_test.c diff --git a/tools/testing/selftests/kvm/Makefile.kvm b/tools/testing/selftests/kvm/Makefile.kvm index ba5c2b643efa..6f00bd8271c2 100644 --- a/tools/testing/selftests/kvm/Makefile.kvm +++ b/tools/testing/selftests/kvm/Makefile.kvm @@ -115,6 +115,7 @@ TEST_GEN_PROGS_x86 += x86/ucna_injection_test TEST_GEN_PROGS_x86 += x86/userspace_io_test TEST_GEN_PROGS_x86 += x86/userspace_msr_exit_test TEST_GEN_PROGS_x86 += x86/vmx_apic_access_test +TEST_GEN_PROGS_x86 += x86/vmx_apicv_updates_test TEST_GEN_PROGS_x86 += x86/vmx_dirty_log_test TEST_GEN_PROGS_x86 += x86/vmx_exception_with_invalid_guest_state TEST_GEN_PROGS_x86 += x86/vmx_msrs_test diff --git a/tools/testing/selftests/kvm/include/x86/apic.h b/tools/testing/selftests/kvm/include/x86/apic.h index 80fe9f69b38d..d42a0998d868 100644 --- a/tools/testing/selftests/kvm/include/x86/apic.h +++ b/tools/testing/selftests/kvm/include/x86/apic.h @@ -32,6 +32,7 @@ #define APIC_SPIV 0xF0 #define APIC_SPIV_FOCUS_DISABLED (1 << 9) #define APIC_SPIV_APIC_ENABLED (1 << 8) +#define APIC_ISR 0x100 #define APIC_IRR 0x200 #define APIC_ICR 0x300 #define APIC_LVTCMCI 0x2f0 @@ -68,6 +69,9 @@ #define APIC_TMCCT 0x390 #define APIC_TDCR 0x3E0 +#define APIC_VECTOR_TO_BIT_NUMBER(v) ((unsigned int)(v) % 32) +#define APIC_VECTOR_TO_REG_OFFSET(v) ((unsigned int)(v) / 32 * 0x10) + void apic_disable(void); void xapic_enable(void); void x2apic_enable(void); diff --git a/tools/testing/selftests/kvm/x86/vmx_apicv_updates_test.c b/tools/testing/selftests/kvm/x86/vmx_apicv_updates_test.c new file mode 100644 index 000000000000..907d226fd0fd --- /dev/null +++ b/tools/testing/selftests/kvm/x86/vmx_apicv_updates_test.c @@ -0,0 +1,181 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "vmx.h" + +#define GOOD_IPI_VECTOR 0xe0 +#define BAD_IPI_VECTOR 0xf0 + +static volatile int good_ipis_received; + +static void good_ipi_handler(struct ex_regs *regs) +{ + good_ipis_received++; +} + +static void bad_ipi_handler(struct ex_regs *regs) +{ + TEST_FAIL("Received \"bad\" IPI; ICR MMIO write should have been ignored"); +} + +static void l2_vmcall(void) +{ + /* + * Exit to L1. Assume all registers may be clobbered as selftests's + * VM-Enter code doesn't preserve L2 GPRs. + */ + asm volatile("push %%rbp\n\t" + "push %%r15\n\t" + "push %%r14\n\t" + "push %%r13\n\t" + "push %%r12\n\t" + "push %%rbx\n\t" + "push %%rdx\n\t" + "push %%rdi\n\t" + "vmcall\n\t" + "pop %%rdi\n\t" + "pop %%rdx\n\t" + "pop %%rbx\n\t" + "pop %%r12\n\t" + "pop %%r13\n\t" + "pop %%r14\n\t" + "pop %%r15\n\t" + "pop %%rbp\n\t" + ::: "rax", "rcx", "rdx", "rsi", "rdx", "r8", "r9", "r10", "r11", "memory"); +} + +static void l2_guest_code(void) +{ + x2apic_enable(); + l2_vmcall(); + + xapic_enable(); + xapic_write_reg(APIC_ID, 1 << 24); + l2_vmcall(); +} + +static void l1_guest_code(struct vmx_pages *vmx_pages) +{ +#define L2_GUEST_STACK_SIZE 64 + unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE]; + uint32_t control; + + GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages)); + GUEST_ASSERT(load_vmcs(vmx_pages)); + + /* Prepare the VMCS for L2 execution. */ + prepare_vmcs(vmx_pages, l2_guest_code, &l2_guest_stack[L2_GUEST_STACK_SIZE]); + control = vmreadz(CPU_BASED_VM_EXEC_CONTROL); + control |= CPU_BASED_USE_MSR_BITMAPS; + vmwrite(CPU_BASED_VM_EXEC_CONTROL, control); + + /* Modify APIC ID to coerce KVM into inhibiting APICv. */ + xapic_enable(); + xapic_write_reg(APIC_ID, 1 << 24); + + /* + * Generate+receive an IRQ without doing EOI to get an IRQ set in vISR + * but not SVI. APICv should be inhibited due to running with a + * modified APIC ID. + */ + xapic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_DM_FIXED | GOOD_IPI_VECTOR); + GUEST_ASSERT_EQ(xapic_read_reg(APIC_ID), 1 << 24); + + /* Enable IRQs and verify the IRQ was received. */ + sti_nop(); + GUEST_ASSERT_EQ(good_ipis_received, 1); + + /* + * Run L2 to switch to x2APIC mode, which in turn will uninhibit APICv, + * as KVM should force the APIC ID back to its default. + */ + GUEST_ASSERT(!vmlaunch()); + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); + vmwrite(GUEST_RIP, vmreadz(GUEST_RIP) + vmreadz(VM_EXIT_INSTRUCTION_LEN)); + GUEST_ASSERT(rdmsr(MSR_IA32_APICBASE) & MSR_IA32_APICBASE_EXTD); + + /* + * Scribble the APIC access page to verify KVM disabled xAPIC + * virtualization in vmcs01, and to verify that KVM flushes L1's TLB + * when L2 switches back to accelerated xAPIC mode. + */ + xapic_write_reg(APIC_ICR2, 0xdeadbeefu); + xapic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_DM_FIXED | BAD_IPI_VECTOR); + + /* + * Verify the IRQ is still in-service and emit an EOI to verify KVM + * propagates the highest vISR vector to SVI when APICv is activated + * (and does so even if APICv was uninhibited while L2 was active). + */ + GUEST_ASSERT_EQ(x2apic_read_reg(APIC_ISR + APIC_VECTOR_TO_REG_OFFSET(GOOD_IPI_VECTOR)), + BIT(APIC_VECTOR_TO_BIT_NUMBER(GOOD_IPI_VECTOR))); + x2apic_write_reg(APIC_EOI, 0); + GUEST_ASSERT_EQ(x2apic_read_reg(APIC_ISR + APIC_VECTOR_TO_REG_OFFSET(GOOD_IPI_VECTOR)), 0); + + /* + * Run L2 one more time to switch back to xAPIC mode to verify that KVM + * handles the x2APIC => xAPIC transition and inhibits APICv while L2 + * is active. + */ + GUEST_ASSERT(!vmresume()); + GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL); + GUEST_ASSERT(!(rdmsr(MSR_IA32_APICBASE) & MSR_IA32_APICBASE_EXTD)); + + xapic_write_reg(APIC_ICR, APIC_DEST_SELF | APIC_DM_FIXED | GOOD_IPI_VECTOR); + /* Re-enable IRQs, as VM-Exit clears RFLAGS.IF. */ + sti_nop(); + GUEST_ASSERT_EQ(good_ipis_received, 2); + + GUEST_ASSERT_EQ(xapic_read_reg(APIC_ISR + APIC_VECTOR_TO_REG_OFFSET(GOOD_IPI_VECTOR)), + BIT(APIC_VECTOR_TO_BIT_NUMBER(GOOD_IPI_VECTOR))); + xapic_write_reg(APIC_EOI, 0); + GUEST_ASSERT_EQ(xapic_read_reg(APIC_ISR + APIC_VECTOR_TO_REG_OFFSET(GOOD_IPI_VECTOR)), 0); + GUEST_DONE(); +} + +int main(int argc, char *argv[]) +{ + vm_vaddr_t vmx_pages_gva; + struct vmx_pages *vmx; + struct kvm_vcpu *vcpu; + struct kvm_vm *vm; + struct ucall uc; + + TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX)); + + vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code); + + vmx = vcpu_alloc_vmx(vm, &vmx_pages_gva); + prepare_virtualize_apic_accesses(vmx, vm); + vcpu_args_set(vcpu, 2, vmx_pages_gva); + + virt_pg_map(vm, APIC_DEFAULT_GPA, APIC_DEFAULT_GPA); + vm_install_exception_handler(vm, BAD_IPI_VECTOR, bad_ipi_handler); + vm_install_exception_handler(vm, GOOD_IPI_VECTOR, good_ipi_handler); + + vcpu_run(vcpu); + TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO); + + switch (get_ucall(vcpu, &uc)) { + case UCALL_ABORT: + REPORT_GUEST_ASSERT(uc); + /* NOT REACHED */ + case UCALL_DONE: + break; + default: + TEST_FAIL("Unexpected ucall %lu", uc.cmd); + } + + /* + * Verify at least two IRQs were injected. Unfortunately, KVM counts + * re-injected IRQs (e.g. if delivering the IRQ hits an EPT violation), + * so being more precise isn't possible given the current stats. + */ + TEST_ASSERT(vcpu_get_stat(vcpu, irq_injections) >= 2, + "Wanted at least 2 IRQ injections, got %lu\n", + vcpu_get_stat(vcpu, irq_injections)); + + kvm_vm_free(vm); + return 0; +} -- 2.52.0.223.gf5cc29aaa4-goog If KVM toggles "CPU dirty logging", a.k.a. Page-Modification Logging (PML), while L2 is active, temporarily load vmcs01 and immediately update the relevant controls instead of deferring the update until the next nested VM-Exit. For PML, deferring the update is relatively straightforward, but for several APICv related updates, deferring updates creates ordering and state consistency problems, e.g. KVM at-large thinks APICv is enabled, but vmcs01 is still running with stale (and effectively unknown) state. Convert PML first precisely because it's the simplest case to handle: if something is broken with the vmcs01 <=> vmcs02 dance, then hopefully bugs will bisect here. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 5 ----- arch/x86/kvm/vmx/vmx.c | 40 +++++++++++++++++++++++++++++++++++---- arch/x86/kvm/vmx/vmx.h | 1 - 3 files changed, 36 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 6137e5307d0f..920a925bb46f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5152,11 +5152,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx_set_virtual_apic_mode(vcpu); } - if (vmx->nested.update_vmcs01_cpu_dirty_logging) { - vmx->nested.update_vmcs01_cpu_dirty_logging = false; - vmx_update_cpu_dirty_logging(vcpu); - } - nested_put_vmcs12_pages(vcpu); if (vmx->nested.reload_vmcs01_apic_access_page) { diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 6b96f7aea20b..1420665fbb66 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -1594,6 +1594,41 @@ void vmx_vcpu_put(struct kvm_vcpu *vcpu) vmx_prepare_switch_to_host(to_vmx(vcpu)); } +static void vmx_switch_loaded_vmcs(struct kvm_vcpu *vcpu, + struct loaded_vmcs *vmcs) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + int cpu; + + cpu = get_cpu(); + vmx->loaded_vmcs = vmcs; + vmx_vcpu_load_vmcs(vcpu, cpu); + put_cpu(); +} + +static void vmx_load_vmcs01(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + if (!is_guest_mode(vcpu)) { + WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->vmcs01); + return; + } + + WARN_ON_ONCE(vmx->loaded_vmcs != &vmx->nested.vmcs02); + vmx_switch_loaded_vmcs(vcpu, &vmx->vmcs01); +} + +static void vmx_put_vmcs01(struct kvm_vcpu *vcpu) +{ + if (!is_guest_mode(vcpu)) + return; + + vmx_switch_loaded_vmcs(vcpu, &to_vmx(vcpu)->nested.vmcs02); +} +DEFINE_GUARD(vmx_vmcs01, struct kvm_vcpu *, + vmx_load_vmcs01(_T), vmx_put_vmcs01(_T)) + bool vmx_emulation_required(struct kvm_vcpu *vcpu) { return emulate_invalid_guest_state && !vmx_guest_state_valid(vcpu); @@ -8267,10 +8302,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu) if (WARN_ON_ONCE(!enable_pml)) return; - if (is_guest_mode(vcpu)) { - vmx->nested.update_vmcs01_cpu_dirty_logging = true; - return; - } + guard(vmx_vmcs01)(vcpu); /* * Note, nr_memslots_dirty_logging can be changed concurrent with this diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index bc3ed3145d7e..b44eda6225f4 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -133,7 +133,6 @@ struct nested_vmx { bool change_vmcs01_virtual_apic_mode; bool reload_vmcs01_apic_access_page; - bool update_vmcs01_cpu_dirty_logging; bool update_vmcs01_apicv_status; bool update_vmcs01_hwapic_isr; -- 2.52.0.223.gf5cc29aaa4-goog If KVM updates L1's TPR Threshold while L2 is active, temporarily load vmcs01 and immediately update TPR_THRESHOLD instead of deferring the update until the next nested VM-Exit. Deferring the TPR Threshold update is relatively straightforward, but for several APICv related updates, deferring updates creates ordering and state consistency problems, e.g. KVM at-large thinks APICv is enabled, but vmcs01 is still running with stale (and effectively unknown) state. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 4 ---- arch/x86/kvm/vmx/vmx.c | 7 +++---- arch/x86/kvm/vmx/vmx.h | 3 --- 3 files changed, 3 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 920a925bb46f..8efab1cf833f 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -2402,7 +2402,6 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0 exec_control &= ~CPU_BASED_TPR_SHADOW; exec_control |= vmcs12->cpu_based_vm_exec_control; - vmx->nested.l1_tpr_threshold = -1; if (exec_control & CPU_BASED_TPR_SHADOW) vmcs_write32(TPR_THRESHOLD, vmcs12->tpr_threshold); #ifdef CONFIG_X86_64 @@ -5144,9 +5143,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, if (kvm_caps.has_tsc_control) vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio); - if (vmx->nested.l1_tpr_threshold != -1) - vmcs_write32(TPR_THRESHOLD, vmx->nested.l1_tpr_threshold); - if (vmx->nested.change_vmcs01_virtual_apic_mode) { vmx->nested.change_vmcs01_virtual_apic_mode = false; vmx_set_virtual_apic_mode(vcpu); diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 1420665fbb66..3ee86665d8de 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6827,11 +6827,10 @@ void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr) nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW)) return; + guard(vmx_vmcs01)(vcpu); + tpr_threshold = (irr == -1 || tpr < irr) ? 0 : irr; - if (is_guest_mode(vcpu)) - to_vmx(vcpu)->nested.l1_tpr_threshold = tpr_threshold; - else - vmcs_write32(TPR_THRESHOLD, tpr_threshold); + vmcs_write32(TPR_THRESHOLD, tpr_threshold); } void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index b44eda6225f4..36f48c4b39c0 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -184,9 +184,6 @@ struct nested_vmx { u64 pre_vmenter_ssp; u64 pre_vmenter_ssp_tbl; - /* to migrate it to L1 if L2 writes to L1's CR8 directly */ - int l1_tpr_threshold; - u16 vpid02; u16 last_vpid; -- 2.52.0.223.gf5cc29aaa4-goog If APICv is activated while L2 is running and triggers an SVI update, temporarily load vmcs01 and immediately update SVI instead of deferring the update until the next nested VM-Exit. This will eventually allow killing off kvm_apic_update_hwapic_isr(), and all of nVMX's deferred APICv updates. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 5 ----- arch/x86/kvm/vmx/vmx.c | 19 +++++++------------ arch/x86/kvm/vmx/vmx.h | 1 - 3 files changed, 7 insertions(+), 18 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8efab1cf833f..c2c96e4fe20e 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5160,11 +5160,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, vmx_refresh_apicv_exec_ctrl(vcpu); } - if (vmx->nested.update_vmcs01_hwapic_isr) { - vmx->nested.update_vmcs01_hwapic_isr = false; - kvm_apic_update_hwapic_isr(vcpu); - } - if ((vm_exit_reason != -1) && (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))) vmx->nested.need_vmcs12_to_shadow_sync = true; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 3ee86665d8de..74a815cddd37 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6963,21 +6963,16 @@ void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr) u16 status; u8 old; - /* - * If L2 is active, defer the SVI update until vmcs01 is loaded, as SVI - * is only relevant for if and only if Virtual Interrupt Delivery is - * enabled in vmcs12, and if VID is enabled then L2 EOIs affect L2's - * vAPIC, not L1's vAPIC. KVM must update vmcs01 on the next nested - * VM-Exit, otherwise L1 with run with a stale SVI. - */ - if (is_guest_mode(vcpu)) { - to_vmx(vcpu)->nested.update_vmcs01_hwapic_isr = true; - return; - } - if (max_isr == -1) max_isr = 0; + /* + * Always update SVI in vmcs01, as SVI is only relevant for L2 if and + * only if Virtual Interrupt Delivery is enabled in vmcs12, and if VID + * is enabled then L2 EOIs affect L2's vAPIC, not L1's vAPIC. + */ + guard(vmx_vmcs01)(vcpu); + status = vmcs_read16(GUEST_INTR_STATUS); old = status >> 8; if (max_isr != old) { diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 36f48c4b39c0..53969e49d9d1 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -134,7 +134,6 @@ struct nested_vmx { bool change_vmcs01_virtual_apic_mode; bool reload_vmcs01_apic_access_page; bool update_vmcs01_apicv_status; - bool update_vmcs01_hwapic_isr; /* * Enlightened VMCS has been enabled. It does not mean that L1 has to -- 2.52.0.223.gf5cc29aaa4-goog If APICv is (un)inhibited while L2 is running, temporarily load vmcs01 and immediately refresh the APICv controls in vmcs01 instead of deferring the update until the next nested VM-Exit. This all but eliminates potential ordering issues due to vmcs01 not being synchronized with kvm_lapic.apicv_active, e.g. where KVM _thinks_ it refreshed APICv, but vmcs01 still contains stale state. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 5 ----- arch/x86/kvm/vmx/vmx.c | 5 +---- arch/x86/kvm/vmx/vmx.h | 1 - 3 files changed, 1 insertion(+), 10 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index c2c96e4fe20e..2b0702349aa1 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5155,11 +5155,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); } - if (vmx->nested.update_vmcs01_apicv_status) { - vmx->nested.update_vmcs01_apicv_status = false; - vmx_refresh_apicv_exec_ctrl(vcpu); - } - if ((vm_exit_reason != -1) && (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))) vmx->nested.need_vmcs12_to_shadow_sync = true; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 74a815cddd37..90e167f296d0 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -4578,10 +4578,7 @@ void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); - if (is_guest_mode(vcpu)) { - vmx->nested.update_vmcs01_apicv_status = true; - return; - } + guard(vmx_vmcs01)(vcpu); pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx)); diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 53969e49d9d1..dfc9766a7fa3 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -133,7 +133,6 @@ struct nested_vmx { bool change_vmcs01_virtual_apic_mode; bool reload_vmcs01_apic_access_page; - bool update_vmcs01_apicv_status; /* * Enlightened VMCS has been enabled. It does not mean that L1 has to -- 2.52.0.223.gf5cc29aaa4-goog If the KVM-owned APIC-access page is migrated while L2 is running, temporarily load vmcs01 and immediately update APIC_ACCESS_ADDR instead of deferring the update until the next nested VM-Exit. Once changing the virtual APIC mode is converted to always do on-demand updates, all of the "defer until vmcs01 is active" logic will be gone. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 5 ----- arch/x86/kvm/vmx/vmx.c | 7 ++----- arch/x86/kvm/vmx/vmx.h | 1 - 3 files changed, 2 insertions(+), 11 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 2b0702349aa1..8196a1ac22e1 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5150,11 +5150,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, nested_put_vmcs12_pages(vcpu); - if (vmx->nested.reload_vmcs01_apic_access_page) { - vmx->nested.reload_vmcs01_apic_access_page = false; - kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu); - } - if ((vm_exit_reason != -1) && (enable_shadow_vmcs || nested_vmx_is_evmptr12_valid(vmx))) vmx->nested.need_vmcs12_to_shadow_sync = true; diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index 90e167f296d0..af8ec72e8ebf 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6895,11 +6895,8 @@ void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu) kvm_pfn_t pfn; bool writable; - /* Defer reload until vmcs01 is the current VMCS. */ - if (is_guest_mode(vcpu)) { - to_vmx(vcpu)->nested.reload_vmcs01_apic_access_page = true; - return; - } + /* Note, the VIRTUALIZE_APIC_ACCESSES check needs to query vmcs01. */ + guard(vmx_vmcs01)(vcpu); if (!(secondary_exec_controls_get(to_vmx(vcpu)) & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index dfc9766a7fa3..078bc6fef7e6 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -132,7 +132,6 @@ struct nested_vmx { bool vmcs02_initialized; bool change_vmcs01_virtual_apic_mode; - bool reload_vmcs01_apic_access_page; /* * Enlightened VMCS has been enabled. It does not mean that L1 has to -- 2.52.0.223.gf5cc29aaa4-goog If L1's virtual APIC mode changes while L2 is active, e.g. because L1 doesn't intercept writes to the APIC_BASE MSR and L2 changes the mode, temporarily load vmcs01 and do all of the necessary actions instead of deferring the update until the next nested VM-Exit. This will help in fixing yet more issues related to updates while L2 is active, e.g. KVM neglects to update vmcs02 MSR intercepts if vmcs01's MSR intercepts are modified while L2 is active. Not updating x2APIC MSRs is benign because vmcs01's settings are not factored into vmcs02's bitmap, but deferring the x2APIC MSR updates would create a weird, inconsistent state. Signed-off-by: Sean Christopherson --- arch/x86/kvm/vmx/nested.c | 5 ----- arch/x86/kvm/vmx/vmx.c | 17 +++++++++++------ arch/x86/kvm/vmx/vmx.h | 2 -- 3 files changed, 11 insertions(+), 13 deletions(-) diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c index 8196a1ac22e1..b99e3c80d43e 100644 --- a/arch/x86/kvm/vmx/nested.c +++ b/arch/x86/kvm/vmx/nested.c @@ -5143,11 +5143,6 @@ void __nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason, if (kvm_caps.has_tsc_control) vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio); - if (vmx->nested.change_vmcs01_virtual_apic_mode) { - vmx->nested.change_vmcs01_virtual_apic_mode = false; - vmx_set_virtual_apic_mode(vcpu); - } - nested_put_vmcs12_pages(vcpu); if ((vm_exit_reason != -1) && diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c index af8ec72e8ebf..ef8d29c677b9 100644 --- a/arch/x86/kvm/vmx/vmx.c +++ b/arch/x86/kvm/vmx/vmx.c @@ -6842,11 +6842,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) !cpu_has_vmx_virtualize_x2apic_mode()) return; - /* Postpone execution until vmcs01 is the current VMCS. */ - if (is_guest_mode(vcpu)) { - vmx->nested.change_vmcs01_virtual_apic_mode = true; - return; - } + guard(vmx_vmcs01)(vcpu); sec_exec_control = secondary_exec_controls_get(vmx); sec_exec_control &= ~(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES | @@ -6869,8 +6865,17 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) * only do so if its physical address has changed, but * the guest may have inserted a non-APIC mapping into * the TLB while the APIC access page was disabled. + * + * If L2 is active, immediately flush L1's TLB instead + * of requesting a flush of the current TLB, because + * the current TLB context is L2's. */ - kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); + if (!is_guest_mode(vcpu)) + kvm_make_request(KVM_REQ_TLB_FLUSH_CURRENT, vcpu); + else if (!enable_ept) + vpid_sync_context(to_vmx(vcpu)->vpid); + else if (VALID_PAGE(vcpu->arch.root_mmu.root.hpa)) + vmx_flush_tlb_ept_root(vcpu->arch.root_mmu.root.hpa); } break; case LAPIC_MODE_X2APIC: diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 078bc6fef7e6..a926ce43ad40 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -131,8 +131,6 @@ struct nested_vmx { */ bool vmcs02_initialized; - bool change_vmcs01_virtual_apic_mode; - /* * Enlightened VMCS has been enabled. It does not mean that L1 has to * use it. However, VMX features available to L1 will be limited based -- 2.52.0.223.gf5cc29aaa4-goog Fold the calls to .hwapic_isr_update() in kvm_apic_set_state() and __kvm_vcpu_update_apicv() into kvm_apic_update_apicv(), as updating SVI is directly related to updating KVM's own cache of ISR information, e.g. SVI is more or less the APICv equivalent of highest_isr_cache. Note, calling .hwapic_isr_update() during kvm_apic_update_apicv() has two benign side effects. First, it adds a call during kvm_lapic_reset(), but that's a glorified nop as the ISR has already been zeroed. Second, it changes the order between .hwapic_isr_update() and .apicv_post_state_restore() in kvm_apic_set_state(), but the former is VMX-only and the latter is SVM-only, i.e. is also a glorified nop. Signed-off-by: Sean Christopherson --- arch/x86/kvm/lapic.c | 21 +++++---------------- arch/x86/kvm/lapic.h | 1 - arch/x86/kvm/x86.c | 2 -- 3 files changed, 5 insertions(+), 19 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index 1597dd0b0cc6..7be4d759884c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -770,17 +770,6 @@ static inline void apic_clear_isr(int vec, struct kvm_lapic *apic) } } -void kvm_apic_update_hwapic_isr(struct kvm_vcpu *vcpu) -{ - struct kvm_lapic *apic = vcpu->arch.apic; - - if (WARN_ON_ONCE(!lapic_in_kernel(vcpu)) || !apic->apicv_active) - return; - - kvm_x86_call(hwapic_isr_update)(vcpu, apic_find_highest_isr(apic)); -} -EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_apic_update_hwapic_isr); - int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu) { /* This may race with setting of irr in __apic_accept_irq() and @@ -2785,10 +2774,12 @@ void kvm_apic_update_apicv(struct kvm_vcpu *vcpu) */ apic->irr_pending = true; - if (apic->apicv_active) + if (apic->apicv_active) { apic->isr_count = 1; - else + kvm_x86_call(hwapic_isr_update)(vcpu, apic_find_highest_isr(apic)); + } else { apic->isr_count = count_vectors(apic->regs + APIC_ISR); + } apic->highest_isr_cache = -1; } @@ -3232,10 +3223,8 @@ int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s) __start_apic_timer(apic, APIC_TMCCT); kvm_lapic_set_reg(apic, APIC_TMCCT, 0); kvm_apic_update_apicv(vcpu); - if (apic->apicv_active) { + if (apic->apicv_active) kvm_x86_call(apicv_post_state_restore)(vcpu); - kvm_x86_call(hwapic_isr_update)(vcpu, apic_find_highest_isr(apic)); - } kvm_make_request(KVM_REQ_EVENT, vcpu); #ifdef CONFIG_KVM_IOAPIC diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 282b9b7da98c..aa0a9b55dbb7 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -126,7 +126,6 @@ void kvm_apic_send_ipi(struct kvm_lapic *apic, u32 icr_low, u32 icr_high); int kvm_apic_set_base(struct kvm_vcpu *vcpu, u64 value, bool host_initiated); int kvm_apic_get_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); int kvm_apic_set_state(struct kvm_vcpu *vcpu, struct kvm_lapic_state *s); -void kvm_apic_update_hwapic_isr(struct kvm_vcpu *vcpu); int kvm_lapic_find_highest_irr(struct kvm_vcpu *vcpu); u64 kvm_get_lapic_tscdeadline_msr(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index ff8812f3a129..66c5edbbda2b 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -10894,8 +10894,6 @@ void __kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu) */ if (!apic->apicv_active) kvm_make_request(KVM_REQ_EVENT, vcpu); - else - kvm_apic_update_hwapic_isr(vcpu); out: preempt_enable(); -- 2.52.0.223.gf5cc29aaa4-goog