During Windows Server 2025 hibernation, I have seen Windows' calculation of interrupt target time get skewed over the hypervisor view of the same. This can cause Windows to emit timer events in the past for events that do not fire yet according to the real time source. This then leads to interrupt storms in the guest which slow down execution to a point where watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during hibernation, typically in the resume path. To work around this problem, we can delay timers that get created with a target time in the past by a tiny bit (10µs) to give the guest CPU time to process real work and make forward progress, hopefully recovering its interrupt logic in the process. While this small delay can marginally reduce accuracy of guest timers, 10µs are within the noise of VM entry/exit overhead (~1-2 µs) so I do not expect to see real world impact. To still provide some level of visibility when this happens, add a trace point that clearly shows the discrepancy between the target time and the current time. Signed-off-by: Alexander Graf --- arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++---- arch/x86/kvm/trace.h | 26 ++++++++++++++++++++++++++ 2 files changed, 44 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c index 72b19a88a776..c41061acbcbc 100644 --- a/arch/x86/kvm/hyperv.c +++ b/arch/x86/kvm/hyperv.c @@ -666,13 +666,27 @@ static int stimer_start(struct kvm_vcpu_hv_stimer *stimer) stimer->exp_time = stimer->count; if (time_now >= stimer->count) { /* - * Expire timer according to Hypervisor Top-Level Functional - * specification v4(15.3.1): + * Hypervisor Top-Level Functional specification v4(15.3.1): * "If a one shot is enabled and the specified count is in * the past, it will expire immediately." + * + * However, there are cases during hibernation when Windows's + * interrupt count calculation can go out of sync with KVM's + * view of it, causing Windows to emit timer events in the past + * for events that do not fire yet according to the real time + * source. This then leads to interrupt storms in the guest + * which slow down execution to a point where watchdogs trigger. + * + * Instead of taking TLFS literally on what "immediately" means, + * give the guest at least 10µs to process work. While this can + * marginally reduce accuracy of guest timers, 10µs are within + * the noise of VM entry/exit overhead (~1-2 µs). */ - stimer_mark_pending(stimer, false); - return 0; + trace_kvm_hv_stimer_start_expired( + hv_stimer_to_vcpu(stimer)->vcpu_id, + stimer->index, + time_now, stimer->count); + stimer->count = time_now + 100; } trace_kvm_hv_stimer_start_one_shot(hv_stimer_to_vcpu(stimer)->vcpu_id, diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h index 57d79fd31df0..f9e69c4d9e9b 100644 --- a/arch/x86/kvm/trace.h +++ b/arch/x86/kvm/trace.h @@ -1401,6 +1401,32 @@ TRACE_EVENT(kvm_hv_stimer_start_one_shot, __entry->count) ); +/* + * Tracepoint for stimer_start(one-shot timer already expired). + */ +TRACE_EVENT(kvm_hv_stimer_start_expired, + TP_PROTO(int vcpu_id, int timer_index, u64 time_now, u64 count), + TP_ARGS(vcpu_id, timer_index, time_now, count), + + TP_STRUCT__entry( + __field(int, vcpu_id) + __field(int, timer_index) + __field(u64, time_now) + __field(u64, count) + ), + + TP_fast_assign( + __entry->vcpu_id = vcpu_id; + __entry->timer_index = timer_index; + __entry->time_now = time_now; + __entry->count = count; + ), + + TP_printk("vcpu_id %d timer %d time_now %llu count %llu (expired)", + __entry->vcpu_id, __entry->timer_index, __entry->time_now, + __entry->count) +); + /* * Tracepoint for stimer_timer_callback. */ -- 2.47.1 Amazon Web Services Development Center Germany GmbH Tamara-Danz-Str. 13 10243 Berlin Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B Sitz: Berlin Ust-ID: DE 365 538 597