During Windows Server 2025 hibernation, I have seen Windows' calculation
of interrupt target time get skewed over the hypervisor view of the same.
This can cause Windows to emit timer events in the past for events that
do not fire yet according to the real time source. This then leads to
interrupt storms in the guest which slow down execution to a point where
watchdogs trigger. Those manifest as bugchecks 0x9f and 0xa0 during
hibernation, typically in the resume path.

To work around this problem, we can delay timers that get created with a
target time in the past by a tiny bit (10µs) to give the guest CPU time
to process real work and make forward progress, hopefully recovering its
interrupt logic in the process. While this small delay can marginally
reduce accuracy of guest timers, 10µs are within the noise of VM
entry/exit overhead (~1-2 µs) so I do not expect to see real world impact.

To still provide some level of visibility when this happens, add a trace
point that clearly shows the discrepancy between the target time and the
current time.

Signed-off-by: Alexander Graf <graf@amazon.com>
---
 arch/x86/kvm/hyperv.c | 22 ++++++++++++++++++----
 arch/x86/kvm/trace.h  | 26 ++++++++++++++++++++++++++
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 72b19a88a776..c41061acbcbc 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -666,13 +666,27 @@ static int stimer_start(struct kvm_vcpu_hv_stimer *stimer)
 	stimer->exp_time = stimer->count;
 	if (time_now >= stimer->count) {
 		/*
-		 * Expire timer according to Hypervisor Top-Level Functional
-		 * specification v4(15.3.1):
+		 * Hypervisor Top-Level Functional specification v4(15.3.1):
 		 * "If a one shot is enabled and the specified count is in
 		 * the past, it will expire immediately."
+		 *
+		 * However, there are cases during hibernation when Windows's
+		 * interrupt count calculation can go out of sync with KVM's
+		 * view of it, causing Windows to emit timer events in the past
+		 * for events that do not fire yet according to the real time
+		 * source. This then leads to interrupt storms in the guest
+		 * which slow down execution to a point where watchdogs trigger.
+		 *
+		 * Instead of taking TLFS literally on what "immediately" means,
+		 * give the guest at least 10µs to process work. While this can
+		 * marginally reduce accuracy of guest timers, 10µs are within
+		 * the noise of VM entry/exit overhead (~1-2 µs).
 		 */
-		stimer_mark_pending(stimer, false);
-		return 0;
+		trace_kvm_hv_stimer_start_expired(
+					hv_stimer_to_vcpu(stimer)->vcpu_id,
+					stimer->index,
+					time_now, stimer->count);
+		stimer->count = time_now + 100;
 	}
 
 	trace_kvm_hv_stimer_start_one_shot(hv_stimer_to_vcpu(stimer)->vcpu_id,
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 57d79fd31df0..f9e69c4d9e9b 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1401,6 +1401,32 @@ TRACE_EVENT(kvm_hv_stimer_start_one_shot,
 		  __entry->count)
 );
 
+/*
+ * Tracepoint for stimer_start(one-shot timer already expired).
+ */
+TRACE_EVENT(kvm_hv_stimer_start_expired,
+	TP_PROTO(int vcpu_id, int timer_index, u64 time_now, u64 count),
+	TP_ARGS(vcpu_id, timer_index, time_now, count),
+
+	TP_STRUCT__entry(
+		__field(int, vcpu_id)
+		__field(int, timer_index)
+		__field(u64, time_now)
+		__field(u64, count)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu_id;
+		__entry->timer_index = timer_index;
+		__entry->time_now = time_now;
+		__entry->count = count;
+	),
+
+	TP_printk("vcpu_id %d timer %d time_now %llu count %llu (expired)",
+		  __entry->vcpu_id, __entry->timer_index, __entry->time_now,
+		  __entry->count)
+);
+
 /*
  * Tracepoint for stimer_timer_callback.
  */
-- 
2.47.1


Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christof Hellmis, Andreas Stieger
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597