From: Ackerley Tng <ackerleytng@google.com>

Document how synchronization is used while managing guest faults centrally
so code comments can point users at a central place.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 Documentation/virt/kvm/locking.rst | 108 +++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

diff --git a/Documentation/virt/kvm/locking.rst b/Documentation/virt/kvm/locking.rst
index f12664443e913..0663ccfe0633d 100644
--- a/Documentation/virt/kvm/locking.rst
+++ b/Documentation/virt/kvm/locking.rst
@@ -339,3 +339,111 @@ time it will be set using the Dirty tracking mechanism described above.
     cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many
     operations need to take cpu_hotplug_lock when loading a vendor module, e.g.
     updating static calls.
+
+4. Synchronization while managing guest faults
+----------------------------------------------
+
+This section explains the intersection of these synchronization mechanisms:
+
+- ``kvm->srcu`` (for memslots)
+- ``kvm->mmu_invalidate_*`` (pending invalidations)
+- ``kvm->mn_*`` (synchronization for ``kvm->mmu_invalidate_*``)
+
+4.1 Overview
+^^^^^^^^^^^^
+
+KVM resolves guest page faults by translating the Guest Frame Number (GFN) into
+a Page Frame Number (PFN) via memslots and then populating its shadow page
+tables with the resulting mapping.
+
+While handling the guest page fault, KVM must ensure a consistent view of the
+active memslots container, so KVM takes ``srcu_read_lock(&kvm->srcu);``.
+
+Guest page fault handling can race with some request from host userspace to
+invalidate shadow page tables. These requests originate from a few places, such
+as
+
+1. MMU Notifiers: KVM registers callbacks with the kernel’s memory management
+   subsystem to know when there are changes to mappings in the host userspace
+   page tables.
+2. Memslot Updates: The host userspace VMM, such as QEMU may use the
+   ``KVM_SET_USER_MEMORY_REGION`` ioctl to add, delete, or move a memslot. KVM
+   must zap the affected shadow page tables to ensure the guest doesn't access
+   stale mappings.
+3. Memory Attribute Changes: The ``KVM_SET_MEMORY_ATTRIBUTES`` ioctl allows
+   userspace to change attributes for a range of guest memory (e.g., setting a
+   range as "private" for Confidential Computing). This also requires
+   invalidating existing shadow mappings.
+
+When such a race occurs, KVM optimistically allows the faulting logic to
+proceed, but just before committing the fault, KVM will check for a pending
+invalidation, and retry the fault process if there is a pending invalidation
+affecting the GFN where the fault occurred.
+
+4.2 Tracking pending invalidations with ``kvm->mmu_invalidate*`` fields
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A "pending invalidation" is determined using a combination of
+
+- ``kvm->mmu_invalidate_in_progress``
+- ``kvm->mmu_invalidate_range_start`` and ``kvm->mmu_invalidate_range_end``
+- ``kvm->mmu_invalidate_seq``
+
+``is_page_fault_stale()`` shows how the above fields are used to determine if
+the page fault is stale and requires a retry.
+
+To protect the above combination of fields, a lock is used, which is the
+``kvm->mmu_lock``.
+
+4.2.1 Derived information vs pending invalidations
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Generally, the result of any information derived from GFN aka page
+attribute/page metadata lookups may race with invalidations. Here are some
+examples of lookups:
+
+- ``host_pfn_mapping_level()`` uses memslot information to find the mapping
+  level of pages in host userspace page tables. If there's an invalidation, the
+  pages that were mapped would no longer be mapped and hence the mapping level
+  result would be stale.
+
+There are several ways to ensure valid results:
+
+- Check ``mmu_invalidate_retry_gfn()`` after grabbing the result, before
+  consuming it. In this case, ``mmu_lock`` doesn't need to be held during the
+  lookup, but it does need to be held while checking the MMU notifier. KVM's
+  guest page fault handling uses this option.
+- Hold ``mmu_lock`` AND ensure there is no in-progress MMU notifier invalidation
+  event for the hva. This can be done by explicit checking the MMU notifier or
+  by ensuring that KVM already has a valid mapping that covers the
+  hva. ``kvm_mmu_recover_huge_pages()`` uses this option.
+
+4.3 Further optimization: ignoring invalidations if there is no matching memslot
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Invalidation is only really required when the invalidated memory range overlaps
+with some memslot. Without a matching memslot, the invalidation request could
+actually just be ignored. Hence, KVM only updates the ``kvm->mmu_invalidate_*``
+fields and takes ``kvm->mmu_lock`` if it finds a matching memslot.
+
+This creates another problem: if memslots are updated while there is an ongoing
+invalidation, then the updates to the fields and the lock would be imbalanced.
+
+4.4 Synchronization for invalidation lock/fields: ``kvm->mn_*``
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To make sure the updates to the invalidation lock/fields are balanced, KVM has a
+further layer of synchronization. ``kvm_swap_active_memslots()`` enforces that
+changes to memslots are only committed once all pending invalidations are
+complete.
+
+In other words, ``kvm->mn_*`` ensures the following does not happen:
+
+1. Some memslot existed, causing a pending invalidation request to be recorded
+   in the ``kvm->mmu_invalidate_*`` fields
+2. Memslot got removed, so the invalidation request was never removed from the
+   ``kvm->mmu_invalidate_*`` fields.
+
+In addition, ``kvm_swap_active_memslots()`` also enforces that changes to
+memslots are complete before doing ``synchronize_srcu(&kvm->srcu)`` to make sure
+running readers of the old memslots container are done before freeing it.

-- 
2.54.0.823.g6e5bcc1fc9-goog