From: "Kirill A. Shutemov" Optimize the PAMT alloc/free helpers to avoid taking the global lock when possible. The recently introduced PAMT alloc/free helpers maintain a refcount to keep track of when it is ok to reclaim and free a 4KB PAMT page. This refcount is protected by a global lock in order to guarantee that races don’t result in the PAMT getting freed while another caller requests it be mapped. But a global lock is a bit heavyweight, especially since the refcounts can be (already are) updated atomically. A simple approach would be to increment/decrement the refcount outside of the lock before actually adjusting the PAMT, and only adjust the PAMT if the refcount transitions from/to 0. This would correctly allocate and free the PAMT page without getting out of sync. But there it leaves a race where a simultaneous caller could see the refcount already incremented and return before it is actually mapped. So treat the refcount 0->1 case as a special case. On add, if the refcount is zero *don’t* increment the refcount outside the lock (to 1). Always take the lock in that case and only set the refcount to 1 after the PAMT is actually added. This way simultaneous adders, when PAMT is not installed yet, will take the slow lock path. On the 1->0 case, it is ok to return from tdx_pamt_put() when the DPAMT is not actually freed yet, so the basic approach works. Just decrement the refcount before taking the lock. Only do the lock and removal of the PAMT when the refcount goes to zero. There is an asymmetry between tdx_pamt_get() and tdx_pamt_put() in that tdx_pamt_put() goes 1->0 outside the lock, but tdx_pamt_put() does 0-1 inside the lock. Because of this, there is a special race where tdx_pamt_put() could decrement the refcount to zero before the PAMT is actually removed, and tdx_pamt_get() could try to do a PAMT.ADD when the page is already mapped. Luckily the TDX module will tell return a special error that tells us we hit this case. So handle it specially by looking for the error code. The optimization is a little special, so make the code extra commented and verbose. Signed-off-by: Kirill A. Shutemov [Clean up code, update log] Signed-off-by: Rick Edgecombe --- v3: - Split out optimization from “x86/virt/tdx: Add tdx_alloc/free_page() helpers” - Remove edge case handling that I could not find a reason for - Write log --- arch/x86/include/asm/shared/tdx_errno.h | 2 ++ arch/x86/virt/vmx/tdx/tdx.c | 46 +++++++++++++++++++++---- 2 files changed, 42 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/shared/tdx_errno.h b/arch/x86/include/asm/shared/tdx_errno.h index 49ab7ecc7d54..4bc0b9c9e82b 100644 --- a/arch/x86/include/asm/shared/tdx_errno.h +++ b/arch/x86/include/asm/shared/tdx_errno.h @@ -21,6 +21,7 @@ #define TDX_PREVIOUS_TLB_EPOCH_BUSY 0x8000020100000000ULL #define TDX_RND_NO_ENTROPY 0x8000020300000000ULL #define TDX_PAGE_METADATA_INCORRECT 0xC000030000000000ULL +#define TDX_HPA_RANGE_NOT_FREE 0xC000030400000000ULL #define TDX_VCPU_NOT_ASSOCIATED 0x8000070200000000ULL #define TDX_KEY_GENERATION_FAILED 0x8000080000000000ULL #define TDX_KEY_STATE_INCORRECT 0xC000081100000000ULL @@ -100,6 +101,7 @@ DEFINE_TDX_ERRNO_HELPER(TDX_SUCCESS); DEFINE_TDX_ERRNO_HELPER(TDX_RND_NO_ENTROPY); DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_INVALID); DEFINE_TDX_ERRNO_HELPER(TDX_OPERAND_BUSY); +DEFINE_TDX_ERRNO_HELPER(TDX_HPA_RANGE_NOT_FREE); DEFINE_TDX_ERRNO_HELPER(TDX_VCPU_NOT_ASSOCIATED); DEFINE_TDX_ERRNO_HELPER(TDX_FLUSHVP_NOT_DONE); diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c index af73b6c2e917..c25e238931a7 100644 --- a/arch/x86/virt/vmx/tdx/tdx.c +++ b/arch/x86/virt/vmx/tdx/tdx.c @@ -2117,7 +2117,7 @@ int tdx_pamt_get(struct page *page) u64 pamt_pa_array[MAX_DPAMT_ARG_SIZE]; atomic_t *pamt_refcount; u64 tdx_status; - int ret; + int ret = 0; if (!tdx_supports_dynamic_pamt(&tdx_sysinfo)) return 0; @@ -2128,14 +2128,40 @@ int tdx_pamt_get(struct page *page) pamt_refcount = tdx_find_pamt_refcount(hpa); + if (atomic_inc_not_zero(pamt_refcount)) + goto out_free; + scoped_guard(spinlock, &pamt_lock) { - if (atomic_read(pamt_refcount)) + /* + * Lost race to other tdx_pamt_add(). Other task has already allocated + * PAMT memory for the HPA. + */ + if (atomic_read(pamt_refcount)) { + atomic_inc(pamt_refcount); goto out_free; + } tdx_status = tdh_phymem_pamt_add(hpa | TDX_PS_2M, pamt_pa_array); if (IS_TDX_SUCCESS(tdx_status)) { + /* + * The refcount is zero, and this locked path is the only way to + * increase it from 0-1. If the PAMT.ADD was successful, set it + * to 1 (obviously). + */ + atomic_set(pamt_refcount, 1); + } else if (IS_TDX_HPA_RANGE_NOT_FREE(tdx_status)) { + /* + * Less obviously, another CPU's call to tdx_pamt_put() could have + * decremented the refcount before entering its lock section. + * In this case, the PAMT is not actually removed yet. Luckily + * TDX module tells about this case, so increment the refcount + * 0-1, so tdx_pamt_put() skips its pending PAMT.REMOVE. + * + * The call didn't need the pages though, so free them. + */ atomic_inc(pamt_refcount); + goto out_free; } else { pr_err("TDH_PHYMEM_PAMT_ADD failed: %#llx\n", tdx_status); goto out_free; @@ -2167,15 +2193,23 @@ void tdx_pamt_put(struct page *page) pamt_refcount = tdx_find_pamt_refcount(hpa); + /* + * Unlike the paired call in tdx_pamt_get(), decrement the refcount + * outside the lock even if it's not the special 0<->1 transition. + * See special logic around HPA_RANGE_NOT_FREE in tdx_pamt_get(). + */ + if (!atomic_dec_and_test(pamt_refcount)) + return; + scoped_guard(spinlock, &pamt_lock) { - if (!atomic_read(pamt_refcount)) + /* Lost race with tdx_pamt_get() */ + if (atomic_read(pamt_refcount)) return; tdx_status = tdh_phymem_pamt_remove(hpa | TDX_PS_2M, pamt_pa_array); - if (IS_TDX_SUCCESS(tdx_status)) { - atomic_dec(pamt_refcount); - } else { + if (!IS_TDX_SUCCESS(tdx_status)) { + atomic_inc(pamt_refcount); pr_err("TDH_PHYMEM_PAMT_REMOVE failed: %#llx\n", tdx_status); return; } -- 2.51.0