Add hmm_range_fault_unlockable(), a new HMM entry point that allows the mmap read lock to be dropped during page faults. This follows the int *locked pattern from get_user_pages_remote() in mm/gup.c: callers pass an int *locked variable indicating they can handle the lock being dropped. When locked is non-NULL, hmm_vma_fault() adds FAULT_FLAG_ALLOW_RETRY and FAULT_FLAG_KILLABLE to the fault flags passed to handle_mm_fault(). If the fault handler drops the mmap lock (returning VM_FAULT_RETRY or VM_FAULT_COMPLETED), the function sets *locked = 0 and returns 0, signalling the caller to restart its walk with a fresh notifier sequence. Fatal signals are checked before returning, matching GUP behavior. The caller is responsible for re-acquiring the lock and restarting from the beginning, since previously collected PFNs may be stale after the lock was dropped. The existing hmm_range_fault() is refactored into a thin wrapper that calls hmm_range_fault_unlockable(range, NULL). Passing NULL means FAULT_FLAG_ALLOW_RETRY is never set, preserving existing behavior for all current callers with no functional change. Faulting hugetlb pages is not supported on the unlockable path: if a hugetlb page requires faulting, -EFAULT is returned. This is because walk_hugetlb_range() holds hugetlb_vma_lock_read across the callback and unconditionally unlocks on return; if the mmap lock is dropped inside the callback the VMA may be freed, making the walk framework's unlock a use-after-free. Hugetlb pages already present in page tables are handled normally. Documentation/mm/hmm.rst is updated with a new section describing the unlockable API, its usage pattern, and the hugetlb limitation. Signed-off-by: Stanislav Kinsburskii --- Documentation/mm/hmm.rst | 89 +++++++++++++++++++++++++++++++++++++++++++++ include/linux/hmm.h | 1 + mm/hmm.c | 91 +++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 172 insertions(+), 9 deletions(-) diff --git a/Documentation/mm/hmm.rst b/Documentation/mm/hmm.rst index 7d61b7a8b65b7..13874b4dfd5f4 100644 --- a/Documentation/mm/hmm.rst +++ b/Documentation/mm/hmm.rst @@ -208,6 +208,95 @@ invalidate() callback. That lock must be held before calling mmu_interval_read_retry() to avoid any race with a concurrent CPU page table update. +Scalable lock-drop support (hmm_range_fault_unlockable) +======================================================= + +Some page fault handlers (e.g., userfaultfd) require the mmap lock to be +dropped during fault resolution. Drivers that need to support such mappings +can use:: + + int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); + +This follows the same ``int *locked`` pattern used by ``get_user_pages_remote()`` +in ``mm/gup.c``. The caller sets ``*locked = 1`` and holds the mmap read lock +before calling. If the lock is dropped during the fault (VM_FAULT_RETRY or +VM_FAULT_COMPLETED), the function returns 0 with ``*locked = 0``, signalling +the caller to restart its walk with a fresh notifier sequence. The caller is +responsible for re-acquiring the lock and restarting from the beginning, since +previously collected PFNs may be stale. + +The usage pattern is:: + + int driver_populate_range_unlockable(...) + { + struct hmm_range range; + int locked; + ... + + range.notifier = &interval_sub; + range.start = ...; + range.end = ...; + range.hmm_pfns = ...; + + if (!mmget_not_zero(interval_sub->notifier.mm)) + return -EFAULT; + + again: + range.notifier_seq = mmu_interval_read_begin(&interval_sub); + locked = 1; + mmap_read_lock(mm); + ret = hmm_range_fault_unlockable(&range, &locked); + if (locked) + mmap_read_unlock(mm); + if (ret) { + if (ret == -EBUSY) + goto again; + return ret; + } + if (!locked) + goto again; + + take_lock(driver->update); + if (mmu_interval_read_retry(&ni, range.notifier_seq) { + release_lock(driver->update); + goto again; + } + + /* Use pfns array content to update device page table, + * under the update lock */ + + release_lock(driver->update); + return 0; + } + +Passing ``locked = NULL`` to ``hmm_range_fault_unlockable()`` is equivalent to +calling ``hmm_range_fault()`` — the lock will never be dropped. + +Note: hugetlb pages are not supported with the unlockable path. If a hugetlb +page requires faulting during an ``hmm_range_fault_unlockable()`` call, +``-EFAULT`` is returned. Hugetlb pages that are already present in page tables +are handled normally. + +This limitation exists because ``walk_hugetlb_range()`` in the page walk +framework holds ``hugetlb_vma_lock_read`` across the callback and unconditionally +unlocks on return. If the mmap lock is dropped inside the callback (via +VM_FAULT_RETRY), the VMA may be freed before the walk framework's unlock, +resulting in a use-after-free. Possible approaches to lift this limitation in +the future: + +1. Extend the walk framework to allow callbacks to signal that the hugetlb vma + lock was dropped (e.g., a flag in ``struct mm_walk`` that tells + ``walk_hugetlb_range()`` to skip the unlock). + +2. Bypass ``walk_page_range()`` for hugetlb pages in the unlockable path and + walk hugetlb page tables directly with custom lock management (similar to + how GUP handles hugetlb without the walk framework). + +3. Re-acquire the mmap lock before returning from the hugetlb callback (like + ``fixup_user_fault()``), ensuring the VMA remains valid for the walk + framework's unlock. This changes the "never re-take" contract and would + require callers to handle hugetlb differently. + Leverage default_flags and pfn_flags_mask ========================================= diff --git a/include/linux/hmm.h b/include/linux/hmm.h index db75ffc949a7a..46e581865c48a 100644 --- a/include/linux/hmm.h +++ b/include/linux/hmm.h @@ -123,6 +123,7 @@ struct hmm_range { * Please see Documentation/mm/hmm.rst for how to use the range API. */ int hmm_range_fault(struct hmm_range *range); +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked); /* * HMM_RANGE_DEFAULT_TIMEOUT - default timeout (ms) when waiting for a range diff --git a/mm/hmm.c b/mm/hmm.c index 5955f2f0c83db..9bf2fa37f2efd 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -33,6 +33,7 @@ struct hmm_vma_walk { struct hmm_range *range; unsigned long last; + int *locked; }; enum { @@ -86,10 +87,28 @@ static int hmm_vma_fault(unsigned long addr, unsigned long end, fault_flags |= FAULT_FLAG_WRITE; } - for (; addr < end; addr += PAGE_SIZE) - if (handle_mm_fault(vma, addr, fault_flags, NULL) & - VM_FAULT_ERROR) + if (hmm_vma_walk->locked) + fault_flags |= FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE; + + for (; addr < end; addr += PAGE_SIZE) { + vm_fault_t ret; + + ret = handle_mm_fault(vma, addr, fault_flags, NULL); + + if (ret & (VM_FAULT_RETRY | VM_FAULT_COMPLETED)) { + /* + * The mmap lock has been dropped by the fault handler. + * Record the failing address and signal lock-drop to + * the caller. + */ + *hmm_vma_walk->locked = 0; + hmm_vma_walk->last = addr; + return -EAGAIN; + } + + if (ret & VM_FAULT_ERROR) return -EFAULT; + } return -EBUSY; } @@ -566,6 +585,17 @@ static int hmm_vma_walk_hugetlb_entry(pte_t *pte, unsigned long hmask, if (required_fault) { int ret; + /* + * Faulting hugetlb pages on the unlockable path is not + * supported. The walk framework holds hugetlb_vma_lock_read + * which must be dropped before handle_mm_fault, but if the + * mmap lock is also dropped (VM_FAULT_RETRY), the vma may + * be freed and the walk framework's unconditional unlock + * becomes a use-after-free. + */ + if (hmm_vma_walk->locked) + return -EFAULT; + spin_unlock(ptl); hugetlb_vma_unlock_read(vma); /* @@ -655,14 +685,49 @@ static const struct mm_walk_ops hmm_walk_ops = { * * This is similar to get_user_pages(), except that it can read the page tables * without mutating them (ie causing faults). + * + * The mmap lock must be held by the caller and will remain held on return. + * For a variant that allows the mmap lock to be dropped during faults (e.g., + * for userfaultfd support), see hmm_range_fault_unlockable(). */ int hmm_range_fault(struct hmm_range *range) { + return hmm_range_fault_unlockable(range, NULL); +} +EXPORT_SYMBOL(hmm_range_fault); + +/** + * hmm_range_fault_unlockable - fault a range with mmap lock-drop support + * @range: argument structure + * @locked: pointer to lock state variable (input: 1; output: 0 if lock + * was dropped) + * + * Similar to hmm_range_fault() but allows the mmap lock to be dropped during + * page faults. This enables support for userfaultfd-backed mappings and other + * cases where handle_mm_fault() may need to release the mmap lock. + * + * The caller must hold the mmap read lock and set *locked = 1 before calling. + * On return: + * - *locked == 1: mmap lock is still held, return value has normal semantics + * - *locked == 0: mmap lock was dropped. The caller must re-acquire the lock + * and restart the operation. Return value is -EBUSY in this case. + * + * When the lock is dropped internally, this function will attempt to + * re-acquire it and retry the fault with FAULT_FLAG_TRIED set. If the retry + * also results in lock-drop (possible but unusual), or if a fatal signal is + * pending, the function returns with *locked == 0. + * + * Returns 0 on success or a negative error code. See hmm_range_fault() for + * the full list of possible errors. + */ +int hmm_range_fault_unlockable(struct hmm_range *range, int *locked) +{ + struct mm_struct *mm = range->notifier->mm; struct hmm_vma_walk hmm_vma_walk = { .range = range, .last = range->start, + .locked = locked, }; - struct mm_struct *mm = range->notifier->mm; int ret; mmap_assert_locked(mm); @@ -674,16 +739,24 @@ int hmm_range_fault(struct hmm_range *range) return -EBUSY; ret = walk_page_range(mm, hmm_vma_walk.last, range->end, &hmm_walk_ops, &hmm_vma_walk); + if (ret == -EAGAIN) { + /* + * The mmap lock was dropped during the fault + * (e.g. userfaultfd). Signal the caller to restart + * by returning with *locked = 0. + */ + if (fatal_signal_pending(current)) + return -EINTR; + return 0; + } /* - * When -EBUSY is returned the loop restarts with - * hmm_vma_walk.last set to an address that has not been stored - * in pfns. All entries < last in the pfn array are set to their - * output, and all >= are still at their input values. + * -EBUSY: page table changed during the walk. + * Restart from hmm_vma_walk.last. */ } while (ret == -EBUSY); return ret; } -EXPORT_SYMBOL(hmm_range_fault); +EXPORT_SYMBOL(hmm_range_fault_unlockable); /** * hmm_dma_map_alloc - Allocate HMM map structure Convert the mshv driver's HMM fault path to use hmm_range_fault_unlockable() instead of hmm_range_fault(). This enables userfaultfd-backed guest memory regions by allowing the mmap lock to be dropped during page fault handling. Extract the per-VMA walk into a dedicated mshv_region_hmm_fault_walk() helper. The outer mshv_region_hmm_fault_and_lock() handles the do/while restart loop: if the lock is dropped during a fault (userfaultfd resolution or similar) or an invalidation occurs (-EBUSY), the function restarts the entire walk from the beginning with a fresh notifier_seq, since the VMA layout may have changed. Signed-off-by: Stanislav Kinsburskii --- drivers/hv/mshv_regions.c | 127 +++++++++++++++++++++++++++++++-------------- 1 file changed, 87 insertions(+), 40 deletions(-) diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c index d09940e88298e..05665446ca6d9 100644 --- a/drivers/hv/mshv_regions.c +++ b/drivers/hv/mshv_regions.c @@ -565,6 +565,75 @@ int mshv_region_get(struct mshv_region *region) return kref_get_unless_zero(®ion->mreg_refcount); } +/** + * mshv_region_hmm_fault_walk - Walk VMAs and fault in pages for a range + * @region : Pointer to the memory region structure + * @range : HMM range structure (caller sets notifier and notifier_seq) + * @start : Starting virtual address of the range to fault (inclusive) + * @end : Ending virtual address of the range to fault (exclusive) + * @pfns : Output array for page frame numbers with HMM flags + * @locked : Pointer to lock state; set to 0 if mmap lock was dropped + * @do_fault: If true, fault in missing pages; if false, snapshot only + * + * Iterates through VMAs covering [start, end), collecting page frame + * numbers via hmm_range_fault_unlockable() for each VMA segment. + * When @do_fault is true, missing pages are faulted in and write faults + * are requested only when both the VMA and the hypervisor mapping permit + * writes, to avoid breaking copy-on-write semantics on read-only mappings. + * + * Return: 0 on success, negative error code on failure. + */ +static int mshv_region_hmm_fault_walk(struct mshv_region *region, + struct hmm_range *range, + unsigned long start, + unsigned long end, + unsigned long *pfns, + int *locked, + bool do_fault) +{ + unsigned long cur_start = start; + unsigned long *cur_pfns = pfns; + + while (cur_start < end) { + struct vm_area_struct *vma; + + vma = vma_lookup(range->notifier->mm, cur_start); + if (!vma) + return -EFAULT; + + range->hmm_pfns = cur_pfns; + range->start = cur_start; + range->end = min(vma->vm_end, end); + range->default_flags = 0; + if (do_fault) { + range->default_flags = HMM_PFN_REQ_FAULT; + /* + * Only request writable pages from HMM when + * both the VMA and the hypervisor mapping allow + * writes. Without this, hmm_range_fault() would + * trigger COW on read-only mappings (e.g. shared + * zero pages, file-backed pages), breaking + * copy-on-write semantics and potentially + * granting the guest write access to shared host + * pages. + */ + if ((vma->vm_flags & VM_WRITE) && + (region->hv_map_flags & HV_MAP_GPA_WRITABLE)) + range->default_flags |= HMM_PFN_REQ_WRITE; + } + + int ret = hmm_range_fault_unlockable(range, locked); + + if (ret || !*locked) + return ret; + + cur_start = range->end; + cur_pfns += (range->end - range->start) >> PAGE_SHIFT; + } + + return 0; +} + /** * mshv_region_hmm_fault_and_lock - Fault in pages across VMAs and lock * the memory region @@ -575,11 +644,9 @@ int mshv_region_get(struct mshv_region *region) * @do_fault: If true, fault in missing pages; if false, snapshot only * pages already present in page tables * - * Iterates through VMAs covering [start, end), collecting page frame - * numbers via hmm_range_fault() for each VMA segment. When @do_fault - * is true, missing pages are faulted in and write faults are requested - * only when both the VMA and the hypervisor mapping permit writes, to - * avoid breaking copy-on-write semantics on read-only mappings. + * Faults in pages covering [start, end) and acquires region->mreg_mutex. + * If the mmap lock is dropped during the fault (e.g. by userfaultfd) or + * the mmu notifier sequence is invalidated, the entire walk is restarted. * * On success, returns with region->mreg_mutex held; the caller is * responsible for releasing it. Returns -EBUSY if the mmu notifier @@ -597,47 +664,27 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_region *region, .notifier = ®ion->mreg_mni, }; struct mm_struct *mm = region->mreg_mni.mm; + int locked; int ret; - range.notifier_seq = mmu_interval_read_begin(range.notifier); - mmap_read_lock(mm); - while (start < end) { - struct vm_area_struct *vma; + do { + range.notifier_seq = mmu_interval_read_begin(range.notifier); + locked = 1; + mmap_read_lock(mm); - vma = vma_lookup(mm, start); - if (!vma) { - ret = -EFAULT; - break; - } + ret = mshv_region_hmm_fault_walk(region, &range, start, end, + pfns, &locked, do_fault); - range.hmm_pfns = pfns; - range.start = start; - range.end = min(vma->vm_end, end); - range.default_flags = 0; - if (do_fault) { - range.default_flags = HMM_PFN_REQ_FAULT; - /* - * Only request writable pages from HMM when both - * the VMA and the hypervisor mapping allow writes. - * Without this, hmm_range_fault() would trigger - * COW on read-only mappings (e.g. shared zero - * pages, file-backed pages), breaking - * copy-on-write semantics and potentially granting - * the guest write access to shared host pages. - */ - if ((vma->vm_flags & VM_WRITE) && - (region->hv_map_flags & HV_MAP_GPA_WRITABLE)) - range.default_flags |= HMM_PFN_REQ_WRITE; - } + if (locked) + mmap_read_unlock(mm); - ret = hmm_range_fault(&range); - if (ret) - break; + /* + * If the lock was dropped (by userfaultfd or similar), restart + * the entire walk with a fresh notifier_seq since the VMA layout + * may have changed. Also restart on -EBUSY (invalidation). + */ + } while (!locked || ret == -EBUSY); - start = range.end; - pfns += (range.end - range.start) >> PAGE_SHIFT; - } - mmap_read_unlock(mm); if (ret) return ret; Add a selftest that exercises hmm_range_fault_unlockable() with a userfaultfd-backed mapping. The test: 1. Creates an anonymous mmap region 2. Registers it with userfaultfd (UFFDIO_REGISTER_MODE_MISSING) 3. Spawns a handler thread that responds to page faults by filling pages with a known pattern (0xAB) via UFFDIO_COPY 4. Issues HMM_DMIRROR_READ_UNLOCKABLE to the test_hmm driver, which calls hmm_range_fault_unlockable() internally 5. Verifies the device read back the data provided by the userfaultfd handler This requires changes to the test_hmm kernel module: - New dmirror_range_fault_unlockable() that uses the new HMM API - New dmirror_fault_unlockable() and dmirror_read_unlockable() wrappers - New HMM_DMIRROR_READ_UNLOCKABLE ioctl (0x09) Signed-off-by: Stanislav Kinsburskii --- lib/test_hmm.c | 122 +++++++++++++++++++++++++++++ lib/test_hmm_uapi.h | 1 tools/testing/selftests/mm/hmm-tests.c | 133 ++++++++++++++++++++++++++++++++ 3 files changed, 256 insertions(+) diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 0964d53365e61..20b14e279a8bd 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -327,6 +327,84 @@ static int dmirror_range_fault(struct dmirror *dmirror, return ret; } +static int dmirror_range_fault_unlockable(struct dmirror *dmirror, + struct hmm_range *range) +{ + struct mm_struct *mm = dmirror->notifier.mm; + unsigned long timeout = + jiffies + msecs_to_jiffies(HMM_RANGE_DEFAULT_TIMEOUT); + int locked; + int ret; + + while (true) { + if (time_after(jiffies, timeout)) { + ret = -EBUSY; + goto out; + } + + range->notifier_seq = mmu_interval_read_begin(range->notifier); + locked = 1; + mmap_read_lock(mm); + ret = hmm_range_fault_unlockable(range, &locked); + if (locked) + mmap_read_unlock(mm); + if (ret) { + if (ret == -EBUSY) + continue; + goto out; + } + if (!locked) + continue; + + mutex_lock(&dmirror->mutex); + if (mmu_interval_read_retry(range->notifier, + range->notifier_seq)) { + mutex_unlock(&dmirror->mutex); + continue; + } + break; + } + + ret = dmirror_do_fault(dmirror, range); + + mutex_unlock(&dmirror->mutex); +out: + return ret; +} + +static int dmirror_fault_unlockable(struct dmirror *dmirror, + unsigned long start, + unsigned long end, bool write) +{ + struct mm_struct *mm = dmirror->notifier.mm; + unsigned long addr; + unsigned long pfns[32]; + struct hmm_range range = { + .notifier = &dmirror->notifier, + .hmm_pfns = pfns, + .pfn_flags_mask = 0, + .default_flags = + HMM_PFN_REQ_FAULT | (write ? HMM_PFN_REQ_WRITE : 0), + .dev_private_owner = dmirror->mdevice, + }; + int ret = 0; + + if (!mmget_not_zero(mm)) + return 0; + + for (addr = start; addr < end; addr = range.end) { + range.start = addr; + range.end = min(addr + (ARRAY_SIZE(pfns) << PAGE_SHIFT), end); + + ret = dmirror_range_fault_unlockable(dmirror, &range); + if (ret) + break; + } + + mmput(mm); + return ret; +} + static int dmirror_fault(struct dmirror *dmirror, unsigned long start, unsigned long end, bool write) { @@ -426,6 +504,47 @@ static int dmirror_read(struct dmirror *dmirror, struct hmm_dmirror_cmd *cmd) return ret; } +static int dmirror_read_unlockable(struct dmirror *dmirror, + struct hmm_dmirror_cmd *cmd) +{ + struct dmirror_bounce bounce; + unsigned long start, end; + unsigned long size = cmd->npages << PAGE_SHIFT; + int ret; + + start = cmd->addr; + end = start + size; + if (end < start) + return -EINVAL; + + ret = dmirror_bounce_init(&bounce, start, size); + if (ret) + return ret; + + while (1) { + mutex_lock(&dmirror->mutex); + ret = dmirror_do_read(dmirror, start, end, &bounce); + mutex_unlock(&dmirror->mutex); + if (ret != -ENOENT) + break; + + start = cmd->addr + (bounce.cpages << PAGE_SHIFT); + ret = dmirror_fault_unlockable(dmirror, start, end, false); + if (ret) + break; + cmd->faults++; + } + + if (ret == 0) { + if (copy_to_user(u64_to_user_ptr(cmd->ptr), bounce.ptr, + bounce.size)) + ret = -EFAULT; + } + cmd->cpages = bounce.cpages; + dmirror_bounce_fini(&bounce); + return ret; +} + static int dmirror_do_write(struct dmirror *dmirror, unsigned long start, unsigned long end, struct dmirror_bounce *bounce) { @@ -1537,6 +1656,9 @@ static long dmirror_fops_unlocked_ioctl(struct file *filp, dmirror->flags = cmd.npages; ret = 0; break; + case HMM_DMIRROR_READ_UNLOCKABLE: + ret = dmirror_read_unlockable(dmirror, &cmd); + break; default: return -EINVAL; diff --git a/lib/test_hmm_uapi.h b/lib/test_hmm_uapi.h index f94c6d4573382..076df6df92275 100644 --- a/lib/test_hmm_uapi.h +++ b/lib/test_hmm_uapi.h @@ -38,6 +38,7 @@ struct hmm_dmirror_cmd { #define HMM_DMIRROR_CHECK_EXCLUSIVE _IOWR('H', 0x06, struct hmm_dmirror_cmd) #define HMM_DMIRROR_RELEASE _IOWR('H', 0x07, struct hmm_dmirror_cmd) #define HMM_DMIRROR_FLAGS _IOWR('H', 0x08, struct hmm_dmirror_cmd) +#define HMM_DMIRROR_READ_UNLOCKABLE _IOWR('H', 0x09, struct hmm_dmirror_cmd) #define HMM_DMIRROR_FLAG_FAIL_ALLOC (1ULL << 0) diff --git a/tools/testing/selftests/mm/hmm-tests.c b/tools/testing/selftests/mm/hmm-tests.c index e8328c89d855e..e7bf061747edd 100644 --- a/tools/testing/selftests/mm/hmm-tests.c +++ b/tools/testing/selftests/mm/hmm-tests.c @@ -26,6 +26,9 @@ #include #include #include +#include +#include +#include /* @@ -2852,4 +2855,134 @@ TEST_F_TIMEOUT(hmm, benchmark_thp_migration, 120) &thp_results, ®ular_results); } } + +/* + * Test that HMM can fault in pages backed by userfaultfd using the + * hmm_range_fault_unlockable() path. This exercises the lock-drop retry + * logic in the HMM framework. + */ +struct uffd_thread_args { + int uffd; + void *page_buffer; + unsigned long page_size; +}; + +static void *uffd_handler_thread(void *arg) +{ + struct uffd_thread_args *args = arg; + struct uffd_msg msg; + struct uffdio_copy copy; + struct pollfd pollfd; + int ret; + + pollfd.fd = args->uffd; + pollfd.events = POLLIN; + + while (1) { + ret = poll(&pollfd, 1, 5000); + if (ret <= 0) + break; + + ret = read(args->uffd, &msg, sizeof(msg)); + if (ret != sizeof(msg)) + break; + + if (msg.event != UFFD_EVENT_PAGEFAULT) + break; + + /* Fill the page with a known pattern */ + memset(args->page_buffer, 0xAB, args->page_size); + + copy.dst = msg.arg.pagefault.address & ~(args->page_size - 1); + copy.src = (unsigned long)args->page_buffer; + copy.len = args->page_size; + copy.mode = 0; + copy.copy = 0; + + ret = ioctl(args->uffd, UFFDIO_COPY, ©); + if (ret < 0) + break; + } + + return NULL; +} + +TEST_F(hmm, userfaultfd_read) +{ + struct hmm_buffer *buffer; + struct uffd_thread_args uffd_args; + unsigned long npages; + unsigned long size; + unsigned long i; + unsigned char *ptr; + pthread_t thread; + int uffd; + int ret; + struct uffdio_api api; + struct uffdio_register reg; + + npages = 4; + size = npages << self->page_shift; + + /* Create userfaultfd */ + uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK); + if (uffd < 0) + SKIP(return, "userfaultfd not available"); + + api.api = UFFD_API; + api.features = 0; + ret = ioctl(uffd, UFFDIO_API, &api); + ASSERT_EQ(ret, 0); + + buffer = malloc(sizeof(*buffer)); + ASSERT_NE(buffer, NULL); + + buffer->fd = -1; + buffer->size = size; + buffer->mirror = malloc(size); + ASSERT_NE(buffer->mirror, NULL); + + /* Create anonymous mapping */ + buffer->ptr = mmap(NULL, size, + PROT_READ | PROT_WRITE, + MAP_PRIVATE | MAP_ANONYMOUS, + -1, 0); + ASSERT_NE(buffer->ptr, MAP_FAILED); + + /* Register the region with userfaultfd */ + reg.range.start = (unsigned long)buffer->ptr; + reg.range.len = size; + reg.mode = UFFDIO_REGISTER_MODE_MISSING; + ret = ioctl(uffd, UFFDIO_REGISTER, ®); + ASSERT_EQ(ret, 0); + + /* Set up the handler thread */ + uffd_args.uffd = uffd; + uffd_args.page_buffer = malloc(self->page_size); + ASSERT_NE(uffd_args.page_buffer, NULL); + uffd_args.page_size = self->page_size; + + ret = pthread_create(&thread, NULL, uffd_handler_thread, &uffd_args); + ASSERT_EQ(ret, 0); + + /* + * Use the unlockable read path which allows the mmap lock to be + * dropped during the fault, enabling userfaultfd resolution. + */ + ret = hmm_dmirror_cmd(self->fd, HMM_DMIRROR_READ_UNLOCKABLE, + buffer, npages); + ASSERT_EQ(ret, 0); + ASSERT_EQ(buffer->cpages, npages); + + /* Verify the device read the data filled by the uffd handler */ + ptr = buffer->mirror; + for (i = 0; i < size; ++i) + ASSERT_EQ(ptr[i], (unsigned char)0xAB); + + pthread_join(thread, NULL); + free(uffd_args.page_buffer); + close(uffd); + hmm_buffer_free(buffer); +} + TEST_HARNESS_MAIN