When taking a longterm GUP pin via pin_user_pages(), __gup_longterm_locked() tries to migrate target folios that should not be longterm pinned, for example because they reside in a CMA region or movable zone. This is done by first pinning all of the target folios anyway, collecting all of the longterm-unpinnable target folios into a list, dropping the pins that were just taken and finally handing the list off to migrate_pages() for the actual migration. It is critically important that no unexpected references are held on the folios being migrated, otherwise the migration will fail and pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is relatively easy to observe migration failures when running pKVM (which uses pin_user_pages() on crosvm's virtual address space to resolve stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and this results in the VM terminating prematurely. In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its mapping of guest memory prior to the pinning. Subsequently, when pin_user_pages() walks the page-table, the relevant 'pte' is not present and so the faulting logic allocates a new folio, mlocks it with mlock_folio() and maps it in the page-table. Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec"), mlock/munlock operations on a folio (formerly page), are deferred. For example, mlock_folio() takes an additional reference on the target folio before placing it into a per-cpu 'folio_batch' for later processing by mlock_folio_batch(), which drops the refcount once the operation is complete. Processing of the batches is coupled with the LRU batch logic and can be forcefully drained with lru_add_drain_all() but as long as a folio remains unprocessed on the batch, its refcount will be elevated. This deferred batching therefore interacts poorly with the pKVM pinning scenario as we can find ourselves in a situation where the migration code fails to migrate a folio due to the elevated refcount from the pending mlock operation. Extend the existing LRU draining logic in collect_longterm_unpinnable_folios() so that unpinnable mlocked folios on the LRU also trigger a drain. Cc: Hugh Dickins Cc: Keir Fraser Cc: Jason Gunthorpe Cc: David Hildenbrand Cc: John Hubbard Cc: Frederick Mayle Cc: Andrew Morton Cc: Peter Xu Fixes: 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page() batch by pagevec") Signed-off-by: Will Deacon --- This has been quite unpleasant to debug and, as I'm not intimately familiar with the mm internals, I've tried to include all the relevant details in the commit message in case there's a preferred alternative way of solving the problem or there's a flaw in my logic. mm/gup.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index adffe663594d..656835890f05 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios( continue; } - if (!folio_test_lru(folio) && drain_allow) { + if (drain_allow && + (!folio_test_lru(folio) || folio_test_mlocked(folio))) { lru_add_drain_all(); drain_allow = false; } -- 2.51.0.rc1.167.g924127e9c0-goog