From: "Kiryl Shutsemau (Meta)" Read-write protect mode (UFFDIO_REGISTER_MODE_RWP) is supported starting from Linux 7.2. It traps every access -- read or write -- to a present page within a registered range. The matching UAPI consists of: - UFFDIO_REGISTER_MODE_RWP registration-mode bit - UFFD_FEATURE_RWP capability bit - UFFD_FEATURE_RWP_ASYNC async (in-kernel) fault resolution - UFFDIO_RWPROTECT install / remove RWP on a range - UFFDIO_SET_MODE runtime sync/async toggle - UFFD_PAGEFAULT_FLAG_RWP new pagefault.flags bit Document the new registration-mode entry, the "Userfaultfd read-write protect mode" section, the new pagefault flag, and a VERSIONS line. Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2/userfaultfd.2 | 174 ++++++++++++++++++++++++++++++++++++++++- 1 file changed, 170 insertions(+), 4 deletions(-) diff --git a/man/man2/userfaultfd.2 b/man/man2/userfaultfd.2 index 6d56085f1534..c395bf9bb332 100644 --- a/man/man2/userfaultfd.2 +++ b/man/man2/userfaultfd.2 @@ -111,6 +111,32 @@ .SH DESCRIPTION until user-space write-unprotects the page using an .B UFFDIO_WRITEPROTECT ioctl. +.TP +.BR UFFDIO_REGISTER_MODE_RWP " (since Linux 7.2)" +When registered with +.B UFFDIO_REGISTER_MODE_RWP +mode, +user space will receive a page-fault notification on any access +\[em]read or write\[em] +to a page present within the range. +By default, +the faulted thread will be stopped from execution +until user space removes the protection using a +.B UFFDIO_RWPROTECT +ioctl; +if +.B UFFD_FEATURE_RWP_ASYNC +was negotiated, +the kernel restores access in place +and the faulted thread continues without blocking. +.IP +.B UFFDIO_REGISTER_MODE_RWP +and +.B UFFDIO_REGISTER_MODE_WP +cannot be combined on the same range; +attempting to register with both bits set fails with +.BR EINVAL . +See the "Userfaultfd read-write-protect mode" section below. .P Multiple modes can be enabled at the same time for the same memory range. .P @@ -192,6 +218,24 @@ .SS Usage kicking the faulted thread to continue. For more information, please refer to the "Userfaultfd write-protect mode" section. +.P +Since Linux 7.2, +userfaultfd can do read-write-protection tracking, +which traps every access +(read or write) +to a page present within a registered range. +One should check against the feature bit +.B UFFD_FEATURE_RWP +before using this feature, +and optionally negotiate +.B UFFD_FEATURE_RWP_ASYNC +to have the kernel auto-restore page permissions on fault +without delivering a notification. +This mode is intended for working-set tracking +by VM memory managers and similar callers; +cold pages can then be evicted using independent kernel interfaces. +For more information, +please refer to the "Userfaultfd read-write-protect mode" section. .\" .SS Userfaultfd operation After the userfaultfd object is created with @@ -387,6 +431,113 @@ .SS Userfaultfd minor fault mode (since Linux 5.13) Minor fault mode supports only hugetlbfs-backed (since Linux 5.13) and shmem-backed (since Linux 5.14) memory. .\" +.SS Userfaultfd read-write-protect mode (since Linux 7.2) +Since Linux 7.2, +userfaultfd supports read-write-protect mode. +Unlike write-protect mode, +every access +\[em]read or write\[em] +to a protected page generates a userfaultfd notification. +It works on anonymous, shmem, and hugetlbfs mappings. +.P +The user needs to first check availability of this feature using the +.B UFFDIO_API +ioctl against the feature bit +.B UFFD_FEATURE_RWP +before using this mode. +See +.BR UFFDIO_API (2const) +for the recommended discovery sequence. +.P +To register with userfaultfd read-write-protect mode, +the user needs to initiate the +.B UFFDIO_REGISTER +ioctl with mode +.B UFFDIO_REGISTER_MODE_RWP +set. +.B UFFDIO_REGISTER_MODE_RWP +cannot be combined with +.BR UFFDIO_REGISTER_MODE_WP ; +however it can be combined with +.B UFFDIO_REGISTER_MODE_MISSING +when the caller also wants notifications for fresh page populations. +.P +After registration, +the user can read-write-protect any existing memory within the range using the +.B UFFDIO_RWPROTECT +ioctl where +.I uffdio_rwprotect.mode +is set to +.BR UFFDIO_RWPROTECT_MODE_RWP . +Read-write protection only affects pages +that are currently populated in the range; +unpopulated addresses remain unpopulated +and fall through to the normal missing-page path on first access. +.P +For anonymous mappings, +protection is preserved across page reclaim +(the marker rides on the swap entry) +and migration. +For shmem and file-backed mappings, +protection is dropped when the backing page is reclaimed +and must be re-armed by the caller. +Protection is also +.I not +preserved across operations that explicitly drop the underlying page: +.B MADV_DONTNEED +on anonymous memory, +hole-punch on shmem, +truncation of a file mapping. +Callers must re-arm the range with +.B UFFDIO_RWPROTECT +after any such operation. +.P +When an access fault happens against a protected page, +user space will receive a page-fault notification whose +.I uffd_msg.pagefault.flags +field has the +.B UFFD_PAGEFAULT_FLAG_RWP +bit set. +.P +To resolve a read-write-protect page fault, +the user initiates another +.B UFFDIO_RWPROTECT +ioctl whose +.I uffdio_rwprotect.mode +has the +.B UFFDIO_RWPROTECT_MODE_RWP +flag cleared. +This restores the original VMA permissions on the affected pages +and wakes any blocked threads +(unless +.B UFFDIO_RWPROTECT_MODE_DONTWAKE +is also set). +.P +If +.B UFFD_FEATURE_RWP_ASYNC +was negotiated alongside +.BR UFFD_FEATURE_RWP , +the kernel resolves access faults in place +without delivering a notification: +page permissions are restored automatically +and the faulting thread continues. +Callers can later reconstruct which pages were touched +by inspecting the +.B PAGE_IS_ACCESSED +bit returned by the +.B PAGEMAP_SCAN +ioctl described in +.BR ioctl_userfaultfd (2) +and +.IR Documentation/admin\-guide/mm/pagemap.rst +in the Linux kernel source. +.P +The async mode can be toggled at runtime using the +.B UFFDIO_SET_MODE +ioctl, +which lets a single userfaultfd switch between async detection +and synchronous eviction without re-registering the range. +.\" .SS Reading from the userfaultfd structure Each .BR read (2) @@ -531,13 +682,17 @@ .SS Reading from the userfaultfd structure .B UFFD_PAGEFAULT_FLAG_MINOR If this flag is set, then the fault was a minor fault. .TP +.BR UFFD_PAGEFAULT_FLAG_RWP " (since Linux 7.2)" +If this flag is set, then the fault was a read-write-protect fault. +.TP .B UFFD_PAGEFAULT_FLAG_WRITE If this flag is set, then the fault was a write fault. .P -If neither -.B UFFD_PAGEFAULT_FLAG_WP -nor -.B UFFD_PAGEFAULT_FLAG_MINOR +If none of +.BR UFFD_PAGEFAULT_FLAG_WP , +.BR UFFD_PAGEFAULT_FLAG_MINOR , +or +.B UFFD_PAGEFAULT_FLAG_RWP are set, then the fault was a missing fault. .RE .TP @@ -640,6 +795,17 @@ .SH HISTORY .P Support for hugetlbfs and shared memory areas and non-page-fault events was added in Linux 4.11 +.P +Read-write-protect mode +.RB ( UFFDIO_REGISTER_MODE_RWP , +.BR UFFD_FEATURE_RWP , +.BR UFFDIO_RWPROTECT ) +was added in Linux 7.2, +together with +.B UFFD_FEATURE_RWP_ASYNC +and the +.B UFFDIO_SET_MODE +runtime mode toggle. .SH NOTES The userfaultfd mechanism can be used as an alternative to traditional user-space paging techniques based on the use of the -- 2.54.0 Document the UFFDIO_RWPROTECT ioctl (since Linux 7.2). It installs or removes read-write protection on a range that was registered with UFFDIO_REGISTER_MODE_RWP, and is also how a handler resolves an UFFD_PAGEFAULT_FLAG_RWP notification. Cover the two mode bits (UFFDIO_RWPROTECT_MODE_RWP and UFFDIO_RWPROTECT_MODE_DONTWAKE, mutually exclusive), the populated- pages-only semantics, the anon vs file-backed reclaim behaviour, the explicit-drop list (MADV_DONTNEED, hole-punch, truncation), and the EINVAL/EAGAIN/ENOENT/EFAULT errors returned by the kernel. Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2const/UFFDIO_RWPROTECT.2const | 122 ++++++++++++++++++++++++++ 1 file changed, 122 insertions(+) create mode 100644 man/man2const/UFFDIO_RWPROTECT.2const diff --git a/man/man2const/UFFDIO_RWPROTECT.2const b/man/man2const/UFFDIO_RWPROTECT.2const new file mode 100644 index 000000000000..42654a834cd5 --- /dev/null +++ b/man/man2const/UFFDIO_RWPROTECT.2const @@ -0,0 +1,122 @@ +.\" Copyright, the authors of the Linux man-pages project +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.TH UFFDIO_RWPROTECT 2const (date) "Linux man-pages (unreleased)" +.SH NAME +UFFDIO_RWPROTECT +\- +read-write-protect or un-protect a userfaultfd-registered memory range +.SH LIBRARY +Standard C library +.RI ( libc ,\~ \-lc ) +.SH SYNOPSIS +.nf +.BR "#include " " /* Definition of " UFFD* " constants */" +.B #include +.P +.BI "int ioctl(int " fd ", UFFDIO_RWPROTECT, struct uffdio_rwprotect *" argp ); +.P +.B #include +.P +.fi +.EX +.B struct uffdio_rwprotect { +.BR " struct uffdio_range range;" " /* Range to change RWP on */" +.BR " __u64 mode;" " /* Mode flags */" +.B }; +.EE +.SH DESCRIPTION +Read-write-protect or un-protect a userfaultfd-registered memory range +registered with mode +.BR UFFDIO_REGISTER_MODE_RWP . +.P +The following mode bits are supported: +.TP +.B UFFDIO_RWPROTECT_MODE_RWP +When this mode bit is set, +the ioctl installs read-write protection +on every page present in the range specified by +.IR range . +Otherwise the ioctl removes read-write protection from the range, +which is also how a fault handler resolves an +.B UFFD_PAGEFAULT_FLAG_RWP +notification. +.TP +.B UFFDIO_RWPROTECT_MODE_DONTWAKE +When this mode bit is set, +do not wake up any thread +that waits for page-fault resolution after the operation. +This can be specified only if +.B UFFDIO_RWPROTECT_MODE_RWP +is not specified. +.P +Read-write protection only affects pages +that are currently populated in the range; +unmapped addresses are left untouched. +For anonymous mappings, +protection is preserved across page reclaim +(the marker rides on the swap entry) +and migration. +For shmem and file-backed mappings, +protection is dropped when the backing page is reclaimed. +Callers must also re-arm a range with +.B UFFDIO_RWPROTECT +after any operation that explicitly drops the underlying page: +.B MADV_DONTNEED +on anonymous memory, +hole-punch on shmem, +truncation of a file mapping. +.SH RETURN VALUE +On success, +0 is returned. +On error, \-1 is returned and +.I errno +is set to indicate the error. +.SH ERRORS +.TP +.B EINVAL +The +.I start +or the +.I len +field of the +.I uffdio_range +structure was not a multiple of the system page size; +or +.I len +was zero; +or the specified range was otherwise invalid; +or an invalid mode bit was specified; +or +.B UFFDIO_RWPROTECT_MODE_DONTWAKE +was specified together with +.BR UFFDIO_RWPROTECT_MODE_RWP . +.TP +.B EAGAIN +The process was interrupted; +retry this call. +.TP +.B ENOENT +The range specified in +.I range +is not valid. +For example, the virtual address does not exist, +or part of the range is not registered with +.BR UFFDIO_REGISTER_MODE_RWP . +.TP +.B EFAULT +Encountered a generic fault during processing. +.SH STANDARDS +Linux. +.SH HISTORY +Linux 7.2. +.SH EXAMPLES +See +.BR userfaultfd (2). +.SH SEE ALSO +.BR ioctl (2), +.BR ioctl_userfaultfd (2), +.BR userfaultfd (2) +.P +.I linux.git/\:Documentation/\:admin\-guide/\:mm/\:userfaultfd.rst -- 2.54.0 Document the UFFDIO_SET_MODE ioctl (since Linux 7.2). It toggles userfaultfd feature bits at runtime; currently only UFFD_FEATURE_RWP_ASYNC is toggleable, and enabling it requires UFFD_FEATURE_RWP to have been negotiated at UFFDIO_API time. Describe the uffdio_set_mode struct (enable/disable pair, must not overlap), the serialization against in-flight page faults that lets a single userfaultfd switch between async detection and synchronous eviction without re-registering its ranges, and the EINVAL/EFAULT errors returned by the kernel. Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2const/UFFDIO_SET_MODE.2const | 98 ++++++++++++++++++++++++++++ 1 file changed, 98 insertions(+) create mode 100644 man/man2const/UFFDIO_SET_MODE.2const diff --git a/man/man2const/UFFDIO_SET_MODE.2const b/man/man2const/UFFDIO_SET_MODE.2const new file mode 100644 index 000000000000..b71632011a4c --- /dev/null +++ b/man/man2const/UFFDIO_SET_MODE.2const @@ -0,0 +1,98 @@ +.\" Copyright, the authors of the Linux man-pages project +.\" +.\" SPDX-License-Identifier: Linux-man-pages-copyleft +.\" +.TH UFFDIO_SET_MODE 2const (date) "Linux man-pages (unreleased)" +.SH NAME +UFFDIO_SET_MODE +\- +toggle userfaultfd runtime mode bits +.SH LIBRARY +Standard C library +.RI ( libc ,\~ \-lc ) +.SH SYNOPSIS +.nf +.BR "#include " " /* Definition of " UFFD* " constants */" +.B #include +.P +.BI "int ioctl(int " fd ", UFFDIO_SET_MODE, struct uffdio_set_mode *" argp ); +.P +.B #include +.P +.fi +.EX +.B struct uffdio_set_mode { +.BR " __u64 enable;" " /* Feature bits to set */" +.BR " __u64 disable;" " /* Feature bits to clear */" +.B }; +.EE +.SH DESCRIPTION +Toggle userfaultfd features that may be flipped at runtime. +.P +Bits set in +.I enable +turn the named features on; +bits set in +.I disable +turn them off. +The two fields must not overlap. +Today only +.B UFFD_FEATURE_RWP_ASYNC +is a valid bit in either field; +any other bit causes the ioctl to fail with +.BR EINVAL . +Enabling +.B UFFD_FEATURE_RWP_ASYNC +also requires +.B UFFD_FEATURE_RWP +to have been negotiated at +.BR UFFDIO_API (2const) +time. +.P +The operation is serialized against in-flight page faults, +so the new mode takes effect +only after every fault that started before the call has finished, +and any fault that starts after the call observes the new mode. +This allows a single userfaultfd +to switch between lightweight async detection +and synchronous eviction +without re-registering its ranges. +.SH RETURN VALUE +On success, +0 is returned. +On error, \-1 is returned and +.I errno +is set to indicate the error. +.SH ERRORS +.TP +.B EINVAL +A bit other than +.B UFFD_FEATURE_RWP_ASYNC +was specified in +.I enable +or +.IR disable ; +the two fields overlap; +or +.B UFFD_FEATURE_RWP_ASYNC +was requested without +.B UFFD_FEATURE_RWP +having been negotiated. +.TP +.B EFAULT +.I argp +refers to an address that is outside the calling process's +accessible address space. +.SH STANDARDS +Linux. +.SH HISTORY +Linux 7.2. +.SH EXAMPLES +See +.BR userfaultfd (2). +.SH SEE ALSO +.BR ioctl (2), +.BR ioctl_userfaultfd (2), +.BR userfaultfd (2) +.P +.I linux.git/\:Documentation/\:admin\-guide/\:mm/\:userfaultfd.rst -- 2.54.0 Add the two RWP feature bits introduced in Linux 7.2: UFFD_FEATURE_RWP gates UFFDIO_REGISTER_MODE_RWP and the UFFDIO_RWPROTECT(2const) ioctl. UFFD_FEATURE_RWP_ASYNC in-kernel resolution of RWP faults without delivering a notification; requires UFFD_FEATURE_RWP to be set in the same UFFDIO_API call. Also document 1 << _UFFDIO_SET_MODE in argp->ioctls, the file-descriptor-level bit that advertises UFFDIO_SET_MODE(2const) for toggling UFFD_FEATURE_RWP_ASYNC at runtime; it is independent of any registered range. The existing page intro already describes UFFDIO_API returning EINVAL on unsupported feature bits and the temporary-uffd probe pattern, so the new TP entries do not re-state that. Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2const/UFFDIO_API.2const | 31 +++++++++++++++++++++++++++++++ 1 file changed, 31 insertions(+) diff --git a/man/man2const/UFFDIO_API.2const b/man/man2const/UFFDIO_API.2const index e894114bb8e2..46ee7e31fed9 100644 --- a/man/man2const/UFFDIO_API.2const +++ b/man/man2const/UFFDIO_API.2const @@ -213,6 +213,30 @@ .SH DESCRIPTION the kernel supports resolving faults with the .B UFFDIO_MOVE ioctl. +.TP +.BR UFFD_FEATURE_RWP " (since Linux 7.2)" +If this feature bit is set, +the kernel supports read-write-protection tracking, +and the +.B UFFDIO_REGISTER_MODE_RWP +registration mode and the +.B UFFDIO_RWPROTECT +ioctl become available. +.TP +.BR UFFD_FEATURE_RWP_ASYNC " (since Linux 7.2)" +If this feature bit is set, +the kernel will resolve read-write-protect faults in place +without delivering a notification, +automatically restoring page permissions +and letting the faulted thread continue. +This bit requires +.B UFFD_FEATURE_RWP +to be set in the same +.B UFFDIO_API +call. +The async mode can also be toggled at runtime using the +.BR UFFDIO_SET_MODE (2const) +ioctl. .P The returned .I argp->ioctls @@ -234,6 +258,13 @@ .SH DESCRIPTION The .B UFFDIO_UNREGISTER operation is supported. +.TP +.BR "1 << _UFFDIO_SET_MODE" " (since Linux 7.2)" +The +.B UFFDIO_SET_MODE +operation is supported. +This is a file-descriptor-level ioctl and is reported once per +userfaultfd, independent of any registered range. .SH RETURN VALUE On success, 0 is returned. -- 2.54.0 Add the new registration mode bit introduced in Linux 7.2: UFFDIO_REGISTER_MODE_RWP Track every access (read or write) to a present page in the registered range. Cannot be combined with UFFDIO_REGISTER_MODE_WP; both modes share the same per-PTE marker bit. Anonymous, shmem, and hugetlbfs ranges are compatible. Also document the matching argp->ioctls bit, 1 << _UFFDIO_RWPROTECT, which the kernel reports only when the range was registered with UFFDIO_REGISTER_MODE_RWP (which itself requires UFFD_FEATURE_RWP to have been negotiated). Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2const/UFFDIO_REGISTER.2const | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) diff --git a/man/man2const/UFFDIO_REGISTER.2const b/man/man2const/UFFDIO_REGISTER.2const index 50064c954b81..ded57cf301ad 100644 --- a/man/man2const/UFFDIO_REGISTER.2const +++ b/man/man2const/UFFDIO_REGISTER.2const @@ -72,6 +72,20 @@ .SH DESCRIPTION only hugetlbfs ranges are compatible. Since Linux 5.14, compatibility with shmem ranges was added. +.TP +.BR UFFDIO_REGISTER_MODE_RWP " (since Linux 7.2)" +Track page faults on read-write-protected pages. +Every access +(read or write) +to a page present within the registered range +generates a notification +once the range has been protected with +.BR UFFDIO_RWPROTECT (2const). +This mode cannot be combined with +.BR UFFDIO_REGISTER_MODE_WP ; +attempting to do so fails with +.BR EINVAL . +Anonymous, shmem, and hugetlbfs ranges are compatible. .P If the operation is successful, the kernel modifies the .I argp->ioctls @@ -109,6 +123,16 @@ .SH DESCRIPTION The .B UFFDIO_POISON operation is supported. +.TP +.BR "1 << _UFFDIO_RWPROTECT" " (since Linux 7.2)" +The +.B UFFDIO_RWPROTECT +operation is supported. +This bit is reported only when the range was registered with +.B UFFDIO_REGISTER_MODE_RWP +(which itself requires +.B UFFD_FEATURE_RWP +to have been negotiated). .SH RETURN VALUE On success, 0 is returned. -- 2.54.0 Add the two new ioctls introduced in Linux 7.2 to the list of operations supported on a userfaultfd file descriptor. Signed-off-by: Kiryl Shutsemau Acked-by: Mike Rapoport (Microsoft) --- man/man2/ioctl_userfaultfd.2 | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/man/man2/ioctl_userfaultfd.2 b/man/man2/ioctl_userfaultfd.2 index 37553cd7a88f..fb57fe222979 100644 --- a/man/man2/ioctl_userfaultfd.2 +++ b/man/man2/ioctl_userfaultfd.2 @@ -76,9 +76,13 @@ .SH DESCRIPTION .TQ .BR UFFDIO_WRITEPROTECT (2const) .TQ +.BR UFFDIO_RWPROTECT (2const) +.TQ .BR UFFDIO_CONTINUE (2const) .TQ .BR UFFDIO_POISON (2const) +.TQ +.BR UFFDIO_SET_MODE (2const) .SH RETURN VALUE On success, 0 is returned. -- 2.54.0