As-is, khugepaged and writable-file opening exclude each other. A file cannot be open writeable and have THPs (because the filesystem is not aware of them). khugepaged will never collapse file pages for files that are opened writeable. On an open(O_RDWR/O_WRONLY), the page cache for that particular file is dropped. This is fine because nothing could've been dirtied. However, there is an edge-case: collapse_file() might not be able to coexist with concurrent writers, but it can coexist with dirty folios (from previous writers). Therefore, the following can happen: open(file, O_RDWR) write(file) close(file) madvise(file_mapping, MADV_COLLAPSE, some non-dirty range) open(file, O_RDWR) nr_thps > 0 truncate_inode_pages() /* THPs are cleared out, but so are the dirty folios */ When this edge-case happens, there is data loss, as the dirty folios are fully discarded. Fix it by fully writing back the page cache (and waiting) when collapsing file THPs. Doing so provides the guarantee that no dirty folio will be observed while there are active THPs. To fully ensure this is safe, the invalidate_lock needs to be held while doing the writeout, so that do_dentry_open()'s page cache truncation excludes this write-and-wait. Cc: stable@vger.kernel.org Cc: Alexander Viro Cc: Christian Brauner Cc: Jan Kara Cc: Matthew Wilcox Cc: Song Liu Cc: Eric Hagberg Cc: Zi Yan Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS") Reported-by: Gregg Leventhal Closes: https://lore.kernel.org/linux-mm/CAFN_u7H_0ECF3jixP=T=U7AH5=Q3wQNvJMo8an3VqUDMerQfUw@mail.gmail.com/ Tested-by: Zi Yan Signed-off-by: Pedro Falcato --- This patch is written against 7.1.0 (because the code no longer exists in mainline). Zi, I kept your Tested-by, but I had to move some things around and use the invalidate lock. Please re-test if you can. mm/khugepaged.c | 39 +++++++++++++++++++++++++-------------- 1 file changed, 25 insertions(+), 14 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index b8452dbdb043..0707d719a270 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2094,32 +2094,43 @@ static enum scan_result collapse_file(struct mm_struct *mm, unsigned long addr, goto xa_unlocked; } - if (!is_shmem) { +xa_locked: + xas_unlock_irq(&xas); +xa_unlocked: + + /* + * If collapse is successful, flush must be done now before copying. + * If collapse is unsuccessful, does flush actually need to be done? + * Do it anyway, to clear the state. + */ + try_to_unmap_flush(); + + if (result == SCAN_SUCCEED && !is_shmem) { + /* + * invalidate_lock as shared excludes against concurrent opens + * in do_dentry_open() truncating the page cache. This is + * particularly important if there are dirty folios in transit. + */ + filemap_invalidate_lock_shared(mapping); filemap_nr_thps_inc(mapping); /* * Paired with the fence in do_dentry_open() -> get_write_access() * to ensure i_writecount is up to date and the update to nr_thps * is visible. Ensures the page cache will be truncated if the - * file is opened writable. + * file is opened writable. If collapse looks to be successful, + * flush any dirty pages out the page cache. With the nr_thps + * incremented, there won't be any new writers (nor new dirties). */ smp_mb(); - if (inode_is_open_for_write(mapping->host)) { + if (inode_is_open_for_write(mapping->host) || filemap_write_and_wait(mapping)) { result = SCAN_FAIL; filemap_nr_thps_dec(mapping); + filemap_invalidate_unlock_shared(mapping); + goto rollback; } + filemap_invalidate_unlock_shared(mapping); } -xa_locked: - xas_unlock_irq(&xas); -xa_unlocked: - - /* - * If collapse is successful, flush must be done now before copying. - * If collapse is unsuccessful, does flush actually need to be done? - * Do it anyway, to clear the state. - */ - try_to_unmap_flush(); - if (result == SCAN_SUCCEED && nr_none && !shmem_charge(mapping->host, nr_none)) result = SCAN_FAIL; -- 2.54.0