From: Dominique Martinet 9p suffered from a corruption in mmap-heavy workloads such as compiling with clang (e.g. kernel build with `make LLVM=1`, failing with random errors on mmaped headers and clang segfaults) David Howells identified a problem in the netfs trace of such a crash, below: > We unlocked page ix=00005 before doing the ZERO subreq that clears > the page tail. That shouldn't have happened since the collection > point hasn't reached the end of the folio yet. netfs_collect_folio: R=00001b55 ix=00003 r=3000-4000 t=3000/5fb2 netfs_folio: i=157f3 ix=00003-00003 read-done netfs_folio: i=157f3 ix=00003-00003 read-unlock netfs_collect_folio: R=00001b55 ix=00004 r=4000-5000 t=4000/5fb2 netfs_folio: i=157f3 ix=00004-00004 read-done netfs_folio: i=157f3 ix=00004-00004 read-unlock netfs_collect_folio: R=00001b55 ix=00005 r=5000-5fb2 t=5000/5fb2 netfs_folio: i=157f3 ix=00005-00005 read-done netfs_folio: i=157f3 ix=00005-00005 read-unlock ... netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=c netfs_collect_stream: R=00001b55[0:] cto=5fb2 frn=ffffffff netfs_collect_state: R=00001b55 col=5fb2 cln=6000 n=8 ... netfs_sreq: R=00001b55[2] ZERO SUBMT f=000 s=5fb2 0/4e s=0 e=0 netfs_sreq: R=00001b55[2] ZERO TERM f=102 s=5fb2 4e/4e s=5 e=0 This patch changes the end condition for collecting folios that have been read, from "is the IO over?" to "have we read up till the end of the folio?" (collected_to >= fend), which delays the read-unlock to the "ZERO CONSUM" step after the page tail has been zeroed Reported-by: Christian Schoenebeck Link: https://lkml.kernel.org/r/2332946.iZASKD2KPV@weasel Fixes: e2d46f2ec332 ("netfs: Change the read result collector to only use one work item") Signed-off-by: Dominique Martinet --- A few notes out of order: - I don't really understand why this actually matter -- sure, there'll be junk after the mmap region ends, but why does this cause a bug? Since the folio is unlocked the zero fill operation ends up zeroing the end of another page? Well, either way there's no doubt that the early unlock was a problem... - I have no idea how to test that this doesn't break something else by not unlocking a page that should be unlocked e.g. something that wouldn't need zero-filling (partial refresh?) -- I don't *think* we have anything like that, but I don't actually know, so pleae don't trust me too much here... Nothing I normally test on 9p broke so it's probably not too far off though. This has been broken for about a year so I guess there's no great hurry at this point, but any extra pair of eyes would be appreciated... (Christian S, I've tested as much as I could think of, but any extra validation would be great too if you have time!) --- fs/netfs/read_collect.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/netfs/read_collect.c b/fs/netfs/read_collect.c index a95e7aadafd072edbb0f1743e0020eaf65b7d1ed..7a0ffa675fb17aa7431c0fd6cc546deb0be0be22 100644 --- a/fs/netfs/read_collect.c +++ b/fs/netfs/read_collect.c @@ -137,7 +137,7 @@ static void netfs_read_unlock_folios(struct netfs_io_request *rreq, rreq->front_folio_order = order; fsize = PAGE_SIZE << order; fpos = folio_pos(folio); - fend = umin(fpos + fsize, rreq->i_size); + fend = fpos + fsize; trace_netfs_collect_folio(rreq, folio, fend, collected_to); --- base-commit: b10e94717f7e295d7ff6ce08249f7e31b630ef1b change-id: 20251220-netfs_mmap_corruption-62e187ffc28c Best regards, -- Dominique Martinet