RocksDB sequential read benchmark under high concurrency shows severe lock contention. Multiple threads may issue readahead on the same file simultaneously, which leads to heavy contention on the xas spinlock in filemap_add_folio(). Perf profiling indicates 30%~60% of CPU time spent there. To mitigate this issue, a readahead request will be skipped if its range is fully covered by an ongoing readahead. This avoids redundant work and significantly reduces lock contention. In one-second sampling, contention on xas spinlock dropped from 138,314 times to 2,144 times, resulting in a large performance improvement in the benchmark. w/o patch w/ patch RocksDB-readseq (ops/sec) (32-threads) 1.2M 2.4M Cc: Tim Chen Cc: Vinicius Gomes Cc: Tianyou Li Cc: Chen Yu Suggested-by: Nanhai Zou Tested-by: Gang Deng Signed-off-by: Aubrey Li --- mm/readahead.c | 19 ++++++++++++++++++- 1 file changed, 18 insertions(+), 1 deletion(-) diff --git a/mm/readahead.c b/mm/readahead.c index 20d36d6b055e..57ae1a137730 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -337,7 +337,7 @@ void force_page_cache_ra(struct readahead_control *ractl, struct address_space *mapping = ractl->mapping; struct file_ra_state *ra = ractl->ra; struct backing_dev_info *bdi = inode_to_bdi(mapping->host); - unsigned long max_pages; + unsigned long max_pages, index; if (unlikely(!mapping->a_ops->read_folio && !mapping->a_ops->readahead)) return; @@ -348,6 +348,19 @@ void force_page_cache_ra(struct readahead_control *ractl, */ max_pages = max_t(unsigned long, bdi->io_pages, ra->ra_pages); nr_to_read = min_t(unsigned long, nr_to_read, max_pages); + + index = readahead_index(ractl); + /* + * Skip this readahead if the requested range is fully covered + * by the ongoing readahead range. This typically occurs in + * concurrent scenarios. + */ + if (index >= ra->start && index + nr_to_read <= ra->start + ra->size) + return; + + ra->start = index; + ra->size = nr_to_read; + while (nr_to_read) { unsigned long this_chunk = (2 * 1024 * 1024) / PAGE_SIZE; @@ -357,6 +370,10 @@ void force_page_cache_ra(struct readahead_control *ractl, nr_to_read -= this_chunk; } + + /* Reset readahead state to allow the next readahead */ + ra->start = 0; + ra->size = 0; } /* -- 2.43.0