Compiler CSE and SSA GVN optimizations can cause the address dependency of addresses returned by rcu_dereference to be lost when comparing those pointers with either constants or previously loaded pointers. Introduce ptr_eq() to compare two addresses while preserving the address dependencies for later use of the address. It should be used when comparing an address returned by rcu_dereference(). This is needed to prevent the compiler CSE and SSA GVN optimizations from using @a (or @b) in places where the source refers to @b (or @a) based on the fact that after the comparison, the two are known to be equal, which does not preserve address dependencies and allows the following misordering speculations: - If @b is a constant, the compiler can issue the loads which depend on @a before loading @a. - If @b is a register populated by a prior load, weakly-ordered CPUs can speculate loads which depend on @a before loading @a. The same logic applies with @a and @b swapped. Suggested-by: Linus Torvalds Suggested-by: Boqun Feng Signed-off-by: Mathieu Desnoyers Reviewed-by: Boqun Feng Reviewed-by: Joel Fernandes (Google) Tested-by: Joel Fernandes (Google) Acked-by: "Paul E. McKenney" Acked-by: Alan Stern Cc: Greg Kroah-Hartman Cc: Sebastian Andrzej Siewior Cc: "Paul E. McKenney" Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Alan Stern Cc: John Stultz Cc: Neeraj Upadhyay Cc: Linus Torvalds Cc: Boqun Feng Cc: Frederic Weisbecker Cc: Joel Fernandes Cc: Josh Triplett Cc: Uladzislau Rezki Cc: Steven Rostedt Cc: Lai Jiangshan Cc: Zqiang Cc: Ingo Molnar Cc: Waiman Long Cc: Mark Rutland Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: maged.michael@gmail.com Cc: Mateusz Guzik Cc: Gary Guo Cc: Jonas Oberhauser Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev Cc: Nikita Popov Cc: llvm@lists.linux.dev --- Changes since v0: - Include feedback from Alan Stern. --- include/linux/compiler.h | 63 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 63 insertions(+) diff --git a/include/linux/compiler.h b/include/linux/compiler.h index 5b45ea7dff3e..c5ca3b54c112 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -163,6 +163,69 @@ void ftrace_likely_update(struct ftrace_likely_data *f, int val, __asm__ ("" : "=r" (var) : "0" (var)) #endif +/* + * Compare two addresses while preserving the address dependencies for + * later use of the address. It should be used when comparing an address + * returned by rcu_dereference(). + * + * This is needed to prevent the compiler CSE and SSA GVN optimizations + * from using @a (or @b) in places where the source refers to @b (or @a) + * based on the fact that after the comparison, the two are known to be + * equal, which does not preserve address dependencies and allows the + * following misordering speculations: + * + * - If @b is a constant, the compiler can issue the loads which depend + * on @a before loading @a. + * - If @b is a register populated by a prior load, weakly-ordered + * CPUs can speculate loads which depend on @a before loading @a. + * + * The same logic applies with @a and @b swapped. + * + * Return value: true if pointers are equal, false otherwise. + * + * The compiler barrier() is ineffective at fixing this issue. It does + * not prevent the compiler CSE from losing the address dependency: + * + * int fct_2_volatile_barriers(void) + * { + * int *a, *b; + * + * do { + * a = READ_ONCE(p); + * asm volatile ("" : : : "memory"); + * b = READ_ONCE(p); + * } while (a != b); + * asm volatile ("" : : : "memory"); <-- barrier() + * return *b; + * } + * + * With gcc 14.2 (arm64): + * + * fct_2_volatile_barriers: + * adrp x0, .LANCHOR0 + * add x0, x0, :lo12:.LANCHOR0 + * .L2: + * ldr x1, [x0] <-- x1 populated by first load. + * ldr x2, [x0] + * cmp x1, x2 + * bne .L2 + * ldr w0, [x1] <-- x1 is used for access which should depend on b. + * ret + * + * On weakly-ordered architectures, this lets CPU speculation use the + * result from the first load to speculate "ldr w0, [x1]" before + * "ldr x2, [x0]". + * Based on the RCU documentation, the control dependency does not + * prevent the CPU from speculating loads. + */ +static __always_inline +int ptr_eq(const volatile void *a, const volatile void *b) +{ + OPTIMIZER_HIDE_VAR(a); + OPTIMIZER_HIDE_VAR(b); + return a == b; +} + #define __UNIQUE_ID(prefix) __PASTE(__PASTE(__UNIQUE_ID_, prefix), __COUNTER__) /** -- 2.39.5 Refer to ptr_eq() in the rcu_dereference() documentation. ptr_eq() is a mechanism that preserves address dependencies when comparing pointers, and should be favored when comparing a pointer obtained from rcu_dereference() against another pointer. Signed-off-by: Mathieu Desnoyers Acked-by: Alan Stern Acked-by: Paul E. McKenney Reviewed-by: Joel Fernandes (Google) Cc: Greg Kroah-Hartman Cc: Sebastian Andrzej Siewior Cc: "Paul E. McKenney" Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Alan Stern Cc: John Stultz Cc: Neeraj Upadhyay Cc: Linus Torvalds Cc: Boqun Feng Cc: Frederic Weisbecker Cc: Joel Fernandes Cc: Josh Triplett Cc: Uladzislau Rezki Cc: Steven Rostedt Cc: Lai Jiangshan Cc: Zqiang Cc: Ingo Molnar Cc: Waiman Long Cc: Mark Rutland Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: maged.michael@gmail.com Cc: Mateusz Guzik Cc: Gary Guo Cc: Jonas Oberhauser Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev Cc: Nikita Popov Cc: llvm@lists.linux.dev --- Changes since v1: - Include feedback from Paul E. McKenney. Changes since v0: - Include feedback from Alan Stern. --- Documentation/RCU/rcu_dereference.rst | 38 +++++++++++++++++++++++---- 1 file changed, 33 insertions(+), 5 deletions(-) diff --git a/Documentation/RCU/rcu_dereference.rst b/Documentation/RCU/rcu_dereference.rst index 2524dcdadde2..de6175bf430f 100644 --- a/Documentation/RCU/rcu_dereference.rst +++ b/Documentation/RCU/rcu_dereference.rst @@ -104,11 +104,12 @@ readers working properly: after such branches, but can speculate loads, which can again result in misordering bugs. -- Be very careful about comparing pointers obtained from - rcu_dereference() against non-NULL values. As Linus Torvalds - explained, if the two pointers are equal, the compiler could - substitute the pointer you are comparing against for the pointer - obtained from rcu_dereference(). For example:: +- Use operations that preserve address dependencies (such as + "ptr_eq()") to compare pointers obtained from rcu_dereference() + against non-NULL pointers. As Linus Torvalds explained, if the + two pointers are equal, the compiler could substitute the + pointer you are comparing against for the pointer obtained from + rcu_dereference(). For example:: p = rcu_dereference(gp); if (p == &default_struct) @@ -125,6 +126,29 @@ readers working properly: On ARM and Power hardware, the load from "default_struct.a" can now be speculated, such that it might happen before the rcu_dereference(). This could result in bugs due to misordering. + Performing the comparison with "ptr_eq()" ensures the compiler + does not perform such transformation. + + If the comparison is against another pointer, the compiler is + allowed to use either pointer for the following accesses, which + loses the address dependency and allows weakly-ordered + architectures such as ARM and PowerPC to speculate the + address-dependent load before rcu_dereference(). For example:: + + p1 = READ_ONCE(gp); + p2 = rcu_dereference(gp); + if (p1 == p2) /* BUGGY!!! */ + do_default(p2->a); + + The compiler can use p1->a rather than p2->a, destroying the + address dependency. Performing the comparison with "ptr_eq()" + ensures the compiler preserves the address dependencies. + Corrected code:: + + p1 = READ_ONCE(gp); + p2 = rcu_dereference(gp); + if (ptr_eq(p1, p2)) + do_default(p2->a); However, comparisons are OK in the following cases: @@ -204,6 +228,10 @@ readers working properly: comparison will provide exactly the information that the compiler needs to deduce the value of the pointer. + When in doubt, use operations that preserve address dependencies + (such as "ptr_eq()") to compare pointers obtained from + rcu_dereference() against non-NULL pointers. + - Disable any value-speculation optimizations that your compiler might provide, especially if you are making use of feedback-based optimizations that take data collected from prior runs. Such -- 2.39.5 This API provides existence guarantees of objects through Hazard Pointers [1] (hazptr). Its main benefit over RCU is that it allows fast reclaim of HP-protected pointers without needing to wait for a grace period. This implementation has 8 statically allocated hazard pointer slots per cpu for the fast path, and relies on a on-stack backup slot allocated by the hazard pointer user as fallback in case no per-cpu slot is available. References: [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for lock-free objects," in IEEE Transactions on Parallel and Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 Link: https://lpc.events/event/19/contributions/2082/ Link: https://lore.kernel.org/lkml/j3scdl5iymjlxavomgc6u5ndg3svhab6ga23dr36o4f5mt333w@7xslvq6b6hmv/ Link: https://lpc.events/event/18/contributions/1731/ Signed-off-by: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Michael Ellerman Cc: Greg Kroah-Hartman Cc: Sebastian Andrzej Siewior Cc: "Paul E. McKenney" Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Alan Stern Cc: John Stultz Cc: Neeraj Upadhyay Cc: Linus Torvalds Cc: Andrew Morton Cc: Boqun Feng Cc: Frederic Weisbecker Cc: Joel Fernandes Cc: Josh Triplett Cc: Uladzislau Rezki Cc: Steven Rostedt Cc: Lai Jiangshan Cc: Zqiang Cc: Ingo Molnar Cc: Waiman Long Cc: Mark Rutland Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: maged.michael@gmail.com Cc: Mateusz Guzik Cc: Jonas Oberhauser Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev --- Changes since v3: - Rename hazptr_retire to hazptr_release. - Remove domains. - Introduce "backup_slot" within hazptr context structure (on stack) to handle slot overflow. - Rename hazptr_try_protect to hazptr_acquire. - Preallocate 8 per-CPU slots, and rely on caller-provided backup slots (typically on stack) for out-of-slots situations. Changes since v2: - Address Peter Zijlstra's comments. - Address Paul E. McKenney's comments. Changes since v0: - Remove slot variable from hp_dereference_allocate(). --- include/linux/hazptr.h | 182 +++++++++++++++++++++++++++++++++++++++++ init/main.c | 2 + kernel/Makefile | 2 +- kernel/hazptr.c | 150 +++++++++++++++++++++++++++++++++ 4 files changed, 335 insertions(+), 1 deletion(-) create mode 100644 include/linux/hazptr.h create mode 100644 kernel/hazptr.c diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h new file mode 100644 index 000000000000..70c066ddb0f5 --- /dev/null +++ b/include/linux/hazptr.h @@ -0,0 +1,182 @@ +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers +// +// SPDX-License-Identifier: LGPL-2.1-or-later + +#ifndef _LINUX_HAZPTR_H +#define _LINUX_HAZPTR_H + +/* + * hazptr: Hazard Pointers + * + * This API provides existence guarantees of objects through hazard + * pointers. + * + * Its main benefit over RCU is that it allows fast reclaim of + * HP-protected pointers without needing to wait for a grace period. + * + * References: + * + * [1]: M. M. Michael, "Hazard pointers: safe memory reclamation for + * lock-free objects," in IEEE Transactions on Parallel and + * Distributed Systems, vol. 15, no. 6, pp. 491-504, June 2004 + */ + +#include +#include +#include + +/* 8 slots (each sizeof(void *)) fit in a single cache line. */ +#define NR_HAZPTR_PERCPU_SLOTS 8 + +/* + * Hazard pointer slot. + */ +struct hazptr_slot { + void *addr; +}; + +struct hazptr_backup_slot { + struct list_head node; + struct hazptr_slot slot; + /* CPU requesting the backup slot. */ + int cpu; +}; + +struct hazptr_ctx { + struct hazptr_slot *slot; + /* Backup slot in case all per-CPU slots are used. */ + struct hazptr_backup_slot backup_slot; +}; + +struct hazptr_percpu_slots { + struct hazptr_slot slots[NR_HAZPTR_PERCPU_SLOTS]; +} ____cacheline_aligned; + +DECLARE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots); + +/* + * hazptr_synchronize: Wait until @addr is released from all slots. + * + * Wait to observe that each slot contains a value that differs from + * @addr before returning. + * Should be called from preemptible context. + */ +void hazptr_synchronize(void *addr); + +/* + * hazptr_chain_backup_slot: Chain backup slot into overflow list. + * + * Set backup slot address to @addr, and chain it into the overflow + * list. + */ +struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx); + +/* + * hazptr_unchain_backup_slot: Unchain backup slot from overflow list. + */ +void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx); + +static inline +struct hazptr_slot *hazptr_get_free_percpu_slot(void) +{ + struct hazptr_percpu_slots *percpu_slots = this_cpu_ptr(&hazptr_percpu_slots); + unsigned int idx; + + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) { + struct hazptr_slot *slot = &percpu_slots->slots[idx]; + + if (!READ_ONCE(slot->addr)) + return slot; + } + /* All slots are in use. */ + return NULL; +} + +static inline +bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot) +{ + return slot == &ctx->backup_slot.slot; +} + +/* + * hazptr_acquire: Load pointer at address and protect with hazard pointer. + * + * Load @addr_p, and protect the loaded pointer with hazard pointer. + * + * Returns a non-NULL protected address if the loaded pointer is non-NULL. + * Returns NULL if the loaded pointer is NULL. + * + * On success the protected hazptr slot is stored in @ctx->slot. + */ +static inline +void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p) +{ + struct hazptr_slot *slot = NULL; + void *addr, *addr2; + + /* + * Load @addr_p to know which address should be protected. + */ + addr = READ_ONCE(*addr_p); + for (;;) { + if (!addr) + return NULL; + guard(preempt)(); + if (likely(!hazptr_slot_is_backup(ctx, slot))) { + slot = hazptr_get_free_percpu_slot(); + /* + * If all the per-CPU slots are already in use, fallback + * to the backup slot. + */ + if (unlikely(!slot)) + slot = hazptr_chain_backup_slot(ctx); + } + WRITE_ONCE(slot->addr, addr); /* Store B */ + + /* Memory ordering: Store B before Load A. */ + smp_mb(); + + /* + * Re-load @addr_p after storing it to the hazard pointer slot. + */ + addr2 = READ_ONCE(*addr_p); /* Load A */ + if (likely(ptr_eq(addr2, addr))) + break; + /* + * If @addr_p content has changed since the first load, + * release the hazard pointer and try again. + */ + WRITE_ONCE(slot->addr, NULL); + if (!addr2) { + if (hazptr_slot_is_backup(ctx, slot)) + hazptr_unchain_backup_slot(ctx); + return NULL; + } + addr = addr2; + } + ctx->slot = slot; + /* + * Use addr2 loaded from the second READ_ONCE() to preserve + * address dependency ordering. + */ + return addr2; +} + +/* Release the protected hazard pointer from @slot. */ +static inline +void hazptr_release(struct hazptr_ctx *ctx, void *addr) +{ + struct hazptr_slot *slot; + + if (!addr) + return; + slot = ctx->slot; + WARN_ON_ONCE(slot->addr != addr); + smp_store_release(&slot->addr, NULL); + if (unlikely(hazptr_slot_is_backup(ctx, slot))) + hazptr_unchain_backup_slot(ctx); +} + +void hazptr_init(void); + +#endif /* _LINUX_HAZPTR_H */ diff --git a/init/main.c b/init/main.c index 07a3116811c5..858eaa87bde7 100644 --- a/init/main.c +++ b/init/main.c @@ -104,6 +104,7 @@ #include #include #include +#include #include #include @@ -1002,6 +1003,7 @@ void start_kernel(void) workqueue_init_early(); rcu_init(); + hazptr_init(); kvfree_rcu_init(); /* Trace events are available after this */ diff --git a/kernel/Makefile b/kernel/Makefile index 9fe722305c9b..1178907fe0ea 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -7,7 +7,7 @@ obj-y = fork.o exec_domain.o panic.o \ cpu.o exit.o softirq.o resource.o \ sysctl.o capability.o ptrace.o user.o \ signal.o sys.o umh.o workqueue.o pid.o task_work.o \ - extable.o params.o \ + extable.o params.o hazptr.o \ kthread.o sys_ni.o nsproxy.o nstree.o nscommon.o \ notifier.o ksysfs.o cred.o reboot.o \ async.o range.o smpboot.o ucount.o regset.o ksyms_common.o diff --git a/kernel/hazptr.c b/kernel/hazptr.c new file mode 100644 index 000000000000..2ec288bc1132 --- /dev/null +++ b/kernel/hazptr.c @@ -0,0 +1,150 @@ +// SPDX-FileCopyrightText: 2024 Mathieu Desnoyers +// +// SPDX-License-Identifier: LGPL-2.1-or-later + +/* + * hazptr: Hazard Pointers + */ + +#include +#include +#include +#include +#include + +struct overflow_list { + raw_spinlock_t lock; /* Lock protecting overflow list and list generation. */ + struct list_head head; /* Overflow list head. */ + uint64_t gen; /* Overflow list generation. */ +}; + +static DEFINE_PER_CPU(struct overflow_list, percpu_overflow_list); + +DEFINE_PER_CPU(struct hazptr_percpu_slots, hazptr_percpu_slots); +EXPORT_PER_CPU_SYMBOL_GPL(hazptr_percpu_slots); + +/* + * Perform piecewise iteration on overflow list waiting until "addr" is + * not present. Raw spinlock is released and taken between each list + * item and busy loop iteration. The overflow list generation is checked + * each time the lock is taken to validate that the list has not changed + * before resuming iteration or busy wait. If the generation has + * changed, retry the entire list traversal. + */ +static +void hazptr_synchronize_overflow_list(struct overflow_list *overflow_list, void *addr) +{ + struct hazptr_backup_slot *backup_slot; + uint64_t snapshot_gen; + + raw_spin_lock(&overflow_list->lock); +retry: + snapshot_gen = overflow_list->gen; + list_for_each_entry(backup_slot, &overflow_list->head, node) { + /* Busy-wait if node is found. */ + while (smp_load_acquire(&backup_slot->slot.addr) == addr) { /* Load B */ + raw_spin_unlock(&overflow_list->lock); + cpu_relax(); + raw_spin_lock(&overflow_list->lock); + if (overflow_list->gen != snapshot_gen) + goto retry; + } + raw_spin_unlock(&overflow_list->lock); + /* + * Release raw spinlock, validate generation after + * re-acquiring the lock. + */ + raw_spin_lock(&overflow_list->lock); + if (overflow_list->gen != snapshot_gen) + goto retry; + } + raw_spin_unlock(&overflow_list->lock); +} + +static +void hazptr_synchronize_cpu_slots(int cpu, void *addr) +{ + struct hazptr_percpu_slots *percpu_slots = per_cpu_ptr(&hazptr_percpu_slots, cpu); + unsigned int idx; + + for (idx = 0; idx < NR_HAZPTR_PERCPU_SLOTS; idx++) { + struct hazptr_slot *slot = &percpu_slots->slots[idx]; + + /* Busy-wait if node is found. */ + smp_cond_load_acquire(&slot->addr, VAL != addr); /* Load B */ + } +} + +/* + * hazptr_synchronize: Wait until @addr is released from all slots. + * + * Wait to observe that each slot contains a value that differs from + * @addr before returning. + * Should be called from preemptible context. + */ +void hazptr_synchronize(void *addr) +{ + int cpu; + + /* + * Busy-wait should only be done from preemptible context. + */ + lockdep_assert_preemption_enabled(); + + /* + * Store A precedes hazptr_scan(): it unpublishes addr (sets it to + * NULL or to a different value), and thus hides it from hazard + * pointer readers. + */ + if (!addr) + return; + /* Memory ordering: Store A before Load B. */ + smp_mb(); + /* Scan all CPUs slots. */ + for_each_possible_cpu(cpu) { + /* Scan CPU slots. */ + hazptr_synchronize_cpu_slots(cpu, addr); + /* Scan backup slots in percpu overflow list. */ + hazptr_synchronize_overflow_list(per_cpu_ptr(&percpu_overflow_list, cpu), addr); + } +} +EXPORT_SYMBOL_GPL(hazptr_synchronize); + +struct hazptr_slot *hazptr_chain_backup_slot(struct hazptr_ctx *ctx) +{ + struct overflow_list *overflow_list = this_cpu_ptr(&percpu_overflow_list); + struct hazptr_slot *slot = &ctx->backup_slot.slot; + + slot->addr = NULL; + + raw_spin_lock(&overflow_list->lock); + overflow_list->gen++; + list_add(&ctx->backup_slot.node, &overflow_list->head); + ctx->backup_slot.cpu = smp_processor_id(); + raw_spin_unlock(&overflow_list->lock); + return slot; +} +EXPORT_SYMBOL_GPL(hazptr_chain_backup_slot); + +void hazptr_unchain_backup_slot(struct hazptr_ctx *ctx) +{ + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, ctx->backup_slot.cpu); + + raw_spin_lock(&overflow_list->lock); + overflow_list->gen++; + list_del(&ctx->backup_slot.node); + raw_spin_unlock(&overflow_list->lock); +} +EXPORT_SYMBOL_GPL(hazptr_unchain_backup_slot); + +void __init hazptr_init(void) +{ + int cpu; + + for_each_possible_cpu(cpu) { + struct overflow_list *overflow_list = per_cpu_ptr(&percpu_overflow_list, cpu); + + raw_spin_lock_init(&overflow_list->lock); + INIT_LIST_HEAD(&overflow_list->head); + } +} -- 2.39.5 Integrate with the scheduler to migrate per-CPU slots to the backup slot on context switch. This ensures that the per-CPU slots won't be used by blocked or preempted tasks holding on hazard pointers for a long time. Signed-off-by: Mathieu Desnoyers Cc: Nicholas Piggin Cc: Michael Ellerman Cc: Greg Kroah-Hartman Cc: Sebastian Andrzej Siewior Cc: "Paul E. McKenney" Cc: Will Deacon Cc: Peter Zijlstra Cc: Boqun Feng Cc: Alan Stern Cc: John Stultz Cc: Neeraj Upadhyay Cc: Linus Torvalds Cc: Andrew Morton Cc: Boqun Feng Cc: Frederic Weisbecker Cc: Joel Fernandes Cc: Josh Triplett Cc: Uladzislau Rezki Cc: Steven Rostedt Cc: Lai Jiangshan Cc: Zqiang Cc: Ingo Molnar Cc: Waiman Long Cc: Mark Rutland Cc: Thomas Gleixner Cc: Vlastimil Babka Cc: maged.michael@gmail.com Cc: Mateusz Guzik Cc: Jonas Oberhauser Cc: rcu@vger.kernel.org Cc: linux-mm@kvack.org Cc: lkmm@lists.linux.dev --- include/linux/hazptr.h | 63 ++++++++++++++++++++++++++++++++++++++++-- include/linux/sched.h | 4 +++ init/init_task.c | 3 ++ kernel/Kconfig.preempt | 10 +++++++ kernel/fork.c | 3 ++ kernel/sched/core.c | 2 ++ 6 files changed, 83 insertions(+), 2 deletions(-) diff --git a/include/linux/hazptr.h b/include/linux/hazptr.h index 70c066ddb0f5..10ac53a42a7a 100644 --- a/include/linux/hazptr.h +++ b/include/linux/hazptr.h @@ -24,6 +24,7 @@ #include #include #include +#include /* 8 slots (each sizeof(void *)) fit in a single cache line. */ #define NR_HAZPTR_PERCPU_SLOTS 8 @@ -46,6 +47,9 @@ struct hazptr_ctx { struct hazptr_slot *slot; /* Backup slot in case all per-CPU slots are used. */ struct hazptr_backup_slot backup_slot; +#ifdef CONFIG_PREEMPT_HAZPTR + struct list_head preempt_node; +#endif }; struct hazptr_percpu_slots { @@ -98,6 +102,50 @@ bool hazptr_slot_is_backup(struct hazptr_ctx *ctx, struct hazptr_slot *slot) return slot == &ctx->backup_slot.slot; } +#ifdef CONFIG_PREEMPT_HAZPTR +static inline +void hazptr_chain_task_ctx(struct hazptr_ctx *ctx) +{ + list_add(&ctx->preempt_node, ¤t->hazptr_ctx_list); +} + +static inline +void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx) +{ + list_del(&ctx->preempt_node); +} + +static inline +void hazptr_note_context_switch(void) +{ + struct hazptr_ctx *ctx; + + list_for_each_entry(ctx, ¤t->hazptr_ctx_list, preempt_node) { + struct hazptr_slot *slot; + + if (hazptr_slot_is_backup(ctx, ctx->slot)) + continue; + slot = hazptr_chain_backup_slot(ctx); + /* + * Move hazard pointer from per-CPU slot to backup slot. + * This requires hazard pointer synchronize to iterate + * on per-CPU slots with load-acquire before iterating + * on the overflow list. + */ + WRITE_ONCE(slot->addr, ctx->slot->addr); + /* + * store-release orders store to backup slot addr before + * store to per-CPU slot addr. + */ + smp_store_release(&ctx->slot->addr, NULL); + } +} +#else +static inline void hazptr_chain_task_ctx(struct hazptr_ctx *ctx) { } +static inline void hazptr_unchain_task_ctx(struct hazptr_ctx *ctx) { } +static inline void hazptr_note_context_switch(void) { } +#endif + /* * hazptr_acquire: Load pointer at address and protect with hazard pointer. * @@ -114,6 +162,7 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p) struct hazptr_slot *slot = NULL; void *addr, *addr2; + ctx->slot = NULL; /* * Load @addr_p to know which address should be protected. */ @@ -121,7 +170,9 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p) for (;;) { if (!addr) return NULL; + guard(preempt)(); + hazptr_chain_task_ctx(ctx); if (likely(!hazptr_slot_is_backup(ctx, slot))) { slot = hazptr_get_free_percpu_slot(); /* @@ -140,8 +191,11 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p) * Re-load @addr_p after storing it to the hazard pointer slot. */ addr2 = READ_ONCE(*addr_p); /* Load A */ - if (likely(ptr_eq(addr2, addr))) + if (likely(ptr_eq(addr2, addr))) { + ctx->slot = slot; + /* Success. Break loop, enable preemption and return. */ break; + } /* * If @addr_p content has changed since the first load, * release the hazard pointer and try again. @@ -150,11 +204,14 @@ void *hazptr_acquire(struct hazptr_ctx *ctx, void * const * addr_p) if (!addr2) { if (hazptr_slot_is_backup(ctx, slot)) hazptr_unchain_backup_slot(ctx); + hazptr_unchain_task_ctx(ctx); + /* Loaded NULL. Enable preemption and return NULL. */ return NULL; } addr = addr2; + hazptr_unchain_task_ctx(ctx); + /* Enable preemption and retry. */ } - ctx->slot = slot; /* * Use addr2 loaded from the second READ_ONCE() to preserve * address dependency ordering. @@ -170,11 +227,13 @@ void hazptr_release(struct hazptr_ctx *ctx, void *addr) if (!addr) return; + guard(preempt)(); slot = ctx->slot; WARN_ON_ONCE(slot->addr != addr); smp_store_release(&slot->addr, NULL); if (unlikely(hazptr_slot_is_backup(ctx, slot))) hazptr_unchain_backup_slot(ctx); + hazptr_unchain_task_ctx(ctx); } void hazptr_init(void); diff --git a/include/linux/sched.h b/include/linux/sched.h index b469878de25c..bbec9fd6b163 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -933,6 +933,10 @@ struct task_struct { struct rcu_node *rcu_blocked_node; #endif /* #ifdef CONFIG_PREEMPT_RCU */ +#ifdef CONFIG_PREEMPT_HAZPTR + struct list_head hazptr_ctx_list; +#endif + #ifdef CONFIG_TASKS_RCU unsigned long rcu_tasks_nvcsw; u8 rcu_tasks_holdout; diff --git a/init/init_task.c b/init/init_task.c index a55e2189206f..117aebf5573a 100644 --- a/init/init_task.c +++ b/init/init_task.c @@ -160,6 +160,9 @@ struct task_struct init_task __aligned(L1_CACHE_BYTES) = { .rcu_node_entry = LIST_HEAD_INIT(init_task.rcu_node_entry), .rcu_blocked_node = NULL, #endif +#ifdef CONFIG_PREEMPT_HAZPTR + .hazptr_ctx_list = LIST_HEAD_INIT(init_task.hazptr_ctx_list), +#endif #ifdef CONFIG_TASKS_RCU .rcu_tasks_holdout = false, .rcu_tasks_holdout_list = LIST_HEAD_INIT(init_task.rcu_tasks_holdout_list), diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt index da326800c1c9..beb351b42b7c 100644 --- a/kernel/Kconfig.preempt +++ b/kernel/Kconfig.preempt @@ -189,3 +189,13 @@ config SCHED_CLASS_EXT For more information: Documentation/scheduler/sched-ext.rst https://github.com/sched-ext/scx + +config PREEMPT_HAZPTR + bool "Move Hazard Pointers to Task Slots on Context Switch" + help + Integrate hazard pointers with the scheduler so the active + hazard pointers using preallocated per-CPU slots are moved to + their context local slot on context switch. This prevents + blocked or preempted tasks to hold on to per-CPU slots for + a long time, which would cause higher overhead for short + hazard pointer critical sections. diff --git a/kernel/fork.c b/kernel/fork.c index 3da0f08615a9..35c810fe744e 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1780,6 +1780,9 @@ static inline void rcu_copy_process(struct task_struct *p) p->rcu_blocked_node = NULL; INIT_LIST_HEAD(&p->rcu_node_entry); #endif /* #ifdef CONFIG_PREEMPT_RCU */ +#ifdef CONFIG_PREEMPT_HAZPTR + INIT_LIST_HEAD(&p->hazptr_ctx_list); +#endif /* #ifdef CONFIG_PREEMPT_HAZPTR */ #ifdef CONFIG_TASKS_RCU p->rcu_tasks_holdout = false; INIT_LIST_HEAD(&p->rcu_tasks_holdout_list); diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f754a60de848..ac8bf2708140 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -60,6 +60,7 @@ #include #include #include +#include #include #include #include @@ -6812,6 +6813,7 @@ static void __sched notrace __schedule(int sched_mode) local_irq_disable(); rcu_note_context_switch(preempt); + hazptr_note_context_switch(); /* * Make sure that signal_pending_state()->signal_pending() below -- 2.39.5