The netlink message needs to be formatted and sent inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. Because it will happen inside an already existing critical section, it has to assume that the neighbor lock is held. Add a helper __neigh_fill_info(), which is like neigh_fill_info(), but makes this assumption. Convert neigh_fill_info() to a wrapper around this new helper. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 24 +++++++++++++++++------- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 96a3b1a93252..6cdd93dfa3ea 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -2622,8 +2622,8 @@ static int neightbl_dump_info(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } -static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, - u32 pid, u32 seq, int type, unsigned int flags) +static int __neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, + u32 pid, u32 seq, int type, unsigned int flags) { u32 neigh_flags, neigh_flags_ext; unsigned long now = jiffies; @@ -2649,23 +2649,19 @@ static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, if (nla_put(skb, NDA_DST, neigh->tbl->key_len, neigh->primary_key)) goto nla_put_failure; - read_lock_bh(&neigh->lock); ndm->ndm_state = neigh->nud_state; if (neigh->nud_state & NUD_VALID) { char haddr[MAX_ADDR_LEN]; neigh_ha_snapshot(haddr, neigh, neigh->dev); - if (nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, haddr) < 0) { - read_unlock_bh(&neigh->lock); + if (nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, haddr) < 0) goto nla_put_failure; - } } ci.ndm_used = jiffies_to_clock_t(now - neigh->used); ci.ndm_confirmed = jiffies_to_clock_t(now - neigh->confirmed); ci.ndm_updated = jiffies_to_clock_t(now - neigh->updated); ci.ndm_refcnt = refcount_read(&neigh->refcnt) - 1; - read_unlock_bh(&neigh->lock); if (nla_put_u32(skb, NDA_PROBES, atomic_read(&neigh->probes)) || nla_put(skb, NDA_CACHEINFO, sizeof(ci), &ci)) @@ -2684,6 +2680,20 @@ static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, return -EMSGSIZE; } +static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh, + u32 pid, u32 seq, int type, unsigned int flags) + __releases(neigh->lock) + __acquires(neigh->lock) +{ + int err; + + read_lock_bh(&neigh->lock); + err = __neigh_fill_info(skb, neigh, pid, seq, type, flags); + read_unlock_bh(&neigh->lock); + + return err; +} + static int pneigh_fill_info(struct sk_buff *skb, struct pneigh_entry *pn, u32 pid, u32 seq, int type, unsigned int flags, struct neigh_table *tbl) -- 2.51.1 Andy Roulin has described an issue with the current neighbor notification scheme as follows. This was also presented publicly at the link below. neigh_update sends a rtnl notification if an update, e.g., nud_state change, was done but there is no guarantee of ordering of the rtnl notifications. Consider the following scenario: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = STALE read_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify neigh_fill_info read_lock_bh(n->lock) ndm->nud_state = REACHABLE read_unlock_bh(n->lock) rtnl_nofify RTNL REACHABLE sent <-------- rtnl_notify RTNL STALE sent In this scenario, the kernel neigh is updated first to STALE and then REACHABLE but the netlink notifications are sent out of order, first REACHABLE and then STALE. The solution is to send the netlink message inside the same critical section that formats the message. That way both the contents and ordering of the message reflect the same state, and we cannot see the abovementioned out-of-order delivery. Even with this patch, an issue remains that the contents of the message may not reflect the changes made to the neighbor. A kernel thread might still interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents. The following patches will attempt to address that issue. To support those future patches, convert __neigh_notify() to a helper that assumes that the neighbor lock is already taken by having it call __neigh_fill_info() instead of neigh_fill_info(). Add a new helper, neigh_notify(), which takes the lock before calling __neigh_notify(). Migrate all callers to use the latter. Link: https://lore.kernel.org/netdev/ed6768c1-80b8-aee2-e545-b51661d49336@nvidia.com/ Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 6cdd93dfa3ea..5a56b787b5ec 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -51,8 +51,7 @@ do { \ #define PNEIGH_HASHMASK 0xF static void neigh_timer_handler(struct timer_list *t); -static void __neigh_notify(struct neighbour *n, int type, int flags, - u32 pid); +static void neigh_notify(struct neighbour *n, int type, int flags, u32 pid); static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid); static void pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev, bool skip_perm); @@ -117,7 +116,7 @@ static int neigh_blackhole(struct neighbour *neigh, struct sk_buff *skb) static void neigh_cleanup_and_release(struct neighbour *neigh) { trace_neigh_cleanup_and_release(neigh, 0); - __neigh_notify(neigh, RTM_DELNEIGH, 0, 0); + neigh_notify(neigh, RTM_DELNEIGH, 0, 0); call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); neigh_release(neigh); } @@ -2740,7 +2739,7 @@ static int pneigh_fill_info(struct sk_buff *skb, struct pneigh_entry *pn, static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid) { call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); - __neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); + neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); } static bool neigh_master_filtered(struct net_device *dev, int master_idx) @@ -3555,7 +3554,7 @@ static void __neigh_notify(struct neighbour *n, int type, int flags, if (skb == NULL) goto errout; - err = neigh_fill_info(skb, n, pid, 0, type, flags); + err = __neigh_fill_info(skb, n, pid, 0, type, flags); if (err < 0) { /* -EMSGSIZE implies BUG in neigh_nlmsg_size() */ WARN_ON(err == -EMSGSIZE); @@ -3570,9 +3569,18 @@ static void __neigh_notify(struct neighbour *n, int type, int flags, rcu_read_unlock(); } +static void neigh_notify(struct neighbour *neigh, int type, int flags, u32 pid) + __releases(neigh->lock) + __acquires(neigh->lock) +{ + read_lock_bh(&neigh->lock); + __neigh_notify(neigh, type, flags, pid); + read_unlock_bh(&neigh->lock); +} + void neigh_app_ns(struct neighbour *n) { - __neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST, 0); + neigh_notify(n, RTM_GETNEIGH, NLM_F_REQUEST, 0); } EXPORT_SYMBOL(neigh_app_ns); -- 2.51.1 In order to make manipulation with this bit of code clearer, extract it to a helper function, neigh_update_process_arp_queue(). Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 78 ++++++++++++++++++++++++-------------------- 1 file changed, 43 insertions(+), 35 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 5a56b787b5ec..efbfaaba0e0b 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1302,6 +1302,47 @@ static void neigh_update_hhs(struct neighbour *neigh) } } +static void neigh_update_process_arp_queue(struct neighbour *neigh) + __releases(neigh->lock) + __acquires(neigh->lock) +{ + struct sk_buff *skb; + + /* Again: avoid deadlock if something went wrong. */ + while (neigh->nud_state & NUD_VALID && + (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) { + struct dst_entry *dst = skb_dst(skb); + struct neighbour *n2, *n1 = neigh; + + write_unlock_bh(&neigh->lock); + + rcu_read_lock(); + + /* Why not just use 'neigh' as-is? The problem is that + * things such as shaper, eql, and sch_teql can end up + * using alternative, different, neigh objects to output + * the packet in the output path. So what we need to do + * here is re-lookup the top-level neigh in the path so + * we can reinject the packet there. + */ + n2 = NULL; + if (dst && + READ_ONCE(dst->obsolete) != DST_OBSOLETE_DEAD) { + n2 = dst_neigh_lookup_skb(dst, skb); + if (n2) + n1 = n2; + } + READ_ONCE(n1->output)(n1, skb); + if (n2) + neigh_release(n2); + rcu_read_unlock(); + + write_lock_bh(&neigh->lock); + } + __skb_queue_purge(&neigh->arp_queue); + neigh->arp_queue_len_bytes = 0; +} + /* Generic update routine. -- lladdr is new lladdr or NULL, if it is not supplied. -- new is new state. @@ -1461,43 +1502,10 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, neigh_connect(neigh); else neigh_suspect(neigh); - if (!(old & NUD_VALID)) { - struct sk_buff *skb; - /* Again: avoid dead loop if something went wrong */ + if (!(old & NUD_VALID)) + neigh_update_process_arp_queue(neigh); - while (neigh->nud_state & NUD_VALID && - (skb = __skb_dequeue(&neigh->arp_queue)) != NULL) { - struct dst_entry *dst = skb_dst(skb); - struct neighbour *n2, *n1 = neigh; - write_unlock_bh(&neigh->lock); - - rcu_read_lock(); - - /* Why not just use 'neigh' as-is? The problem is that - * things such as shaper, eql, and sch_teql can end up - * using alternative, different, neigh objects to output - * the packet in the output path. So what we need to do - * here is re-lookup the top-level neigh in the path so - * we can reinject the packet there. - */ - n2 = NULL; - if (dst && - READ_ONCE(dst->obsolete) != DST_OBSOLETE_DEAD) { - n2 = dst_neigh_lookup_skb(dst, skb); - if (n2) - n1 = n2; - } - READ_ONCE(n1->output)(n1, skb); - if (n2) - neigh_release(n2); - rcu_read_unlock(); - - write_lock_bh(&neigh->lock); - } - __skb_queue_purge(&neigh->arp_queue); - neigh->arp_queue_len_bytes = 0; - } out: if (update_isrouter) neigh_update_is_router(neigh, flags, ¬ify); -- 2.51.1 ARP queue processing unlocks the neighbor lock, which can allow another thread to asynchronously perform a neighbor update and send an out of order notification. Therefore this needs to be done after the notification is sent. Move it just before the end of the critical section. Since neigh_update_process_arp_queue() unlocks, it does not form a part of the critical section anymore but it can benefit from the lock being taken. The intention is to eventually do the RTNL notification before this call. This motion crosses a call to neigh_update_is_router(), which should not influence processing of the ARP queue. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index efbfaaba0e0b..f3290385db68 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1369,6 +1369,7 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, struct netlink_ext_ack *extack) { bool gc_update = false, managed_update = false; + bool process_arp_queue = false; int update_isrouter = 0; struct net_device *dev; int err, notify = 0; @@ -1504,12 +1505,17 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, neigh_suspect(neigh); if (!(old & NUD_VALID)) - neigh_update_process_arp_queue(neigh); + process_arp_queue = true; out: if (update_isrouter) neigh_update_is_router(neigh, flags, ¬ify); + + if (process_arp_queue) + neigh_update_process_arp_queue(neigh); + write_unlock_bh(&neigh->lock); + if (((new ^ old) & NUD_PERMANENT) || gc_update) neigh_update_gc_list(neigh); if (managed_update) -- 2.51.1 The obvious idea behind the helper is to keep together the two bits that should be done either both or neither: the internal notifier chain message, and the netlink notification. To make sure that the notification sent reflects the change being made, the netlink message needs to be send inside the critical section where the neighbor is changed. But for the notifier chain, there is no such need: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. Therefore these two items have each different locking needs. Now we could unlock inside the helper, but I find that error prone, and the fact that the notification is conditional in the first place does not help to make the call site obvious. So in this patch, the helper is instead removed and the body, which is just these two calls, inlined. That way we can use each notifier independently. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index f3290385db68..e37db91c5339 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -52,7 +52,6 @@ do { \ static void neigh_timer_handler(struct timer_list *t); static void neigh_notify(struct neighbour *n, int type, int flags, u32 pid); -static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid); static void pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev, bool skip_perm); @@ -1187,8 +1186,10 @@ static void neigh_timer_handler(struct timer_list *t) write_unlock(&neigh->lock); } - if (notify) - neigh_update_notify(neigh, 0); + if (notify) { + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + neigh_notify(neigh, RTM_NEWNEIGH, 0, 0); + } trace_neigh_timer_handler(neigh, 0); @@ -1520,8 +1521,12 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, neigh_update_gc_list(neigh); if (managed_update) neigh_update_managed_list(neigh); - if (notify) - neigh_update_notify(neigh, nlmsg_pid); + + if (notify) { + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); + neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); + } + trace_neigh_update_done(neigh, err); return err; } @@ -2750,12 +2755,6 @@ static int pneigh_fill_info(struct sk_buff *skb, struct pneigh_entry *pn, return -EMSGSIZE; } -static void neigh_update_notify(struct neighbour *neigh, u32 nlmsg_pid) -{ - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); - neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); -} - static bool neigh_master_filtered(struct net_device *dev, int master_idx) { struct net_device *master; -- 2.51.1 The netlink message needs to be send inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. On the other hand, there is no such need in case of notifier chain: the listeners do not assume lock, and often in fact just schedule a delayed work to act on the neighbor later. At least one in fact also takes the neighbor lock. This requires that the netlink notification be done before the internal notifier chain message is sent. That is safe to do, because the current listeners, as well as __neigh_notify(), only read the updated neighbor fields, and never modify them. (Apart from locking.) Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index e37db91c5339..ada691690948 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1187,8 +1187,8 @@ static void neigh_timer_handler(struct timer_list *t) } if (notify) { - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); neigh_notify(neigh, RTM_NEWNEIGH, 0, 0); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); } trace_neigh_timer_handler(neigh, 0); @@ -1523,8 +1523,8 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, neigh_update_managed_list(neigh); if (notify) { - call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); + call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); } trace_neigh_update_done(neigh, err); -- 2.51.1 As noted in a previous patch, one race remains in the current code. A kernel thread might interrupt a userspace thread after the change is done, but before formatting and sending the message. Then what we would see is two messages with the same contents: userspace thread kernel thread ================ ============= neigh_update write_lock_bh(n->lock) n->nud_state = STALE write_unlock_bh(n->lock) --------------------------> neigh:update write_lock_bh(n->lock) n->nud_state = REACHABLE write_unlock_bh(n->lock) neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent <-------- neigh_notify read_lock_bh(n->lock) __neigh_fill_info ndm->nud_state = REACHABLE rtnl_notify read_unlock_bh(n->lock) RTNL REACHABLE sent again The solution is to send the netlink message inside the critical section where the neighbor is changed, so that it reflects the notified-upon neighbor state. To that end, in __neigh_update(), move the current neigh_notify() call up to said critical section, and convert it to __neigh_notify(), because the lock is held. This motion crosses calls to neigh_update_managed_list(), neigh_update_gc_list() and neigh_update_process_arp_queue(), all of which potentially unlock and give an opportunity for the above race. This also crosses a call to neigh_update_process_arp_queue() which calls neigh->output(), which might be neigh_resolve_output() calls neigh_event_send() calls neigh_event_send_probe() calls __neigh_event_send() calls neigh_probe(), which touches neigh->probes, an update which will now not be visible in the notification. However, there is indication that there is no promise that these changes will be accurately projected to notifications: fib6_table_lookup() indirectly calls route.c's find_match() calls rt6_probe(), which looks up a neighbor and call __neigh_set_probe_once(), which sets neigh->probes to 0, but neither this nor the caller seems to send a notification. Additionally, the neighbor object that the neigh_probe() mentioned above is called on, might be the alternative neighbor looked up for the ARP queue packet destination. If that is the case, the changed value of n1->probes is not notified anywhere. So at least in some circumstances, the reported number of probes needs to be assumed to change without notification. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index ada691690948..635d71c6420f 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -52,6 +52,7 @@ do { \ static void neigh_timer_handler(struct timer_list *t); static void neigh_notify(struct neighbour *n, int type, int flags, u32 pid); +static void __neigh_notify(struct neighbour *n, int type, int flags, u32 pid); static void pneigh_ifdown(struct neigh_table *tbl, struct net_device *dev, bool skip_perm); @@ -1512,6 +1513,9 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, if (update_isrouter) neigh_update_is_router(neigh, flags, ¬ify); + if (notify) + __neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); + if (process_arp_queue) neigh_update_process_arp_queue(neigh); @@ -1522,10 +1526,8 @@ static int __neigh_update(struct neighbour *neigh, const u8 *lladdr, if (managed_update) neigh_update_managed_list(neigh); - if (notify) { - neigh_notify(neigh, RTM_NEWNEIGH, 0, nlmsg_pid); + if (notify) call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); - } trace_neigh_update_done(neigh, err); return err; -- 2.51.1 Similarly to the issue from the previous patch, neigh_timer_handler() also updates the neighbor separately from formatting and sending the netlink notification message. We have not seen reports to the effect of this causing trouble, but in theory, the same sort of issues could have come up: neigh_timer_handler() would make changes as necessary, but before formatting and sending a notification, is interrupted before sending by another thread, which makes a parallel change and sends its own message. The message send that is prompted by an earlier change thus contains information that does not reflect the change having been made. To solve this, the netlink notification needs to be in the same critical section that updates the neighbor. The critical section is ended by the neigh_probe() call which drops the lock before calling solicit. Stretching the critical section over the solicit call is problematic, because that can then involved all sorts of forwarding callbacks. Therefore, like in the previous patch, split the netlink notification away from the internal one and move it ahead of the probe call. Signed-off-by: Petr Machata Reviewed-by: Ido Schimmel --- net/core/neighbour.c | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 635d71c6420f..5512dd7035b1 100644 --- a/net/core/neighbour.c +++ b/net/core/neighbour.c @@ -1180,6 +1180,10 @@ static void neigh_timer_handler(struct timer_list *t) if (!mod_timer(&neigh->timer, next)) neigh_hold(neigh); } + + if (notify) + __neigh_notify(neigh, RTM_NEWNEIGH, 0, 0); + if (neigh->nud_state & (NUD_INCOMPLETE | NUD_PROBE)) { neigh_probe(neigh); } else { @@ -1187,10 +1191,8 @@ static void neigh_timer_handler(struct timer_list *t) write_unlock(&neigh->lock); } - if (notify) { - neigh_notify(neigh, RTM_NEWNEIGH, 0, 0); + if (notify) call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, neigh); - } trace_neigh_timer_handler(neigh, 0); -- 2.51.1