tcp_disconnect() calls tcp_send_active_reset() with gfp_any(), which
returns GFP_KERNEL in process context. This can trigger a circular
locking dependency when called during block device teardown that
involves network-backed storage.
The deadlock scenario occurs with storage configurations like MD RAID
over NVMeOF TCP when tearing down the block device:
CPU0 (mdadm --stop /dev/mdX): CPU1 (NVMe I/O submission):
================================ ===========================
del_gendisk()
blk_unregister_queue()
elevator_set_none()
elevator_switch()
__synchronize_srcu()
[holds set->srcu]
[waits for operations]
nvme_tcp_queue_rq()
nvme_tcp_try_send()
tcp_sendmsg()
lock_sock_nested()
[holds sk_lock-AF_INET-NVME]
[can wait for set->srcu]
[cleanup triggers NVMe disconnect]
nvme_tcp_teardown_io_queues()
nvme_tcp_free_queue()
sock_release()
__sock_release()
tcp_close()
lock_sock_nested()
[holds sk_lock-AF_INET-NVME]
__tcp_close()
tcp_disconnect()
tcp_send_active_reset()
alloc_skb(gfp_any())
[GFP_KERNEL in process context]
kmem_cache_alloc_node()
fs_reclaim_acquire()
[can trigger writeback]
[needs block layer]
[waits for set->srcu]
*** DEADLOCK ***
blktests ./check md/001:
[ 95.764798] run blktests md/001 at 2025-11-24 21:13:10
[ 96.020965] brd: module loaded
[ 96.098934] Key type psk registered
[ 96.237974] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 96.244988] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 96.286775] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 96.290980] nvme nvme0: creating 48 I/O queues.
[ 96.304554] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 96.322530] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 96.414331] md: async del_gendisk mode will be removed in future, please upgrade to mdadm-4.5+
[ 96.414427] block device autoloading is deprecated and will be removed.
[ 96.473347] md/raid1:md127: active with 1 out of 2 mirrors
[ 96.474602] md127: detected capacity change from 0 to 2093056
[ 96.665424] md127: detected capacity change from 2093056 to 0
[ 96.665433] md: md127 stopped.
[ 96.694365] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 96.708310] block nvme0n1: no available path - failing I/O
[ 96.708379] block nvme0n1: no available path - failing I/O
[ 96.708414] block nvme0n1: no available path - failing I/O
[ 96.708734] block nvme0n1: no available path - failing I/O
[ 96.708745] block nvme0n1: no available path - failing I/O
[ 96.708761] block nvme0n1: no available path - failing I/O
[ 96.812432] ======================================================
[ 96.816828] WARNING: possible circular locking dependency detected
[ 96.821054] 6.18.0-rc6lblk-fnext+ #7 Tainted: G N
[ 96.825312] ------------------------------------------------------
[ 96.830181] nvme/2595 is trying to acquire lock:
[ 96.833374] ffffffff82e487e0 (fs_reclaim){+.+.}-{0:0}, at: kmem_cache_alloc_node_noprof+0x5a/0x770
[ 96.839640]
but task is already holding lock:
[ 96.843657] ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[ 96.849247]
which lock already depends on the new lock.
[ 96.854869]
the existing dependency chain (in reverse order) is:
[ 96.860473]
-> #4 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 96.865028] lock_sock_nested+0x2e/0x70
[ 96.868084] tcp_sendmsg+0x1a/0x40
[ 96.870833] sock_sendmsg+0xed/0x110
[ 96.873677] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[ 96.878007] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[ 96.881344] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[ 96.884399] blk_mq_dispatch_rq_list+0x29a/0x800
[ 96.887237] __blk_mq_sched_dispatch_requests+0x3de/0x5f0
[ 96.891116] blk_mq_sched_dispatch_requests+0x29/0x70
[ 96.894166] blk_mq_run_work_fn+0x76/0x1b0
[ 96.896710] process_one_work+0x211/0x630
[ 96.899162] worker_thread+0x184/0x330
[ 96.901503] kthread+0x10d/0x250
[ 96.903570] ret_from_fork+0x29a/0x300
[ 96.905888] ret_from_fork_asm+0x1a/0x30
[ 96.908186]
-> #3 (set->srcu){.+.+}-{0:0}:
[ 96.910188] __synchronize_srcu+0x49/0x170
[ 96.911882] elevator_switch+0xc9/0x330
[ 96.913459] elevator_change+0x133/0x1b0
[ 96.915079] elevator_set_none+0x3b/0x80
[ 96.916714] blk_unregister_queue+0xb0/0x120
[ 96.918450] __del_gendisk+0x14e/0x3c0
[ 96.920700] del_gendisk+0x75/0xa0
[ 96.922098] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 96.924044] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 96.926220] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 96.928310] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 96.930429] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 96.932450] kernfs_fop_write_iter+0x16d/0x220
[ 96.934271] vfs_write+0x37b/0x520
[ 96.935746] ksys_write+0x67/0xe0
[ 96.937141] do_syscall_64+0x76/0xa60
[ 96.938645] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 96.940628]
-> #2 (&q->elevator_lock){+.+.}-{4:4}:
[ 96.942903] __mutex_lock+0xa2/0x1150
[ 96.944434] elevator_change+0x9b/0x1b0
[ 96.946046] elv_iosched_store+0x116/0x190
[ 96.947746] kernfs_fop_write_iter+0x16d/0x220
[ 96.949524] vfs_write+0x37b/0x520
[ 96.951506] ksys_write+0x67/0xe0
[ 96.952934] do_syscall_64+0x76/0xa60
[ 96.954457] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 96.956489]
-> #1 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 96.959011] blk_alloc_queue+0x30e/0x350
[ 96.960664] blk_mq_alloc_queue+0x61/0xd0
[ 96.962293] scsi_alloc_sdev+0x2a0/0x3e0
[ 96.963954] scsi_probe_and_add_lun+0x1bd/0x430
[ 96.965782] __scsi_add_device+0x109/0x120
[ 96.967461] ata_scsi_scan_host+0x97/0x1c0
[ 96.969198] async_run_entry_fn+0x30/0x130
[ 96.970903] process_one_work+0x211/0x630
[ 96.972577] worker_thread+0x184/0x330
[ 96.974097] kthread+0x10d/0x250
[ 96.975448] ret_from_fork+0x29a/0x300
[ 96.977050] ret_from_fork_asm+0x1a/0x30
[ 96.978705]
-> #0 (fs_reclaim){+.+.}-{0:0}:
[ 96.981265] __lock_acquire+0x1468/0x2210
[ 96.982950] lock_acquire+0xd3/0x2f0
[ 96.984445] fs_reclaim_acquire+0x99/0xd0
[ 96.986141] kmem_cache_alloc_node_noprof+0x5a/0x770
[ 96.988171] __alloc_skb+0x15f/0x190
[ 96.989681] tcp_send_active_reset+0x3f/0x1e0
[ 96.991248] tcp_disconnect+0x551/0x770
[ 96.992851] __tcp_close+0x2c7/0x520
[ 96.994327] tcp_close+0x20/0x80
[ 96.995727] inet_release+0x34/0x60
[ 96.997168] __sock_release+0x3d/0xc0
[ 96.998688] sock_close+0x14/0x20
[ 97.000058] __fput+0xf1/0x2c0
[ 97.001388] task_work_run+0x58/0x90
[ 97.002922] exit_to_user_mode_loop+0x12c/0x150
[ 97.004720] do_syscall_64+0x2a0/0xa60
[ 97.006256] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 97.008279]
other info that might help us debug this:
[ 97.011827] Chain exists of:
fs_reclaim --> set->srcu --> sk_lock-AF_INET-NVME
[ 97.015506] Possible unsafe locking scenario:
[ 97.017718] CPU0 CPU1
[ 97.019363] ---- ----
[ 97.020984] lock(sk_lock-AF_INET-NVME);
[ 97.022399] lock(set->srcu);
[ 97.024415] lock(sk_lock-AF_INET-NVME);
[ 97.026798] lock(fs_reclaim);
[ 97.027927]
*** DEADLOCK ***
[ 97.030010] 2 locks held by nvme/2595:
[ 97.031353] #0: ffff88810047b388 (&sb->s_type->i_mutex_key#10){+.+.}-{4:4}, at: __sock_release+0x30/0xc0
[ 97.034820] #1: ffff88810c503358 (sk_lock-AF_INET-NVME){+.+.}-{0:0}, at: tcp_close+0x15/0x80
[ 97.037806]
stack backtrace:
[ 97.039367] CPU: 2 UID: 0 PID: 2595 Comm: nvme Tainted: G N 6.18.0-rc6lblk-fnext+ #7 PREEMPT(voluntary)
[ 97.039370] Tainted: [N]=TEST
[ 97.039371] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 97.039372] Call Trace:
[ 97.039374]
[ 97.039375] dump_stack_lvl+0x75/0xb0
[ 97.039379] print_circular_bug+0x26a/0x330
[ 97.039381] check_noncircular+0x12f/0x150
[ 97.039385] __lock_acquire+0x1468/0x2210
[ 97.039388] lock_acquire+0xd3/0x2f0
[ 97.039390] ? kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039393] fs_reclaim_acquire+0x99/0xd0
[ 97.039395] ? kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039396] kmem_cache_alloc_node_noprof+0x5a/0x770
[ 97.039397] ? __alloc_skb+0x15f/0x190
[ 97.039400] ? __alloc_skb+0x15f/0x190
[ 97.039401] __alloc_skb+0x15f/0x190
[ 97.039403] tcp_send_active_reset+0x3f/0x1e0
[ 97.039405] tcp_disconnect+0x551/0x770
[ 97.039407] __tcp_close+0x2c7/0x520
[ 97.039408] tcp_close+0x20/0x80
[ 97.039410] inet_release+0x34/0x60
[ 97.039412] __sock_release+0x3d/0xc0
[ 97.039413] sock_close+0x14/0x20
[ 97.039414] __fput+0xf1/0x2c0
[ 97.039416] task_work_run+0x58/0x90
[ 97.039418] exit_to_user_mode_loop+0x12c/0x150
[ 97.039420] do_syscall_64+0x2a0/0xa60
[ 97.039422] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 97.039423] RIP: 0033:0x7f869032e317
[ 97.039425] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 97.039430] RSP: 002b:00007fff7ceb31c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 97.039432] RAX: 0000000000000001 RBX: 00007fff7ceb44bd RCX: 00007f869032e317
[ 97.039433] RDX: 0000000000000001 RSI: 00007f869044c719 RDI: 0000000000000003
[ 97.039433] RBP: 0000000000000003 R08: 0000000017c8a850 R09: 00007f86903c44e0
[ 97.039434] R10: 00007f8690252130 R11: 0000000000000246 R12: 00007f869044c719
[ 97.039435] R13: 0000000017c8a4c0 R14: 0000000017c8a4c0 R15: 0000000017c8b680
[ 97.039438]
[ 97.263257] brd: module unloaded
Fix this by using GFP_ATOMIC instead of gfp_any() in tcp_disconnect().
This matches the existing pattern in __tcp_close() which already uses
GFP_ATOMIC when calling tcp_send_active_reset() (tcp.c:3246).
gfp_any() only considers softirq vs process context, but doesn't
account for lock context where sleeping is unsafe.
The issue was discovered with blktests md/001, which creates an MD RAID1
array with internal bitmap over NVMe-TCP, then stops the array. This
triggers the block device removal -> elevator cleanup -> network teardown
path that exposes the circular dependency.
Signed-off-by: Chaitanya Kulkarni
---
Hi,
Full disclosure: I'm not an expert in this area, if there is a better
solution, I'll be happy to try that.
-ck
---
net/ipv4/tcp.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8a18aeca7ab0..9fd01a8b90b5 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3363,14 +3363,15 @@ int tcp_disconnect(struct sock *sk, int flags)
} else if (unlikely(tp->repair)) {
WRITE_ONCE(sk->sk_err, ECONNABORTED);
} else if (tcp_need_reset(old_state)) {
- tcp_send_active_reset(sk, gfp_any(), SK_RST_REASON_TCP_STATE);
+ /* Use GFP_ATOMIC since we're holding sk_lock */
+ tcp_send_active_reset(sk, GFP_ATOMIC, SK_RST_REASON_TCP_STATE);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (tp->snd_nxt != tp->write_seq &&
(1 << old_state) & (TCPF_CLOSING | TCPF_LAST_ACK)) {
/* The last check adjusts for discrepancy of Linux wrt. RFC
* states
*/
- tcp_send_active_reset(sk, gfp_any(),
+ tcp_send_active_reset(sk, GFP_ATOMIC,
SK_RST_REASON_TCP_DISCONNECT_WITH_DATA);
WRITE_ONCE(sk->sk_err, ECONNRESET);
} else if (old_state == TCP_SYN_SENT)
--
2.40.0