When an sk_msg scatterlist ring wraps (sg.end < sg.start), tls_push_record() chains the tail portion of the ring to the head using sg_chain(). An extra entry in the sg array is reserved for this: struct sk_msg_sg { [...] /* The extra two elements: * 1) used for chaining the front and sections when the list becomes * partitioned (e.g. end < start). The crypto APIs require the * chaining; * 2) to chain tailer SG entries after the message. */ struct scatterlist data[MAX_MSG_FRAGS + 2]; The current code uses MAX_SKB_FRAGS + 1 as the ring size: sg_chain(&msg_pl->sg.data[msg_pl->sg.start], MAX_SKB_FRAGS - msg_pl->sg.start + 1, msg_pl->sg.data); This places the chain pointer at sg_chain(data[start], (MAX_SKB_FRAGS - msg_start + 1) .. = &data[start] + (MAX_SKB_FRAGS - msg_start + 1) - 1 = data[start + (MAX_SKB_FRAGS - start + 1) - 1] = data[MAX_SKB_FRAGS] instead of the true last entry. This is likely due to a "race" of the commit under Fixes landing close to commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down") Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested by Sabrina). Reported-by: 钱一铭 Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski --- CC: john.fastabend@gmail.com CC: sd@queasysnail.net CC: daniel@iogearbox.net CC: jonathan.lemon@gmail.com CC: bpf@vger.kernel.org --- net/tls/tls_sw.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 2590e855f6a5..2608b0c01849 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -800,11 +800,9 @@ static int tls_push_record(struct sock *sk, int flags, sg_mark_end(sk_msg_elem(msg_pl, i)); } - if (msg_pl->sg.end < msg_pl->sg.start) { - sg_chain(&msg_pl->sg.data[msg_pl->sg.start], - MAX_SKB_FRAGS - msg_pl->sg.start + 1, + if (msg_pl->sg.end < msg_pl->sg.start) + sg_chain(msg_pl->sg.data, ARRAY_SIZE(msg_pl->sg.data), msg_pl->sg.data); - } i = msg_pl->sg.start; sg_chain(rec->sg_aead_in, 2, &msg_pl->sg.data[i]); -- 2.54.0 Sashiko points out that if end = 0 (start != 0) the current code will create a chain link to content type right after the wrap link: This would create a chain where the wrap link points directly to another chain link. The scatterlist API sg_next iterator does not recursively resolve consecutive chain links. meaning this is illegal input to crypto. The wrapping link is unnecessary if end = 0. end is the entry after the last one used so end = 0 means there's nothing pushed after the wrap: end start i v v v [ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap] Skip the wrapping in this case. TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0. This avoids the chain-after-chain. Move the wrap chaining before marking END and chaining off content type, that feels like more logical ordering to me, but should not matter from functional perspective. Reported-by: Sashiko Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski --- CC: john.fastabend@gmail.com CC: sd@queasysnail.net CC: daniel@iogearbox.net CC: jonathan.lemon@gmail.com --- net/tls/tls_sw.c | 24 ++++++++++++++++++------ 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 2608b0c01849..3bfdaf5e64f5 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -789,21 +789,33 @@ static int tls_push_record(struct sock *sk, int flags, i = msg_pl->sg.end; sk_msg_iter_var_prev(i); + /* msg_pl->sg.data is a ring; data[MAX+1] is reserved for the wrap + * link (frags won't use it). 'i' is now the last filled entry: + * + * i end start + * v v v [ rsv ] + * [ d ][ d ][ ][ ]...[ ][ d ][ d ][ d ][chain] + * ^ END v + * `-----------------------------------------' + * + * Note that SGL does not allow chain-after-chain, so for TLS 1.3, + * we must make sure we don't create the wrap entry and then chain + * link to content_type immediately at index 0. + */ + if (i < msg_pl->sg.start) + sg_chain(msg_pl->sg.data, ARRAY_SIZE(msg_pl->sg.data), + msg_pl->sg.data); + rec->content_type = record_type; if (prot->version == TLS_1_3_VERSION) { /* Add content type to end of message. No padding added */ sg_set_buf(&rec->sg_content_type, &rec->content_type, 1); sg_mark_end(&rec->sg_content_type); - sg_chain(msg_pl->sg.data, msg_pl->sg.end + 1, - &rec->sg_content_type); + sg_chain(msg_pl->sg.data, i + 2, &rec->sg_content_type); } else { sg_mark_end(sk_msg_elem(msg_pl, i)); } - if (msg_pl->sg.end < msg_pl->sg.start) - sg_chain(msg_pl->sg.data, ARRAY_SIZE(msg_pl->sg.data), - msg_pl->sg.data); - i = msg_pl->sg.start; sg_chain(rec->sg_aead_in, 2, &msg_pl->sg.data[i]); -- 2.54.0 bpf_exec_tx_verdict() may return having modified the record and the plaintext/encrypted sk_msg pointers. We must always reload those pointers after calling bpf_exec_tx_verdict(). On the wait_for_memory path after sk_stream_wait_memory() returns, the post-wait contains a shortcut: if (ctx->open_rec && msg_en->sg.size < required_size) goto alloc_encrypted; which dereferences the cached msg_en, which can equally point at a freed record if the prior bpf_exec_tx_verdict() split the open rec before returning -ENOMEM. Drop the shortcut it seems to have only been an optimization to skip trivial intro of the loop. Reported-by: Sashiko Fixes: 54a3ecaeeeae ("bpf: fix ktls panic with sockmap") Signed-off-by: Jakub Kicinski --- net/tls/tls_sw.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 3bfdaf5e64f5..360f71fd7884 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1112,7 +1112,6 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, if (!sk_stream_memory_free(sk)) goto wait_for_sndbuf; -alloc_encrypted: ret = tls_alloc_encrypted_msg(sk, required_size); if (ret) { if (ret != -ENOSPC) @@ -1255,9 +1254,6 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, tls_trim_both_msgs(sk, orig_size); goto send_end; } - - if (ctx->open_rec && msg_en->sg.size < required_size) - goto alloc_encrypted; } send_end: -- 2.54.0 As explained in commit 54a3ecaeeeae ("bpf: fix ktls panic with sockmap") once we call BPF there's no way for us to rollback the iter and copy data, since BPF may have modified the message. This is regardless of whether BPF set up cork or not. Remove the attempt to roll back iter completely. This removes a UAF since BPF may have modified msg_pl and rec, so these pointers were stale. Note that I'm entirely unsure what the expected behavior is here for BPF. Feels like this path must not be exercised by normal applications / existing deployments in the first place. Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Signed-off-by: Jakub Kicinski --- net/tls/tls_sw.c | 12 ++---------- 1 file changed, 2 insertions(+), 10 deletions(-) diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 360f71fd7884..22b77840e35a 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -1164,11 +1164,8 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, else if (ret == -ENOMEM) goto wait_for_memory; else if (ctx->open_rec && ret == -ENOSPC) { - if (msg_pl->cork_bytes) { - ret = 0; - goto send_end; - } - goto rollback_iter; + ret = 0; + goto send_end; } else if (ret != -EAGAIN) goto send_end; } @@ -1180,11 +1177,6 @@ static int tls_sw_sendmsg_locked(struct sock *sk, struct msghdr *msg, } continue; -rollback_iter: - copied -= try_to_copy; - sk_msg_sg_copy_clear(msg_pl, first); - iov_iter_revert(&msg->msg_iter, - msg_pl->sg.size - orig_size); fallback_to_reg_send: sk_msg_trim(sk, msg_pl, orig_size); } -- 2.54.0