btrfs_dev_replace_finishing() can return early, before the source and target devices are swapped, if btrfs_start_delalloc_roots() or btrfs_start_transaction() fails (e.g. with -ENOMEM): ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false); if (ret) { mutex_unlock(&dev_replace->lock_finishing_cancel_unmount); return ret; } ... trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { mutex_unlock(&dev_replace->lock_finishing_cancel_unmount); return PTR_ERR(trans); } These legs leave dev_replace->replace_state as STARTED. The scrub that drives the replace has already exited (btrfs_scrub_dev() has returned), and the caller releases the exclusive operation unconditionally: both btrfs_dev_replace_kthread() and the BTRFS_IOCTL_DEV_REPLACE_CMD_START ioctl call btrfs_exclop_finish() regardless of the return value. The result is a replace stuck in STARTED with no running scrub, which cannot be canceled. btrfs_dev_replace_cancel() takes its STARTED case and calls btrfs_scrub_cancel(), which returns -ENOTCONN because no scrub is running; cancel then reports BTRFS_IOCTL_DEV_REPLACE_RESULT_NOT_STARTED and tears nothing down. The replace stays STARTED -- writes keep being duplicated to the target -- until the filesystem is unmounted; only then does btrfs_dev_replace_suspend_for_unmount() move it to SUSPENDED so the next mount can resume it. Demote the replace from STARTED to SUSPENDED on these early-return legs. SUSPENDED is exactly the state the rest of the code expects for "no scrub running, resumable": btrfs_dev_replace_cancel()'s SUSPENDED case cleanly tears the target down, and a resume on the next mount restarts the copy. This does not change that the exclusive operation is released while the replace is only suspended; tightening the exclop lifetime so a balance cannot start against an in-mount suspended replace is a separate change. Fixes: e93c89c1aaaa ("Btrfs: add new sources for device replace code") Cc: stable@vger.kernel.org Signed-off-by: Christian Brauner (Amutable) --- fs/btrfs/dev-replace.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index e5090fb7fc11..13badbdee06b 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -861,6 +861,21 @@ static void btrfs_dev_replace_update_device_in_mapping_tree( write_unlock(&fs_info->mapping_tree_lock); } +/* Demote a STARTED replace to SUSPENDED on an early finishing failure. */ +static void btrfs_dev_replace_set_suspended(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dev_replace *dev_replace = &fs_info->dev_replace; + + down_write(&dev_replace->rwsem); + if (dev_replace->replace_state == BTRFS_IOCTL_DEV_REPLACE_STATE_STARTED) { + dev_replace->replace_state = + BTRFS_IOCTL_DEV_REPLACE_STATE_SUSPENDED; + dev_replace->time_stopped = ktime_get_real_seconds(); + dev_replace->item_needs_writeback = 1; + } + up_write(&dev_replace->rwsem); +} + static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, int scrub_ret) { @@ -895,6 +910,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, */ ret = btrfs_start_delalloc_roots(fs_info, LONG_MAX, false); if (ret) { + btrfs_dev_replace_set_suspended(fs_info); mutex_unlock(&dev_replace->lock_finishing_cancel_unmount); return ret; } @@ -908,6 +924,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, while (1) { trans = btrfs_start_transaction(root, 0); if (IS_ERR(trans)) { + btrfs_dev_replace_set_suspended(fs_info); mutex_unlock(&dev_replace->lock_finishing_cancel_unmount); return PTR_ERR(trans); } -- 2.47.3