Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port scripts to strict POSIX implementations #41

Merged
merged 23 commits into from
Sep 1, 2024
Merged

Conversation

aaronhurt
Copy link
Owner

There is a new script in this PR zfs-replicate-posix.sh that adds POSIX shell support. This script should function via standard Bourne Shell (sh) and/or strict POSIX shells like Dash. This script is currently considered experimental. Feedback is welcome.

@aaronhurt aaronhurt mentioned this pull request Aug 29, 2024
@aaronhurt aaronhurt changed the title Add Experimental POSIX Support Add Experimental POSIX Support Fixes #40 Aug 29, 2024
@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 30, 2024

The script stops at these two lines 257-258

    srcSnaps=$(snapList "$src" "$srcHost" 1)
    dstSnaps=$(snapList "$dst" "$dstHost" 0)

When commenting them out, it proceeds, but it stops here.

EDIT: by “proceeds” I mean it moves further. Yes the process fails entirely without them.

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 30, 2024

The script stops at these two lines 257-258

    srcSnaps=$(snapList "$src" "$srcHost" 1)
    dstSnaps=$(snapList "$dst" "$dstHost" 0)

When commenting them out, it proceeds, but it stops here.

It won't work at all if those are commented out, hrm it works here. Maybe something specific about FreeBSD sh that dash does different. I'll check it on my FreeBSD host.

EDIT: You can help debug by editing the first line of the script from set -e to set -ex and you'll get verbose debugging output.

@tschettervictor
Copy link
Collaborator

The script stops at these two lines 257-258

    srcSnaps=$(snapList "$src" "$srcHost" 1)
    dstSnaps=$(snapList "$dst" "$dstHost" 0)

When commenting them out, it proceeds, but it stops here.

It won't work at all if those are commented out, hrm it works here. Maybe something specific about FreeBSD sh that dash does different. I'll check it on my FreeBSD host.

EDIT: You can help debug by editing the first line of the script from set -e to set -ex and you'll get verbose debugging output.

Yes I'm testing that way. The function works because it shows srcSnaps=' ' on the last line of the script.

@aaronhurt
Copy link
Owner Author

This is working from my FreeBSD host with a basic test ...

[ahurt@schroeder ~/zfs-replicate]$ uname -a
FreeBSD schroeder.anbcs.com 14.1-RELEASE-p3 FreeBSD 14.1-RELEASE-p3 GENERIC amd64
[ahurt@schroeder ~/zfs-replicate]$ doas /bin/sh -c "LOG_BASE=./logs REPLICATE_SETS=zstore/temp:zstore/temp2 ./zfs-replicate-posix.sh"
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: loading configuration from defaults and environmental settings.
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: found old snapshot zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: destroying snapshot cmd=/sbin/zfs destroy zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: creating snapshot cmd=/sbin/zfs snapshot zstore/temp@autorep-08292024_1724980551
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.send.lock
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: sending snapshot cmd=/sbin/zfs send -Rs -I zstore/temp@autorep-08292024_1724980442 zstore/temp@autorep-08292024_1724980551 | /sbin/zfs receive -vFd zstore/temp2
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/foo and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/hello and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: attempting destroy zstore/temp2/temp@autorep-08292024_1724980392
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: success
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: receiving incremental stream of zstore/temp@autorep-08292024_1724980551 into zstore/temp2/temp@autorep-08292024_1724980551
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: received 312B stream in 2.66 seconds (117B/sec)
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.send.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: pruning logs ./logs/autorep-08292024_1724979945.log
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: success total sets 1 skipped 0

@tschettervictor
Copy link
Collaborator

This is working from my FreeBSD host with a basic test ...

[ahurt@schroeder ~/zfs-replicate]$ uname -a
FreeBSD schroeder.anbcs.com 14.1-RELEASE-p3 FreeBSD 14.1-RELEASE-p3 GENERIC amd64
[ahurt@schroeder ~/zfs-replicate]$ doas /bin/sh -c "LOG_BASE=./logs REPLICATE_SETS=zstore/temp:zstore/temp2 ./zfs-replicate-posix.sh"
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: loading configuration from defaults and environmental settings.
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: found old snapshot zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: destroying snapshot cmd=/sbin/zfs destroy zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: creating snapshot cmd=/sbin/zfs snapshot zstore/temp@autorep-08292024_1724980551
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.send.lock
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: sending snapshot cmd=/sbin/zfs send -Rs -I zstore/temp@autorep-08292024_1724980442 zstore/temp@autorep-08292024_1724980551 | /sbin/zfs receive -vFd zstore/temp2
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/foo and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/hello and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: attempting destroy zstore/temp2/temp@autorep-08292024_1724980392
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: success
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: receiving incremental stream of zstore/temp@autorep-08292024_1724980551 into zstore/temp2/temp@autorep-08292024_1724980551
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: received 312B stream in 2.66 seconds (117B/sec)
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.send.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: pruning logs ./logs/autorep-08292024_1724979945.log
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: success total sets 1 skipped 0

Did you remove all snapshots from src and dst?

@aaronhurt
Copy link
Owner Author

This is working from my FreeBSD host with a basic test ...

[ahurt@schroeder ~/zfs-replicate]$ uname -a
FreeBSD schroeder.anbcs.com 14.1-RELEASE-p3 FreeBSD 14.1-RELEASE-p3 GENERIC amd64
[ahurt@schroeder ~/zfs-replicate]$ doas /bin/sh -c "LOG_BASE=./logs REPLICATE_SETS=zstore/temp:zstore/temp2 ./zfs-replicate-posix.sh"
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: loading configuration from defaults and environmental settings.
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: zstore/temp
Aug 29 20:15:51 zfs-replicate-posix.sh[48582]: checking dataset cmd=/sbin/zfs list -H -o name zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: zstore/temp2
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: found old snapshot zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: destroying snapshot cmd=/sbin/zfs destroy zstore/temp@autorep-08292024_1724980392
Aug 29 20:15:52 zfs-replicate-posix.sh[48582]: creating snapshot cmd=/sbin/zfs snapshot zstore/temp@autorep-08292024_1724980551
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: creating lockfile /tmp/.replicate.send.lock
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: sending snapshot cmd=/sbin/zfs send -Rs -I zstore/temp@autorep-08292024_1724980442 zstore/temp@autorep-08292024_1724980551 | /sbin/zfs receive -vFd zstore/temp2
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/foo and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:53 zfs-replicate-posix.sh[48582]: WARNING: skipping dataset zstore/temp/hello and its children: snapshot autorep-08292024_1724980551 does not exist
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: attempting destroy zstore/temp2/temp@autorep-08292024_1724980392
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: success
Aug 29 20:15:55 zfs-replicate-posix.sh[48582]: receiving incremental stream of zstore/temp@autorep-08292024_1724980551 into zstore/temp2/temp@autorep-08292024_1724980551
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: received 312B stream in 2.66 seconds (117B/sec)
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.send.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: deleting lockfile /tmp/.replicate.snapshot.lock
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: pruning logs ./logs/autorep-08292024_1724979945.log
Aug 29 20:15:57 zfs-replicate-posix.sh[48582]: success total sets 1 skipped 0

Did you remove all snapshots from src and dst?

No, I didn't but thinking about it that might make sense. In a POSIX world that could be an error - no count when it comes to the next few commands.

Copy link

@yonas yonas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaronhurt This commit isn't signed. Can you please sign before merging?

Copy link

@yonas yonas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zfs-replicate-posix.sh ran without error, but there's nothing in the destination dataset.

$ cat config.sh
REPLICATE_SETS="zroot/test:zroot/fun@localhost"

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 30, 2024

Fixed the issue with no snaps on either side. It was a counting issue. The POSIX way of using wc -l will return a 1 for an empty string that's newline terminated. The grep also needs to be fed to || true or it will exit 1 when the string doesn't match.

@tschettervictor
Copy link
Collaborator

zfs-replicate-posix.sh ran without error, but there's nothing in the destination dataset.

$ cat config.sh
REPLICATE_SETS="zroot/test:zroot/fun@localhost"

When replicating locally you dont really need the @localhost

@yonas
Copy link

yonas commented Aug 30, 2024

I recommend this small change:

f302e68 - (HEAD -> pr-41) Update example config script to use sh. (2 minutes ago) <Yonas> [email protected] SHA256:5ZC2NLyr/ExZaQKf7DPl2f6V65LX8roqeB01qtT4izE

diff --git a/config.sample.sh b/config.sample.sh
index afadb6d..a6342bd 100644
--- a/config.sample.sh
+++ b/config.sample.sh
@@ -1,4 +1,5 @@
-#!/usr/bin/env bash
+#!/usr/bin/sh
+
 ## zfs-replicate configuration file
 # shellcheck disable=SC2034

@yonas
Copy link

yonas commented Aug 30, 2024

zfs-replicate-posix.sh ran without error, but there's nothing in the destination dataset.

$ cat config.sh
REPLICATE_SETS="zroot/test:zroot/fun@localhost"

When replicating locally you dont really need the @localhost

Thanks 👍

@aaronhurt
Copy link
Owner Author

I recommend this small change:

f302e68 - (HEAD -> pr-41) Update example config script to use sh. (2 minutes ago) <Yonas> [email protected] SHA256:5ZC2NLyr/ExZaQKf7DPl2f6V65LX8roqeB01qtT4izE

diff --git a/config.sample.sh b/config.sample.sh
index afadb6d..a6342bd 100644
--- a/config.sample.sh
+++ b/config.sample.sh
@@ -1,4 +1,5 @@
-#!/usr/bin/env bash
+#!/usr/bin/sh
+
 ## zfs-replicate configuration file
 # shellcheck disable=SC2034

Yes, that should be changed if/when the Bash script is completely replaced with the POSIX version. Although, it probably doesn't hurt now since it's just an ENV file.

@yonas
Copy link

yonas commented Aug 30, 2024

Fixed the issue with no snaps on either side. It was a counting issue. The POSIX way of using wc -l will return a 1 for an empty string that's newline terminated. The grep also needs to be fed to || true or it will exit 1 when the string doesn't match.

Replication works for me now 🎉

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 30, 2024

It should still be #!/usr/bin/env sh though. The env program is at /usr/bin/env 99.999% of the time. Whereas sh can be in /bin/ or /usr/bin depending on the system.

@aaronhurt
Copy link
Owner Author

If @tschettervictor is happy I'll work on re-working the test cases and I guess replace the bash version with this POSIX version on merge.

@yonas
Copy link

yonas commented Aug 30, 2024

It should still be #!/usr/bin/env sh though. The env program is at /usr/bin/env 99.999% of the time. Whereas sh can be in /bin/ or /usr/bin depending on the system.

Yes, I agree. I just checked Ubuntu, and it was in /usr/bin.

@aaronhurt
Copy link
Owner Author

Ubuntu sh is Dash I believe, and FreeBSD sh is POSIX Bourne shell.

@aaronhurt
Copy link
Owner Author

I'm going to call it a night, but feel free to leave comments and I'll take a look tomorrow.

@tschettervictor
Copy link
Collaborator

I will test with all combos and report back.

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 30, 2024

So.
Local replication seems to work perfectly with RECURSE_CHILDREN=0/1 and ALLOW_RECONCILITATION=0/1 + nested/un-nested datasets
Remote replication also seems to work fine with the above listed combinations.

However, I ran into one problem. The way I'm running this test is by a REPLICATE_SETS="tank/local/src:tank/local/dst tank/src/1/2:rmttank/dst/1/[email protected] rmttank/src/1/[email protected]:tank/dst/1/2"
So I'm doing a local, a remote push, and a remote pull in the same go

For some reason it is kind of messing up the third set. I believe it might be due to the fact that we are doing a pull and push in the same run, but still, it needs to be fixed. When doing a pull separate from any other sets, it does work.

Also, when the above happens it doesn't error out with a fail (probably due to the fact that it failed to list the dataset and never made it to the skip string). There are other cases where the script also did not error out, like the earlier issue we had, but that is probably a POSIX thing. It would be nice though to add some more error checking. The log did show that it didn't skip any datasets when using the above example, even though the third set failed at listing the dataset.

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 30, 2024

Looks like clearing the variable right before doing the "contains" function solves the issue. See my commit.
Ill do some more testing tomorrow evening.

EDIT

I've ran all the tests I usually do, and all passed. Just for reference, here they are.
My REPLICATE_SETS include one local set, one remote push set, and one remote pull set.

  • Run 3-4 times then delete destination snaps, then run again. This works fine.
  • Run 3-4 times then delete source snaps, then run again. This should fail, unless ALLOW_RECONCILIATION=1
  • Do the above with RECURSE_CHILDREN=0, then do it with RECURSE_CHILDREN=1
  • Also do the above with ALLOW_RECONCILIATION=0 to make sure it fails when source snaps are missing, and passes even if they are missing and ALLOW_RECONCILIATION=1

I have not tested root datasets, as I don't use that at all, and wouldn't recommend it.

As of right now, with the below commit about resetting the host right before checking the datasets, I'm pretty happy to accept this PR.

@yonas
Copy link

yonas commented Aug 30, 2024

@tschettervictor Very nice.

I have just one last suggestion, since sh was updated:

diff --git a/zfs-replicate-posix.sh b/zfs-replicate-posix.sh
index 9c040db..6693e10 100755
--- a/zfs-replicate-posix.sh
+++ b/zfs-replicate-posix.sh
@@ -1,6 +1,12 @@
 #!/usr/bin/env sh
+
 ## zfs-replicate.sh
 set -e
+if [ -n "$(which freebsd-version)" -a -n "$(which bc)" ]; then
+  if [ 1 -eq $(echo $(freebsd-version | cut -d'-' -f1)" >= 11.3" | bc) ]; then
+    set -o pipefail
+  fi
+fi
 
 ############################################
 ##### warning gremlins live below here #####

Copy link

@yonas yonas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Last suggestion is adding -o pipefail:

#41 (comment)

@aaronhurt
Copy link
Owner Author

Thanks @yonas, it's nice to see that pipefail is supported in FreeBSD Bourne shell now, but I'm sure it's not everywhere. I added a conditional check to test for set -o pipefail and add it if supported. I also added -u to fail on any unbound variables.

@tschettervictor
Copy link
Collaborator

Do we want to test some more? From my end it looks good.

Copy link
Collaborator

@tschettervictor tschettervictor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’ve tested all combinations using the new POSIX script. All have passed.

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

Done, and the test case covers just that example now.

  ## test config override
  (
    ## likely default values at script load time
    ZFS="/sbin/zfs"
    SSH="/usr/sbin/ssh"
    ## source script functions
    # shellcheck source=/dev/null
    . ../zfs-replicate.sh
    printf "_testSetsNoConfig/loadConfigOverrideDefaults\n"
    _fail "/usr/sbin/ssh %HOST% /sbin/zfs receive -vFd" "$DEST_PIPE_WITH_HOST"
    _fail "/sbin/zfs receive -vFd" "$DEST_PIPE_WITHOUT_HOST"
    ## generate config
    config="$(mktemp)"
    printf "ZFS=\"myZFS\"\n" >> "$config"
    ## set SSH via environment
    SSH="mySSH"
    loadConfig "$config" && rm -f "$config"
    ## values should match config and environment
    _fail "mySSH %HOST% myZFS receive -vFd" "$DEST_PIPE_WITH_HOST"
    _fail "myZFS receive -vFd" "$DEST_PIPE_WITHOUT_HOST"
  )

@tschettervictor
Copy link
Collaborator

Initial run passed. Testing commencing. Stand by…

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

Testing completed. Just for reference, here are my test runs again.

My REPLICATE_SETS include one local set, one remote push set, and one remote pull set.

  • Run 3-4 times then delete destination snaps, then run again. This should pass.
  • Run 3-4 times then delete source snaps, then run again. This should fail, unless ALLOW_RECONCILIATION=1
  • Do the above with RECURSE_CHILDREN=0, then do it with RECURSE_CHILDREN=1
  • Also do the above with ALLOW_RECONCILIATION=0 to make sure it fails when source snaps are missing, and passes even if they are missing and ALLOW_RECONCILIATION=1
  • ALLOW_ROOT_DATASET=0 works. I have not checked if =1 works as I don't want to replicate my entire pool...

Looks good from my end.

@tschettervictor
Copy link
Collaborator

I do notice though (and this might be a good thing) that when the script encounters an error, it will exit even if it successfully completed replicating the first 2 pairs. This makes it so the end message is ERROR: etc...

But this has only happened when a bug has popped up. Otherwise it will show the "skipped" part at the end.

@aaronhurt
Copy link
Owner Author

I do notice though (and this might be a good thing) that when the script encounters an error, it will exit even if it successfully completed replicating the first 2 pairs. This makes it so the end message is ERROR: etc...

But this has only happened when a bug has popped up. Otherwise it will show the "skipped" part at the end.

Definitely document those in issues if you can reproduce them. They’ll make great test cases to add to further improve the script.

@tschettervictor
Copy link
Collaborator

I do notice though (and this might be a good thing) that when the script encounters an error, it will exit even if it successfully completed replicating the first 2 pairs. This makes it so the end message is ERROR: etc...
But this has only happened when a bug has popped up. Otherwise it will show the "skipped" part at the end.

Definitely document those in issues if you can reproduce them. They’ll make great test cases to add to further improve the script.

#43

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

I’m good to merge. Test have all passed.

I’ll try some more intentional errors.

@aaronhurt
Copy link
Owner Author

I’m good to merge. Test have all passed.

I’ll try some more intentional errors.

Sounds good, let me know if you find anything. Otherwise I'll merge this tonight or tomorrow morning, but I want to give it a little time to bake.

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

But: when it skips a dataset that failed the “checkDataset “ function, it should stop processing that dataset, otherwise it tries to delete it.

The reason is the snapDestroy function. It tries to delete the snapshot, but since none is found, there is no @autorep- suffix. So it tries to delete the dataset entirely.

Aug 31 09:17:11 zfs-replicate.sh[78751]: cannot open 'tank/dst/1/src/1/2/3': dataset does not exist
Aug 31 09:17:11 zfs-replicate.sh[78751]: attempting destroy tank/dst/1/src/1/2
Aug 31 09:17:11 zfs-replicate.sh[78751]: failed - trying rename tank/dst/1/src/1/2 to tank/dst/1/src/1/recv-78927-1

It does fail though…

An if, for, continue, would probably solve it inside the createSnap function right when it does the checkDataset

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

But: when it skips a dataset that failed the “checkDataset “ function, it should stop processing that dataset, otherwise it tries to delete it.

The reason is the snapDestroy function. It tries to delete the snapshot, but since none is found, there is no @autorep- suffix. So it tries to delete the dataset entirely.

Aug 31 09:17:11 zfs-replicate.sh[78751]: cannot open 'tank/dst/1/src/1/2/3': dataset does not exist
Aug 31 09:17:11 zfs-replicate.sh[78751]: attempting destroy tank/dst/1/src/1/2
Aug 31 09:17:11 zfs-replicate.sh[78751]: failed - trying rename tank/dst/1/src/1/2 to tank/dst/1/src/1/recv-78927-1

It does fail though…

An if, for, continue, would probably solve it inside the createSnap function right when it does the checkDataset

I'm not seeing this same behavior for the host check failure. Let me try to simulate a dataset failure.

_testZFSReplicate/snapCreateWithHostCheckErrors
0 creating lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.snapshot.lock
1 checking dataset cmd=./zfs.sh list -H -o name srcPool0/srcFS0
2 srcPool0/srcFS0@autorep-test1
3 srcPool0/srcFS0@autorep-test2
4 srcPool0/srcFS0@autorep-test3
5 checking dataset cmd=./zfs.sh list -H -o name dstPool0/dstFS0
6 dstPool0/dstFS0@autorep-test1
7 dstPool0/dstFS0@autorep-test2
8 dstPool0/dstFS0@autorep-test3
9 found old snapshot srcPool0/srcFS0@autorep-test1
10 destroying snapshot cmd=./zfs.sh destroy srcPool0/srcFS0@autorep-test1
11 creating snapshot cmd=./zfs.sh snapshot srcPool0/srcFS0@autorep-08312024_1725121827
12 creating lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.send.lock
13 sending snapshot cmd=./zfs.sh send -Rs -I srcPool0/srcFS0@autorep-test3 srcPool0/srcFS0@autorep-08312024_1725121827 | ./zfs.sh receive -vFd dstPool0/dstFS0
14 receive -vFd dstPool0/dstFS0
15 deleting lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.send.lock
16 checking host cmd=false
17 WARNING: skipping replication set 'srcPool1/srcFS1/subFS1:dstPool1/dstFS1@dstHost1' - source or destination host check failed
18 checking host cmd=false
19 WARNING: skipping replication set 'srcPool2/srcFS2:dstPool2/dstFS2@dstHost2' - source or destination host check failed
20 checking host cmd=false
21 WARNING: skipping replication set 'srcPool3/srcFS3@srcHost3:dstPool3/dstFS3' - source or destination host check failed
22 checking host cmd=false
23 WARNING: skipping replication set 'srcPool4/srcFS4@srcHost4:dstPool4/dstFS4@dstHost4' - source or destination host check failed
24 deleting lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.snapshot.lock

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

But: when it skips a dataset that failed the “checkDataset “ function, it should stop processing that dataset, otherwise it tries to delete it.
The reason is the snapDestroy function. It tries to delete the snapshot, but since none is found, there is no @autorep- suffix. So it tries to delete the dataset entirely.

Aug 31 09:17:11 zfs-replicate.sh[78751]: cannot open 'tank/dst/1/src/1/2/3': dataset does not exist
Aug 31 09:17:11 zfs-replicate.sh[78751]: attempting destroy tank/dst/1/src/1/2
Aug 31 09:17:11 zfs-replicate.sh[78751]: failed - trying rename tank/dst/1/src/1/2 to tank/dst/1/src/1/recv-78927-1

It does fail though…
An if, for, continue, would probably solve it inside the createSnap function right when it does the checkDataset

I'm not seeing this same behavior for the host check failure. Let me try to simulate a dataset failure.

_testZFSReplicate/snapCreateWithHostCheckErrors
0 creating lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.snapshot.lock
1 checking dataset cmd=./zfs.sh list -H -o name srcPool0/srcFS0
2 srcPool0/srcFS0@autorep-test1
3 srcPool0/srcFS0@autorep-test2
4 srcPool0/srcFS0@autorep-test3
5 checking dataset cmd=./zfs.sh list -H -o name dstPool0/dstFS0
6 dstPool0/dstFS0@autorep-test1
7 dstPool0/dstFS0@autorep-test2
8 dstPool0/dstFS0@autorep-test3
9 found old snapshot srcPool0/srcFS0@autorep-test1
10 destroying snapshot cmd=./zfs.sh destroy srcPool0/srcFS0@autorep-test1
11 creating snapshot cmd=./zfs.sh snapshot srcPool0/srcFS0@autorep-08312024_1725121827
12 creating lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.send.lock
13 sending snapshot cmd=./zfs.sh send -Rs -I srcPool0/srcFS0@autorep-test3 srcPool0/srcFS0@autorep-08312024_1725121827 | ./zfs.sh receive -vFd dstPool0/dstFS0
14 receive -vFd dstPool0/dstFS0
15 deleting lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.send.lock
16 checking host cmd=false
17 WARNING: skipping replication set 'srcPool1/srcFS1/subFS1:dstPool1/dstFS1@dstHost1' - source or destination host check failed
18 checking host cmd=false
19 WARNING: skipping replication set 'srcPool2/srcFS2:dstPool2/dstFS2@dstHost2' - source or destination host check failed
20 checking host cmd=false
21 WARNING: skipping replication set 'srcPool3/srcFS3@srcHost3:dstPool3/dstFS3' - source or destination host check failed
22 checking host cmd=false
23 WARNING: skipping replication set 'srcPool4/srcFS4@srcHost4:dstPool4/dstFS4@dstHost4' - source or destination host check failed
24 deleting lockfile /var/folders/pg/d8xk8vw5235172fkh8vk98g40000gn/T//.replicate.snapshot.lock

EDIT: My bad. When I copied the new config.sample.sh file I forgot to reset the RECURSE_CHILDREN variable.
So the original "recursive" snapshots were already present on the two sides.
I did a replication with RECURSE_CHILDREN=0 so that's why it tried to remove the datsets.

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

The dataset it fails at is the "destination with remote host" so the right side of the ":"

It shouldn't matter, ssh should return non-zero if the remote command returns non-zero.

[ahurt@schroeder ~/zfs-replicate]$ zfs list -H -o name fake/fake; echo $?
cannot open 'fake/fake': dataset does not exist
1
[ahurt@charlie ~]$ ssh schroeder zfs list -H -o name fake/fake; echo $?
cannot open 'fake/fake': dataset does not exist
1

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

The dataset it fails at is the "destination with remote host" so the right side of the ":"

It shouldn't matter, ssh should return non-zero if the remote command returns non-zero.

[ahurt@schroeder ~/zfs-replicate]$ zfs list -H -o name fake/fake; echo $?
cannot open 'fake/fake': dataset does not exist
1

[ahurt@charlie ~]$ ssh schroeder zfs list -H -o name fake/fake; echo $? cannot open 'fake/fake': dataset does not exist 1

See above. User error.
But still, would there be a way to account for an error like that?
The error was I was doing replication recursively, then switching to non-recursively with the recursive stuff already present.

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

See above. User error. But still, would there be a way to account for an error like that?

Let me see if I can mock up a test, and then handle it.

@aaronhurt
Copy link
Owner Author

I wasn't able to reproduce the same behavior. Would you happen to have a longer log showing all the steps that happened when it tried to remove a dataset?

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

Looking at the main scripts,zfs-replicate.sh is 560 lines and test/test.sh is 384 lines. We're almost there 🤣

All jokes aside though, that's why it's good to let these things bake for a while. Things you didn't see at first come to light after you stop looking.

@tschettervictor
Copy link
Collaborator

tschettervictor commented Aug 31, 2024

Here is the log. I first did a replication with RECURSE_CHILDREN=1, the tried doing the exact same replication with RECURSE_CHILDREN=0

Aug 31 12:12:52 zfs-replicate.sh[80977]: sending snapshot cmd=zfs send -Rs -I tank/src/1@autorep-08312024_1725127961 tank/src/1@autorep-08312024_1725127971 | zfs receive -vFd tank/dst/1
Aug 31 12:12:52 zfs-replicate.sh[80977]: WARNING: skipping dataset tank/src/1/2 and its children: snapshot autorep-08312024_1725127971 does not exist
Aug 31 12:12:54 zfs-replicate.sh[80977]: cannot open 'tank/dst/1/src/1/2/3@autorep-08312024_1725127957': dataset does not exist
Aug 31 12:12:54 zfs-replicate.sh[80977]: cannot open 'tank/dst/1/src/1/2/3@autorep-08312024_1725127961': dataset does not exist
Aug 31 12:12:54 zfs-replicate.sh[80977]: cannot open 'tank/dst/1/src/1/2/3': dataset does not exist
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1@autorep-08312024_1725127957
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/2@autorep-08312024_1725127957
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/2@autorep-08312024_1725127961
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/2
Aug 31 12:12:56 zfs-replicate.sh[80977]: failed - trying rename tank/dst/1/src/1/2 to tank/dst/1/src/1/recv-81180-1
Aug 31 12:12:56 zfs-replicate.sh[80977]: another pass:
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/recv-81180-1
Aug 31 12:12:56 zfs-replicate.sh[80977]: failed (0)
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/recv-81180-1/3@autorep-08312024_1725127957
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/recv-81180-1/3@autorep-08312024_1725127961
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/recv-81180-1/3
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: another pass:
Aug 31 12:12:56 zfs-replicate.sh[80977]: attempting destroy tank/dst/1/src/1/recv-81180-1
Aug 31 12:12:56 zfs-replicate.sh[80977]: success
Aug 31 12:12:56 zfs-replicate.sh[80977]: receiving incremental stream of tank/src/1@autorep-08312024_1725127971 into tank/dst/1/src/1@autorep-08312024_1725127971
Aug 31 12:12:57 zfs-replicate.sh[80977]: received 312B stream in 1 seconds (312B/sec)
Aug 31 12:12:57 zfs-replicate.sh[80977]: deleting lockfile /tmp/.replicate.send.lock
Aug 31 12:12:57 zfs-replicate.sh[80977]: deleting lockfile /tmp/.replicate.snapshot.lock
Aug 31 12:12:57 zfs-replicate.sh[80977]: pruning 2 logs
Aug 31 12:12:57 zfs-replicate.sh[80977]: /mnt/tank/logs/autorep-08312024_1725117429.log
Aug 31 12:12:57 zfs-replicate.sh[80977]: /mnt/tank/logs/autorep-08312024_1725110220.log
Aug 31 12:12:57 zfs-replicate.sh[80977]: success total sets 1 skipped 0

You should really never go from recursive to non-recursive, but this is what happens. It tries to delete all the leftover snapshots from the recursive replication, and it succeeds, but it also tries to rename and destroy existing datasets. Notice the rename to "receive-xxxx" then destroying that same one.

EDIT I don't think it actually deletes anything of value. It just removes the ones that aren't supposed to be recursive on the destination. So this might not even be a bug.

@aaronhurt
Copy link
Owner Author

aaronhurt commented Aug 31, 2024

Hrm, yeah that's definitely odd but I see what it's trying to do. That's all zfs itself trying to reconcile state for the replication stream. The script isn't sending those destroy commands. The string attempting doesn't exist in the script. The part I find strange is that it logs a send and then immediately logs a skip. There should be other actions between those two pieces. The script is designed to run synchronously. That send and receive pipe should complete before the script tries to move on to another dataset. That doesn't make much sense. I added some simulated delay in the test receive command, and it is indeed synchronous. The script blocks till the receive pipe completes.

@tschettervictor
Copy link
Collaborator

Before it can complete the send, the destination has to match up. So it destroys the existing ones, then completes the send.

@tschettervictor
Copy link
Collaborator

In any case I just switched back and forth between RECURSE_CHILDREN=1/0 and it seemed to work fine.

@tschettervictor
Copy link
Collaborator

It skips because it is doing a non-recursive replication with nested datasets. It’s seeing the datasets, but also seeing it’s not supposed to do a recursive send. So it’s skips the nested ones (which are included in the send anyway, except as directories I suppose)

@aaronhurt aaronhurt merged commit 5a9cee7 into main Sep 1, 2024
6 checks passed
@aaronhurt aaronhurt deleted the ahurt/add-posix branch September 1, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants