Issue 6032 - Replication is broken after backup restore #6035

progier389 · 2024-01-17T18:05:55Z

Replication is broken after doing an offline backup then later on an online or offline restore
Note: with online backup changelog is discarded at restore time (because it has no purge RUV)
In fact there are multiple cause:
[1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated
[2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl
[3] Replication keep alive get created before the replica get back in sync. This creates missing csn.
Solution:
[1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context.
[2] Replace invalid test by a proper one.
[3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout)
To let a chance for the other supplier to replay the missing changes.
Also added/modified some more data when replication log are enabled
Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because
of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ...

Issue: #6032

Reviewed by: @droideck, @tbordaz (Thanks!)

droideck

Looks good!
2*PROTOCOL_BACKOFF_MAXIMUM, part feels a bit rough but, IIUC, you said you work on a fix that will deal with the "resync after restore" and this will be reverted later.
And anyway, the 2*PROTOCOL_BACKOFF_MAXIMUM, is still more stable than nothing :)

Ack from me.

tbordaz

The fixes are looking so good... that I wonder how it has been working before ;)

tbordaz · 2024-01-19T10:03:14Z

ldap/servers/plugins/replication/cl5_api.c

@@ -2589,7 +2589,7 @@ _cl5CICbInit(dbi_val_t *key, dbi_val_t *data, DBLCI_CTX *dblcictx)
        return DBI_RC_SUCCESS;
    }
    /* Update last csn */
-    csn_init_by_string(&dblcictx->csn, data->data);
+    csn_init_by_string(&dblcictx->csn, key->data);


I understand that fix but I can not understand how it have been working before.
_cl5Iterate is used for Trimming, cleanRuV,.... We know it works. But IIUC it was setting the CL iterator with crappy csn (data instead of key). It looks it worked to access the first CL entry but on next entries, 'startcsn' in _cl5Iterate should have been completely random !
I am missing something here.

Anyway this fix is valid.

Yes, it is broken since 5a75518 on Mar 2023
Issue 5661 - LMDB hangs while Rebuilding the replication changelog RUV (#5676)
So it is quite recent, 2.4, I think and this code is only called in case of disorderly shutdown or restore or import
(i.e: When purge RUV and upperbound Ruv are not found in changelog)
so it is not that frequent, and obviously The CI tests were not testing it until now ...

Ah !! Okay I thought it was a old (always existing) bug :)
Thanks for the detailed explanation

tbordaz · 2024-01-19T10:07:06Z

ldap/servers/plugins/replication/cl5_api.c

    }
-    ridinfo->maxcsn = csn;
+    ridinfo->maxcsn = dblcictx->csn;


Sorry again same comment. The fix is valid but how did it work before ?
The csn was NULL and not modified. Does that mean the _cl5GenRUVInfo generated ridinfo with NULL min/max csn ?

@droideck

Replication is broken after doing an offline backup then later on an online or offline restore Note: with online backup changelog is discarded at restore time (because it has no purge RUV) In fact there are multiple cause: [1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated [2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl [3] Replication keep alive get created before the replica get back in sync. This creates missing csn. Solution: [1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context. [2] Replace invalid test by a proper one. [3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout) To let a chance for the other supplier to replay the missing changes. Also added/modified some more data when replication log are enabled Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ... Issue: #6032 Reviewed by: @droideck, @tbordaz (Thanks!) (cherry picked from commit 9e595d4)

tbordaz

Minor comment. ACK

tbordaz · 2024-01-19T11:28:05Z

dirsrvtests/tests/suites/backups/backup_test.py

+        if get_default_db_lib() == "mdb":
+            os.remove(f'{S1.ds_paths.db_dir}/data.mdb')
+        else:
+            shutil.rmtree(f'{S1.ds_paths.db_dir}/userRoot')


A backup should restore all backends. Why only removing userRoot ?
Also the removal of the database is not described in the doc . Is it necessary so the doc should describe it ?

In production the removal is not needed: the restore task is supposed to delete the whole db (so the doc is OK), But in this test case, I wanted to be absolutely sure that the db get destroyed to verify that the "restore" task restores it properly. The test use only a single backend (i.e: userRoot) and in Python it is easier to remove it than to remove everything in the db directory.

progier389 · 2024-01-19T11:29:02Z

Looks good! 2*PROTOCOL_BACKOFF_MAXIMUM, part feels a bit rough but, IIUC, you said you work on a fix that will deal with the "resync after restore" and this will be reverted later. And anyway, the 2*PROTOCOL_BACKOFF_MAXIMUM, is still more stable than nothing :)

FYI: Initially I had the same feeling when I decided to change the starting timeout but I realized that waiting 10 minutes is not that long for a timer that fires once per hour ! The keep alive entry just need to be updated regularly to limit the number of changes to look at before finding the first change to replay but precise timing is not required.

progier389 force-pushed the i6031 branch from be9d241 to 91e011e Compare January 17, 2024 18:10

progier389 linked an issue Jan 17, 2024 that may be closed by this pull request

Replication broken after backup restore with freeipa configuration #6032

Closed

progier389 self-assigned this Jan 17, 2024

Issue 6032 - Replication broken after backup restore

14fdd8a

progier389 force-pushed the i6031 branch from 91e011e to 14fdd8a Compare January 17, 2024 19:30

droideck approved these changes Jan 18, 2024

View reviewed changes

droideck mentioned this pull request Jan 18, 2024

Issue 5941 - Switch default backend to lmdb and change version to 3.0 #6013

Merged

tbordaz reviewed Jan 19, 2024

View reviewed changes

progier389 merged commit 9e595d4 into 389ds:main Jan 19, 2024
193 of 195 checks passed

tbordaz reviewed Jan 19, 2024

View reviewed changes

progier389 deleted the i6031 branch March 28, 2024 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 6032 - Replication is broken after backup restore #6035

Issue 6032 - Replication is broken after backup restore #6035

progier389 commented Jan 17, 2024 •

edited

Loading

droideck left a comment

tbordaz left a comment

tbordaz Jan 19, 2024

progier389 Jan 19, 2024

tbordaz Jan 19, 2024

tbordaz Jan 19, 2024

progier389 Jan 19, 2024

tbordaz left a comment

tbordaz Jan 19, 2024

progier389 Jan 19, 2024

progier389 commented Jan 19, 2024

Issue 6032 - Replication is broken after backup restore #6035

Issue 6032 - Replication is broken after backup restore #6035

Conversation

progier389 commented Jan 17, 2024 • edited Loading

droideck left a comment

Choose a reason for hiding this comment

tbordaz left a comment

Choose a reason for hiding this comment

tbordaz Jan 19, 2024

Choose a reason for hiding this comment

progier389 Jan 19, 2024

Choose a reason for hiding this comment

tbordaz Jan 19, 2024

Choose a reason for hiding this comment

tbordaz Jan 19, 2024

Choose a reason for hiding this comment

progier389 Jan 19, 2024

Choose a reason for hiding this comment

tbordaz left a comment

Choose a reason for hiding this comment

tbordaz Jan 19, 2024

Choose a reason for hiding this comment

progier389 Jan 19, 2024

Choose a reason for hiding this comment

progier389 commented Jan 19, 2024

progier389 commented Jan 17, 2024 •

edited

Loading