-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 6032 - Replication is broken after backup restore #6035
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
2*PROTOCOL_BACKOFF_MAXIMUM,
part feels a bit rough but, IIUC, you said you work on a fix that will deal with the "resync after restore" and this will be reverted later.
And anyway, the 2*PROTOCOL_BACKOFF_MAXIMUM,
is still more stable than nothing :)
Ack from me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fixes are looking so good... that I wonder how it has been working before ;)
@@ -2589,7 +2589,7 @@ _cl5CICbInit(dbi_val_t *key, dbi_val_t *data, DBLCI_CTX *dblcictx) | |||
return DBI_RC_SUCCESS; | |||
} | |||
/* Update last csn */ | |||
csn_init_by_string(&dblcictx->csn, data->data); | |||
csn_init_by_string(&dblcictx->csn, key->data); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that fix but I can not understand how it have been working before.
_cl5Iterate is used for Trimming, cleanRuV,.... We know it works. But IIUC it was setting the CL iterator with crappy csn (data instead of key). It looks it worked to access the first CL entry but on next entries, 'startcsn' in _cl5Iterate should have been completely random !
I am missing something here.
Anyway this fix is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is broken since 5a75518 on Mar 2023
Issue 5661 - LMDB hangs while Rebuilding the replication changelog RUV (#5676)
So it is quite recent, 2.4, I think and this code is only called in case of disorderly shutdown or restore or import
(i.e: When purge RUV and upperbound Ruv are not found in changelog)
so it is not that frequent, and obviously The CI tests were not testing it until now ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah !! Okay I thought it was a old (always existing) bug :)
Thanks for the detailed explanation
} | ||
ridinfo->maxcsn = csn; | ||
ridinfo->maxcsn = dblcictx->csn; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry again same comment. The fix is valid but how did it work before ?
The csn was NULL and not modified. Does that mean the _cl5GenRUVInfo generated ridinfo with NULL min/max csn ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactltly
Replication is broken after doing an offline backup then later on an online or offline restore Note: with online backup changelog is discarded at restore time (because it has no purge RUV) In fact there are multiple cause: [1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated [2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl [3] Replication keep alive get created before the replica get back in sync. This creates missing csn. Solution: [1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context. [2] Replace invalid test by a proper one. [3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout) To let a chance for the other supplier to replay the missing changes. Also added/modified some more data when replication log are enabled Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ... Issue: #6032 Reviewed by: @droideck, @tbordaz (Thanks!) (cherry picked from commit 9e595d4)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment. ACK
if get_default_db_lib() == "mdb": | ||
os.remove(f'{S1.ds_paths.db_dir}/data.mdb') | ||
else: | ||
shutil.rmtree(f'{S1.ds_paths.db_dir}/userRoot') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A backup should restore all backends. Why only removing userRoot ?
Also the removal of the database is not described in the doc . Is it necessary so the doc should describe it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production the removal is not needed: the restore task is supposed to delete the whole db (so the doc is OK), But in this test case, I wanted to be absolutely sure that the db get destroyed to verify that the "restore" task restores it properly. The test use only a single backend (i.e: userRoot) and in Python it is easier to remove it than to remove everything in the db directory.
FYI: Initially I had the same feeling when I decided to change the starting timeout but I realized that waiting 10 minutes is not that long for a timer that fires once per hour ! The keep alive entry just need to be updated regularly to limit the number of changes to look at before finding the first change to replay but precise timing is not required. |
Replication is broken after doing an offline backup then later on an online or offline restore
Note: with online backup changelog is discarded at restore time (because it has no purge RUV)
In fact there are multiple cause:
[1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated
[2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl
[3] Replication keep alive get created before the replica get back in sync. This creates missing csn.
Solution:
[1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context.
[2] Replace invalid test by a proper one.
[3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout)
To let a chance for the other supplier to replay the missing changes.
Also added/modified some more data when replication log are enabled
Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because
of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ...
Issue: #6032
Reviewed by: @droideck, @tbordaz (Thanks!)