Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 6032 - Replication is broken after backup restore #6035

Merged
merged 1 commit into from
Jan 19, 2024

Conversation

progier389
Copy link
Contributor

@progier389 progier389 commented Jan 17, 2024

Replication is broken after doing an offline backup then later on an online or offline restore
Note: with online backup changelog is discarded at restore time (because it has no purge RUV)
In fact there are multiple cause:
[1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated
[2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl
[3] Replication keep alive get created before the replica get back in sync. This creates missing csn.
Solution:
[1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context.
[2] Replace invalid test by a proper one.
[3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout)
To let a chance for the other supplier to replay the missing changes.
Also added/modified some more data when replication log are enabled
Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because
of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ...

Issue: #6032

Reviewed by: @droideck, @tbordaz (Thanks!)

Copy link
Member

@droideck droideck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!
2*PROTOCOL_BACKOFF_MAXIMUM, part feels a bit rough but, IIUC, you said you work on a fix that will deal with the "resync after restore" and this will be reverted later.
And anyway, the 2*PROTOCOL_BACKOFF_MAXIMUM, is still more stable than nothing :)

Ack from me.

Copy link
Contributor

@tbordaz tbordaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fixes are looking so good... that I wonder how it has been working before ;)

@@ -2589,7 +2589,7 @@ _cl5CICbInit(dbi_val_t *key, dbi_val_t *data, DBLCI_CTX *dblcictx)
return DBI_RC_SUCCESS;
}
/* Update last csn */
csn_init_by_string(&dblcictx->csn, data->data);
csn_init_by_string(&dblcictx->csn, key->data);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that fix but I can not understand how it have been working before.
_cl5Iterate is used for Trimming, cleanRuV,.... We know it works. But IIUC it was setting the CL iterator with crappy csn (data instead of key). It looks it worked to access the first CL entry but on next entries, 'startcsn' in _cl5Iterate should have been completely random !
I am missing something here.

Anyway this fix is valid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is broken since 5a75518 on Mar 2023
Issue 5661 - LMDB hangs while Rebuilding the replication changelog RUV (#5676)
So it is quite recent, 2.4, I think and this code is only called in case of disorderly shutdown or restore or import
(i.e: When purge RUV and upperbound Ruv are not found in changelog)
so it is not that frequent, and obviously The CI tests were not testing it until now ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah !! Okay I thought it was a old (always existing) bug :)
Thanks for the detailed explanation

}
ridinfo->maxcsn = csn;
ridinfo->maxcsn = dblcictx->csn;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry again same comment. The fix is valid but how did it work before ?
The csn was NULL and not modified. Does that mean the _cl5GenRUVInfo generated ridinfo with NULL min/max csn ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactltly

@progier389 progier389 merged commit 9e595d4 into 389ds:main Jan 19, 2024
193 of 195 checks passed
progier389 added a commit that referenced this pull request Jan 19, 2024
Replication is broken after doing an offline backup then later on an online or offline restore
Note: with online backup changelog is discarded at restore time (because it has no purge RUV)
In fact there are multiple cause:
[1] _cl5CICbInit is building wrongly the changelog RUVs so changelog is recreated
[2] Changelog is not cleared when it is "Recreated because of wrong test in dbmdb_back_ctrl
[3] Replication keep alive get created before the replica get back in sync. This creates missing csn.
Solution:
[1] Fix _cl5CICbInit to get the csn from the changelog record key and store properly the min and max in the context.
[2] Replace invalid test by a proper one.
[3] Change keep alive update starting delay from 2 seconds to 10 minutes (i.e twice the maximum backoff timeout)
To let a chance for the other supplier to replay the missing changes.
Also added/modified some more data when replication log are enabled
Note: this is a partial fix as a proper "resync after db reload" is not handled so this left issues (typically because
of the plugin internal operations like memberof plugin or if there are lots of changes to replay) but at least is is enough for the CI test ...

Issue: #6032

Reviewed by: @droideck, @tbordaz (Thanks!)

(cherry picked from commit 9e595d4)
Copy link
Contributor

@tbordaz tbordaz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment. ACK

if get_default_db_lib() == "mdb":
os.remove(f'{S1.ds_paths.db_dir}/data.mdb')
else:
shutil.rmtree(f'{S1.ds_paths.db_dir}/userRoot')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A backup should restore all backends. Why only removing userRoot ?
Also the removal of the database is not described in the doc . Is it necessary so the doc should describe it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production the removal is not needed: the restore task is supposed to delete the whole db (so the doc is OK), But in this test case, I wanted to be absolutely sure that the db get destroyed to verify that the "restore" task restores it properly. The test use only a single backend (i.e: userRoot) and in Python it is easier to remove it than to remove everything in the db directory.

@progier389
Copy link
Contributor Author

Looks good! 2*PROTOCOL_BACKOFF_MAXIMUM, part feels a bit rough but, IIUC, you said you work on a fix that will deal with the "resync after restore" and this will be reverted later. And anyway, the 2*PROTOCOL_BACKOFF_MAXIMUM, is still more stable than nothing :)

FYI: Initially I had the same feeling when I decided to change the starting timeout but I realized that waiting 10 minutes is not that long for a timer that fires once per hour ! The keep alive entry just need to be updated regularly to limit the number of changes to look at before finding the first change to replay but precise timing is not required.

@progier389 progier389 deleted the i6031 branch March 28, 2024 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replication broken after backup restore with freeipa configuration
3 participants