Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 5939 - During an update, if the target entry is reverted in the… #6007

Merged
merged 1 commit into from
Dec 12, 2023

Conversation

tbordaz
Copy link
Contributor

@tbordaz tbordaz commented Dec 11, 2023

… entry cache, the server should not retry to lock it

Bug description:
During an update if an BETXN plugin fails the full TXN is aborted and the DB
returns to the previous state.
However potential internal updates, done by BETXN plugins, are also applied
on the entry cache.
Even if the TXN is aborted some entries in the entry cache are left in a state
that does not reflect the DB state.
The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
entries, setting their state to INVALID.

A problem is that reverted entries stay in the entry cache, until refcnt is 0.
During that period, an update targeting that entry fails to retrieve the
entry from the entry cache and fails to add it again as it already exist
the entry.
The update iterates 1000 times, trying to read the entry and to fetch it
from DB.
This is a pure waste as the reverted entry stays too long.

The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
entry state is INVALID.
In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by:

@@ -262,6 +271,7 @@ find_entry_internal_uniqueid(
struct backentry *e;
int err;
size_t tries = 0;
int reverted_entry = 0;

while ((tries < LDBM_CACHE_RETRY_COUNT) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not reset reverted_entry in the while loop ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A reverted entry may be the target of several requests that started before it got reverted. Those req increased the refcnt and we have to wait that all requests call cache_return. The last cache_return will reset and remove the entry from the cache.

slapi_send_ldap_result(pb, (0 == err || DBI_RC_NOTFOUND == err) ? LDAP_NO_SUCH_OBJECT : LDAP_OPERATIONS_ERROR, NULL /* matched */, NULL,
0, NULL);
if (reverted_entry) {
slapi_send_ldap_result(pb, LDAP_BUSY, NULL, NULL, 0, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO We should add a rexr message to explain why LDAP_BUSY was returned.
(i.e: slapi_send_ldap_result(pb, LDAP_BUSY, "target entry busy because of a cancelled operation.", NULL, 0, NULL);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good point

@@ -98,6 +98,7 @@ find_entry_internal_dn(
size_t tries = 0;
int isroot = 0;
int op_type;
int reverted_entry = 0;

/* get the managedsait ldap message control */
slapi_pblock_get(pb, SLAPI_MANAGEDSAIT, &managedsait);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we not reset the reverted_entry flag in the while loop ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see response above. In short it is the job of cache_return to reset/free the entry when refcnt==0

CACHE_RETURN(&inst->inst_cache, &e);
tries++;
}
if (tries >= LDBM_CACHE_RETRY_COUNT) {
slapi_log_err(SLAPI_LOG_ERR, "find_entry_internal_dn", "Retry count exceeded (%s)\n", slapi_sdn_get_dn(sdn));
}
if (reverted_entry) {
slapi_send_ldap_result(pb, LDAP_BUSY, NULL, NULL, 0, NULL);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add text to explain we return BUSY

Copy link
Contributor

@progier389 progier389 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@droideck droideck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, but I think we need to make it flaky... (it has already failed and I've rerun it once)

>       assert topology_st.standalone.ds_error_log.match('.*cache_is_reverted_entry - Entry reverted.*')
E       AssertionError: assert []
E        +  where [] = <bound method DirsrvLog.match of <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>>('.*cache_is_reverted_entry - Entry reverted.*')
E        +    where <bound method DirsrvLog.match of <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>> = <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>.match
E        +      where <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0> = <lib389.DirSrv object at 0x7f7dce59f920>.ds_error_log
E        +        where <lib389.DirSrv object at 0x7f7dce59f920> = <lib389.topologies.TopologyMain object at 0x7f7dce59f950>.standalone

… entry cache, the server should not retry to lock it

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: 389ds#5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
@tbordaz
Copy link
Contributor Author

tbordaz commented Dec 12, 2023

The code looks good, but I think we need to make it flaky... (it has already failed and I've rerun it once)

>       assert topology_st.standalone.ds_error_log.match('.*cache_is_reverted_entry - Entry reverted.*')
E       AssertionError: assert []
E        +  where [] = <bound method DirsrvLog.match of <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>>('.*cache_is_reverted_entry - Entry reverted.*')
E        +    where <bound method DirsrvLog.match of <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>> = <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0>.match
E        +      where <lib389.dirsrv_log.DirsrvErrorLog object at 0x7f7dce46b2f0> = <lib389.DirSrv object at 0x7f7dce59f920>.ds_error_log
E        +        where <lib389.DirSrv object at 0x7f7dce59f920> = <lib389.topologies.TopologyMain object at 0x7f7dce59f950>.standalone

Indeed the test relies on the dynamic of a entry state and it is not a surprise that it sometime fails. I change the test to flaky and in addition did more parallel (failing) updates that made the test almost always successful on my laptop.

@tbordaz tbordaz merged commit 1ab0a09 into 389ds:main Dec 12, 2023
188 of 195 checks passed
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
tbordaz added a commit that referenced this pull request Dec 12, 2023
… entry cache, the server should not retry to lock it (#6007)

Bug description:
	During an update if an BETXN plugin fails the full TXN is aborted and the DB
	returns to the previous state.
	However potential internal updates, done by BETXN plugins, are also applied
	on the entry cache.
	Even if the TXN is aborted some entries in the entry cache are left in a state
	that does not reflect the DB state.
	The fix https://pagure.io/389-ds-base/issue/50260 "reverts" those
	entries, setting their state to INVALID.

	A problem is that reverted entries stay in the entry cache, until refcnt is 0.
	During that period, an update targeting that entry fails to retrieve the
	entry from the entry cache and fails to add it again as it already exist
	the entry.
	The update iterates 1000 times, trying to read the entry and to fetch it
	from DB.
	This is a pure waste as the reverted entry stays too long.

	The signature of this issue is a message in the error log: "Retry count exceeded"

Fix description:
	The fix consiste in the loops (fetch on DN or NSUNIQUEID) to test if the
        entry state is INVALID.
	In such case it aborts the loop and return a failure.

relates: #5939

Reviewed by: Pierre Rogier, Simon Pichugin (Thanks !!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants