-
Notifications
You must be signed in to change notification settings - Fork 451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes bug with FateInterleavingIT.testInterleaving #5214
base: main
Are you sure you want to change the base?
Conversation
Rewrote the test. Seemed to expect an interleave to occur at every opportunity when this may not always occur. Now, just expect at least one interleave to occur. This still accomplishes the same goal of the test: test a fate thread interleaving work on multiple fate ids and ensure the order of operations for any given fate id is as expected even with interleaving.
The test wasn't really created to ensure that interleaving would occur, but I created it to show that the interleaving of Fate operations was how Fate worked by default. Fate tries to advance all operations instead of working on one operation to completion before moving to the next operation. I had a PR that modified Fate to allow it to operate in an interleaving or non-interleaving mode, but it was closed based on the feedback. With the addition of thread pools per operation types, maybe it makes sense to have some operations execute in a non-interleaving fashion (run to completion), like create table or something. |
The test would check that all fate ops occurred before the next set of fate ops across all fate ids. For example, it would ensure FateId1, 2, and 3 would all execute Is this an expected behavior? As the test is right now, either it needs to be changed or the code it is testing needs to be changed. |
I'm wondering if the recent Fate changes changed the execution order logic. For example, if you look at accumulo/core/src/main/java/org/apache/accumulo/core/fate/ZooStore.java Lines 159 to 164 in d0fb7f5
It looks like the new logic for selecting the next thing to work on is in |
Looking at your image, it appears that the code is not interleaving at all. The |
How so? There's one FATE thread executing these ops and it jumps back and forth between the 3 FateIds, is that not what is meant by "interleaving"? |
Maybe I'm misunderstanding your image. If you have 3 Fate ops and 1 Fate thread, then the execution order would have been the following in past versions of the code: FirstOp.isReady The Fate thread is making sure that all transactions make progress. I'm used the word "interleave" as in to mix, or alternate. Whereas in a non-interleaving or non-alternating scenario the Fate thread would run each Fate operation to completion before starting on the next Fate operation. In the non-alternating scenario, it would look like: FirstOp.isReady |
How would SecondOp.isReady be called before FirstOp.call? FirstOp.call returns SecondOp, so SecondOp.isReady has to execute after FirstOp.call |
Yeah, its possible something has been lost/broken, which was the main motivation for this PR. |
Sorry, you are right, it's been a while since I wrote this test. So, it should be that all of the FirstOp's would complete, then all of the SecondOps would complete, then all of the ThirdOps should complete. I just backported the test to 3.1 and ran it and the tracking table contains:
|
Backported FateInterleavingIT to 3.1 in #5223 |
Yeah, that looks like the expected order. Wondering if something is not working as expected in FATE, or something is wrong with the test (which is what I attempted to change in this PR). |
Looking at: accumulo/core/src/main/java/org/apache/accumulo/core/fate/AbstractFateStore.java Lines 182 to 203 in b63690a
If the |
The fate code in 4.0 will run as many steps as possible in a fate op as long as isReady keeps returning zero. The code in 3.1 and 2.1 will only run a single step at a time and switch to something else even if the next step isReady. Since this test has a single fate thread for the case of when 0 is always returned for isReady then it should never interleave. For the case where isReady returns non zero it should interleave if possible, but it is not guaranteed to happen in the strict way this test is looking for. I ran the test and saw a situation like the following happen.
In the above situation the fact that the fate threads saw some of the work before all of the work was added throws off what the test is looking for. Another thing that I saw throw off the test was fate operations taking longer than they slept for. For example the isReady method sleeps for 50ms, but it could actually take 400ms to run which has a cascading impact on the expectations of the test. I think the test is trying to ensure that when isReady returns non zero that fate does switch to work other things. However the way its doing it is to strict so it does seem good to relax it. Since this test is dealing w/ multithreading something like the following could happen (although it has a really low probability).
In this example if the test thread is running really slowly and fate threads are running fast then no interleaving would happen and that would not indicate a bug in the code. This could happen on an overloaded test server. Maybe for the interleaving test the Fate.isReady calls could only return 0 when its seen the other two have executed isReady. Like Fate1 step1 would only return 0 for is ready if its seen Fate2 and Fate3 have run isReady. Then if the code was not switching between fate ops that were returning non-zero of isReady the test wold get stuck forever. That may be overkill though, changing the test look for some interleaving is probably good. Normally some interleaving will happen. |
if (prevOp != null && prevOp.getValue().contains("isReady1") | ||
&& !currOp.equals(new AbstractMap.SimpleImmutableEntry<>(prevOp.getKey(), | ||
prevOp.getValue().replace('1', '2')))) { | ||
interleaves++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering if it would be better to relax this and look for any interleaving of FateIds in the sequence of operations. This would increment the interleave counter in more cases.
if (prevOp != null && prevOp.getValue().contains("isReady1") | |
&& !currOp.equals(new AbstractMap.SimpleImmutableEntry<>(prevOp.getKey(), | |
prevOp.getValue().replace('1', '2')))) { | |
interleaves++; | |
} | |
// is the current operation passed its first step? | |
boolean passedFirtStep = !currStep.equals(expRunOrder.get(0)); | |
// was the previous op a different fate id? | |
boolean prevFateIdDiffered = prevOp != null && !prevOp.getKey().equals(fateId); | |
if(passedFirstStep && prevFateIdDiffered){ | |
interleaves++; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, this is quite a bit easier to read too
I wonder if the best solution here is to remove FateInterleavingIT in main and close #5223 (backport to 3.1). FateInterleavingIT was created as part of PR #3852 where I wanted to change the default behavior of some Fate operations such that they would run to competion. Since that time and the introduction of FateInterleavingIT the Fate implementation has changed and the default behavior has changed. If we want to create a test for the expected execution order of Fate transactions, then we should likely create that and remove FateInterleavingIT. If we don't need a test for the expected execution order, then we should remove FateInterleavingIT and not fix it. In both cases, FateInterleavingIT is OBE and should be removed. I think we have sufficient coverage ensuring that Fate transactions complete, so this is really about the execution order. |
Testing execution order is really nice and the changes to this test are adding that. If fate called Repo.call after Repo.isReady returned non zero it would be nice to catch that bug and the changes in this test would have chance of doing that. So testing the following is good and the changes in this PR are doing that with the comparisons to the new
Testing not interleaving and interleaving is also nice. However, only the not interleaving case can be tested deterministically in 4.0. For the case where isReady always returns zero we should never see interleaving and we should see things happen in the expected order per fate op. For the case were isReady is returning non-zero the fate threads should switch to work on something else, but that may not be observed because of timing issues. So maybe we should test execution order and non interleaving and chuck the interleaving test. Does seem the name of the test should change to be FateExecutionOrderIT or something like that. |
Based on the discussion, it seems like:
|
@@ -386,10 +407,10 @@ private FateId verifySameIds(Iterator<Entry<Key,Value>> iter, SortedMap<Key,Valu | |||
Text fateId = subset.keySet().iterator().next().getColumnFamily(); | |||
assertTrue(subset.keySet().stream().allMatch(k -> k.getColumnFamily().equals(fateId))); | |||
|
|||
var expectedVals = Set.of("FirstNonInterleavingOp::isReady", "FirstNonInterleavingOp::call", | |||
var expectedVals = List.of("FirstNonInterleavingOp::isReady", "FirstNonInterleavingOp::call", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of a List here is really subtle in its importance to verify the correctness of the observed data. A comment would be good to make its importance more prominent.
var expectedVals = List.of("FirstNonInterleavingOp::isReady", "FirstNonInterleavingOp::call", | |
// In addition to ensuring the expected fate operations happened also want to ensure they happened in a specific order, this why a List was used. | |
var expectedVals = List.of("FirstNonInterleavingOp::isReady", "FirstNonInterleavingOp::call", |
Could keep the test for now and see if it causes problems and remove it later if it does. Could add a comment linking to this PR. |
- renamed FateInterleavingIT/UserFateInterleavingIT/MetaFateInterleavingIT to FateExecutionOrderIT to better indicate what is being tested. - Simplified how an interleave is counted/checked for
This PR is now up to date with requested changes. |
@kevinrr888 - I merged #5223 into 3.1, then I removed the FateExecutionOrderIT file on the merge to main. Please rename FateInterleavingIT to FateExecutionOrderIT in this PR. |
@dlmarion I was holding off on re-reviewing that until this received approval, that way we could better decide what the backport would look like. I looked over the merged PR, and it seems it is missing an interleaving test.
Thank you, I have renamed it in this PR. EDIT2:
From earlier comment from Keith, so my comments on the missing interleaving test can be ignored. |
@kevinrr888 - Just read your prior comment (and edits). We are good? I think the test in 3.1 is good as-is, as it tests the fate execution order in 3.1. The fate execution order in 4.0 is different. But in both branches, even though the order is different, there is only one way it works. The reason that the test name in 3.1 is |
@kevinrr888 - test method rename in #5280 |
Yes all good. Only thing is maybe could add this: #5223 (comment). This would catch the case where |
When working on #5130, I noticed that
FateInterleavingIT.testInterleaving
would sometimes fail. Realized it is a bug existing in main. This is how the test was functioning:3
Repo
s to execute (FirstOp
thenSecondOp
thenLastOp
) for eachFateId
(3FateId
s)Each Op would have 3 steps:
isReady()
returning a deferral of100ms
, allowing otherFateId
s to be worked onisReady()
executed some time after the100ms
deferral returning a deferral of0ms
(it is ready)call()
the op is complete after this stepIt seems that this test was trying to ensure that an interleave would occur every time after the first
isReady()
. I could be wrong, but I don't think this is guaranteed to happen (and I did not always see it happen). I can provide an example of a failure trace if desired.For example, the test would ensure that we only saw
FirstOp.call
for all 3FateId
s before any other calls. However, this does not always occur.I changed this test to no longer ensure that an interleave always occurs, just that it occurs at least once. I also ensure the correct order of operations still occur (interleaving should have no effect on the order of operations for a given
FateId
)If we do expect an interleave to always occur after the first
isReady()
, this is not the correct fix, and there may be a problem with the FATE code...