-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Reprovision Workflow IT is flaky #870
Comments
Observations:
String updateStepName = WorkflowResources.getUpdateStepByWorkflowStep(node.type());
if (updateStepName != null) {
WorkflowStep step = workflowStepFactory.createStep(updateStepName);
return new ProcessNode(
node.id(),
step, But the stack trace indicates it's executing CreateIndexStep:
if (!originalTemplateMap.containsKey(node.id())) {
// Case 1: Additive modification, create new node
return createNewProcessNode(node, data, predecessorNodes, nodeTimeout);
flowFrameworkIndicesHandler.updateFlowFrameworkSystemIndexDoc(
workflowId,
Map.ofEntries(
Map.entry(STATE_FIELD, State.COMPLETED),
Map.entry(PROVISIONING_PROGRESS_FIELD, ProvisioningProgress.DONE),
Map.entry(PROVISION_END_TIME_FIELD, Instant.now().toEpochMilli())
),
ActionListener.wrap(updateResponse -> {
logger.info("updated workflow {} state to {}", workflowId, State.COMPLETED);
// Replace template document
flowFrameworkIndicesHandler.updateTemplateInGlobalContext(
workflowId,
template,
ActionListener.wrap(templateResponse -> {
logger.info("Updated template for {}", workflowId, State.COMPLETED);
}, exception -> {
String errorMessage = "Failed to update use case template for " + workflowId;
logger.error(errorMessage, exception);
}),
true // ignores NOT_STARTED state if request is to reprovision
);
}, exception -> { logger.error("Failed to update workflow state for workflow {}", workflowId, exception); })
); This reversed sequence is actually evident in the logs, a portion actually included in observation 1 above. I saw it and thought it was unusual, and now it makes complete sense.
Conclusion: We have a race condition in the test: the first reprovision (which adds a create_index step) completes successfully and updates the state to completed; which in the test case moves on to reprovision a second time. It fails in the case where the GET occurs before the update, fetching the earlier copy of the template. Needed fix: Make sure the template is updated before reporting the state as COMPLETED. Given that we typically update the workflow before provisioning, it would make sense to do it before executing the reprovisioning steps, but after validation; basically at a point where we'll either complete or fail with an error. Alternately, try to do both updates simultaneously using a latch (see ML Commons delete model for an example). Also improve the logging to more clearly indicate the type of reprovision step being used. :) |
@gaiksaya do you want us to try to merge a fix for this into 2.17.1? |
If it is just a test fix then sure @dbwiddis. |
Should be merged shortly: |
For clarity, PR #882 fixed the integration test, but there are still some underlying issues that need addressing in another PR:
|
What is the bug?
FlowFrameworkRestApiIT.testReprovisionWorkflow()
is flaky. During 2.17.0 release it required multiple retries to pass on Windows x86. There are occasional GitHub CI failures on other platforms, see https://github.com/opensearch-project/flow-framework/actions/runs/10777769278/job/29887668921#step:4:59 (Ubuntu) and https://github.com/opensearch-project/flow-framework/actions/runs/10777865339/job/29888745661#step:4:49 (macOS)How can one reproduce the bug?
Run our CI repeatedly. It fails reasonably often.
What is the expected behavior?
Tests consistently pass.
What is your host/environment?
Reproducible on Windows, Linux, macOS as shown above
Do you have any additional context?
The failure is occurring consistently on the attempt to reprovision to remove default ingest pipeline (setting it to
_none
, which is theSearchPipelineService.NOOP_PIPELINE_ID
).The change in default pipeline name is the only difference in the two templates.
The text was updated successfully, but these errors were encountered: