-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PQ AccessDeniedException on Windows in checkpoint move #12345
Comments
I have this exact issue in production re-occurring randomly while running Logstash v7.9.3 with queue.checkpoint.retry : true In my case i've seen a 30 second delay on some temporary files before.
|
@jugsofbeer Thanks for the feedback.
Could be but AFAIK for the cases I saw reported a suspected anti-virus was disabled and problem persisted.
This is so far my hypothesis that we are dealing with some Windows filesystem layering (for virtual servers and/or NAS) which causes some race condition but this is unconfirmed. |
FWIW here are a few similar reported issues on the internets: |
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: elastic#12345
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: elastic#12345
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345 (cherry picked from commit 1a5030b) Co-authored-by: kaisecheng <[email protected]>
This commit changes `queue.checkpoint.retry` to `true` by default allowing retry of checkpoint write failure. Add exponential backoff retry to checkpoint write to mitigate AccessDeniedExcpetion in Windows. Fixed: #12345 (cherry picked from commit 1a5030b) Co-authored-by: kaisecheng <[email protected]>
Hi , we have updated to Logstash 7.17.3 and still getting the Error writing checkpoint: java.nio.file.AccessDeniedException |
No. Logstash mitigates the issue by retrying with backoff by default. The retry is not infinite. This issue is caused by antivirus or even a file explorer, which could lock the file. Retrying indefinitely or a configurable number of retries is not under consideration as it hides the underlying problem and gives too much burden for users to config. I consider giving a higher number of retries which finish around 3 seconds. Another we could think of is to restart the pipeline instead of shutdown. |
Hi , do you have estimate on a new fix for it ? should it be included in a new release ? |
Edit: We need more feedback or telemetry to support the decision of 3 seconds retry instead of blindly make it longer. |
We have put the antivirus to be not enabled and still the issue reproduced , |
The exception cannot be solved in Logstash alone. One suggestion is to find out which process in Windows locks the file. We don't have a timetable for further investigation. It would be great if you can provide minimum pipelines and environment information to reproduce the problem for future investigation. That's said, if there is any improvement this year, it will be in 7.x. I will keep this issue up-to-date. Use |
Hi. Any suggestions on how to check deeper? What did you mean by "but it will lose at least one delivery"? Thanks. |
@reuvenik Thanks for pointing out the drain option! For memory queue, the drain mechanism is enabled by default. Logstash try to drain the events before shutdown. As events are stored in memory, data loss happens if the host crashes, hence no guarantee for delivery. I am far from an expert on the Windows platform, no new idea in mind. So far we confirm antivirus and file explorer can lock the file. I am curious if you can see Handle showing file explorer when it triggers the exception. If you believe there is no process locking the file, you can share your pipeline, config, env setup here to help us reproduce the issue. |
Hi @kaisecheng Before I ran the handle in loop, I tested having explorer opened in the data folder and handle captured it. In any case, sending the pipeline configuration is not relevant because it happened in what seems to be random pipeline each time. And it happened for pipelines with different source technology (kafka, beats, oracle, event hub, jms...). So I guess the problem is not in the pipeline. It even happened for pipeline which is getting feed from other pipeline and sets the index name in the elasticsearch output. |
@reuvenik Thanks for the input.
This page highlight the key points. In addition to the listed limitation, keep in mind that network connectivity issue or bug could happen in plugins. Admin might want to restart the process. In such case, without a persistent queue, events lose in memory. |
@kaisecheng we are also encountering this issue but the strange part was it happens in only one environment and that is also random. Checked the past bugs, Found there were two tests removed (ignored) citing they are failing in windows . Surprisingly the exception was "java.lang.IllegalStateException: java.nio.file.AccessDeniedException: C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head.tmp -> C:\Users\jenkins.LOGSTASH-CI-WIN\AppData\Local\Temp\junit1468481081486511579\data\checkpoint.head" By any chance a race condition leads to this type of error in PQ checkpoints? |
I try to investigate the possible causes to for the AccessDeniedException on file move. And found the following code throws AccessDeniedException. try {
var sourceFilename = Path.of("c:\\temp\\source.txt");
var targetFilename = Path.of("c:\\temp\\target.txt");
if( Files.exists(sourceFilename) )
Files.delete(sourceFilename);
if( Files.exists(targetFilename) )
Files.delete(targetFilename);
// Create source and target file.
Files.writeString(sourceFilename, "source file");
Files.writeString(targetFilename, "target file");
// Open a byte channel for the target file but dont close it.
SeekableByteChannel sbc = Files.newByteChannel(targetFilename);
// Try to move the source file to target which is opened. This line throws AccessDeniedException
Files.move(sourceFilename, targetFilename, StandardCopyOption.ATOMIC_MOVE);
} catch (Exception exception) {
System.out.println(exception.toString());
} I try to move the file, where a handle is already open (but not closed) for the target file. Simulating a case where checkpoint head write() happens while a checkpoint read was in progress. Is this a valid case for PQ checkpoints read and write? |
@kaisecheng Some people encountered this bug even after the fix may be due to a buggy implementation of the settings and the retry never kicked in! Please refer: #14486 |
@sharonhac The retry implementation have not kicked in due to another bug. Please refer #14486 |
Hi , as @reuvenik suggested - using queue.type: memory is a workaround for the issue as the data isn't being saved to the disk , and doesn't cause the lock issue , |
@prakashguru many thanks for creating the issue and keep trying to solve the Windows issue. It is indeed a buggy implementation in SettingImpl 🤦 which disable the retry for years. We have seen retry log in old version probably in v6 but it seems to be disabled starting from v7. Thanks again, will try to get a fix up and merge next week |
Identical issue: https://issues.apache.org/jira/browse/MRESOLVER-325 |
The PQ checkpoint strategy has been modified to write to a temporary file and then atomically moved to the final checkpoint file name.
On Windows but under some unknown specific circumstances, an
AccessDeniedException
is thrown by the move operation. This problem seems to happen when using layered filesystems such as with NAS and/or when running in containers.#10234 introduced the
queue.checkpoint.retry
to mitigate that problem by catching that exception and retrying the operation.We still have two problem with this:
Unfortunately the root cause is not yet understood.
The mitigation sometimes does not work.
On the later point, I think we should revisit the
queue.checkpoint.retry
strategy which does a single retry and does a sleep of 500ms before retrying.Until we figure the root cause for this I think we should improve that strategy by having multiple (configurable?) retries using a backoff policy (exponential?). Start with a much lower sleep and increase until we reach a retry limit.
The text was updated successfully, but these errors were encountered: