-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some local partitions's segments don't get purged when over retention time #596
Comments
For the logs before the broker restart, there's only Rolled new log segment logs:
And this is the only logs found before restarting the broker for several hours, but no delete logs found. |
Sounds related to https://issues.apache.org/jira/browse/KAFKA-16511 -- could you try this on 3.7.1 and see if it's still an issue? |
Hi @jeqo , thank you for your response. Actually I tried 3.7.1 and the same issue happened again. This time I got more clear idea of how it happened. The local segment was successfully uploaded to S3, but it didn't get deleted locally. Also
Then after I restart the broker, the "blocked" segment uploaded to S3 again and everything are back to normal(local disk was reduced as well). As we are still trying to find the root cause, is there something you think that might cause this issue? Thank you. |
One thing that worth mentioning is that: this not only happened in one broker, it actually happened in two brokers and both broker has the partition with issue. After I restarted the broker that holds the leader of the partition, the disk that held the follower of the partition also dropped. |
Hi @jeqo , It just happened again for 2 cases and looks like it just tried to upload, but either never succeeded to upload to S3 or it actually uploaded successfully, but it doesn't seem to "know" and no logs for the successful upload as well. These logs looks like:
Instead what a successful upload would be:
I was wondering if the plugin has the timeout for uploading to configure? For our case, it hanged there for about 2 hours already and if we have a timeout, maybe then we can re-upload it instead of kept waiting. Also if there's a mechanism to check instead of just waiting and doing nothing? |
@bingkunyangvungle sorry for the late reply. I have failed to find some time to look into this. Will try again this week. In the meantime, there are a couple of configs that you can try for timeouts on S3:
|
Hi @jeqo , thank you for your reply. BTW, the plugin version we are using is this one: 2024-04-02-1712056402 Actually we also tried to set the size-based local retention, it works for other partition, but not for this one. It seems to me that it aware the segment has not uploaded, then it won't delete that one. (Just guessing) |
@bingkunyangvungle I've cut a new release, please try https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/releases/tag/2024-10-23-1729694047 |
Hi @jeqo , thank you for this one release. I noticed that the release still missing the artifacts (core and the s3/azure/google cloud), compared to the other releases. Do we have plan to upload them? |
Argh, missed that. Uploading them now. |
Let me try it. Thank you! |
@bingkunyangvungle just checking if you have news on the results with the latest plugin version, thanks! |
Hi @jeqo , we are currently testing this new plugin and will release to our production afterwards. We'll let you know how it goes. |
What happened?
We keep the local retention to be 20% of the total retention and the configuration looks like this:
Normally for each partition there's about 9~11 segments stored locally, but sometimes for a certain partition, the cluster seems to 'forget' to delete the local segment that is out of the retention policy. As a result, the number of segments can grow continuously and the data size for the broker would go up non-stop as well, causing the issue of high disk utilization. After observing the issue, restart the Kafka service in the broker who is the leader for the partition with issue and the out-of-retention segments would be purged afterwards.
This is what happened before and after the restarting of the leader for the partition:
Kafka version:
3.7.0
Tiered Storage version: 2024-04-02-1712056402
What did you expect to happen?
The out-of-retention segments would be purged automatically.
What else do we need to know?
Not sure whether this is the issue for the Kafka or the plugin. So maybe this submit might be a good start for discussion.
The text was updated successfully, but these errors were encountered: