Some local partitions's segments don't get purged when over retention time #596

bingkunyangvungle · 2024-09-22T00:42:16Z

What happened?

We keep the local retention to be 20% of the total retention and the configuration looks like this:

      config = {
        "retention.ms"          = 86400000 / 8     #  3 hours 
        "local.retention.ms"    = 86400000 / 8 / 5 # 20% local data
        "remote.storage.enable" = true
      }

Normally for each partition there's about 9~11 segments stored locally, but sometimes for a certain partition, the cluster seems to 'forget' to delete the local segment that is out of the retention policy. As a result, the number of segments can grow continuously and the data size for the broker would go up non-stop as well, causing the issue of high disk utilization. After observing the issue, restart the Kafka service in the broker who is the leader for the partition with issue and the out-of-retention segments would be purged afterwards.

This is what happened before and after the restarting of the leader for the partition:

Kafka version: 3.7.0

Tiered Storage version: 2024-04-02-1712056402

What did you expect to happen?

The out-of-retention segments would be purged automatically.

What else do we need to know?

Not sure whether this is the issue for the Kafka or the plugin. So maybe this submit might be a good start for discussion.

The text was updated successfully, but these errors were encountered:

bingkunyangvungle · 2024-09-22T00:50:14Z

For the logs before the broker restart, there's only Rolled new log segment logs:

[2024-09-21 21:50:07,077] INFO [ProducerStateManager partition=topic-125] Wrote producer snapshot at offset 63538937203 with 0 producer ids in 1 ms. (org.apache.kafka.storage.internals.log.ProducerStateManager)
[2024-09-21 21:52:48,432] INFO [LocalLog partition=topic-125, dir=/data/kafka] Rolled new log segment at offset 63541597236 in 1 ms. (kafka.log.LocalLog)
[2024-09-21 21:52:48,432] INFO [ProducerStateManager partition=topic-125] Wrote producer snapshot at offset 63541597236 with 0 producer ids in 1 ms. (org.apache.kafka.storage.internals.log.ProducerStateManager)
[2024-09-21 21:55:29,915] INFO [LocalLog partition=topic-125, dir=/data/kafka] Rolled new log segment at offset 63544256614 in 0 ms. (kafka.log.LocalLog)

And this is the only logs found before restarting the broker for several hours, but no delete logs found.

jeqo · 2024-09-24T11:17:29Z

Sounds related to https://issues.apache.org/jira/browse/KAFKA-16511 -- could you try this on 3.7.1 and see if it's still an issue?

bingkunyangvungle · 2024-10-10T14:28:00Z

Hi @jeqo , thank you for your response. Actually I tried 3.7.1 and the same issue happened again. This time I got more clear idea of how it happened. The local segment was successfully uploaded to S3, but it didn't get deleted locally. Also
this partition has way less segments than the other partitions in S3, because all the segments after it are "blocked" by this segment in this partition.
As for the logs, I only saw it was trying to upload, but didn't see the successful upload message for that segment:

Copying log segment data, metadata: RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=9SR61_KCT92N36mvlxeVfQ:topic-80

Then after I restart the broker, the "blocked" segment uploaded to S3 again and everything are back to normal(local disk was reduced as well).

As we are still trying to find the root cause, is there something you think that might cause this issue? Thank you.

bingkunyangvungle · 2024-10-10T15:52:33Z

One thing that worth mentioning is that: this not only happened in one broker, it actually happened in two brokers and both broker has the partition with issue. After I restarted the broker that holds the leader of the partition, the disk that held the follower of the partition also dropped.

bingkunyangvungle · 2024-10-13T17:40:59Z

Just give another datapoint:
The local segment that not gets deleted is uploaded to S3.

And the segment after it is also uploaded to S3.
Looks like is just that local segment not been cleaned up.

Then after restart the broker that has the leader of the segment, the deletion works:

So looks like it's a mixed situation here. Sometimes the segment is uploaded to S3 and sometimes it doesn't.

bingkunyangvungle · 2024-10-23T05:43:56Z

Hi @jeqo , It just happened again for 2 cases and looks like it just tried to upload, but either never succeeded to upload to S3 or it actually uploaded successfully, but it doesn't seem to "know" and no logs for the successful upload as well. These logs looks like:

[2024-10-23 02:50:00,549] INFO [RemoteLogManager=27 partition=FBmNXszXRo2dGWeQ9lNbKQ:topic-217] Copying 00000000013819732679.log to remote storage. (kafka.log.remote.RemoteLogManager$RLMTask)
[2024-10-23 02:50:00,552] INFO Copying log segment data, metadata: RemoteLogSegmentMetadata{remoteLogSegmentId=RemoteLogSegmentId{topicIdPartition=FBmNXszXRo2dGWeQ9lNbKQ:topic-217, id=5mBo1GHnRDWyVUjYo1gFDw}, startOffset=13819732679, endOffset=13822372032, brokerId=27, maxTimestampMs=1729651777804, eventTimestampMs=1729651800549, segmentLeaderEpochs={16=13819732679}, segmentSizeInBytes=1073735369, customMetadata=Optional.empty, state=COPY_SEGMENT_STARTED} (io.aiven.kafka.tieredstorage.RemoteStorageManager)

Instead what a successful upload would be:

[2024-10-23 02:47:05,812] INFO [RemoteLogManager=27 partition=FBmNXszXRo2dGWeQ9lNbKQ:user-217] Copying 00000000013817090668.log to remote storage. (kafka.log.remote.RemoteLogManager$RLMTask)
[2024-10-23 02:48:00,356] INFO [RemoteLogManager=27 partition=FBmNXszXRo2dGWeQ9lNbKQ:ex-jaeger-auction-report-20220421-217] Copied 00000000013817090668.log to remote storage with segment-id: RemoteLogSegmentId{topicIdPartition=FBmNXszXRo2dGWeQ9lNbKQ:user-217, id=GT7finQ_SgCRVu6idXHHgg} (kafka.log.remote.RemoteLogManager$RLMTask)

I was wondering if the plugin has the timeout for uploading to configure? For our case, it hanged there for about 2 hours already and if we have a timeout, maybe then we can re-upload it instead of kept waiting. Also if there's a mechanism to check instead of just waiting and doing nothing?

jeqo · 2024-10-23T12:55:14Z

@bingkunyangvungle sorry for the late reply. I have failed to find some time to look into this. Will try again this week.

In the meantime, there are a couple of configs that you can try for timeouts on S3:

s3.api.call.timeout (including all retries) and s3.api.call.attempt.timeout (each retry). Let us know how it works with these settings. Also, could you share which commit for kafka and ts plugin are you using? Thanks!

bingkunyangvungle · 2024-10-23T13:03:01Z

Hi @jeqo , thank you for your reply.
Just give more context of what is on my side:
I checked the corresponding code in Kafka, this line is getting executed based on my log, however there's no error or exception from this line, meaning that it has to go into this block and just stuck there. Since RemoteStorageManager is implemented by the plugin(this line appear in log), I guess we might want to take some time looking into this. If you can help on it, that'll be awesome!

BTW, the plugin version we are using is this one: 2024-04-02-1712056402
As for kafka, we tried 3.7.0 and 3.7.1 and both version has this issue.

Actually we also tried to set the size-based local retention, it works for other partition, but not for this one. It seems to me that it aware the segment has not uploaded, then it won't delete that one. (Just guessing)

jeqo · 2024-10-23T14:35:18Z

@bingkunyangvungle I've cut a new release, please try https://github.com/Aiven-Open/tiered-storage-for-apache-kafka/releases/tag/2024-10-23-1729694047

bingkunyangvungle · 2024-10-23T16:04:23Z

Hi @jeqo , thank you for this one release. I noticed that the release still missing the artifacts (core and the s3/azure/google cloud), compared to the other releases. Do we have plan to upload them?
BTW, I noticed that there's a couple of changes that might fix for our situation, especially for:
1.Upload rate limit(code, config)
2.Removing orphan files(code)
So for the rate limit, do you have some suggestions for the threshold? Or is there something else relevant you would recommend to look at for this release? Thank you!

jeqo · 2024-10-23T22:45:33Z

Argh, missed that. Uploading them now.
About the rate limit: I'd suggest to try with 100-200MiB/s and analyze the CPU usage.

bingkunyangvungle · 2024-10-24T02:58:46Z

Let me try it. Thank you!

jeqo · 2024-11-06T17:03:15Z

@bingkunyangvungle just checking if you have news on the results with the latest plugin version, thanks!

bingkunyangvungle · 2024-11-16T03:47:10Z

Hi @jeqo , we are currently testing this new plugin and will release to our production afterwards. We'll let you know how it goes.

bingkunyangvungle changed the title ~~Some local partitions don't get purged when over retention time~~ Some local partitions's segments don't get purged when over retention time Sep 22, 2024

bingkunyangvungle mentioned this issue Oct 12, 2024

[Alert] Discussion about indicators for scaling up Kafka #608

Open

jeqo self-assigned this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some local partitions's segments don't get purged when over retention time #596

Some local partitions's segments don't get purged when over retention time #596

bingkunyangvungle commented Sep 22, 2024

bingkunyangvungle commented Sep 22, 2024

jeqo commented Sep 24, 2024

bingkunyangvungle commented Oct 10, 2024 •

edited

Loading

bingkunyangvungle commented Oct 10, 2024 •

edited

Loading

bingkunyangvungle commented Oct 13, 2024 •

edited

Loading

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 24, 2024 •

edited

Loading

jeqo commented Nov 6, 2024

bingkunyangvungle commented Nov 16, 2024

Some local partitions's segments don't get purged when over retention time #596

Some local partitions's segments don't get purged when over retention time #596

Comments

bingkunyangvungle commented Sep 22, 2024

What happened?

What did you expect to happen?

What else do we need to know?

bingkunyangvungle commented Sep 22, 2024

jeqo commented Sep 24, 2024

bingkunyangvungle commented Oct 10, 2024 • edited Loading

bingkunyangvungle commented Oct 10, 2024 • edited Loading

bingkunyangvungle commented Oct 13, 2024 • edited Loading

bingkunyangvungle commented Oct 23, 2024 • edited Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 23, 2024 • edited Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 23, 2024 • edited Loading

jeqo commented Oct 23, 2024

bingkunyangvungle commented Oct 24, 2024 • edited Loading

jeqo commented Nov 6, 2024

bingkunyangvungle commented Nov 16, 2024

bingkunyangvungle commented Oct 10, 2024 •

edited

Loading

bingkunyangvungle commented Oct 10, 2024 •

edited

Loading

bingkunyangvungle commented Oct 13, 2024 •

edited

Loading

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

bingkunyangvungle commented Oct 23, 2024 •

edited

Loading

bingkunyangvungle commented Oct 24, 2024 •

edited

Loading