-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DBZ-PGYB][yugabyte/yugabyte-db#24555] Add task ID to PostgresPartition #163
Conversation
Can you modify the existing parallel snapshot UT, by setting |
@Sumukh-Phalgaonkar It is not possible to have multiple tasks in the connector test framework currently. |
Problem
With the introduction of the parallel snapshot model, we can have multiple tasks when the snapshot mode is set to
parallel
. This introduces a problem at the underlying layer when the connector stores the sourceInfo for its partitions i.e.PostgresPartition
objects in Kafka.The
PostgresPartition
is identified by a map which has a structure{"server", topicPrefix}
- currently this is the same for all thePostgresPartition
objects which are created by the tasks whensnapshot.mode
isparallel
and hence they all end up referring to the same source partition in the Kafka topic. Subsequently, what happens is that (assume that we have 2 tasks i.e. 0 and 1):a. After completion,
task_0
updates thesourceInfo
saying that its snapshot is completed.sourceInfo
object and concludes that the snapshot is completed so it skips its snapshot.The above situation will cause a data loss since task_1 will never actually take a snapshot.
Solution
This PR implements a short term solution where we simply add the task ID to the partition so that each
PostgresPartition
can identity a sourcePartition uniquely, the identifying map will now become{"server", topicPrefix_taskId}
.Note:
This solution is a quick fix for the problem given that the number of tasks in the connector remain the same.
This partially fixes yugabyte/yugabyte-db#24555