-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Postgres database replication #933
Comments
@jorritsandbrink a few comments from my side:
|
Overall it looks good!
engine upgrade:
|
One more note regarding the replication state management. is similar thing possible? probably we'll need to keep LSN in the state but can we tell postgres to consume slot on the next run? |
@rudolfix which has your preference? |
I prefer (2) as a default. and we can implement a helper function that does (1). it can take instance of the dlt resource. and each resource exposes its state so we can flush after run (ie. when your destination does not support state sync or you disable it) |
|
this snapshot thing is neat |
@jorritsandbrink we've got a requirement from one of the users to be able to easily select columns for the replication. the reason is to be able to unselect sensitive data that may be implemented in source/resource itself or by providing a map transform (add_map) that drops selected columns and demonstrating that |
Hey :) I see there is a lot of amazing stuff in the making here 🚀 . I think I'm the mysterious "user" that @rudolfix mentioned in his previous comment ;) and I'm really excited that we at Taktile will be among the first to try out DB replication via DLT. As a follow-up on a conversation that I had with @matthauskrzykowski on Monday and also to tie up some loose ends from a DM thread that I had with Marcin, I briefly wanted to comment here and summarize the requirements from our point of view. As far as I can tell from the conversation above, all of them are are already covered (I still left one open question at the end)
Open question on schema evolution: Would we be able to "back-fill/re-sync" columns that were part of the initial source table, but not part of the initial schema contract? |
Thanks for neatly listing your requirements @codingcyclist! Most of your points are indeed within the scope of this issue. I just wanted to respond to two of your points:
Do you want to be able to provide a list of columns that should be ignored, while allowing columns that are not in the list to be added as the schema evolves? I don't think that's supported yet with schema contracts or any other
We're currently adding support for hard deletes in this PR. The next step we had in mind is soft deletes, which would be a natural follow-up to that PR. I would say we use separate issues for the column ignoring and soft deletes features. I'll leave your open question regarding backfilling to @rudolfix. |
re. selecting columns: my idea was to ignore certain columns already in CDC resource. the other option is to filter out values with re. backfill - not sure what do you mean here @codingcyclist . a situation where certain columns were ignored and then we un-ignore them after initial (and possibly subsequent) sync already happened? If so we'd need to implement partial updates (or do a full re-sync which could be painful) my take on next steps
|
re. selecting columns: I think we have a small misunderstanding here. What we'd need is a mechanism to explicitly define a table schema and let only the columns on the schema be synced into the destination. All other columns should be ignored. If I'm not mistaken, this is already possible with the current version of schema contracts, right?
re. backfill: Exactly, this is about back-filling columns that were initially ignored when they get "un-ignored" at a later point in time. But it's more of a nice-to-have. let's not worry too much about it for the time being. Soft deletes would be much more important to have, as they are crucial to ensure referential integrity between our entity- and our event tables. Here's an example: a decision flow gets deleted, but we still need to keep a record for that decision flow in the destination so that we can map it to the historic decisions that were executed on that decision flow. From a first glance, it looks like this would be covered by the SCD2 PR that you linked above |
re. selecting columns: I had a little chat with @rudolfix about this on Slack. Schema contracts are still experimental and likely to change significantly. Hence, not a good idea to rely on this for now. Also, it's more efficient if we can do the filtering further upstream. My idea is to include an argument on the replication resource that let's you provide a column list, which is then used to filter already in the resource. |
Feature description
dlt
should support Postgres database replication. This should be efficiently done, using a CDC-based approach leveraging the transaction log (WAL)—not by querying the actual data directly.Implementation should be as generic as possible, so database replication support for other databases (e.g. SQL Server) can easily be added later.
Are you a dlt user?
None
Use case
Multiple users have requested this in the Slack community, e.g. here.
Proposed solution
dlt
psycopg2
's support for replication, whichdlt
already uses forpostgres
connectionspgoutput
, so users do not have to install an output plugin such as wal2jsonThe above is Postgres-specific logic that needs to be integrated into new generic functionality for database replication / CDC processing that will be implemented in another ticket (@rudolfix will create). I think we need something along the lines of:
insert
,update
, ordelete
insert
andupdate
operationsdelete
operationsresource
that yields change eventsmerge
and recognize we're handling a CDC resource instead of a regular resource?sql_database
source andsql_table
resource somehow? Postgres CDC doesn't really fit there very naturally because it doesn't use SQLAlchemy and it doesn't persist state for incremental processing.Related issues
No response
The text was updated successfully, but these errors were encountered: