-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
filesystem state sync #1184
filesystem state sync #1184
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
8791aaa
to
f6d5c9c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also make sure that it works when state sync is disabled. otherwise it is really good!
schema_name=self.schema.name, table_name=table_name | ||
) | ||
# dlt tables do not respect layout (for now) | ||
if table_name in self.schema.dlt_table_names(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is totally fine and should stay like that (we need to document this)
# | ||
|
||
def _write_to_json_file(self, filepath: str, data: DictStrAny) -> None: | ||
dirname = os.path.dirname(filepath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't use os.path
, use posixpath
. here paths are normalized from fsspec.
|
||
def get_stored_state(self, pipeline_name: str) -> Optional[StateInfo]: | ||
# raise if dir not initialized | ||
filepath = self._get_state_file_name(pipeline_name, "current") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to reproduce the WHERE clause of other destinations.
- must start with pipeline_name
- find one with highest load_id and return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what is does, no? I mean it starts with the pipeline name, so you can look up the state with the pipeline, and instead of looking for the highest load_id (which we shouldn't do anyway, since we should not rely on load ids being timestamps) it has this current marker. I have a screenshot above of the file layout this pr produces. This way we can avoid iterating through a list of files to find the newest one which will eventually slow down against destinations with many loads, at least that would be my expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this now.
safe_hash = "".join( | ||
[c for c in version_hash if re.match(r"\w", c)] | ||
) # remove all special chars from hash | ||
return f"{self.dataset_path}/{self.schema.version_table_name}/{self.schema.name}__{safe_hash}.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO hash is enough. also it would be good to have load_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need the name in the filepath so I can find the right schema when looking for the newest version of a schema, so I will keep it.
|
||
def get_stored_schema(self) -> Optional[StorageSchemaInfo]: | ||
"""Retrieves newest schema from destination storage""" | ||
return self.get_stored_schema_by_hash("current") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing like in state: find the oldest load id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done (assuming you mean the newest load id :) )
fs_testing_pipe.py
Outdated
@@ -0,0 +1,20 @@ | |||
import dlt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep those in some ignored folder ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i have that also, but since i am working on two different machines i need to do this sometimes ;)
self.fs_client.touch(posixpath.join(directory, INIT_FILE_NAME)) | ||
|
||
# write schema to destination | ||
self.store_current_schema(load_id or "1") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When "sync_destination" is called, we are not inside the context of a load. I am not quite sure how to handle this case. I first just did not store the schema, but there is a test that verifies that there is a schema in the destination after "sync_destination" is called on a pipeline with nothing in the versions folder. Either we change the the tests or think of some default value. I am not sure..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can also find the newest load_id for this schema present and increase it by one, but that also does not feel right..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or we use a current timestamp, then it should be in line with the other destinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i have changed it to be like this though. For lineage purposes it would be interesting to also have the load_id (if available) in the file/table, but for now it is inline with the other destinations
c6a65f3
to
e7e0192
Compare
There is a lot of state and schema testing going on in test_job_client, I will think about if we can adapt these tests to also run for the filesystem destinations. |
change filesystem to fallback, to old state loading when used as staging destination
also fixes table count loading to work for all bucket destinations
@rudolfix I have added some more tests for the statesync. Imho the best approach would be to standardize the state loading and retrieving further and run all standard sql tests for the filesystem also, right now many standard sql tests rely on executing sql, so this would be a bigger refactor including probably an extension of the "WithStateSync" interface. Additionally the drop command will currently not work on the filesystem, since that also relies on the sql client, but I don't think it needs to if the destinations would have an improved interface that could be implemented by both the sql and filesystem destinations. Maybe you could have a look if the coverage seems enough at this point and we tackle the harmonization of (possibly all?) destinations when we also add the ability to route statesync to a separate place. |
self.fs_client.write_text(filepath, json.dumps(data), "utf-8") | ||
|
||
def _to_path_safe_string(self, s: str) -> str: | ||
return "".join([c for c in s if re.match(r"\w", c)]) if s else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we instead use c in strings.ascii_letters
?
In [2]: import string
In [3]: string.ascii_letters
Out[3]: 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed it to a hex string now
def _list_dlt_dir(self, dirname: str) -> Iterator[Tuple[str, List[str]]]: | ||
if not self.fs_client.exists(posixpath.join(dirname, INIT_FILE_NAME)): | ||
raise DestinationUndefinedEntity({"dir": dirname}) | ||
for filepath in self.fs_client.listdir(dirname, detail=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
listir
is just an alias to fs_client.ls
or is it implementing some additional behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when listing directories always request a refresh!
all_files = self.fs_client.ls(truncate_dir, detail=False, refresh=True)
IMO this is the only command you should use. fsspec is a mess and this one is proven to work. please replace all commands that list
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could have our own fsclient wrapper that only exposes stuff we think is reliable..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on having our own abstraction with things we really need
"inserted_at": pendulum.now().isoformat(), | ||
"schema_version_hash": self.schema.version_hash, | ||
} | ||
filepath = f"{self.dataset_path}/{self.schema.loads_table_name}/{self.schema.name}__{load_id}.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use path.join
instead?
|
||
def _get_state_file_name(self, pipeline_name: str, version_hash: str, load_id: str) -> str: | ||
"""gets full path for schema file for a given hash""" | ||
return f"{self.dataset_path}/{self.schema.state_table_name}/{pipeline_name}__{load_id}__{self._to_path_safe_string(version_hash)}.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use path.join
here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only use to "encode" hash. why not convert hash to hex? still not perfect but less hacky
@@ -155,7 +155,12 @@ def test_replace_write_disposition(layout: str, default_buckets_env: str) -> Non | |||
for basedir, _dirs, files in client.fs_client.walk( | |||
client.dataset_path, detail=False, refresh=True | |||
): | |||
# remove internal paths | |||
if "_dlt" in basedir or "init" in basedir: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"init" maybe should be a reference to constant INIT_FILE_NAME
managed in the regular way by the final destination you have configured. | ||
|
||
You will also notice `init` files being present in the root folder and the special `dlt` folders. In the absence of the concepts of schemas and tables | ||
in blob storages and directories, `dlt` uses these special files to harmonize the behavior of the `filesystem` destination with the other implemented destinations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this much gpt
🙂 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean.
@@ -19,6 +19,8 @@ | |||
from tests.common.utils import load_json_case | |||
from tests.utils import ALL_TEST_DATA_ITEM_FORMATS, TestDataItemFormat, skip_if_not_active | |||
from dlt.destinations.path_utils import create_path | |||
from tests.load.pipeline.utils import destinations_configs, DestinationTestConfiguration | |||
from tests.load.pipeline.utils import load_table_counts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these two can be merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! we can remove the mark. if we are running filesystem through standard state tests, it should be good
|
||
def _get_state_file_name(self, pipeline_name: str, version_hash: str, load_id: str) -> str: | ||
"""gets full path for schema file for a given hash""" | ||
return f"{self.dataset_path}/{self.schema.state_table_name}/{pipeline_name}__{load_id}__{self._to_path_safe_string(version_hash)}.jsonl" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only use to "encode" hash. why not convert hash to hex? still not perfect but less hacky
def _list_dlt_dir(self, dirname: str) -> Iterator[Tuple[str, List[str]]]: | ||
if not self.fs_client.exists(posixpath.join(dirname, INIT_FILE_NAME)): | ||
raise DestinationUndefinedEntity({"dir": dirname}) | ||
for filepath in self.fs_client.listdir(dirname, detail=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when listing directories always request a refresh!
all_files = self.fs_client.ls(truncate_dir, detail=False, refresh=True)
IMO this is the only command you should use. fsspec is a mess and this one is proven to work. please replace all commands that list
dlt/pipeline/state_sync.py
Outdated
|
||
|
||
def state_resource(state: TPipelineState) -> DltResource: | ||
doc = dlt.mark.with_package_state(state_doc(state), LOAD_PACKAGE_STATE_KEY) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! if you want you can commit pipeline state when we commit package here. and you can decide if we remove this mark.... I think package state will be available also in dlt.resource right? so you can always write to it. if so you can remove the mark... sorry for the stupid idea ;>
2857a5c
to
cd4dd23
Compare
# commit load packages | ||
extract_step.commit_packages() | ||
# commit load packages with state | ||
extract_step.commit_packages(state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the state only gets committed to the load package when it has changed now, this is how it is done for other destinations too, but I think we should probably always add the pipeline state to the load package and then check in the destination wether a state entry with this hash exists, right? eventually (I think) all other destinations will also use this mechanism to store the state, this would make sense to me at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we could add info if version was bumped. easy to check the pre and post bump and extract state. but for now it is OK
Follow up ticket for state synching to separate place: #1233 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
# commit load packages | ||
extract_step.commit_packages() | ||
# commit load packages with state | ||
extract_step.commit_packages(state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we could add info if version was bumped. easy to check the pre and post bump and extract state. but for now it is OK
# Conflicts: # docs/website/docs/dlt-ecosystem/destinations/filesystem.md
* clean some stuff * first messy version of filesystem state sync * clean up a bit * fix bug in state sync * enable state tests for all bucket providers * do not store state to uninitialized dataset folders * fix linter errors * get current pipeline from pipeline context * fix bug in filesystem table init * update testing pipe * move away from "current" file, rather iterator bucket path contents * store pipeline state in load package state and send to filesystem destination from there * fix tests for changed number of files in filesystem destination * remove dev code * create init file also to mark datasets * fix tests to respect new init file change filesystem to fallback, to old state loading when used as staging destination * update filesystem docs * fix incoming tests of placeholders * small fixes * adds some tests for filesystem state also fixes table count loading to work for all bucket destinations * fix test helper * save schema with timestamp instead of load_id * pr fixes and move pipeline state saving to committing of extracted packages * ensure pipeline state is only saved to load package if it has changed * adds missing state injection into state package * fix athena iceberg locations * fix google drive filesystem with missing argument
Description
Implements support for synching state and schema with a filesystem destination. Also saves the info to the loads table the way it is saved in other destinations. The saving of the state is done a bit different, mostly to have the same fileformat always and making things easier. I'm not sure if exracting the state at such a late time creates any problems?
Notes:
ToDo: