WIP: #2166 databricks direct loading #2219

donotpush · 2025-01-15T09:05:37Z

Resolves Load data into databricks without external staging and auth. #2166

netlify · 2025-01-15T09:05:54Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`2bd0be0`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/678e76f86e51fe0008095972

rudolfix

On top of this review:
We should be able to run databricks without configured staging so we can enable many tests that will do that.

We have destination config in destinations_configs test util:

remove databricks here:

destination_configs += [
            DestinationTestConfiguration(destination_type=destination)
            for destination in SQL_DESTINATIONS
            if destination
            not in ("athena", "synapse", "databricks", "dremio", "clickhouse", "sqlalchemy")
        ]

add staging to this and move it in with other staging databricks setup

destination_configs += [
            DestinationTestConfiguration(
                destination_type="databricks",
                file_format="parquet",
                bucket_url=AZ_BUCKET,
                extra_info="az-authorization",
            )
        ]

you can also ping me and I can prepare a valid commit

rudolfix · 2025-01-20T17:01:59Z

dlt/destinations/impl/databricks/configuration.py

+            # databricks authentication: get context config
+            from databricks.sdk import WorkspaceClient
+
+            w = WorkspaceClient()


code is correct but you must handle the situation when default credentials do not exist (ie. outside of notebook). I get this exception in this case:

ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

just skip the code that assign values

rudolfix · 2025-01-20T17:43:28Z

dlt/destinations/impl/databricks/databricks.py

+        else:
+            return "", file_name
+
+        volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}"


how IMO we should handle volumes:

Allow to define staging_volume_name in DatabricksClientConfiguration. This should be (I think) fully qualified name.

If staging_volume_name is empty: create here (ad hoc) a volume with _dlt_temp_load_volume

we do not need to handle volumes on the level of sql_client. you can drop additional method you added

we do not need to care to drop _dlt_temp_load_volume. it belong to current schema. so if schema is dropped, the volume will be dropped as well (I hope!)

Why do we need time_ns?

rudolfix · 2025-01-20T17:45:05Z

dlt/destinations/impl/databricks/databricks.py

+            return "", file_name
+
+        volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}"
+        volume_file_name = (  # replace file_name for random hex code - databricks loading fails when file_name starts with - or .


why the same name as in file_name cant be used here? it will be unique (will contain file_id part which is uniq_id() already)

rudolfix · 2025-01-20T17:47:15Z

dlt/destinations/impl/databricks/databricks.py

+
+        file_name = FileStorage.get_file_name_from_file_path(local_file_path)
+        file_format = ""
+        if file_name.endswith(".parquet"):


I do not think that you need to know the file format here. just upload a file we have. it has proper extension. also keep the file_name as mentioned above

rudolfix · 2025-01-20T17:47:52Z

dlt/destinations/impl/databricks/sql_client.py

@@ -63,6 +63,7 @@ def iter_df(self, chunk_size: int) -> Generator[DataFrame, None, None]:

 class DatabricksSqlClient(SqlClientBase[DatabricksSqlConnection], DBTransaction):
    dbapi: ClassVar[DBApi] = databricks_lib
+    volume_name: str = "_dlt_temp_load_volume"


move to DatabricksConfiguration (as mentioned above)

rudolfix · 2025-01-20T17:48:18Z

dlt/destinations/impl/databricks/sql_client.py

@@ -102,6 +103,18 @@ def close_connection(self) -> None:
            self._conn.close()
            self._conn = None

+    def create_volume(self) -> None:
+        self.execute_sql(f"""


create volume ad hoc before uploading file

rudolfix · 2025-01-20T17:48:44Z

dlt/destinations/job_client_impl.py

@@ -176,6 +176,7 @@ def initialize_storage(self, truncate_tables: Iterable[str] = None) -> None:
            self.sql_client.create_dataset()
        elif truncate_tables:
            self.sql_client.truncate_tables(*truncate_tables)
+        self.sql_client.create_volume()


you can drop all of those

rudolfix · 2025-01-20T17:48:58Z

tests/.dlt/config.toml

@@ -1,4 +1,6 @@
 [runtime]
+log_level="DEBUG"


remember to remove before final push

rudolfix · 2025-01-20T17:50:54Z

dlt/destinations/impl/databricks/configuration.py

+                    " and the server_hostname."
+                )
+
+            self.direct_load = True


I do not think we need this. if we have a local file we do direct load. we do not need to be in a notebook context to do it. just the default access token needs notebook

donotpush added 2 commits January 15, 2025 10:03

databricks: enable local files

9d560d9

fix: databricks test config

902c49d

work in progress

1efe565

donotpush changed the title ~~#2166 databricks direct loading~~ WIP: #2166 databricks direct loading Jan 16, 2025

donotpush marked this pull request as draft January 16, 2025 12:22

added create and drop volume to interface

b60b3d3

rudolfix self-requested a review January 20, 2025 11:21

donotpush added 2 commits January 20, 2025 16:25

refactor direct load authentication

e772d20

fix databricks volume file name

2bd0be0

rudolfix requested changes Jan 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: #2166 databricks direct loading #2219

WIP: #2166 databricks direct loading #2219

donotpush commented Jan 15, 2025

netlify bot commented Jan 15, 2025 •

edited

Loading

rudolfix left a comment

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

rudolfix Jan 20, 2025

WIP: #2166 databricks direct loading #2219

Are you sure you want to change the base?

WIP: #2166 databricks direct loading #2219

Conversation

donotpush commented Jan 15, 2025

netlify bot commented Jan 15, 2025 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Jan 15, 2025 •

edited

Loading