Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: #2166 databricks direct loading #2219

Draft
wants to merge 6 commits into
base: devel
Choose a base branch
from

Conversation

donotpush
Copy link
Collaborator

Copy link

netlify bot commented Jan 15, 2025

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 2bd0be0
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/678e76f86e51fe0008095972

@donotpush donotpush changed the title #2166 databricks direct loading WIP: #2166 databricks direct loading Jan 16, 2025
@donotpush donotpush marked this pull request as draft January 16, 2025 12:22
@rudolfix rudolfix self-requested a review January 20, 2025 11:21
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On top of this review:
We should be able to run databricks without configured staging so we can enable many tests that will do that.

We have destination config in destinations_configs test util:

  1. remove databricks here:
destination_configs += [
            DestinationTestConfiguration(destination_type=destination)
            for destination in SQL_DESTINATIONS
            if destination
            not in ("athena", "synapse", "databricks", "dremio", "clickhouse", "sqlalchemy")
        ]
  1. add staging to this and move it in with other staging databricks setup
destination_configs += [
            DestinationTestConfiguration(
                destination_type="databricks",
                file_format="parquet",
                bucket_url=AZ_BUCKET,
                extra_info="az-authorization",
            )
        ]

you can also ping me and I can prepare a valid commit

# databricks authentication: get context config
from databricks.sdk import WorkspaceClient

w = WorkspaceClient()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code is correct but you must handle the situation when default credentials do not exist (ie. outside of notebook). I get this exception in this case:

ValueError: default auth: cannot configure default credentials, please check https://docs.databricks.com/en/dev-tools/auth.html#databricks-client-unified-authentication to configure credentials for your preferred authentication method.

just skip the code that assign values

else:
return "", file_name

volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how IMO we should handle volumes:

  1. Allow to define staging_volume_name in DatabricksClientConfiguration. This should be (I think) fully qualified name.
  2. If staging_volume_name is empty: create here (ad hoc) a volume with _dlt_temp_load_volume
  3. we do not need to handle volumes on the level of sql_client. you can drop additional method you added
  4. we do not need to care to drop _dlt_temp_load_volume. it belong to current schema. so if schema is dropped, the volume will be dropped as well (I hope!)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need time_ns?

return "", file_name

volume_path = f"/Volumes/{self._sql_client.database_name}/{self._sql_client.dataset_name}/{self._sql_client.volume_name}/{time.time_ns()}"
volume_file_name = ( # replace file_name for random hex code - databricks loading fails when file_name starts with - or .
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the same name as in file_name cant be used here? it will be unique (will contain file_id part which is uniq_id() already)


file_name = FileStorage.get_file_name_from_file_path(local_file_path)
file_format = ""
if file_name.endswith(".parquet"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think that you need to know the file format here. just upload a file we have. it has proper extension. also keep the file_name as mentioned above

@@ -63,6 +63,7 @@ def iter_df(self, chunk_size: int) -> Generator[DataFrame, None, None]:

class DatabricksSqlClient(SqlClientBase[DatabricksSqlConnection], DBTransaction):
dbapi: ClassVar[DBApi] = databricks_lib
volume_name: str = "_dlt_temp_load_volume"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to DatabricksConfiguration (as mentioned above)

@@ -102,6 +103,18 @@ def close_connection(self) -> None:
self._conn.close()
self._conn = None

def create_volume(self) -> None:
self.execute_sql(f"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create volume ad hoc before uploading file

@@ -176,6 +176,7 @@ def initialize_storage(self, truncate_tables: Iterable[str] = None) -> None:
self.sql_client.create_dataset()
elif truncate_tables:
self.sql_client.truncate_tables(*truncate_tables)
self.sql_client.create_volume()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can drop all of those

@@ -1,4 +1,6 @@
[runtime]
log_level="DEBUG"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remember to remove before final push

" and the server_hostname."
)

self.direct_load = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think we need this. if we have a local file we do direct load. we do not need to be in a notebook context to do it. just the default access token needs notebook

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Load data into databricks without external staging and auth.
2 participants