Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access #3

Open
caldempsey opened this issue Mar 2, 2024 · 3 comments · Fixed by #6
Assignees
Labels
bug Something isn't working

Comments

@caldempsey
Copy link
Owner

caldempsey commented Mar 2, 2024

Describe the problem

Reproduced in the notebook on #6

At present we have set up a Jupyter Notebook w/ PySpark connected to a Spark cluster, where the Spark instance is intended to perform writes to a Delta table. I'm observing that the Spark instance fails to complete the writes if the Jupyter Notebook doesn't have access to the data location.

This behavior seems counterintuitive to me as I expect the Spark instance to handle data writes independently of the Jupyter Notebook's access to the data.

Steps to reproduce

Via the repo provided:

  1. Clone the repo
  2. Run the notebook. Observe we can write delta tables.
  3. Delete everything in the notebook's data folder.
  4. Remove infra-delta-lake/localhost/docker-compose.yml:63 ./../../notebook-data-lake/data:/data, which prevents the notebook from accessing /data at the same target shared with the Spark Master and Workers on their local filesystem.

Observed results

When the notebook has access to /data (but is a connected application not a member of the cluster), Delta Tables write successfully with _delta_log.

When the notebook does not have access to /data it complains that it can't write _delta_log, but parquet files still get written!

Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.sql.delta.DeltaIOException: [DELTA_CANNOT_CREATE_LOG_PATH] Cannot create file:/data/delta_table_of_dog_owners/_delta_log
    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException(DeltaErrors.scala:1534)
    at org.apache.spark.sql.delta.DeltaErrorsBase.cannotCreateLogPathException$(DeltaErrors.scala:1533)
    at org.apache.spark.sql.delta.DeltaErrors$.cannotCreateLogPathException(DeltaErrors.scala:3203)
    at org.apache.spark.sql.delta.DeltaLog.createDirIfNotExists$1(DeltaLog.scala:443)```

Expected results

Expect the _delta_log to be written regardless of whether the Notebook has access to the target filesystem.

Further details

Since this error is surfacing from PySpark I'm wondering if either the Notebook instance is somehow electing itself master via PySpark or if there's a bug in delta lake where you can’t write delta tables without the application call-site having access to the location. Neither of these sound right but I can't think of a third way.

Feel free to have a gander or submit a PR 🙏 !

Environment information

  • Delta Lake version: 3.1.0
  • Spark version: 3.5.1
  • Scala version: 2.12
@caldempsey caldempsey changed the title [] Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access [bug] Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access Mar 2, 2024
@caldempsey caldempsey changed the title [bug] Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access [BUG] Clustered Spark fails to write _delta_log via a Notebook without granting the Notebook data access Mar 2, 2024
@caldempsey caldempsey self-assigned this Mar 2, 2024
@caldempsey caldempsey added the bug Something isn't working label Mar 2, 2024
@caldempsey
Copy link
Owner Author

@caldempsey
Copy link
Owner Author

This needs to be fixed with a refinement to the overall architecture as above. It's not a bug, just that PySpark can only run in client/driver mode when connecting to a standalone cluster.

@caldempsey caldempsey reopened this Mar 6, 2024
@caldempsey
Copy link
Owner Author

Databricks might also have a solution for this with their latest DataLake connectors. Kind of a game-changer in the space. Something to read on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant