Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a forward proxy for accessing data files from blob storage #649

Open
igor-lobanov-maersk opened this issue Jan 17, 2025 · 1 comment
Labels
enhancement New feature or request

Comments

@igor-lobanov-maersk
Copy link

igor-lobanov-maersk commented Jan 17, 2025

Please describe why this is necessary.

This was originally raised as duckdb/duckdb-delta#135.

We have a requirement to provide a lookup API on top of data sitting in a delta lake table in a cloud blob storage (Azure Data Lake Storage account Gen2 in our code). We are considering putting duckdb straight on top of ADLS via duckdb-delta extension that uses delta kernel under the hood.

Most API calls would be clustered around a small subset of the data, so we are likely going to have a few 'hot' data files in the Delta Lake table getting most of the traffic. For large number of repeated queries, there would be a lot of redundant blob storage reads, and, further, there is a risk that Azure would throttle or fail the requests. I learned from duckdb team that duckdb does not currently cache locally any data it fetches from the blob storage as part of delta_scan operation, and it relies on delta kernel to do all such IO.

One apparently viable strategy would be to put a forward HTTP proxy (e.g., squid) between delta kernel and the blob storage, so it could serve the hot data files from its cache. Whilst duckdb can be configured to use a forward proxy, but it seems that delta kernel does not honour that configuration, and connects to the blob storage directly.

Describe the functionality you are proposing.

Expose configuration options for delta kernel to use a forward HTTP proxy for accessing a blob storage.

Direct and indirect users of delta kernel should be able to set these options directly (e.g., via environment variables).

Developers of software using delta kernel internally should be able to set these options via delta kernel API to honour the configuration provided by the users of their software, e.g. duckdb-delta should be passing forward proxy settings of duckdb to the delta kernel.

Additional context

Based on my research, there are limited options available for those looking to provide a lookup API on top of data in a Delta Lake table without copying that data elsewhere (e.g., a DBMS) or creating some kind of storage-aware cache. All that leads to a considerable complexity. There is also ROAPI, but that does not seem particularly lightweight nor is it mature. Using a forward proxy to cache data files may be an attractive alternative way to implement such a solution in a generic way with already available mature tools.

@roeap
Copy link
Collaborator

roeap commented Jan 18, 2025

Thanks for reporting this @igor-lobanov-maersk.

Also left a comment on the upstream issue. In short, duckdb could build that, but we can also include this in the default engine in this repo. In fact, the default client implementation is due for an update, specifically how we handle object stores.

That said, you can already pass in configuration for the underlying http client which includes some config for a proxy.

https://github.com/apache/arrow-rs/blob/af777cd53e56f8382382137b6e08af249c475397/object_store/src/client/mod.rs#L146-L174

would that already serve your needs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants