-
Notifications
You must be signed in to change notification settings - Fork 199
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add GCS Clickhouse staging docs #1055
Signed-off-by: Marcel Coetzee <[email protected]>
- Loading branch information
Showing
1 changed file
with
56 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -68,7 +68,6 @@ To load data into ClickHouse, you need to create a ClickHouse database. While we | |
secure = 1 # Set to 1 if using HTTPS, else 0. | ||
dataset_table_separator = "___" # Separator for dataset table names from dataset. | ||
``` | ||
|
||
2. You can pass a database connection string similar to the one used by the `clickhouse-driver` library. The credentials above will look like this: | ||
|
||
```toml | ||
|
@@ -92,7 +91,8 @@ Data is loaded into ClickHouse using the most efficient method depending on the | |
|
||
`Clickhouse` does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons. | ||
To make `clickhouse` work with `dlt`, tables generated by `dlt` in your `clickhouse` database will have their name prefixed with the dataset name separated by | ||
the configurable `dataset_table_separator`. Additionally a special sentinel table that does not contain any data will also be created, so dlt knows which virtual datasets already exist in a clickhouse | ||
the configurable `dataset_table_separator`. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a | ||
clickhouse | ||
destination. | ||
|
||
## Supported file formats | ||
|
@@ -102,16 +102,16 @@ destination. | |
|
||
The `clickhouse` destination has a few specific deviations from the default sql destinations: | ||
|
||
1. `Clickhouse` has an experimental `object` datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex dataype to a `text` column. If you need | ||
this feature, please get in touch in our slack community and we will consider adding it. | ||
1. `Clickhouse` has an experimental `object` datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to a `text` column. If you need | ||
this feature, get in touch with our Slack community, and we will consider adding it. | ||
2. `Clickhouse` does not support the `time` datatype. Time will be loaded to a `text` column. | ||
3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string, when loading from parquet this will be | ||
the `binary` object converted to `text`. | ||
4. `Clickhouse` accepts adding columns to a populated table that are not null. | ||
3. `Clickhouse` does not support the `binary` datatype. Binary will be loaded to a `text` column. When loading from `jsonl`, this will be a base64 string, when loading from parquet this will be | ||
the `binary` object converted to `text`. | ||
4. `Clickhouse` accepts adding columns to a populated table that are not null. | ||
|
||
## Supported column hints | ||
|
||
ClickHouse supports the following [column hints](https://dlthub.com/docs/general-usage/schema#tables-and-columns): | ||
ClickHouse supports the following [column hints](../../general-usage/schema#tables-and-columns): | ||
|
||
- `primary_key` - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key. | ||
|
||
|
@@ -122,10 +122,12 @@ By default, tables are created using the `ReplicatedMergeTree` table engine in C | |
```py | ||
from dlt.destinations.adapters import clickhouse_adapter | ||
|
||
|
||
@dlt.resource() | ||
def my_resource(): | ||
... | ||
|
||
|
||
clickhouse_adapter(my_resource, table_engine_type="merge_tree") | ||
|
||
``` | ||
|
@@ -158,9 +160,54 @@ pipeline = dlt.pipeline( | |
) | ||
``` | ||
|
||
### Using Google Cloud Storage as a Staging Area | ||
|
||
dlt supports using Google Cloud Storage (GCS) as a staging area when loading data into ClickHouse. This is handled automatically by | ||
ClickHouse's [GCS table function](https://clickhouse.com/docs/en/sql-reference/table-functions/gcs) which dlt uses under the hood. | ||
|
||
Somewhat annoyingly, the GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys. To enable this, GCS provides an S3 compatibility mode that emulates the Amazon S3 | ||
API. ClickHouse takes advantage of this to allow accessing GCS buckets via its S3 integration. | ||
|
||
To set up GCS staging with HMAC authentication in dlt: | ||
|
||
1. Create HMAC keys for your GCS service account by following the [Google Cloud guide](https://cloud.google.com/storage/docs/authentication/managing-hmackeys#create). | ||
|
||
2. Configure the HMAC keys as well as the `client_email`, `project_id` and `private_key` for your service account in your dlt project's ClickHouse destination settings in `config.toml`: | ||
|
||
```toml | ||
[destination.filesystem] | ||
bucket_url = "gs://dlt-ci" | ||
|
||
[destination.filesystem.credentials] | ||
project_id = "a-cool-project" | ||
client_email = "[email protected]" | ||
private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkaslkdjflasjnkdcopauihj...wEiEx7y+mx\nNffxQBqVVej2n/D93xY99pM=\n-----END PRIVATE KEY-----\n" | ||
|
||
[destination.clickhouse.credentials] | ||
database = "dlt" | ||
username = "dlt" | ||
password = "Dlt*12345789234567" | ||
host = "localhost" | ||
port = 9440 | ||
secure = 1 | ||
gcp_access_key_id = "JFJ$$*f2058024835jFffsadf" | ||
gcp_secret_access_key = "DFJdwslf2hf57)%$02jaflsedjfasoi" | ||
``` | ||
|
||
Note: In addition to the HMAC keys (`gcp_access_key_id` and `gcp_secret_access_key`), you now need to provide the `client_email`, `project_id` and `private_key` for your service account | ||
under `[destination.filesystem.credentials]`. | ||
This is because the GCS staging support is now implemented as a temporary workaround and is still unoptimized. | ||
|
||
dlt will pass these credentials to ClickHouse which will handle the authentication and GCS access. | ||
|
||
There is active work in progress to simplify and improve the GCS staging setup for the ClickHouse dlt destination in the future. Proper GCS staging support is being tracked in these GitHub issues: | ||
|
||
- [Make filesystem destination work with gcs in s3 compatibility mode](https://github.com/dlt-hub/dlt/issues/1272) | ||
- [GCS staging area support](https://github.com/dlt-hub/dlt/issues/1181) | ||
|
||
### dbt support | ||
|
||
Integration with [dbt](../transformations/dbt/dbt.md) is generally supported via dbt-clickhouse, but not tested by us at this time. | ||
Integration with [dbt](../transformations/dbt/dbt.md) is generally supported via dbt-clickhouse, but not tested by us. | ||
|
||
### Syncing of `dlt` state | ||
|
||
|