Skip to content

Commit

Permalink
fix typo toml->yaml and add caution
Browse files Browse the repository at this point in the history
  • Loading branch information
AstrakhantsevaAA committed Nov 8, 2023
1 parent 46d65ae commit 1102c39
Showing 1 changed file with 36 additions and 21 deletions.
57 changes: 36 additions & 21 deletions docs/website/docs/walkthroughs/adjust-a-schema.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ dlt.pipeline(
)
```

Following folder structure in project root folder will be created:
The following folder structure in the project root folder will be created:

```
schemas
Expand All @@ -46,11 +46,10 @@ import_schema_path="schemas/import"
## 2. Run the pipeline to see the schemas

To see the schemas, you must run your pipeline again. The `schemas` and `import`/`export`
directories will be created. In each directory, you'll see a `yaml` file with a file
`chess.schema.toml`.
directories will be created. In each directory, you'll see a `yaml` file (e.g. `chess.schema.yaml`).

Look at the export schema (in the export folder): this is the schema that got inferred from the data
and was used to load it into the destination (i.e `duckdb`).
and was used to load it into the destination (e.g. `duckdb`).

## 3. Make changes in import schema

Expand All @@ -59,20 +58,36 @@ hints that were explicitly declared in the `chess` source. You'll use this schem
modifications, typically by pasting relevant snippets from your export schema and modifying them.
You should keep the import schema as simple as possible and let `dlt` do the rest.

> 💡 How importing a schema works:
>
> 1. When a new pipeline is created and the source function is extracted for the first time, a new
> schema is added to pipeline. This schema is created out of global hints and resource hints
> present in the source extractor function.
> 1. Every such new schema will be saved to the `import` folder (if it does not exist there already)
> and used as the initial version for all future pipeline runs.
> 1. Once a schema is present in `import` folder, **it is writable by the user only**.
> 1. Any change to the schemas in that folder are detected and propagated to the pipeline
> automatically on the next run. It means that after an user update, the schema in `import`
> folder reverts all the automatic updates from the data.
In next steps we'll experiment a lot, you will be warned to set `full_refresh=True` in the
`dlt.pipeline` until we are done experimenting.
💡 How importing a schema works:

1. When a new pipeline is created and the source function is extracted for the first time, a new
schema is added to the pipeline. This schema is created out of global hints and resource hints
present in the source extractor function.
1. Every such new schema will be saved to the `import` folder (if it does not exist there already)
and used as the initial version for all future pipeline runs.
1. Once a schema is present in `import` folder, **it is writable by the user only**.
1. Any changes to the schemas in that folder are detected and propagated to the pipeline
automatically on the next run. It means that after a user update, the schema in `import`
folder reverts all the automatic updates from the data.

In next steps we'll experiment a lot, you will be warned to set `full_refresh=True` until we are done experimenting.

:::caution
`dlt` will **not modify** tables after they are created.
So if you have a `yaml` file, and you change it (e.g. change a data type or add a hint),
then you need to **delete the dataset**
or set `full_refresh=True`:
```python
dlt.pipeline(
import_schema_path="schemas/import",
export_schema_path="schemas/export",
pipeline_name="chess_pipeline",
destination='duckdb',
dataset_name="games_data",
full_refresh=True,
)
```
:::

### Change the data type

Expand All @@ -97,7 +112,7 @@ players_games:
data_type: timestamp
```
Run the pipeline script again and make sure that the change is visible in export schema. Then,
Run the pipeline script again and make sure that the change is visible in the export schema. Then,
[launch the Streamlit app](../dlt-ecosystem/visualizations/exploring-the-data.md) to see the changed data.
:::note
Expand All @@ -122,7 +137,7 @@ white__aid:
data_type: text
```

For some reason you'd rather deal with a single JSON (or struct) column. Just declare the `white`
For some reason, you'd rather deal with a single JSON (or struct) column. Just declare the `white`
column as `complex`, which will instruct `dlt` not to flatten it (or not convert into child table in
case of a list). Do the same with `black` column:

Expand Down Expand Up @@ -166,5 +181,5 @@ players_games:

## 4. Keep your import schema

Just add and push the import folder to git. It will be used automatically when cloned. Alternatively
Just add and push the import folder to git. It will be used automatically when cloned. Alternatively,
[bundle such schema with your source](../general-usage/schema.md#attaching-schemas-to-sources).

0 comments on commit 1102c39

Please sign in to comment.