Skip to content

Commit

Permalink
Docs/adding FAQ questions (#1204)
Browse files Browse the repository at this point in the history
* Added 9 questions to FAQs

* Updated

* Updated

* Updated as per comments by @burnash

* Update frequently-asked-questions.md

* Updated FAQ

* Update frequently-asked-questions.md

Removing some questions.

* Update frequently-asked-questions.md

Adding type py for code snippet

* Update frequently-asked-questions.md

---------

Co-authored-by: Zaeem Athar <[email protected]>
  • Loading branch information
dat-a-man and zem360 authored Apr 25, 2024
1 parent df6282b commit 25a3321
Show file tree
Hide file tree
Showing 2 changed files with 112 additions and 0 deletions.
111 changes: 111 additions & 0 deletions docs/website/docs/reference/frequently-asked-questions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
title: Frequently Asked Questions
description: Questions asked frequently by users in technical help or github issues
keywords: [faq, usage information, technical help]
---


## Can I configure different nesting levels for each resource?

Currently, configuring different nesting levels for each resource directly is not supported, but there's an open GitHub issue ([#945](https://github.com/dlt-hub/dlt/issues/945)) addressing this. Meanwhile, you can use these workarounds.

Resources can be separated based on nesting needs, for example, separate the execution of pipelines for resources based on their required maximum table nesting levels.

**For resources that don't require nesting (resource1, resource2), configure nesting as:**

```py
source_data = my_source.with_resources("resource1", "resource2")
source_data.max_table_nesting = 0
load_info = pipeline.run(source_data)
```

**For resources that require deeper nesting (resource3, resource4), configure nesting as:**

```py
source_data = my_source.with_resources("resource3", "resource4")
source_data.max_table_nesting = 2
load_info = pipeline.run(source_data)
```

**Apply hints for complex columns**
If certain columns should not be normalized, you can mark them as `complex`. This can be done in two ways.

1. When fetching the source data.
```py
source_data = my_source()
source_data.resource3.apply_hints(
columns={
"column_name": {
"data_type": "complex"
}
}
)
```

1. During resource definition.
```py
@dlt.resource(columns={"column_name": {"data_type": "complex"}})
def my_resource():
# Function body goes here
pass
```
In this scenario, the specified column (column_name) will not be broken down into nested tables, regardless of the data structure.

These methods allow for a degree of customization in handling data structure and loading processes, serving as interim solutions until a direct feature is developed.

## Can I configure dlt to load data in chunks of 10,000 records for more efficient processing, and how does this affect data resumption and retries in case of failures?

`dlt` buffers to disk, and has built-in resume and retry mechanisms. This makes it less beneficial to manually manage atomicity after the fact unless you're running serverless. If you choose to load every 10k records instead, you could potentially see benefits like quicker data arrival if you're actively reading, and easier resumption from the last loaded point in case of failure, assuming that state is well-managed and records are sorted.

It's worth noting that `dlt` includes a request library replacement with [built-in retries](../../docs/reference/performance#using-the-built-in-requests-client). This means if you pull 10 million records individually, your data should remain safe even in the face of network issues. To resume jobs after a failure, however, it's necessary to run the pipeline in its own virtual machine (VM). Ephemeral storage solutions like Cloud Run don't support job resumption.

## How to contribute a verified source?

To submit your source code to our verified source repository, please refer to [this guide](https://github.com/dlt-hub/verified-sources/blob/master/CONTRIBUTING.md).

## If you don't have the source required listed in verified sources?

In case you don't have the source required listed in the [verified sources](../../docs/dlt-ecosystem/verified-sources/), you could create your own pipeline by referring to the [following docs](../../docs/walkthroughs/create-a-pipeline).

## How can I retrieve the complete schema of a `dlt` source?

To retrieve the complete schema of a `dlt` source as `dlt` recognizes it, you need to execute both the extract and normalization steps. This process can be conducted on a small sample size to avoid processing large volumes of data. You can limit the sample size by applying an `add_limit` to your resources. Here is a sample code snippet demonstrating this process:

```py
# Initialize a new DLT pipeline with specified properties
p = dlt.pipeline(
pipeline_name="my_pipeline",
destination="duckdb",
dataset_name="my_dataset"
)

# Extract data using the predefined source `my_source`
p.extract(my_source().add_limit(10))

# Normalize the data structure for consistency
p.normalize()

# Print the default schema of the pipeline in pretty YAML format for review
print(p.default_schema.to_pretty_yaml())
```

This method ensures you obtain the full schema details, including all columns, as seen by `dlt`, without needing a predefined contract on these resources.

## Is truncating or deleting a staging table safe?

You can safely truncate those or even drop the whole staging dataset. However, it will have to be recreated on the next load and might incur extra loading time or cost.
You can also delete it with Python using [Bigquery client.](https://cloud.google.com/bigquery/docs/samples/bigquery-delete-dataset#bigquery_delete_dataset-python)

## How can I develop a "custom" pagination tracker?

You can use `dlt.sources.incremental` to create a custom cursor for tracking pagination in data streams that lack a specific cursor field. An example can be found in the [Incremental loading with a cursor](https://deploy-preview-1204--dlt-hub-docs.netlify.app/docs/general-usage/incremental-loading#incremental-loading-with-a-cursor-field).

Alternatively, you can manage the state directly in Python. You can access and modify the state like a standard Python dictionary:
```py
state = dlt.current.resource_state()
state["your_custom_key"] = "your_value"
```
This method allows you to create custom pagination logic based on your requirements. An example of using `resource_state()` for pagination can be found [here](https://dlthub.com/docs/general-usage/incremental-loading#custom-incremental-loading-with-pipeline-state).

However, be cautious about overusing the state dictionary, especially in cases involving substreams for each user, as it might become unwieldy. A better strategy might involve tracking users incrementally. Then, upon updates, you only refresh the affected users' substreams entirely. This consideration helps maintain efficiency and manageability in your custom pagination implementation.

1 change: 1 addition & 0 deletions docs/website/sidebars.js
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,7 @@ const sidebars = {
'reference/installation',
'reference/command-line-interface',
'reference/telemetry',
'reference/frequently-asked-questions',
'general-usage/glossary',
],
},
Expand Down

0 comments on commit 25a3321

Please sign in to comment.