Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory efficiency of process_results by iterating. #217

Merged
merged 2 commits into from
May 16, 2024

Conversation

peterallenwebb
Copy link
Contributor

@peterallenwebb peterallenwebb commented May 16, 2024

resolves #218

Problem

The process of returning query results from execute() is memory inefficient, as multiple intermediate copies of the result data are maintained simultaneously.

In the case of docs generate, we are sometimes querying for information about every column in a schema. This can mean that a million or more records are returned in more extreme cases, resulting in gigabytes of memory allocation. In this scenario, maintaining multiple copies of the results, even temporarily, is untenable.

Solution

Yield data rows one by one from process_results() rather than returning every row as a list, to eliminate one full copy of the result table. We could still do more work in this direction, but I documented a 33% reduction in memory associated with the get_catalog query with this approach.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development, and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc.) or this PR has already received feedback and approval from Product or DX

@cla-bot cla-bot bot added the cla:yes The PR author has signed the CLA label May 16, 2024
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@peterallenwebb peterallenwebb marked this pull request as ready for review May 16, 2024 21:04
@peterallenwebb peterallenwebb requested a review from a team as a code owner May 16, 2024 21:04
Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Peter!

@colin-rogers-dbt colin-rogers-dbt merged commit fd33aaf into main May 16, 2024
15 checks passed
@colin-rogers-dbt colin-rogers-dbt deleted the paw/process-results-iteration branch May 16, 2024 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla:yes The PR author has signed the CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce Memory Use in process_results()
3 participants