Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store the compiled code from hooks, or statistics on it #428

Open
matt-winkler opened this issue Feb 20, 2024 · 8 comments
Open

Store the compiled code from hooks, or statistics on it #428

matt-winkler opened this issue Feb 20, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@matt-winkler
Copy link

Describe the feature

As a project admin, I want to be able to analyze the code written in hooks on dbt models. dbt hooks can execute arbitrary SQL statements. The goal of capturing this data is to be able to catch situations in which business logic is embedded in hooks instead of being coded in dbt models themselves.

Describe alternatives you've considered

  • Linters might be able to achieve something similar. Scanning SQL Fluff's dbt templater I didn't see any obvious controls for hooks.

  • If storing the SQL code itself is problematic because of practical, security reasons, it might be useful to compute summary statistics e.g. total number of lines

Additional context

By having additional tools to align the dbt codebase, dbt users can better leverage additional features like restarting jobs from the point of failure.

Who will this benefit?

Groups migrating from legacy SQL systems to dbt.

Are you interested in contributing this feature?

Yes

@matt-winkler matt-winkler added the enhancement New feature or request label Feb 20, 2024
@dave-connors-3
Copy link
Collaborator

million dollar matt! love this idea -- we rely on the graph variable for all the unpacking of project data, and I am not positive that hooks get written there. I can do some digging, but we may be hamstrung here!

@dave-connors-3
Copy link
Collaborator

I may have spoken too soon -- looks like we have uncompiled code in the node config!

what kind of analysis do you anticipate wanting to do on the sql in hooks?

image

@matt-winkler
Copy link
Author

matt-winkler commented Apr 11, 2024

Nice! Thank you for researching. Off the top:

  • num_lines (INT)
  • calls_other_macros (bool)
  • contains_self_reference (bool) --> to indicate the hook processes data in the table managed by the dbt model itself
  • performs_insert (bool)
  • performs_delete (bool)
  • performs_update (bool)
  • performs_truncate (bool)
  • performs_create_table (bool)
  • calls_stored_procs (bool)

@algarbar
Copy link

(+1) - Thanks @matt-winkler for thinking of this idea!

TL;DR:
This is a great! Baking guardrails into dbt for hooks will yield a number of benefits such as proper coding, code visibility, and contract effectiveness.

Re-Framing

Perhaps we reframe the feature to befit dbt's larger, overall view of code compilation/visibility and model contracts. We can do this while still reserving actionable progress to the feature's focus of intent- hooks.

The feature, as I understand it, is not limited to project admins alone. The goal appears to be to gaurdrail developers to patterns aligned to a dbt ethos. Any code can go into a hook. But, it probably shouldn't.

RE Code compilation/visibility:

Code compilation of hook components does not alter the DAG in ways perhaps intended by the developer. Dependencies can be created within hooks that are left unexpressed. That's bad.

Furthermore, compiled code is presented without principal components of execution. An update statement in a hook, a workhorse macro, ... many components are left out of visiblity in terms of what will* be executed (I.E. The compiled and parsed code). dbt simply doesn't know what will be executed in such code. However, dbt could* surface visbility that {some} code will be executed.

RE Model Contracts

You can't have a contract if there is work on its assets outside visibility!

Matt's listed examples illustrate how use of hooks can unintentionally (or intentionally) obfuscate code and dilute contact fulfillment.

@dave-connors-3:

what kind of analysis do you anticipate wanting to do on the sql in hooks?

Analysis would extend to each of the model contract focus domains. Quality: If a hook has a DML (which is effectively a hidden model), does it have a test? Observability: How does code in a hook impact the DAG? etc...

Example of dbt's model contracts focus domains include:

  • Data/Code Quality
  • Data Observability
  • Data Profiling
  • Data Freshness
  • Consumption
  • Documentation:
    • Service Level Agreements
    • Pricing
    • Field Level Documentation
    • System Location (HANA/TD/BQ/SF)

Outcome

Building guardrails will entail surfacing warnings/errors related to hook usage. Enhancing how the node config is parsed, compiled, and presented will advance dbt's mission (E.G. in their contracts) and community packages (E.G. project evaluator) in their scope/effectiveness.

@dave-connors-3
Copy link
Collaborator

@matt-winkler @algarbar i spent some time on this today, and there are a couple caveats to what we have available that make me wonder if it's worth investing in!

1. the graph object only contains the raw sql in post hooks.

If a hook calls a macro, all we'll see in the graph is {{ my_macro() }} -- I presume this dilutes a good amount of the potential value here since my assumption is that a lot of these "shadow models" are parametrized and packed into a macro .

2. we already collect hard coded references

Because we parse the entirety of the raw sql in our fct_hard_coded_references model, if the user passes the hook into the {{ config() }} macro, we would detect those references in our regex logic.

Given these 2 things, i'm wondering what we should do next! open to thoughts and suggestions

@matt-winkler
Copy link
Author

Hey @dave-connors-3 I'm not sure I agree with the assumption in point 1 above. A major use case for this is identifying when migrations to dbt from other transformation tools go awry, or aren't done according to dbt best practices. Many of those other tools don't include macros like dbt conceives of them at all. It does make sense that e.g. permissions grants via macros might not be as inspectable with this, but that's a different use case than what I (at least) considered for this.

Good point on hard coded references. Do you know if there's a way we can identify those hard coded references are created VIA a hook specifically? More curious on this one.

Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 10, 2024
@algarbar
Copy link

tl;dr

Use job artifacts for this insight

Progress to Date

A great deal of work has been done to resolve this need through an alternative avenue (DM for link).

Although not a perfect solution, successful runs generate artifacts (E.G. manifest.json & run_results.json) with sufficient detail. Using an automated system to collect and store all metadata including those artifacts in a data warehouse, dbt administrators can analyze these patterns and open the data for insight.

Pros

  • Easily automated data collection
  • Most node information remains available

Cons

  • Insights cannot be captured prior to job execution
  • Insights only possible where artifacts are generated
  • High degree of effort to parse results

@github-actions github-actions bot removed the Stale label Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants