Store the compiled code from hooks, or statistics on it #428

matt-winkler · 2024-02-20T23:15:24Z

Describe the feature

As a project admin, I want to be able to analyze the code written in hooks on dbt models. dbt hooks can execute arbitrary SQL statements. The goal of capturing this data is to be able to catch situations in which business logic is embedded in hooks instead of being coded in dbt models themselves.

Describe alternatives you've considered

Linters might be able to achieve something similar. Scanning SQL Fluff's dbt templater I didn't see any obvious controls for hooks.
If storing the SQL code itself is problematic because of practical, security reasons, it might be useful to compute summary statistics e.g. total number of lines

Additional context

By having additional tools to align the dbt codebase, dbt users can better leverage additional features like restarting jobs from the point of failure.

Who will this benefit?

Groups migrating from legacy SQL systems to dbt.

Are you interested in contributing this feature?

Yes

dave-connors-3 · 2024-04-11T16:15:02Z

million dollar matt! love this idea -- we rely on the graph variable for all the unpacking of project data, and I am not positive that hooks get written there. I can do some digging, but we may be hamstrung here!

dave-connors-3 · 2024-04-11T16:21:22Z

I may have spoken too soon -- looks like we have uncompiled code in the node config!

what kind of analysis do you anticipate wanting to do on the sql in hooks?

matt-winkler · 2024-04-11T16:42:33Z

Nice! Thank you for researching. Off the top:

num_lines (INT)
calls_other_macros (bool)
contains_self_reference (bool) --> to indicate the hook processes data in the table managed by the dbt model itself
performs_insert (bool)
performs_delete (bool)
performs_update (bool)
performs_truncate (bool)
performs_create_table (bool)
calls_stored_procs (bool)

algarbar · 2024-04-11T20:17:05Z

(+1) - Thanks @matt-winkler for thinking of this idea!

TL;DR:
This is a great! Baking guardrails into dbt for hooks will yield a number of benefits such as proper coding, code visibility, and contract effectiveness.

Re-Framing

Perhaps we reframe the feature to befit dbt's larger, overall view of code compilation/visibility and model contracts. We can do this while still reserving actionable progress to the feature's focus of intent- hooks.

The feature, as I understand it, is not limited to project admins alone. The goal appears to be to gaurdrail developers to patterns aligned to a dbt ethos. Any code can go into a hook. But, it probably shouldn't.

RE Code compilation/visibility:

Code compilation of hook components does not alter the DAG in ways perhaps intended by the developer. Dependencies can be created within hooks that are left unexpressed. That's bad.

Furthermore, compiled code is presented without principal components of execution. An update statement in a hook, a workhorse macro, ... many components are left out of visiblity in terms of what will* be executed (I.E. The compiled and parsed code). dbt simply doesn't know what will be executed in such code. However, dbt could* surface visbility that {some} code will be executed.

RE Model Contracts

You can't have a contract if there is work on its assets outside visibility!

Matt's listed examples illustrate how use of hooks can unintentionally (or intentionally) obfuscate code and dilute contact fulfillment.

@dave-connors-3:

what kind of analysis do you anticipate wanting to do on the sql in hooks?

Analysis would extend to each of the model contract focus domains. Quality: If a hook has a DML (which is effectively a hidden model), does it have a test? Observability: How does code in a hook impact the DAG? etc...

Example of dbt's model contracts focus domains include:

Data/Code Quality
Data Observability
Data Profiling
Data Freshness
Consumption
Documentation:
- Service Level Agreements
- Pricing
- Field Level Documentation
- System Location (HANA/TD/BQ/SF)

Outcome

Building guardrails will entail surfacing warnings/errors related to hook usage. Enhancing how the node config is parsed, compiled, and presented will advance dbt's mission (E.G. in their contracts) and community packages (E.G. project evaluator) in their scope/effectiveness.

dave-connors-3 · 2024-05-02T21:53:41Z

@matt-winkler @algarbar i spent some time on this today, and there are a couple caveats to what we have available that make me wonder if it's worth investing in!

1. the graph object only contains the raw sql in post hooks.

If a hook calls a macro, all we'll see in the graph is {{ my_macro() }} -- I presume this dilutes a good amount of the potential value here since my assumption is that a lot of these "shadow models" are parametrized and packed into a macro .

2. we already collect hard coded references

Because we parse the entirety of the raw sql in our fct_hard_coded_references model, if the user passes the hook into the {{ config() }} macro, we would detect those references in our regex logic.

Given these 2 things, i'm wondering what we should do next! open to thoughts and suggestions

matt-winkler · 2024-05-14T00:19:39Z

Hey @dave-connors-3 I'm not sure I agree with the assumption in point 1 above. A major use case for this is identifying when migrations to dbt from other transformation tools go awry, or aren't done according to dbt best practices. Many of those other tools don't include macros like dbt conceives of them at all. It does make sense that e.g. permissions grants via macros might not be as inspectable with this, but that's a different use case than what I (at least) considered for this.

Good point on hard coded references. Do you know if there's a way we can identify those hard coded references are created VIA a hook specifically? More curious on this one.

github-actions · 2024-11-10T02:00:14Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

algarbar · 2024-11-12T21:44:50Z

tl;dr

Use job artifacts for this insight

Progress to Date

A great deal of work has been done to resolve this need through an alternative avenue (DM for link).

Although not a perfect solution, successful runs generate artifacts (E.G. manifest.json & run_results.json) with sufficient detail. Using an automated system to collect and store all metadata including those artifacts in a data warehouse, dbt administrators can analyze these patterns and open the data for insight.

Pros

Easily automated data collection
Most node information remains available

Cons

Insights cannot be captured prior to job execution
Insights only possible where artifacts are generated
High degree of effort to parse results

matt-winkler added the enhancement New feature or request label Feb 20, 2024

github-actions bot added the Stale label Nov 10, 2024

github-actions bot removed the Stale label Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store the compiled code from hooks, or statistics on it #428

Store the compiled code from hooks, or statistics on it #428

matt-winkler commented Feb 20, 2024

dave-connors-3 commented Apr 11, 2024

dave-connors-3 commented Apr 11, 2024

matt-winkler commented Apr 11, 2024 •

edited

Loading

algarbar commented Apr 11, 2024

dave-connors-3 commented May 2, 2024

matt-winkler commented May 14, 2024

github-actions bot commented Nov 10, 2024

algarbar commented Nov 12, 2024

Store the compiled code from hooks, or statistics on it #428

Store the compiled code from hooks, or statistics on it #428

Comments

matt-winkler commented Feb 20, 2024

Describe the feature

Describe alternatives you've considered

Additional context

Who will this benefit?

Are you interested in contributing this feature?

dave-connors-3 commented Apr 11, 2024

dave-connors-3 commented Apr 11, 2024

matt-winkler commented Apr 11, 2024 • edited Loading

algarbar commented Apr 11, 2024

Re-Framing

RE Code compilation/visibility:

RE Model Contracts

Outcome

dave-connors-3 commented May 2, 2024

matt-winkler commented May 14, 2024

github-actions bot commented Nov 10, 2024

algarbar commented Nov 12, 2024

tl;dr

Progress to Date

Pros

Cons

matt-winkler commented Apr 11, 2024 •

edited

Loading