You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The mutually_exclusive_ranges test currently flags duplicated zero-length ranges as failing rows, even when zero_length_range_allowed=True, which might be unexpected and/or undesired.
As a simple illustration of this, consider the following table:
trans_id
start_date
end_date
0
2000-01-01
2000-01-01
1
2000-01-01
2000-01-01
Mathematically speaking, these ranges are not overlapping, since both are zero-length. However, with the current implementation of mutually_exclusive_ranges, at least one of these rows is marked as a failure, which may be unexpected (it was for me at least :) ).
Although there are absolutely valid use cases for flagging duplicated zero-length rows like the above, this behavior should probably be toggleable.
Steps to reproduce
Here’s a stand-alone Python 3.10 script demonstrating this behavior, using jinja2==3.1.5 pandas==2.2.3 duckdb==1.1.3:
fromtypingimportCallableimportduckdbimportpandasaspdfromjinja2importEnvironmentdefmain():
# Copied from https://github.com/dbt-labs/dbt-utils/blob/5c9dc0d43265cb86c5f69954e6d739dc7a1974c3/macros/generic_tests/mutually_exclusive_ranges.sql#L5:test_template_str=""" {% macro mutually_exclusive_ranges(model, lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) %} {% if gaps == 'not_allowed' %} {% set allow_gaps_operator='=' %} {% set allow_gaps_operator_in_words='equal_to' %} {% elif gaps == 'allowed' %} {% set allow_gaps_operator='<=' %} {% set allow_gaps_operator_in_words='less_than_or_equal_to' %} {% elif gaps == 'required' %} {% set allow_gaps_operator='<' %} {% set allow_gaps_operator_in_words='less_than' %} {% else %} {{ exceptions.raise_compiler_error( "`gaps` argument for mutually_exclusive_ranges test must be one of ['not_allowed', 'allowed', 'required'] Got: '" ~ gaps ~"'.'" ) }} {% endif %} {% if not zero_length_range_allowed %} {% set allow_zero_length_operator='<' %} {% set allow_zero_length_operator_in_words='less_than' %} {% elif zero_length_range_allowed %} {% set allow_zero_length_operator='<=' %} {% set allow_zero_length_operator_in_words='less_than_or_equal_to' %} {% else %} {{ exceptions.raise_compiler_error( "`zero_length_range_allowed` argument for mutually_exclusive_ranges test must be one of [true, false] Got: '" ~ zero_length_range_allowed ~"'.'" ) }} {% endif %} {% set partition_clause="partition by " ~ partition_by if partition_by else '' %} with window_functions as ( select {% if partition_by %} {{ partition_by }} as partition_by_col, {% endif %} {{ lower_bound_column }} as lower_bound, {{ upper_bound_column }} as upper_bound, lead({{ lower_bound_column }}) over ( {{ partition_clause }} order by {{ lower_bound_column }}, {{ upper_bound_column }} ) as next_lower_bound, lead(1) over ( {{ partition_clause }} order by {{ lower_bound_column }}, {{ upper_bound_column }} ) is null as is_last_record_alternative, row_number() over ( {{ partition_clause }} order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc ) = 1 as is_last_record from {{ model }} ), calc as ( -- We want to return records where one of our assumptions fails, so we'll use -- the `not` function with `and` statements so we can write our assumptions more cleanly select *, -- For each record: lower_bound should be < upper_bound. -- Coalesce it to return an error on the null case (implicit assumption -- these columns are not_null) coalesce( lower_bound {{ allow_zero_length_operator }} upper_bound, false ) as lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound, -- For each record: upper_bound {{ allow_gaps_operator }} the next lower_bound. -- Coalesce it to handle null cases for the last record. coalesce( upper_bound {{ allow_gaps_operator }} next_lower_bound, is_last_record, false ) as upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound from window_functions ), validation_errors as ( select * from calc where not( -- THE FOLLOWING SHOULD BE TRUE -- lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound and upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound ) ) select * from validation_errors {% endmacro %} """# Test data with two zero-length ranges occurring on the same day:df=pd.DataFrame(
{
"trans_id": [0, 1],
"start_date": 2* [pd.to_datetime("2000-01-01").date()],
"end_date": 2* [pd.to_datetime("2000-01-01").date()],
}
)
# Render test query for `df` defined above, allowing for zero-length ranges:test_renderer: Callable= (
Environment().from_string(test_template_str).module.mutually_exclusive_ranges
)
test_query=test_renderer(
model="df",
lower_bound_column="start_date",
upper_bound_column="end_date",
zero_length_range_allowed=True,
)
print(duckdb.sql(test_query))
returnNoneif__name__=="__main__":
main()
row_number() over (
{{ partition_clause }}
order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc
) =1as is_last_record
This assumes that ordering by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc always produces the exact reverse order of {{ lower_bound_column }}, {{ upper_bound_column }}. However, this isn't true for duplicated zero-length ranges, where the asc and desc ordering of records are identical. The net result of this is that some records are assigned a null next_lower_bound, but with a false is_last_record flag, causing these rows to be labelled as 'failing'.
To address this edge case, I suggest calculating is_last_record by ordering records in the same way as used to compute next_lower_bound, like so:
-- If no following record, `lead(1)` is null:
lead(1) over (
{{ partition_clause }}
-- No longer `desc` sorting:order by {{ lower_bound_column }}, {{ upper_bound_column }}
) is nullas is_last_record,
A new argument should be introduced to allow users to toggle between this updated behavior and the current behavior.
Are you interested in contributing the fix?
Yes, I am interested in contributing this fix. I'll open a PR re: this issue later today if I find the time.
The text was updated successfully, but these errors were encountered:
Describe the bug
The
mutually_exclusive_ranges
test currently flags duplicated zero-length ranges as failing rows, even whenzero_length_range_allowed=True
, which might be unexpected and/or undesired.As a simple illustration of this, consider the following table:
Mathematically speaking, these ranges are not overlapping, since both are zero-length. However, with the current implementation of
mutually_exclusive_ranges
, at least one of these rows is marked as a failure, which may be unexpected (it was for me at least :) ).Although there are absolutely valid use cases for flagging duplicated zero-length rows like the above, this behavior should probably be toggleable.
Steps to reproduce
Here’s a stand-alone Python 3.10 script demonstrating this behavior, using
jinja2==3.1.5 pandas==2.2.3 duckdb==1.1.3
:Expected results
i.e. an empty relation.
Actual results
i.e. the duplicated zero length range
2000-01-01 to 2000-01-01
is considered to overlap with itself.System information
The contents of your
packages.yml
file:N/A, since provided snippet directly renders
mutually_exclusive_ranges
test withjinja2
, without usingdbt
.Which database are you using dbt with?
duckdb
)The output of
dbt --version
:N/A, since provided snippet directly renders
mutually_exclusive_ranges
test withjinja2
, without usingdbt
.Additional context
I believe the root cause of this issue lies in how
is_last_record
is computed inmutually_exclusive_ranges
:This assumes that ordering by
{{ lower_bound_column }} desc, {{ upper_bound_column }} desc
always produces the exact reverse order of{{ lower_bound_column }}, {{ upper_bound_column }}
. However, this isn't true for duplicated zero-length ranges, where theasc
anddesc
ordering of records are identical. The net result of this is that some records are assigned a nullnext_lower_bound
, but with a falseis_last_record
flag, causing these rows to be labelled as 'failing'.To address this edge case, I suggest calculating
is_last_record
by ordering records in the same way as used to computenext_lower_bound
, like so:A new argument should be introduced to allow users to toggle between this updated behavior and the current behavior.
Are you interested in contributing the fix?
Yes, I am interested in contributing this fix. I'll open a PR re: this issue later today if I find the time.
The text was updated successfully, but these errors were encountered: