Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mutually_exclusive_ranges Test Fails for Duplicated Zero-Length Ranges #981

Open
3 tasks
mabilton opened this issue Jan 13, 2025 · 0 comments
Open
3 tasks
Labels
bug Something isn't working triage

Comments

@mabilton
Copy link

mabilton commented Jan 13, 2025

Describe the bug

The mutually_exclusive_ranges test currently flags duplicated zero-length ranges as failing rows, even when zero_length_range_allowed=True, which might be unexpected and/or undesired.

As a simple illustration of this, consider the following table:

trans_id start_date end_date
0 2000-01-01 2000-01-01
1 2000-01-01 2000-01-01

Mathematically speaking, these ranges are not overlapping, since both are zero-length. However, with the current implementation of mutually_exclusive_ranges, at least one of these rows is marked as a failure, which may be unexpected (it was for me at least :) ).

Although there are absolutely valid use cases for flagging duplicated zero-length rows like the above, this behavior should probably be toggleable.

Steps to reproduce

Here’s a stand-alone Python 3.10 script demonstrating this behavior, using jinja2==3.1.5 pandas==2.2.3 duckdb==1.1.3:

from typing import Callable

import duckdb
import pandas as pd
from jinja2 import Environment


def main():
    # Copied from https://github.com/dbt-labs/dbt-utils/blob/5c9dc0d43265cb86c5f69954e6d739dc7a1974c3/macros/generic_tests/mutually_exclusive_ranges.sql#L5:
    test_template_str = """
    {% macro mutually_exclusive_ranges(model, lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) %}
    {% if gaps == 'not_allowed' %}
        {% set allow_gaps_operator='=' %}
        {% set allow_gaps_operator_in_words='equal_to' %}
    {% elif gaps == 'allowed' %}
        {% set allow_gaps_operator='<=' %}
        {% set allow_gaps_operator_in_words='less_than_or_equal_to' %}
    {% elif gaps == 'required' %}
        {% set allow_gaps_operator='<' %}
        {% set allow_gaps_operator_in_words='less_than' %}
    {% else %}
        {{ exceptions.raise_compiler_error(
            "`gaps` argument for mutually_exclusive_ranges test must be one of ['not_allowed', 'allowed', 'required'] Got: '" ~ gaps ~"'.'"
        ) }}
    {% endif %}
    {% if not zero_length_range_allowed %}
        {% set allow_zero_length_operator='<' %}
        {% set allow_zero_length_operator_in_words='less_than' %}
    {% elif zero_length_range_allowed %}
        {% set allow_zero_length_operator='<=' %}
        {% set allow_zero_length_operator_in_words='less_than_or_equal_to' %}
    {% else %}
        {{ exceptions.raise_compiler_error(
            "`zero_length_range_allowed` argument for mutually_exclusive_ranges test must be one of [true, false] Got: '" ~ zero_length_range_allowed ~"'.'"
        ) }}
    {% endif %}

    {% set partition_clause="partition by " ~ partition_by if partition_by else '' %}

    with window_functions as (

        select
            {% if partition_by %}
            {{ partition_by }} as partition_by_col,
            {% endif %}
            {{ lower_bound_column }} as lower_bound,
            {{ upper_bound_column }} as upper_bound,

            lead({{ lower_bound_column }}) over (
                {{ partition_clause }}
                order by {{ lower_bound_column }}, {{ upper_bound_column }}
            ) as next_lower_bound,

            lead(1) over (
                {{ partition_clause }}
                order by {{ lower_bound_column }}, {{ upper_bound_column }}
            ) is null as is_last_record_alternative,

            row_number() over (
                {{ partition_clause }}
                order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc
            ) = 1 as is_last_record

        from {{ model }}

    ),

    calc as (
        -- We want to return records where one of our assumptions fails, so we'll use
        -- the `not` function with `and` statements so we can write our assumptions more cleanly
        select
            *,

            -- For each record: lower_bound should be < upper_bound.
            -- Coalesce it to return an error on the null case (implicit assumption
            -- these columns are not_null)
            coalesce(
                lower_bound {{ allow_zero_length_operator }} upper_bound,
                false
            ) as lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound,

            -- For each record: upper_bound {{ allow_gaps_operator }} the next lower_bound.
            -- Coalesce it to handle null cases for the last record.
            coalesce(
                upper_bound {{ allow_gaps_operator }} next_lower_bound,
                is_last_record,
                false
            ) as upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound

        from window_functions

    ),

    validation_errors as (

        select
            *
        from calc

        where not(
            -- THE FOLLOWING SHOULD BE TRUE --
            lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound
            and upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound
        )
    )

    select * from validation_errors
    {% endmacro %}
    """

    # Test data with two zero-length ranges occurring on the same day:
    df = pd.DataFrame(
        {
            "trans_id": [0, 1],
            "start_date": 2 * [pd.to_datetime("2000-01-01").date()],
            "end_date": 2 * [pd.to_datetime("2000-01-01").date()],
        }
    )

    # Render test query for `df` defined above, allowing for zero-length ranges:
    test_renderer: Callable = (
        Environment().from_string(test_template_str).module.mutually_exclusive_ranges
    )
    test_query = test_renderer(
        model="df",
        lower_bound_column="start_date",
        upper_bound_column="end_date",
        zero_length_range_allowed=True,
    )

    print(duckdb.sql(test_query))

    return None


if __name__ == "__main__":
    main()

Expected results

lower_bound upper_bound next_lower_bound is_last_record lower_bound_less_than_or_equal_to_upper_bound upper_bound_less_than_or_equal_to_next_lower_bound

i.e. an empty relation.

Actual results

lower_bound upper_bound next_lower_bound is_last_record lower_bound_less_than_or_equal_to_upper_bound upper_bound_less_than_or_equal_to_next_lower_bound
0 2000-01-01 2000-01-01 NaT

i.e. the duplicated zero length range 2000-01-01 to 2000-01-01 is considered to overlap with itself.

System information

The contents of your packages.yml file:

N/A, since provided snippet directly renders mutually_exclusive_ranges test with jinja2, without using dbt.

Which database are you using dbt with?

  • postgres
  • [x ] redshift
  • bigquery
  • snowflake
  • [x ] other (specify: duckdb)

The output of dbt --version:

N/A, since provided snippet directly renders mutually_exclusive_ranges test with jinja2, without using dbt.

Additional context

I believe the root cause of this issue lies in how is_last_record is computed in mutually_exclusive_ranges:

row_number() over (
    {{ partition_clause }}
    order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc
) = 1 as is_last_record

This assumes that ordering by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc always produces the exact reverse order of {{ lower_bound_column }}, {{ upper_bound_column }}. However, this isn't true for duplicated zero-length ranges, where the asc and desc ordering of records are identical. The net result of this is that some records are assigned a null next_lower_bound, but with a false is_last_record flag, causing these rows to be labelled as 'failing'.

To address this edge case, I suggest calculating is_last_record by ordering records in the same way as used to compute next_lower_bound, like so:

-- If no following record, `lead(1)` is null:
lead(1) over (
    {{ partition_clause }}
    -- No longer `desc` sorting:
    order by {{ lower_bound_column }}, {{ upper_bound_column }}
) is null as is_last_record,

A new argument should be introduced to allow users to toggle between this updated behavior and the current behavior.

Are you interested in contributing the fix?

Yes, I am interested in contributing this fix. I'll open a PR re: this issue later today if I find the time.

@mabilton mabilton added bug Something isn't working triage labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

1 participant