`mutually_exclusive_ranges` Test Fails for Duplicated Zero-Length Ranges #981

mabilton · 2025-01-13T00:05:09Z

Describe the bug

The mutually_exclusive_ranges test currently flags duplicated zero-length ranges as failing rows, even when zero_length_range_allowed=True, which might be unexpected and/or undesired.

As a simple illustration of this, consider the following table:

trans_id	start_date	end_date
0	2000-01-01	2000-01-01
1	2000-01-01	2000-01-01

Mathematically speaking, these ranges are not overlapping, since both are zero-length. However, with the current implementation of mutually_exclusive_ranges, at least one of these rows is marked as a failure, which may be unexpected (it was for me at least :) ).

Although there are absolutely valid use cases for flagging duplicated zero-length rows like the above, this behavior should probably be toggleable.

Steps to reproduce

Here’s a stand-alone Python 3.10 script demonstrating this behavior, using jinja2==3.1.5 pandas==2.2.3 duckdb==1.1.3:

from typing import Callable

import duckdb
import pandas as pd
from jinja2 import Environment


def main():
    # Copied from https://github.com/dbt-labs/dbt-utils/blob/5c9dc0d43265cb86c5f69954e6d739dc7a1974c3/macros/generic_tests/mutually_exclusive_ranges.sql#L5:
    test_template_str = """
    {% macro mutually_exclusive_ranges(model, lower_bound_column, upper_bound_column, partition_by=None, gaps='allowed', zero_length_range_allowed=False) %}
    {% if gaps == 'not_allowed' %}
        {% set allow_gaps_operator='=' %}
        {% set allow_gaps_operator_in_words='equal_to' %}
    {% elif gaps == 'allowed' %}
        {% set allow_gaps_operator='<=' %}
        {% set allow_gaps_operator_in_words='less_than_or_equal_to' %}
    {% elif gaps == 'required' %}
        {% set allow_gaps_operator='<' %}
        {% set allow_gaps_operator_in_words='less_than' %}
    {% else %}
        {{ exceptions.raise_compiler_error(
            "`gaps` argument for mutually_exclusive_ranges test must be one of ['not_allowed', 'allowed', 'required'] Got: '" ~ gaps ~"'.'"
        ) }}
    {% endif %}
    {% if not zero_length_range_allowed %}
        {% set allow_zero_length_operator='<' %}
        {% set allow_zero_length_operator_in_words='less_than' %}
    {% elif zero_length_range_allowed %}
        {% set allow_zero_length_operator='<=' %}
        {% set allow_zero_length_operator_in_words='less_than_or_equal_to' %}
    {% else %}
        {{ exceptions.raise_compiler_error(
            "`zero_length_range_allowed` argument for mutually_exclusive_ranges test must be one of [true, false] Got: '" ~ zero_length_range_allowed ~"'.'"
        ) }}
    {% endif %}

    {% set partition_clause="partition by " ~ partition_by if partition_by else '' %}

    with window_functions as (

        select
            {% if partition_by %}
            {{ partition_by }} as partition_by_col,
            {% endif %}
            {{ lower_bound_column }} as lower_bound,
            {{ upper_bound_column }} as upper_bound,

            lead({{ lower_bound_column }}) over (
                {{ partition_clause }}
                order by {{ lower_bound_column }}, {{ upper_bound_column }}
            ) as next_lower_bound,

            lead(1) over (
                {{ partition_clause }}
                order by {{ lower_bound_column }}, {{ upper_bound_column }}
            ) is null as is_last_record_alternative,

            row_number() over (
                {{ partition_clause }}
                order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc
            ) = 1 as is_last_record

        from {{ model }}

    ),

    calc as (
        -- We want to return records where one of our assumptions fails, so we'll use
        -- the `not` function with `and` statements so we can write our assumptions more cleanly
        select
            *,

            -- For each record: lower_bound should be < upper_bound.
            -- Coalesce it to return an error on the null case (implicit assumption
            -- these columns are not_null)
            coalesce(
                lower_bound {{ allow_zero_length_operator }} upper_bound,
                false
            ) as lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound,

            -- For each record: upper_bound {{ allow_gaps_operator }} the next lower_bound.
            -- Coalesce it to handle null cases for the last record.
            coalesce(
                upper_bound {{ allow_gaps_operator }} next_lower_bound,
                is_last_record,
                false
            ) as upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound

        from window_functions

    ),

    validation_errors as (

        select
            *
        from calc

        where not(
            -- THE FOLLOWING SHOULD BE TRUE --
            lower_bound_{{ allow_zero_length_operator_in_words }}_upper_bound
            and upper_bound_{{ allow_gaps_operator_in_words }}_next_lower_bound
        )
    )

    select * from validation_errors
    {% endmacro %}
    """

    # Test data with two zero-length ranges occurring on the same day:
    df = pd.DataFrame(
        {
            "trans_id": [0, 1],
            "start_date": 2 * [pd.to_datetime("2000-01-01").date()],
            "end_date": 2 * [pd.to_datetime("2000-01-01").date()],
        }
    )

    # Render test query for `df` defined above, allowing for zero-length ranges:
    test_renderer: Callable = (
        Environment().from_string(test_template_str).module.mutually_exclusive_ranges
    )
    test_query = test_renderer(
        model="df",
        lower_bound_column="start_date",
        upper_bound_column="end_date",
        zero_length_range_allowed=True,
    )

    print(duckdb.sql(test_query))

    return None


if __name__ == "__main__":
    main()

Expected results

lower_bound upper_bound next_lower_bound	is_last_record	lower_bound_less_than_or_equal_to_upper_bound	upper_bound_less_than_or_equal_to_next_lower_bound

i.e. an empty relation.

Actual results

lower_bound upper_bound next_lower_bound	is_last_record	lower_bound_less_than_or_equal_to_upper_bound	upper_bound_less_than_or_equal_to_next_lower_bound
0	2000-01-01	2000-01-01	NaT

i.e. the duplicated zero length range 2000-01-01 to 2000-01-01 is considered to overlap with itself.

System information

The contents of your packages.yml file:

N/A, since provided snippet directly renders mutually_exclusive_ranges test with jinja2, without using dbt.

Which database are you using dbt with?

postgres
[x ] redshift
bigquery
snowflake
[x ] other (specify: duckdb)

The output of dbt --version:

N/A, since provided snippet directly renders mutually_exclusive_ranges test with jinja2, without using dbt.

Additional context

I believe the root cause of this issue lies in how is_last_record is computed in mutually_exclusive_ranges:

row_number() over (
    {{ partition_clause }}
    order by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc
) = 1 as is_last_record

This assumes that ordering by {{ lower_bound_column }} desc, {{ upper_bound_column }} desc always produces the exact reverse order of {{ lower_bound_column }}, {{ upper_bound_column }}. However, this isn't true for duplicated zero-length ranges, where the asc and desc ordering of records are identical. The net result of this is that some records are assigned a null next_lower_bound, but with a false is_last_record flag, causing these rows to be labelled as 'failing'.

To address this edge case, I suggest calculating is_last_record by ordering records in the same way as used to compute next_lower_bound, like so:

-- If no following record, `lead(1)` is null:
lead(1) over (
    {{ partition_clause }}
    -- No longer `desc` sorting:
    order by {{ lower_bound_column }}, {{ upper_bound_column }}
) is null as is_last_record,

A new argument should be introduced to allow users to toggle between this updated behavior and the current behavior.

Are you interested in contributing the fix?

Yes, I am interested in contributing this fix. I'll open a PR re: this issue later today if I find the time.

The text was updated successfully, but these errors were encountered:

mabilton added bug Something isn't working triage labels Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`mutually_exclusive_ranges` Test Fails for Duplicated Zero-Length Ranges #981

`mutually_exclusive_ranges` Test Fails for Duplicated Zero-Length Ranges #981

mabilton commented Jan 13, 2025 •

edited

Loading

mutually_exclusive_ranges Test Fails for Duplicated Zero-Length Ranges #981

mutually_exclusive_ranges Test Fails for Duplicated Zero-Length Ranges #981

Comments

mabilton commented Jan 13, 2025 • edited Loading

Describe the bug

Steps to reproduce

Expected results

Actual results

System information

Additional context

Are you interested in contributing the fix?

`mutually_exclusive_ranges` Test Fails for Duplicated Zero-Length Ranges #981

`mutually_exclusive_ranges` Test Fails for Duplicated Zero-Length Ranges #981

mabilton commented Jan 13, 2025 •

edited

Loading