Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Demetrio92 · 2025-01-07T22:25:39Z

I managed to cause an unhandled error in weightedcals by feeding it a large dataframe with large weights that caused a float roundoff in numpy.cumsum.

Reproducible Example:

import numpy as np
import pandas as pd
import weightedcalcs

N = 130_000_000
big_df = pd.DataFrame(dict(
    values=np.linspace(0, 1, num=N)/10,
    weights=np.linspace(0, 1, num=N)*0.1+50,
))
big_df['weights'] = big_df['weights'].astype('float32')

w_calc = weightedcalcs.Calculator('weights')
w_calc.quantile(big_df, 'values', q=0.9)  # breaks, but should produce about 0.9 as an answer

Root-Cause

Here:

weightedcalcs/weightedcalcs/core.py

Line 87 in cbd2818

df["cumul_prop"] = df["weights"].cumsum() / df["weights"].sum()

cumsum reaches a float roundoff threshold and plateaus . As the result df["cumul_prop"] never reaches 1. In the example above df["cumul_prop"].max()` is about 0.16.

Then

weightedcalcs/weightedcalcs/core.py

Line 88 in cbd2818

shaved = df[df["cumul_prop"] >= q]

yields an empty dataframe, and the next line fails with an IndexError.

Using float64 fixes the above example, but it will inevitably cause the same problem for larger weights or larger dataframes.

HotFix

The issue can be mitigated by splitting the normalization into two steps:

df["weights"] =  df["weights"] / df["weights"].sum()
df["cumul_prop"] = df["weights"].cumsum()

Upstream

Numpy devs have an explanation for this behavior and seem to not see it as a bug: numpy/numpy#24443

Next Steps

I can add a PR with the hotfix and with an extra check for this behavior, so at least such cases are caught instead of causing an index error.

The text was updated successfully, but these errors were encountered:

Demetrio92 · 2025-01-07T22:41:23Z

Found a related numpy issue: numpy/numpy#24443

Seems like a "won't fix". But also this creates an opportunity to add a custom check for this condition on the side of weightedcalcs. Again, I am happy to submit a PR

Demetrio92 · 2025-01-07T23:05:01Z

p.s. nevermind, normalization solves only a fraction of cases. At a certain df size it is simply necessary to use float64. Still, we should check for whether this is happening.

jsvine · 2025-01-10T03:36:43Z

Thanks for flagging this, @Demetrio92 — what sort of check would you propose?

Demetrio92 changed the title ~~Bug: it's possible to define weights that will lead to a float overflow and result in an unhandled index error~~ Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Demetrio92 commented Jan 7, 2025 •

edited

Loading

Demetrio92 commented Jan 7, 2025

Demetrio92 commented Jan 7, 2025

jsvine commented Jan 10, 2025

Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Comments

Demetrio92 commented Jan 7, 2025 • edited Loading

Reproducible Example:

Root-Cause

HotFix

Upstream

Next Steps

Demetrio92 commented Jan 7, 2025

Demetrio92 commented Jan 7, 2025

jsvine commented Jan 10, 2025

Demetrio92 commented Jan 7, 2025 •

edited

Loading