Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error #10

Open
Demetrio92 opened this issue Jan 7, 2025 · 3 comments

Comments

@Demetrio92
Copy link

Demetrio92 commented Jan 7, 2025

I managed to cause an unhandled error in weightedcals by feeding it a large dataframe with large weights that caused a float roundoff in numpy.cumsum.

Reproducible Example:

import numpy as np
import pandas as pd
import weightedcalcs

N = 130_000_000
big_df = pd.DataFrame(dict(
    values=np.linspace(0, 1, num=N)/10,
    weights=np.linspace(0, 1, num=N)*0.1+50,
))
big_df['weights'] = big_df['weights'].astype('float32')

w_calc = weightedcalcs.Calculator('weights')
w_calc.quantile(big_df, 'values', q=0.9)  # breaks, but should produce about 0.9 as an answer

Root-Cause

Here:

df["cumul_prop"] = df["weights"].cumsum() / df["weights"].sum()

cumsum reaches a float roundoff threshold and plateaus . As the result df["cumul_prop"] never reaches 1. In the example above df["cumul_prop"].max()` is about 0.16.

Then

shaved = df[df["cumul_prop"] >= q]

yields an empty dataframe, and the next line fails with an IndexError.

Using float64 fixes the above example, but it will inevitably cause the same problem for larger weights or larger dataframes.

HotFix

The issue can be mitigated by splitting the normalization into two steps:

df["weights"] =  df["weights"] / df["weights"].sum()
df["cumul_prop"] = df["weights"].cumsum()

Upstream

Numpy devs have an explanation for this behavior and seem to not see it as a bug: numpy/numpy#24443

Next Steps

I can add a PR with the hotfix and with an extra check for this behavior, so at least such cases are caught instead of causing an index error.

@Demetrio92
Copy link
Author

Found a related numpy issue: numpy/numpy#24443

Seems like a "won't fix". But also this creates an opportunity to add a custom check for this condition on the side of weightedcalcs. Again, I am happy to submit a PR

@Demetrio92 Demetrio92 changed the title Bug: it's possible to define weights that will lead to a float overflow and result in an unhandled index error Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error Jan 7, 2025
@Demetrio92
Copy link
Author

p.s. nevermind, normalization solves only a fraction of cases. At a certain df size it is simply necessary to use float64. Still, we should check for whether this is happening.

@jsvine
Copy link
Owner

jsvine commented Jan 10, 2025

Thanks for flagging this, @Demetrio92 — what sort of check would you propose?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants