You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I managed to cause an unhandled error in weightedcals by feeding it a large dataframe with large weights that caused a float roundoff in numpy.cumsum.
Reproducible Example:
importnumpyasnpimportpandasaspdimportweightedcalcsN=130_000_000big_df=pd.DataFrame(dict(
values=np.linspace(0, 1, num=N)/10,
weights=np.linspace(0, 1, num=N)*0.1+50,
))
big_df['weights'] =big_df['weights'].astype('float32')
w_calc=weightedcalcs.Calculator('weights')
w_calc.quantile(big_df, 'values', q=0.9) # breaks, but should produce about 0.9 as an answer
cumsum reaches a float roundoff threshold and plateaus . As the result df["cumul_prop"] never reaches 1. In the example above df["cumul_prop"].max()` is about 0.16.
Seems like a "won't fix". But also this creates an opportunity to add a custom check for this condition on the side of weightedcalcs. Again, I am happy to submit a PR
Demetrio92
changed the title
Bug: it's possible to define weights that will lead to a float overflow and result in an unhandled index error
Bug: it's possible to define weights that will cause a float roundoff and result in an unhandled index error
Jan 7, 2025
p.s. nevermind, normalization solves only a fraction of cases. At a certain df size it is simply necessary to use float64. Still, we should check for whether this is happening.
I managed to cause an unhandled error in weightedcals by feeding it a large dataframe with large weights that caused a float roundoff in numpy.cumsum.
Reproducible Example:
Root-Cause
Here:
weightedcalcs/weightedcalcs/core.py
Line 87 in cbd2818
cumsum
reaches a float roundoff threshold and plateaus . As the resultdf["cumul_prop"]
never reaches 1. In the example abovedf["cumul_prop"]
.max()` is about 0.16.Then
weightedcalcs/weightedcalcs/core.py
Line 88 in cbd2818
yields an empty dataframe, and the next line fails with an IndexError.
Using
float64
fixes the above example, but it will inevitably cause the same problem for larger weights or larger dataframes.HotFix
The issue can be mitigated by splitting the normalization into two steps:
Upstream
Numpy devs have an explanation for this behavior and seem to not see it as a bug: numpy/numpy#24443
Next Steps
I can add a PR with the hotfix and with an extra check for this behavior, so at least such cases are caught instead of causing an index error.
The text was updated successfully, but these errors were encountered: