Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with storing raw arrays #76

Open
TomDLT opened this issue Apr 10, 2020 · 5 comments
Open

Bug with storing raw arrays #76

TomDLT opened this issue Apr 10, 2020 · 5 comments
Labels

Comments

@TomDLT
Copy link
Contributor

TomDLT commented Apr 10, 2020

With @gxlilyBerkeley, we identified a bug when storing raw arrays, which was quite annoying.

If we upload some arrays and download them immediately, they might be changed in the process. We narrowed down the bug to the following conditions:

  • python 2
  • storing as raw arrays
  • large arrays (that trigger Zsdt compression)
  • non C-contiguous arrays

Here is a reproducing example:

import sys
import numpy as np
import cottoncandy as cc

# the bug is only in python 2
print(sys.version_info)

# create some large sparse data to be compressed
array_in = np.zeros((100, 100, 30000)).T
mask = np.random.randn(*array_in.shape) > 3
array_in[mask] = 1

# there is no bug if the array is contiguous in memory
if False:
    array_in = np.ascontiguousarray(array_in)

# store and load the array, using raw arrays
cci = cc.get_interface('tomdlt_tmp', verbose=False)
cci.upload_raw_array("test_array", array_in)
array_out = cci.download_raw_array("test_array")

# array_in and array_out are not identical
np.testing.assert_array_equal(array_in, array_out)
@TomDLT TomDLT added the bug label Apr 10, 2020
@anwarnunez
Copy link
Contributor

ooof, that is annoying. couple of things:
is the "immediate" part relevant? if you store it, exit the session, and then download it in a separate session, does it work?
and what is the difference in the data? is it totally corrupted? or is it an exactness/dtype problem?

@TomDLT
Copy link
Contributor Author

TomDLT commented Apr 10, 2020

The immediate part was just to sort out a possible corruption at the storage level, but we discovered the issue on an array stored for a long time.

The example above is made to reproduce with synthetic data, but on the original array where we discovered the bug, we got a pretty big difference in the data:

Compressed to 0.06% the size

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-53e61b81e11e> in <module>()
      6 cci.upload_raw_array("20200409_cc_test", array_in)
      7 array_out = cci.download_raw_array("20200409_cc_test")
----> 8 np.testing.assert_array_equal(array_in, array_out)
      9 print('success !!')

/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_equal(x, y, err_msg, verbose)
    902     __tracebackhide__ = True  # Hide traceback for py.test
    903     assert_array_compare(operator.__eq__, x, y, err_msg=err_msg,
--> 904                          verbose=verbose, header='Arrays are not equal')
    905 
    906 

/home/jlg/tomdlt/miniconda3/envs/py27/lib/python2.7/site-packages/numpy/testing/_private/utils.pyc in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    825                                 verbose=verbose, header=header,
    826                                 names=('x', 'y'), precision=precision)
--> 827             raise AssertionError(msg)
    828     except ValueError:
    829         import traceback

AssertionError: 
Arrays are not equal

Mismatch: 18%
Max absolute difference: 60.51625653
Max relative difference: 36.06609662
 x: array([[[64.188949, 61.733371, 62.085814, ..., 62.085814, 62.085814,
         63.139827],
        [60.672661, 57.098978, 57.459062, ..., 57.459062, 57.098978,...
 y: array([[[64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,
         64.188949],
        [64.188949, 64.188949, 64.188949, ..., 64.188949, 64.188949,...

@TomDLT
Copy link
Contributor Author

TomDLT commented Apr 10, 2020

Interestingly, the histograms of values are identical, which means that it is probably just a reordering of values, coming from an incorrect handling of non-contiguous arrays.

hist_in = np.histogram(array_in.ravel(), bins=100)
hist_out = np.histogram(array_out.ravel(), bins=100)

np.testing.assert_array_equal(hist_in[0], hist_out[0])
np.testing.assert_array_equal(hist_in[1], hist_out[1])

@TomDLT
Copy link
Contributor Author

TomDLT commented Apr 10, 2020

This is confirmed by the following test, which passes (assuming that array_in is 2D and FORTRAN-ordered):

np.testing.assert_array_equal(array_in, array_out.reshape(array_out.shape[::-1]).T)

It seems that the array is stored with the correct data and shape, but the strides/ordering are lost in the process. An easy fix would be to use np.ascontiguousarray before the storing, with the inconvenience of a data copy.

@anwarnunez
Copy link
Contributor

just to follow up from our conversation. this can be fixed in this code block:
https://github.com/gallantlab/cottoncandy/blob/master/cottoncandy/interfaces.py#L603

A more solid solution is to remove PY2 support.

We also need to add a test for large arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants