-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for H5Z_FILTER_SHUFFLE for GZIP/ZLIB compressed data #11
Comments
That sounds like a totally reasonable plan. I don't think that numcodecs requires you to use cython, but it may be convenient to keep to the same coding style. Whether this needs to be an option to the existing codec or to be a separate one is an open question. As discussed in #9 , SZIP from libaec would be worth adding, as well as perhaps fletcher32 checksum (although I don't know how the checksums are stored) PS: I had no idea that HDF (or HSDS) used numba internally. Probably numcodecs is not interested in taking numba as a dependency, but the question could be asked (I personally am in the zarr team, but have not contributed to numcodecs). |
Sounds like a great plan Paul! I'll be happy to help out on the numcodecs side. |
The Blosc library does have the C code for byte and bit shuffling so that could be used as a starting point. I thought about doing this when working on the HDF5-Zarr translator but did not have enough time on that project to actually do it. Another thing, the Zarr spec supports filters so @pbranson's first option is probably the best approach. The shuffle would be declared as a filter and be available to any compression method. |
Thanks for the advice and the suggestions about the filters and reusing the blosc shuffle - there is plenty of hardware optimisation there that is worth leveraging. I have dug through the blosc code and found where the byte shuffle is implemented: The shuffle.h header is exposed in numcodecs yet, I havent done any cython development work but it seems that the headers can be exposed via a cdef like this: https://github.com/zarr-developers/numcodecs/blob/master/numcodecs/blosc.pyx#L21 And the zarr filters are just calls to numcodec codecs - so I could make a Shuffle(Codec) that uses the blosc shuffle in a similar way that blosc.pyx does it and then we dont need to change the gzip etc codecs. Have updated the checklist with a revised plan and will see about raising the issue at numcodecs. |
I think the suggestion may have been that you can already use the blosc compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra code:
|
Oh! I didn't realise that Blosc also provided a number of different
compressors. I think I will continue with the addition of the shuffle
filter, probably won't perform as well as maintaining the context for the
shuffle operation within C, but adds some flexibility for other
numcodecs....
Will test the Blosc gzip approach in this library.
…On Thu., 26 Nov. 2020, 10:51 pm Martin Durant, ***@***.***> wrote:
I think the suggestion may have been that you can already use the blosc
compressor, as wrapped by numcodecs, to do gzip/shuffle without any extra
code: numcodecs.blosc.Blosc("gzip", compression_level, shuffle, blocksize),
where
- 0: NOSHUFFLE
- 1: SHUFFLE
- 2: BITSHUFFLE
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#11 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADG5WQBD4TQTQHEQDJHVO33SRZTQLANCNFSM4UBW7XDQ>
.
|
OK, let us know! Note that, if you are running any tests with Dask and multiple threads, you may want to set the number of blosc threads to 1, e.g., Looking through the blosc documentation now, I don't see a way to turn OFF the blosc-specific header data in the comp/decompression functions currently available to python. |
Decompressing HDF5 dataset chunks using Blosc gzip and shuffle did not work for me exactly because of that 16-byte Blosc header. That's why I suggested shuffle as a Zarr filter. |
Annoying, right? |
@pbranson , any luck with this? @ajelenak : it does occur to me that it probably is very easy to create the blosc header for each chunk to be decompressed. That could be a super-simple numcodecs filter, or a custom part of a zarr store class. Even fsspec could do this, but it's less obvious a fit there. |
I am not familiar with the Blosc header but it should be possible to do something like that. I think a proper shuffle filter is still a better long term approach. |
After understanding that blosc add's a header I stopped pursuing that approach. I have made a start on setting up the numcodec cython and classes, but getting swamped with a pre-christmas work crunch. I need to test if the full blosc context/init is required use the shuffle functions - I am not sure that it does which would be convenient. I can see how adding a blosc header filter could also be an option, but would that still require a memory copy operation to make the data contiguous prior to sending to blosc-c? In which case it is possibly faster to do the shuffle directly. Hopefully get back to it tomorrow. |
Yes, I suppose so, unless we somehow read the bytes data into a buffer with an initial offset of two bytes, and then fill in the header after reading. Of course, the point is that the compressed data should be much smaller than the final data buffer, so maybe a copy isn't so bad. |
OK I have provided an update of my progress here: zarr-developers/numcodecs#260 Having trouble with incompatible pointer types (I think)
Possibly that is an easier way to go is that documented here: https://blosc-doc.readthedocs.io/en/latest/frame_format.html?highlight=chunk%20format#description-for-different-fields-in-header ? |
Blosc 2 is not in use anywhere - but it's documentation has completely takes over from Blosc v1! I'm pretty sure it's two bytes, giving the compression type, shuffle, and size. I swear I saw the spec recently. |
Oh and I thought it was just me struggling to find it...!
|
OK I have finally got back to this - I have implemented the shuffle with numba and raised a PR to get some feedback from the numcodecs devs. Have have updated my fork of fsspec_reference_maker to add support for shuffle and tested it out in this gist Also it seems that these lines in the main branch have been commented out by mistake? |
To be checked, but I don't think we need those, everything is in chunkstore |
Does numcodecs support numba? Or are you suggesting this as a filter step somewhere else in the pipeline? |
Not at present, but I have asked about it in zarr-developers/numcodecs#273 It seems according to zarr-developers/numcodecs#274 that they are moving towards a more pure-python approach (for blosc at least) so maybe they might consider adding numba? Should I ping someone specifically in that PR?
Probably not, it seems logical for the shuffle to be in numcodecs - but maybe it should be cython rather than numba. |
I suppose the usual maintainers will notice; but I doubt that numba as a dep is a great fit. We'll see. Cython would fit better, for the case where existing python bindings to some compiled code doesn't exist. |
I have started testing out this approach on some of the Australian Integrated Marine Observing System datasets that are stored in NetCDF format on S3 (eg. s3://imos-data/IMOS/SRS/OC/gridded/aqua/P1D/2014/08/A.P1D.20140801T053000Z.aust.K_490.nc )
This dataset utilises gzip compression with the H5Z_FILTER_SHUFFLE filter applied. I have looked over the upstream code dependencies and would appreciate some advice on the pathway forward.
Currently the numcodecs gzip and zlib Codecs do not support a shuffle option. The code for the shuffle is pretty straight forward, however will likely be very slow in python if not compiled: https://github.com/HDFGroup/hsds/blob/03890edfa735cc77da3bc06f6cf5de5bd40d1e23/hsds/util/storUtil.py#L43
numcodecs uses cython for compiled code rather than numba.
I am keen to help get this sorted and suggest one possible way forward could be:
Appreciate advice from some of the more experienced devs here (@martindurant, @ajelenak and @rabernat) if you think this is a reasonable way forward?
Thanks for everyone's effort on this!
The text was updated successfully, but these errors were encountered: