-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support HDF5 compression filter plugins #351
Comments
We try to cover the most frequently used HDF5 filters, but given the pluggable nature and big ecosystem of HDF, we will never succeed! See zarr-developers/numcodecs#422 for a discussion of SZip and zarr-developers/numcodecs#412 for fletcher32 checksum. Some like SZip are implemented in imagecodecs or elsewhere. I don't immediately see how you can get numcodecs classes from hdf5plugin, but it would be good if it would work. Ideally, though, reading HDF data via zarr and kerchunk should not depend on HDF itself.
We can get this to work! |
lz4: Unfortunately, the blocking scheme used by HDF5 is also different from the one used by blosc, so we can't use that as a fallback. On the other hand, the bitshuffle: hdf5plugin: |
cramjam has both blocked and block-free lz4 (as compress/decompress functions, easy to wrap).
It would be a shame to have to call HDF :( By the way, blosc has a bitshuffle, but I don't know if it's the same implementation as HDF and whether you can call it in isolation. |
I agree. (Although here it would "only" be the plugin code, but it doesn't seem to be straightforward to get the bitshuffling part on it's own, without indirectly depending on HDF5)
I don't know for sure, but the Python API doesn't look like it's possible to call it in isolation. |
That's nice, but I fear that the cramjam-blocked-lz4 is according to the lz4 block format, which is something different than the HDF5-lz4-block format. But as I don't believe that there are many datasets out which use larger than 1GB chunk size, the offset trick mentioned above could be more elegant and easier to implement. |
Of course it is - why ever would they be the same?? :) So yes, the question becomes what minimal amount of work do we need to do to support 95% of cases, and you are probably right that offsetting is the way to go. I can't immediately see a spec - is it just 8 bytes for the block size? |
Sorry, it probably got a bit burried in the links. I believe this should be it. So it should be a 16 byte offset. The 16 bytes before are (big endian int):
So if |
Sounds good! So all we need is a small test file for CI, and we can go ahead. |
@florianziemen do you have one at hand? |
In principle yes. I was on holidays last week and our HPC is on holidays today. I'll look into things tomorrow. |
This is probably fixed #350 |
HDF5 has a zoo of compression filters. Some of them can be mapped to numcodecs filters, and simply need an entry in the json. Others might need further effort.
https://portal.hdfgroup.org/display/support/Registered+Filter+Plugins
I've addressed blosc and zstd in #350 (still early state, but I figured it might be good to announce this to avoid duplication of efforts).
lz4 ( id 32004) and bitshuffle ( id 32008) so far resisted my efforts, and I have not tackled combinations of filters, that's why they are currently set to yield an error message in the MR draft.
Maybe it would be good to use the implementations from hdf5plugin and announce them to numcodecs as done in gribscan. @d70-t - any thoughts?
The text was updated successfully, but these errors were encountered: