Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bump Xarray to 2025.1.1 and icechunk to 0.1.0a10 in upstream #375

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

norlandrhagen
Copy link
Collaborator

  • Tests passing

@norlandrhagen norlandrhagen added CI Continuous Integration dependencies Updates a dependency labels Jan 9, 2025
@norlandrhagen
Copy link
Collaborator Author

Hey there @mpiannucci, @TomNicholas and I just bumped the icechunk and Xarray versions and we're seeing some failures on the upstream CI. Wondering if you have any insight!

@mpiannucci
Copy link
Contributor

zarr 3 comes with a number of changes

  1. chunk_shape is now chunks
  2. codecs is no longer an argument to create_array, instead you now specify filters (array to array), compressors (bytes to bytes) and optionally a single serializer (that converts from bytes to array and vice versa). So the whole codec pipeline needs to be redone, but it should make everything simpler.

@norlandrhagen norlandrhagen mentioned this pull request Jan 10, 2025
22 tasks
@redcliff
Copy link

Hey, can you bump icechunk version to 0.1.0a11, which has added azure blob support? I'm interested in trying out writing virtual zarr store into azure blob but got blocked by the same issue causing the CI failures. Is there an estimate on when we can expect this PR to be merged?

@abarciauskas-bgse
Copy link
Collaborator

@norlandrhagen just FYI: This is the branch I am using for icechunk until we can come up with a more complete design for refactoring for zarr-python 3.0

@TomNicholas
Copy link
Member

TomNicholas commented Jan 15, 2025

Thanks @abarciauskas-bgse!

@redcliff we're in a tricky spot right now, with a bunch of backwards-incompatible and mutually-exclusive changes in our dependencies Zarr-python v3, Icechunk, and Kerchunk. Whilst it should be possible to make things work with some hacky branches right now, if you would rather avoid that rabbit hole then it will likely take us a few weeks to get everything working again on up-to-date released branches.

Can I ask what file format you're hoping to Virtualize? That can affect which dependencies you need which affects how easy it is to get things working right now.

@ghidalgo3
Copy link
Contributor

@TomNicholas, @redcliff and I would like to virtualize NetCDF4 and HDF5 files types. They are the primary n-dimensional array datatypes we host on Planetary Computer.

We can wait until the dependencies stabilize. My end goal is for us to be able to do the following:

  1. Bring archival data onto Planetary Computer's Azure Blob Storage as NetCDF and HDF5 files.
  2. For each file, create an adjacent virtualized icechunk store by calling virtualizarr.dataset_to_icechunk (requires icechunk>=0.1.0a11 for Azure Blob Storage support). That's the store we would encourage our users to read from.
  3. Document for our users how to open the data using xarray.open_zarr through the icechunk reader.

If I understand the sequence of events correctly

  1. zarr==3.0.0 (no prerelease!) was released days ago and it introduced breaking changes from zarr==3.0.0b3.
  2. The upstream environment for VirtualiZarr indirectly depended on zarr==3.0.0b3 through IceChunk==0.1.0a8.
  3. That indirect dependency from IceChunk has version constraint zarr>=3 which means that on January 9 when zarr==3.0.0 was released it immediately started being consumed by VirtualiZarr, but you ran into the breaking changes and now need to adapt to them.

Can we help in any way?

@TomNicholas
Copy link
Member

Hey @ghidalgo3!

Yes that's all correct, but with the additional complication that we can't even pin any Zarr-python version >=3.0.0 in main yet (not even the pre-release) because some of our readers (and tests) are still coupled to kerchunk, an optional dependency that currently requires zarr-python<3.0.0.

However in your case if you use the newer HDFReader instead of the kerchunk-reliant HDF5Reader, you should be able to use @abarciauskas-bgse 's branch to get things mostly working again with Icechunk (as that's presumably what she's doing already).

Can we help in any way?

Someone needs to rewrite the codec pipeline code to work with the released version of Zarr-python, and as you guys so kindly wrote the v2-compatible version of that then you would be great people to update it 😄

For the purposes of testing that you could just pin Zarr-python>=3.0.0 in that branch, even though that will currently break kerchunk-reliant tests, as depending on >=3.0.0.0 is the end state we're aiming for anyway.

@abarciauskas-bgse
Copy link
Collaborator

I'm using the dmrpp reader so I'm not 100% sure my branch will work with the HDFVirtualBackend reader

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration dependencies Updates a dependency
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants