-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr reader #271
base: main
Are you sure you want to change the base?
Zarr reader #271
Conversation
Bit of an update. With the help from @sharkinsspatial @abarciauskas-bgse and @maxrjones I got a Zarr loaded as a virtual dataset. <xarray.Dataset> Size: 3kB
Dimensions: (time: 10, lat: 9, lon: 18)
Coordinates:
lat (lat) float32 36B ManifestArray<shape=(9,), dtype=float32, chunk...
lon (lon) float32 72B ManifestArray<shape=(18,), dtype=float32, chun...
* time (time) datetime64[ns] 80B 2013-01-01 ... 2013-01-03T06:00:00
Data variables:
air (time, lat, lon) int16 3kB ManifestArray<shape=(10, 9, 18), dtyp...
Attributes:
Conventions: COARDS
title: 4x daily NMC reanalysis (1948)
description: Data is from NMC initialized reanalysis\n(4x/day). These a...
platform: Model
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly... Next up is how to deal with When I try to write it to Kerchunk JSON, I’m running into some
Where Wondering how There seems to be some conversion logic in @sharkinsspatial's HDF reader for converting fill_values . It also looks like there is some fill_value handling in zarr.py. |
Amazing!
This seems like an issue that should actually be orthogonal to this PR (if it weren't for the ever-present difficulty of testing). Either the problem is in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a a great start! I think the main thing here is that we don't actually need kerchunk in order to test this reader.
# we should parameterize for: | ||
# - group | ||
# - drop_variables | ||
# - loadable_variables | ||
# - store readable over cloud storage? | ||
# - zarr store version v2, v3 | ||
# testing out pytest parameterization with dataclasses :shrug: -- we can revert to a more normal style |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's great to test all these cases, but they don't need to be simultaneously parametrized over, because we don't need to test the matrix of all possible combinations of these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think parametrizing over v2 vs v3 would be good though, as every feature should be tested for both versions.
# Do we have a good way in XRT to compare virtual datasets to xarray datasets? assert_duckarray_allclose? or just roundtrip it. | ||
# from xarray.testing import assert_duckarray_allclose | ||
# xrt.assert_allclose(ds, vds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before adding to test_integration.py
I would first create a tests/test_readers.py/test_zarr.py
and put tests in there. Those tests should open a virtual dataset from a zarr store and assert things about the contents of the ManifestArrays
, checking they match what you would expect based on the contents of the store. That's important because it's a kerchunk-free way to do useful testing.
virtualizarr/readers/zarr.py
Outdated
coord_names = list(set(all_array_dims)) | ||
|
||
# 4 Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have). | ||
# We want to drop 'drop_variables' but also virtual variables since we already **manifested** them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
groan
virtualizarr/readers/zarr.py
Outdated
loadable_vars=loadable_vars, | ||
indexes=indexes, | ||
coord_names=coord_names, | ||
attrs=zg.attrs.asdict(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you remembered the group-level attributes, nice
I think we should just do this in this PR. We can point to Tom's PR for now in the CI, but I expect that will get merged before this does anyway. If you look at Tom's implementation it's basically what we're doing here. |
Co-authored-by: Tom Nicholas <[email protected]>
|
virtualizarr/readers/zarr.py
Outdated
if not item.endswith( | ||
(".zarray", ".zattrs", ".zgroup", ".zmetadata") | ||
) and item.startswith(array_name): | ||
# dict key is created by splitting the value from store.list() by the array_name and trailing /....yuck.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we had a way to ask zarr if a key was backed by a chunk (as opposed to defaulting to the fill_value
) then we wouldn't need to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be waaay better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for more information, see https://pre-commit.ci
… are build from np.ndarrays
virtualizarr/readers/zarr.py
Outdated
|
||
if TYPE_CHECKING: | ||
import zarr | ||
from zarr.core.common import concurrent_map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A cautionary note: zarr.core
is meant to be private API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm guessing importing this from the Zarr private api is a big no-no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the implementation of concurrent_map isn't too complicated you could potentially vendor it instead.
Jan 10th ToDo / Notes: Issues:
|
for more information, see https://pre-commit.ci
If |
WIP PR to add a Zarr reader. Thanks to @TomNicholas for the how to write a reader guide.
docs/releases.rst
Future PR(s):
To Do:
For each virtual variable:
Replace VirtualiZarr.ZArray with zarr ArrayMetadata #175)
Get the loadable_variables by just using xr.open_zarr on the same store (should use drop_variables to avoid handling the virtual variables that we already have).