From 545cae7fc6bb03fa5df55b0f7febb9c23bff1b21 Mon Sep 17 00:00:00 2001 From: Tom Nicholas Date: Fri, 15 Nov 2024 14:54:35 -0700 Subject: [PATCH] Clarify which features are currently available in FAQ (#296) * clarifies which readers actually work * removes tiff from auto-detection for now * release notes * xfail tiff filetype detection test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix emojis * clarify how to combine in coordinate order * add line for generating references from zarr v3 store * typo * add link to icechunk * add table entry for icechunk * add dmr++ table entry * remove reference to xarray backend for virtualizarr that doesn't exist * mention icechunk in overall explanation * don't imply that all virtualizarr readers use kerchunk * use crosses to indicate features kerchunk doesn't have * add table entry for a HDF4 reader * add table entries on how to rename vars/dims * add table entry for renaming paths in manifest * add warning emojis to the parallelization ideas to indicate they are as-yet untested * actually kerchunk does support renaming filepaths in the manifest * remove rogue | * remove redundant link * specify filetype kwarg needed in open_virtual_dataset * add table entry on how to open existing kerchunk references * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> --- docs/faq.md | 50 ++++++++++++++++++------------ docs/releases.rst | 2 ++ virtualizarr/backend.py | 3 +- virtualizarr/tests/test_backend.py | 5 +-- 4 files changed, 35 insertions(+), 25 deletions(-) diff --git a/docs/faq.md b/docs/faq.md index 81f55aa3..a0274620 100644 --- a/docs/faq.md +++ b/docs/faq.md @@ -4,30 +4,35 @@ I'm glad you asked! We can think of the problem of providing virtualized zarr-like access to a set of legacy files in some other format as a series of steps: -1) **Read byte ranges** - We use the various [kerchunk file format backends](https://fsspec.github.io/kerchunk/reference.html#file-format-backends) to determine which byte ranges within a given legacy file would have to be read in order to get a specific chunk of data we want. -2) **Construct a representation of a single file (or array within a file)** - Kerchunk's backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of `ManifestArray` objects. The record of where to look to find the file and the byte ranges is stored under the `ManifestArray.manifest` attribute, in a `ChunkManifest` object. Both steps (1) and (2) are handled by the `'virtualizarr'` xarray backend, which returns one `xarray.Dataset` object per file, each wrapping multiple `ManifestArray` instances (as opposed to e.g. numpy/dask arrays). +1) **Read byte ranges** - We use various [virtualizarr readers](https://github.com/zarr-developers/VirtualiZarr/tree/main/virtualizarr/readers) to determine which byte ranges within a given legacy file would have to be read in order to get a specific chunk of data we want. Several of these readers work by calling one of the [kerchunk file format backends](https://fsspec.github.io/kerchunk/reference.html#file-format-backends) and parsing the output. +2) **Construct a representation of a single file (or array within a file)** - Kerchunk's backends return a nested dictionary representing an entire file, but we instead immediately parse this dict and wrap it up into a set of `ManifestArray` objects. The record of where to look to find the file and the byte ranges is stored under the `ManifestArray.manifest` attribute, in a `ChunkManifest` object. Both steps (1) and (2) are handled by the `virtualizarr.open_virtual_dataset`, which returns one `xarray.Dataset` object for the given file, which wraps multiple `ManifestArray` instances (as opposed to e.g. numpy/dask arrays). 3) **Deduce the concatenation order** - The desired order of concatenation can either be inferred from the order in which the datasets are supplied (which is what `xr.combined_nested` assumes), or it can be read from the coordinate data in the files (which is what `xr.combine_by_coords` does). If the ordering information is not present as a coordinate (e.g. because it's in the filename), a pre-processing step might be required. 4) **Check that the desired concatenation is valid** - Whether called explicitly by the user or implicitly via `xr.combine_nested/combine_by_coords/open_mfdataset`, `xr.concat` is used to concatenate/stack the wrapped `ManifestArray` objects. When doing this xarray will spend time checking that the array objects and any coordinate indexes can be safely aligned and concatenated. Along with opening files, and loading coordinates in step (3), this is the main reason why `xr.open_mfdataset` can take a long time to return a dataset created from a large number of files. 5) **Combine into one big dataset** - `xr.concat` dispatches to the `concat/stack` methods of the underlying `ManifestArray` objects. These perform concatenation by merging their respective Chunk Manifests. Using xarray's `combine_*` methods means that we can handle multi-dimensional concatenations as well as merging many different variables. -6) **Serialize the combined result to disk** - The resultant `xr.Dataset` object wraps `ManifestArray` objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the [kerchunk specification](https://fsspec.github.io/kerchunk/spec.html#version-1), or in future we plan to use [new Zarr extensions](https://github.com/zarr-developers/zarr-specs/issues/287) to write valid Zarr stores directly. -7) **Open the virtualized dataset from disk** - The virtualized zarr store can now be read from disk, skipping all the work we did above. Chunk reads from this store will be redirected to read the corresponding bytes in the original legacy files. +6) **Serialize the combined result to disk** - The resultant `xr.Dataset` object wraps `ManifestArray` objects which contain the complete list of byte ranges for every chunk we might want to read. We now serialize this information to disk, either using the [Kerchunk specification](https://fsspec.github.io/kerchunk/spec.html#version-1), or the [Icechunk specification](https://icechunk.io/spec/). +7) **Open the virtualized dataset from disk** - The virtualized zarr store can now be read from disk, avoiding redoing all the work we did above and instead just opening all the virtualized data immediately. Chunk reads will be redirected to read the corresponding bytes in the original legacy files. -The above steps would also be performed using the `kerchunk` library alone, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, kerchunk's design is much less modular, and the use cases are limited by kerchunk's API surface. +The above steps could also be performed using the `kerchunk` library alone, but because (3), (4), (5), and (6) are all performed by the `kerchunk.combine.MultiZarrToZarr` function, and no internal abstractions are exposed, kerchunk's design is much less modular, and the use cases are limited by kerchunk's API surface. ## How do VirtualiZarr and Kerchunk compare? -You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides [almost all the same features](https://virtualizarr.readthedocs.io/en/latest/faq.html#how-do-virtualizarr-and-kerchunk-compare) as Kerchunk. +You now have a choice between using VirtualiZarr and Kerchunk: VirtualiZarr provides almost all the same features as Kerchunk. Users of kerchunk may find the following comparison table useful, which shows which features of kerchunk map on to which features of VirtualiZarr. + | Component / Feature | Kerchunk | VirtualiZarr | | ------------------------------------------------------------------------ | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------ | | **Generation of references from archival files (1)** | | | -| From a netCDF4/HDF5 file | `kerchunk.hdf.SingleHdf5ToZarr` | `open_virtual_dataset`, via `kerchunk.hdf.SingleHdf5ToZarr` or potentially `hidefix` | -| From a netCDF3 file | `kerchunk.netCDF3.NetCDF3ToZarr` | `open_virtual_dataset`, via `kerchunk.netCDF3.NetCDF3ToZarr` | -| From a COG / tiff file | `kerchunk.tiff.tiff_to_zarr` | `open_virtual_dataset`, via `kerchunk.tiff.tiff_to_zarr` or potentially `cog3pio` | -| From a Zarr v2 store | `kerchunk.zarr.ZarrToZarr` | `open_virtual_dataset`, via `kerchunk.zarr.ZarrToZarr` ? | -| From a GRIB2 file | `kerchunk.grib2.scan_grib` | `open_virtual_datatree`, via `kerchunk.grib2.scan_grib` ? | -| From a FITS file | `kerchunk.fits.process_file` | `open_virtual_dataset`, via `kerchunk.fits.process_file` | +| From a netCDF4/HDF5 file | `kerchunk.hdf.SingleHdf5ToZarr` | `open_virtual_dataset(..., filetype='hdf5')`, via `kerchunk.hdf.SingleHdf5ToZarr` | +| From a netCDF3 file | `kerchunk.netCDF3.NetCDF3ToZarr` | `open_virtual_dataset(..., filetype='netcdf3')`, via `kerchunk.netCDF3.NetCDF3ToZarr` | +| From a COG / tiff file | `kerchunk.tiff.tiff_to_zarr` | `open_virtual_dataset(..., filetype='tiff')`, via `kerchunk.tiff.tiff_to_zarr` or potentially `tifffile` (❌ Not yet implemented - see [issue #291](https://github.com/zarr-developers/VirtualiZarr/issues/291)) | +| From a Zarr v2 store | `kerchunk.zarr.ZarrToZarr` | `open_virtual_dataset(..., filetype='zarr')` (❌ Not yet implemented - see [issue #262](https://github.com/zarr-developers/VirtualiZarr/issues/262)) | +| From a Zarr v3 store | ❌ | `open_virtual_dataset(..., filetype='zarr')` (❌ Not yet implemented - see [issue #262](https://github.com/zarr-developers/VirtualiZarr/issues/262)) | +| From a GRIB2 file | `kerchunk.grib2.scan_grib` | `open_virtual_datatree(..., filetype='grib')` (❌ Not yet implemented - see [issue #11](https://github.com/zarr-developers/VirtualiZarr/issues/11)) | +| From a FITS file | `kerchunk.fits.process_file` | `open_virtual_dataset(..., filetype='fits')`, via `kerchunk.fits.process_file` | +| From a HDF4 file | `kerchunk.hdf4.HDF4ToZarr` | `open_virtual_dataset(..., filetype='hdf4')`, via `kerchunk.hdf4.HDF4ToZarr` (❌ Not yet implemented - see [issue #216](https://github.com/zarr-developers/VirtualiZarr/issues/216)) | +| From a [DMR++](https://opendap.github.io/DMRpp-wiki/DMRpp.html) metadata file | ❌ | `open_virtual_dataset(..., filetype='dmrpp')`, via `virtualizarr.readers.dmrpp.DMRParser` | +| From existing kerchunk JSON/parquet references | `kerchunk.combine.MultiZarrToZarr(append=True)` | `open_virtual_dataset(..., filetype='kerchunk')` | | **In-memory representation (2)** | | | | In-memory representation of byte ranges for single array | Part of a "reference `dict`" with keys for each chunk in array | `ManifestArray` instance (wrapping a `ChunkManifest` instance) | | In-memory representation of actual data values | Encoded bytes directly serialized into the "reference `dict`", created on a per-chunk basis using the `inline_threshold` kwarg | `numpy.ndarray` instances, created on a per-variable basis using the `loadable_variables` kwarg | @@ -35,15 +40,22 @@ Users of kerchunk may find the following comparison table useful, which shows wh | **Manipulation of in-memory references (3, 4 & 5)** | | | | Combining references to multiple arrays representing different variables | `kerchunk.combine.MultiZarrToZarr` | `xarray.merge` | | Combining references to multiple arrays representing the same variable | `kerchunk.combine.MultiZarrToZarr` using the `concat_dims` kwarg | `xarray.concat` | -| Combining references in coordinate order | `kerchunk.combine.MultiZarrToZarr` using the `coo_map` kwarg | `xarray.combine_by_coords` with in-memory xarray indexes created by loading coordinate variables first | -| Combining along multiple dimensions without coordinate data | n/a | `xarray.combine_nested` | -| **Parallelization** | | | -| Parallelized generation of references | Wrapping kerchunk's opener inside `dask.delayed` | Wrapping `open_virtual_dataset` inside `dask.delayed` but eventually instead using `xarray.open_mfdataset(..., parallel=True)` | -| Parallelized combining of references (tree-reduce) | `kerchunk.combine.auto_dask` | Wrapping `ManifestArray` objects within `dask.array.Array` objects inside `xarray.Dataset` to use dask's `concatenate` | +| Combining references in coordinate order | `kerchunk.combine.MultiZarrToZarr` using the `coo_map` kwarg | `xarray.combine_by_coords` with in-memory coordinate variables loaded via the `loadable_variables` kwarg | +| Combining along multiple dimensions without coordinate data | ❌ | `xarray.combine_nested` | +| Dropping variables | `kerchunk.combine.drop` | `xarray.Dataset.drop_vars`, or `open_virtual_dataset(..., drop_variables=...)` | +| Renaming variables | ❌ | `xarray.Dataset.rename_vars` | +| Renaming dimensions | ❌ | `xarray.Dataset.rename_dims` | +| Renaming manifest file paths | `kerchunk.utils.rename_target` | `vds.virtualize.rename_paths` | +| Splitting uncompressed data into chunks | `kerchunk.utils.subchunk` | `xarray.Dataset.chunk` (❌ Not yet implemented - see [PR #199](https://github.com/zarr-developers/VirtualiZarr/pull/199)) +| Selecting specific chunks | ❌ | `xarray.Dataset.isel` (❌ Not yet implemented - see [issue #51](https://github.com/zarr-developers/VirtualiZarr/issues/51)) | +**Parallelization** | | | +| Parallelized generation of references | Wrapping kerchunk's opener inside `dask.delayed` | Wrapping `open_virtual_dataset` inside `dask.delayed` (⚠️ Untested) +| Parallelized combining of references (tree-reduce) | `kerchunk.combine.auto_dask` | Wrapping `ManifestArray` objects within `dask.array.Array` objects inside `xarray.Dataset` to use dask's `concatenate` (⚠️ Untested) | | **On-disk serialization (6) and reading (7)** | | | | Kerchunk reference format as JSON | `ujson.dumps(h5chunks.translate())` , then read using an `fsspec.filesystem` mapper | `ds.virtualize.to_kerchunk('combined.json', format='JSON')` , then read using an `fsspec.filesystem` mapper | | Kerchunk reference format as parquet | `df.refs_to_dataframe(out_dict, "combined.parq")`, then read using an `fsspec` `ReferenceFileSystem` mapper | `ds.virtualize.to_kerchunk('combined.parq', format=parquet')` , then read using an `fsspec` `ReferenceFileSystem` mapper | -| Zarr v3 store with `manifest.json` files | n/a | `ds.virtualize.to_zarr()`, then read via any Zarr v3 reader which implements the manifest storage transformer ZEP | +| Zarr v3 store with `manifest.json` files | ❌ | `ds.virtualize.to_zarr()`, then read via any Zarr v3 reader which implements the manifest storage transformer ZEP | +| [Icechunk](https://icechunk.io/) store | ❌ | `ds.virtualize.to_icechunk()`, then read back via xarray (requires zarr-python v3). | ## Why a new project? @@ -71,7 +83,7 @@ If you see other opportunities then we would love to hear your ideas! ## Is this compatible with Icechunk? -Yes! VirtualiZarr allows you to ingest data as virtual references and write those references into an Icechunk Store. See the [Icechunk documentation on creating virtaul datasets.](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) +Yes! VirtualiZarr allows you to ingest data as virtual references and write those references into an [Icechunk](https://icechunk.io/) Store. See the [Icechunk documentation on creating virtual datasets.](https://icechunk.io/icechunk-python/virtual/#creating-a-virtual-dataset-with-virtualizarr) ## I already have Kerchunked data, do I have to redo that work? diff --git a/docs/releases.rst b/docs/releases.rst index 42d92743..cd30f128 100644 --- a/docs/releases.rst +++ b/docs/releases.rst @@ -33,6 +33,8 @@ Documentation - FAQ answers on Icechunk compatibility, converting from existing Kerchunk references to Icechunk, and how to add a new reader for a custom file format. (:pull:`266`) By `Tom Nicholas `_. +- Clarify which readers actually currently work in FAQ, and temporarily remove tiff from the auto-detection. + (:issue:`291`, :pull:`296`) By `Tom Nicholas `_. - Minor improvements to the Contributing Guide. (:pull:`298`) By `Tom Nicholas `_. diff --git a/virtualizarr/backend.py b/virtualizarr/backend.py index fab010c7..3b7195cb 100644 --- a/virtualizarr/backend.py +++ b/virtualizarr/backend.py @@ -16,7 +16,6 @@ HDF5VirtualBackend, KerchunkVirtualBackend, NetCDF3VirtualBackend, - TIFFVirtualBackend, ZarrV3VirtualBackend, ) from virtualizarr.utils import _FsspecFSFromFilepath, check_for_collisions @@ -30,7 +29,7 @@ "netcdf3": NetCDF3VirtualBackend, "hdf5": HDF5VirtualBackend, "netcdf4": HDF5VirtualBackend, # note this is the same as for hdf5 - "tiff": TIFFVirtualBackend, + # "tiff": TIFFVirtualBackend, "fits": FITSVirtualBackend, } diff --git a/virtualizarr/tests/test_backend.py b/virtualizarr/tests/test_backend.py index e9b60814..b1ddeee4 100644 --- a/virtualizarr/tests/test_backend.py +++ b/virtualizarr/tests/test_backend.py @@ -13,7 +13,6 @@ from virtualizarr.manifests import ManifestArray from virtualizarr.tests import ( has_astropy, - has_tifffile, network, requires_kerchunk, requires_s3fs, @@ -233,9 +232,7 @@ class TestReadFromURL: pytest.param( "tiff", "https://github.com/fsspec/kerchunk/raw/main/kerchunk/tests/lcmap_tiny_cog_2020.tif", - marks=pytest.mark.skipif( - not has_tifffile, reason="package tifffile is not available" - ), + marks=pytest.mark.xfail(reason="not yet implemented"), ), pytest.param( "fits",