Replies: 1 comment
-
Today, in practice, we rely on Dask.Array pretty heavily to defer execution against Zarr-backed arrays. Dask already implements a "lazy graph-based computation system", and we should definitely not try to create a new alternative to that here. We could, however, aim to integrate with other similar libraries, such as cubed. There may be room for a more light-weight deferred-execution Array library (similar to Xarray's duck arrays), but I don't think that belongs in Zarr. We should document better how users can wrap their Zarr arrays in Dask / Cubed arrays in order to obtain deferred execution. In Zarr, we should focus on implementing operations that can be pushed down to the storage layer to optimize computational pipelines. This is primarily indexing...
...I think this is a very good rule of thumb
This is an interesting example and hints at one ambiguity behind the idea that we don't support "operations that transform data". The fact is, this is exactly what codecs do. We already have a dtype codec. You could also imagine a generalized arithmetic codec that operates elementwise on each item--kind of like the So one idea might be...if we know how to express an array-API operation as a codec, we could push it into the codec pipeline. This is something we could explore incrementally, one operation at a time.
I think this is a perfectly reasonable idea. My reading of the API is that there is no expectation of cross-library understanding of device, so we are free to define it however we want. Could we use this information some how? Like, if we know two arrays are on the same device, can we use that for any sort of optimization? |
Beta Was this translation helpful? Give feedback.
-
Supporting the array api in zarr-python
The python array api standard is an effort to standardize the API for various NDArray objects across the python ecosystem. In the v3 roadmap, one of the goals is to "Align the Zarr-Python array API with the array API Standard". I would like to use this discussion to consider how we can achieve this goal.
Array Attributes
Here are the attributes defined in the array API standard:
Some of these (
size
,shape
) are extremely simple to support, but I'm curious about whatarray.device
should be for a zarr array. I don't know much about computing on GPUs, but I am guessing that's wherearray.device
is most relevant (the docs for this attribute on the array API say as much). One interpretation forzarr-python
could be that thedevice
represents the particular storage backend for the chunks of the array, so for data stored on aws s3, the device would be some python object that represents "data stored on s3", and so on for data on other storage backends. I'm curious to hear any other thoughts on this -- since I never use the.device
attribute of a numpy array, my intuitions for what could work here might be way off.Array methods / functions
The array API defines a LOT of functions and methods that transform arrays into new arrays or scalars. Besides indexing, I think implementing these routines in
zarr-python
would be a lot of work -- to implement something likex: zarr.Array = zarray.mean(0).std(0)
we would need to use or create a lazy graph-based computation system, and I don't see that happening any time soon (although it would be super cool).So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.
astype
could be an exception here, since calling.astype
on some chunks after loading them is pretty cheap. But this would still require breaking ground on a lazy evaluation system (or depending on one, which seems undesirable right now).How much of the API standard can we support
Without data-transforming functions and methods, not much, in percentage terms! I couldn't find guidelines for libraries that only support a subset of the standard, but maybe this describes most array APIs used today other than numpy. However, I think this is fine. As long as zarr arrays can be coerced to numpy / cupy / ... arrays as needed, users should be able to compute what they need using the numpy / cupy / ... apis.
I'm curious to hear what other people think about this approach.
Beta Was this translation helpful? Give feedback.
All reactions