supporting the array api in zarr-python #1614

d-v-b · 2023-12-21T12:46:31Z

d-v-b
Dec 21, 2023
Maintainer

Supporting the array api in zarr-python

The python array api standard is an effort to standardize the API for various NDArray objects across the python ecosystem. In the v3 roadmap, one of the goals is to "Align the Zarr-Python array API with the array API Standard". I would like to use this discussion to consider how we can achieve this goal.

Array Attributes

Here are the attributes defined in the array API standard:

Name	Description
array.dtype	Data type of the array elements.
array.device	Hardware device the array data resides on.
array.mT	Transpose of a matrix (or a stack of matrices).
array.ndim	Number of array dimensions (axes).
array.shape	Array dimensions.
array.size	Number of elements in an array.
array.T	Transpose of the array.

Some of these (size, shape) are extremely simple to support, but I'm curious about what array.device should be for a zarr array. I don't know much about computing on GPUs, but I am guessing that's where array.device is most relevant (the docs for this attribute on the array API say as much). One interpretation for zarr-python could be that the device represents the particular storage backend for the chunks of the array, so for data stored on aws s3, the device would be some python object that represents "data stored on s3", and so on for data on other storage backends. I'm curious to hear any other thoughts on this -- since I never use the .device attribute of a numpy array, my intuitions for what could work here might be way off.

Array methods / functions

The array API defines a LOT of functions and methods that transform arrays into new arrays or scalars. Besides indexing, I think implementing these routines in zarr-python would be a lot of work -- to implement something like x: zarr.Array = zarray.mean(0).std(0) we would need to use or create a lazy graph-based computation system, and I don't see that happening any time soon (although it would be super cool).

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data. astype could be an exception here, since calling .astype on some chunks after loading them is pretty cheap. But this would still require breaking ground on a lazy evaluation system (or depending on one, which seems undesirable right now).

How much of the API standard can we support

Without data-transforming functions and methods, not much, in percentage terms! I couldn't find guidelines for libraries that only support a subset of the standard, but maybe this describes most array APIs used today other than numpy. However, I think this is fine. As long as zarr arrays can be coerced to numpy / cupy / ... arrays as needed, users should be able to compute what they need using the numpy / cupy / ... apis.

I'm curious to hear what other people think about this approach.

rabernat · 2023-12-21T13:40:17Z

rabernat
Dec 21, 2023
Maintainer

Today, in practice, we rely on Dask.Array pretty heavily to defer execution against Zarr-backed arrays. Dask already implements a "lazy graph-based computation system", and we should definitely not try to create a new alternative to that here. We could, however, aim to integrate with other similar libraries, such as cubed. There may be room for a more light-weight deferred-execution Array library (similar to Xarray's duck arrays), but I don't think that belongs in Zarr. We should document better how users can wrap their Zarr arrays in Dask / Cubed arrays in order to obtain deferred execution.

In Zarr, we should focus on implementing operations that can be pushed down to the storage layer to optimize computational pipelines. This is primarily indexing...

So I would suggest that we support operations that select data (i.e., indexing), but not operations that transform data.

...I think this is a very good rule of thumb

astype could be an exception here, since calling .astype on some chunks after loading them is pretty cheap

This is an interesting example and hints at one ambiguity behind the idea that we don't support "operations that transform data". The fact is, this is exactly what codecs do. We already have a dtype codec. You could also imagine a generalized arithmetic codec that operates elementwise on each item--kind of like the FixedScaleOffset filter (which basically implements a*x + b).

So one idea might be...if we know how to express an array-API operation as a codec, we could push it into the codec pipeline. This is something we could explore incrementally, one operation at a time.

. One interpretation for zarr-python could be that the device represents the particular storage backend for the chunks of the array

I think this is a perfectly reasonable idea. My reading of the API is that there is no expectation of cross-library understanding of device, so we are free to define it however we want. Could we use this information some how? Like, if we know two arrays are on the same device, can we use that for any sort of optimization?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supporting the array api in zarr-python #1614

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

supporting the array api in zarr-python #1614

d-v-b Dec 21, 2023 Maintainer

Supporting the array api in zarr-python

Array Attributes

Array methods / functions

How much of the API standard can we support

Replies: 1 comment

rabernat Dec 21, 2023 Maintainer

d-v-b
Dec 21, 2023
Maintainer

rabernat
Dec 21, 2023
Maintainer