Storing very large image datasets on S3 #2684

AndreiBarsan · 2020-09-30T15:36:33Z

AndreiBarsan
Sep 30, 2020

Problem description

Hi Zarr Team!

We are interested in storing a large ML dataset of images on S3. The size will be over 30T, likely at least 50T. The dataset is mostly images, which have to be stored compressed (WebP) to save storage. We can't use a regular N x H x W dataset since that would make its size an order of magnitude bigger. The workloads will mostly be ML training, so images will need to be read randomly most of the time.

We are particularly interested in leveraging Zarr's ability to read parts of datasets from S3, which as I currently understand is non-trivial with other formats, such as hdf5.

As such, we end up with a ragged array since different images end up encoded as different byte counts.

I have a couple of questions about this set-up:

Is it possible to index this dataset with pretty paths, like an hdf5 file would allow (basically one array per sample?), or is the only way of doing things through one giant ragged array, whose rows can't be accessed as paths?
Does Zarr always store chunks as individual files (or S3 objects)? Or is a chunk just a conceptual element used when reading data? The reason I am asking is that for S3, we'd like users to be able to read individual samples of our dataset (ML training, so random access) without downloading unnecessary data, while at the same time we would like to avoid having millions of objects in S3 because of storage costs and performance reasons (S3 best practices discourage large numbers of small files). Is Zarr able to do partial file reads from chunks?

I am using the latest version of Zarr, v2.4.0.

Thank you, and please let me know if I can help by providing additional information.

alimanfoo · 2020-10-01T09:31:55Z

alimanfoo
Oct 1, 2020
Maintainer

Hi @AndreiBarsan!

Quick question, speaking just conceptually, can you think of your dataset as a single N x H x W array? I.e., a set of images, stacked along the first dimension?

If so, do you expect your users to always read one whole image at a time? I.e., every read operation will be something like d[i, :, :] where d is your dataset and i is the index of the image to read? Or will there be other access patterns?

0 replies

AndreiBarsan · 2020-10-01T14:43:31Z

AndreiBarsan
Oct 1, 2020
Author

Hi @alimanfoo, thanks for the quick response!

Yes, the dataset conceptually is [N (x 3) x H x W]. People would access the dataset exactly as you described. There may be other access patterns but they are not the main use case, so it's OK if they have extra overhead.

The main use case would be people training ML models on this dataset using PyTorch, which uses a pool of workers to read samples from the disk/S3/whatever in order to build mini-batches of data for training. Each worker loads one sample at a time, selected uniformly at random from the full dataset.

0 replies

rabernat · 2020-10-02T16:33:19Z

rabernat
Oct 2, 2020
Maintainer

which have to be stored compressed (WebP) to save storage. We can't use a regular N x H x W dataset since that would make its size an order of magnitude bigger.

IIUC, your raw images are all the same size (N (x 3) x H x W), but your compressed images are not the same number of bytes. Correct?

The "proper" way to do this with Zarr would be to use one of Zarr's compressors (via numcodecs). That way the Zarr array retains its regular N (x 3) x H x W geometry, but you still benefit from compression on disk / over the network. This would be possible if WebP were implemented as a numcodecs codec. That might not be very hard. Or you could see if an existing codec gives similar performance. Zarr handles all of the compression / decompression automatically.

If you don't go this route, it's not clear to me you actually need Zarr. You might be best off just using s3fs directly to read / write your compressed bytes on S3, and then have your own logic for decompression.

0 replies

AndreiBarsan · 2020-10-02T19:39:09Z

AndreiBarsan
Oct 2, 2020
Author

IIUC, your raw images are all the same size (N (x 3) x H x W), but your compressed images are not the same number of bytes. Correct?

Yes.

The "proper" way to do this with Zarr would be to use one of Zarr's compressors (via numcodecs). That way the Zarr array retains its regular N (x 3) x H x W geometry, but you still benefit from compression on disk / over the network. This would be possible if WebP were implemented as a numcodecs codec. That might not be very hard. Or you could see if an existing codec gives similar performance. Zarr handles all of the compression / decompression automatically.

I think this would definitely be possible! I encode the images to bytes with just 2-3 lines of Python + Pillow, so this sounds doable.

If you don't go this route, it's not clear to me you actually need Zarr. You might be best off just using s3fs directly to read / write your compressed bytes on S3, and then have your own logic for decompression.

I see what you mean. But if I let Zarr handle the compression, if a user is reading random samples from the dataset, then for each sample Zarr would fetch its whole chunk (say, 10+ images to get decently sized chunks), right? Or is there a way for Zarr to do partial S3 reads? Otherwise the user would end up having to discard most of the bites they read, right?

0 replies

rabernat · 2020-10-02T20:58:00Z

rabernat
Oct 2, 2020
Maintainer

You're correct that Zarr doesn't support partial chunk reads (yet). Your chunk size choice should be informed by your anticipated usage pattern.

What is N?

0 replies

rabernat · 2020-10-02T21:02:08Z

rabernat
Oct 2, 2020
Maintainer

I think this would definitely be possible! I encode the images to bytes with just 2-3 lines of Python + Pillow, so this sounds doable.

Numcodecs encodes / decodes numpy arrays. If you go this route, you could either:

Implement your own custom codec in user code. See https://numcodecs.readthedocs.io/en/stable/abc.html. In this case, your data would only be readably by people with that code.
Contribute a new "official" codec to numcodecs via a pull-request. This would be better if you want anyone to be able to read the data with a vanilla installation of zarr / numcodecs.

We'd be happy to help you over at https://github.com/zarr-developers/numcodecs/.

0 replies

AndreiBarsan · 2020-10-08T19:18:45Z

AndreiBarsan
Oct 8, 2020
Author

Thank you for the follow-up. N here is the number of samples in the dataset. It can be up to 10M in our case.

Thank you for the info about numcodecs, Ryan, we will look into that! For now, it seems that the S3 bottleneck is actually the number of requests made during training, and not the number of objects, so we will shift to investigating whether that can be addressed.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing very large image datasets on S3 #2684

{{title}}

Replies: 7 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Storing very large image datasets on S3 #2684

AndreiBarsan Sep 30, 2020

Problem description

Replies: 7 comments

alimanfoo Oct 1, 2020 Maintainer

AndreiBarsan Oct 1, 2020 Author

rabernat Oct 2, 2020 Maintainer

AndreiBarsan Oct 2, 2020 Author

rabernat Oct 2, 2020 Maintainer

rabernat Oct 2, 2020 Maintainer

AndreiBarsan Oct 8, 2020 Author

AndreiBarsan
Sep 30, 2020

alimanfoo
Oct 1, 2020
Maintainer

AndreiBarsan
Oct 1, 2020
Author

rabernat
Oct 2, 2020
Maintainer

AndreiBarsan
Oct 2, 2020
Author

rabernat
Oct 2, 2020
Maintainer

rabernat
Oct 2, 2020
Maintainer

AndreiBarsan
Oct 8, 2020
Author