-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
split out manifests for smaller "coordinate" arrays #539
Comments
And here is some python code to test out this idea on example datasets: import xarray
def partition_manifest(ds: xarray.Dataset):
import math
import operator
from itertools import dropwhile, takewhile
from toolz import accumulate
THRESHOLD_CHUNKS = 2048
allvars: dict[str, int] = {}
for name, var in ds.variables.items():
# we know this from Zarr metadata
nchunks = math.prod(math.ceil(s/c) for s, c in zip(var.shape, var.encoding["chunks"], strict=True))
allvars[name] = nchunks
allvars = dict(sorted(allvars.items(), key=lambda x: x[1]))
accumulated = tuple(accumulate(operator.add, allvars.values()))
threshold_filter = lambda tup: tup[1] < THRESHOLD_CHUNKS
first = lambda tup: next(iter(tup))
small = dict(map(first, takewhile(threshold_filter, zip(allvars.items(), accumulated, strict=True))))
big = dict(map(first, dropwhile(threshold_filter, zip(allvars.items(), accumulated, strict=True))))
assert not set(small) & set(big)
print(f"\nsmall: {sum(small.values())} chunks", tuple(small), f"\nbig: {sum(big.values())} chunks over {len(big)} arrays") grid = xr.open_zarr("gs://pangeo-ecco-llc4320/grid", chunks={})
partition_manifest(grid)
# small: 253 chunks ('PHrefC', 'PHrefF', 'Z', 'Zl', 'Zp1', 'Zu', 'drC', 'drF', 'face', 'i', 'i_g', 'iter', 'j', 'j_g', 'k', 'k_l', 'k_p1', 'k_u',
# 'time', 'CS', 'Depth', 'SN', 'XC', 'XG', 'YC', 'YG', 'dxC', 'dxG', 'dyC', 'dyG', 'hFacC', 'hFacS', 'hFacW', 'rA', 'rAs', 'rAw', 'rAz') ds = xr.open_zarr("gs://gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3", chunks=None, decode_cf=False)
partition_manifest(ds)
# small: 23 chunks ('latitude', 'level', 'longitude', 'time')
# big: 361355904 chunks over 273 arrays import arraylake as al
client = al.Client()
hrrr = client.get_repo("earthmover-demos/hrrr").to_xarray("solar")
gfs = client.get_repo("earthmover-demos/gfs").to_xarray("solar")
partition_manifest(hrrr)
partition_manifest(gfs)
# small: 54 chunks ('latitude', 'longitude', 'spatial_ref', 'step', 'x', 'y', 'time')
# big: 4589460 chunks over 6 arrays
# small: 11 chunks ('latitude', 'longitude', 'step', 'time')
# big: 1189440 chunks over 5 arrays |
I really like this @dcherian . It's simple and it will probably make a big difference for interactive use cases. I wonder what are the things we can generalize. Examples:
|
Where does this come from? It seems like a very important number. I guess is is the size of a file we expect to be able to fetch from S3 quickly using a single thread? According to our model, the fetching time should be T(n) = T0 + n / B0 =~ 10 ms + 1MB / 100 MB/s =~ 20ms. By this math, we could get away with 4 MB and still stay at around 50 ms.
Agree it's okay to rewrite this file with every commit. Could you try running the sample code on some big EO data cubes? Like something from Sylvera? Or https://app.earthmover.io/earthmover-demos/sentinel-datacube-South-America-3 |
It seems like we should also check this assumption for common datasets. It's unclear to me how much the logic you're proposing here depends on the assumption on inline chunks. |
Last comment about inlining: Are the cases where it would make sense to actually rechunk the coordinate data before storing it? Like for this dataset:
Why should we store 23 chunks? Why not just 4? As long as we are inlining, there is not really any benefit to chunking. In the scenario where we recunk, we might exceed the 512 byte inline threshold, but we would still be storing less data total, since we would only have one "big" chunk. Put differently, is there ever a good reason to chunk an array if it is being inlined into a single manifest? |
I think this really is an issue about how to split the remaining arrays, that is we should make sure manifests for big arrays (big in terms of number of chunks) are stored separately. I think that can be a followup.
We should ask: what are the characteristics of "coordinate arrays"? From my survey they seem to be arrays that are "generally" small in terms of number of chunks, and empirically we can make good decisions without needing the user to annotate their data.
Yes we want to be in the "latency range" so <8MB.
These are the landsat examples in the table; they all have single chunks and will be easily detected.
It does not depend at all on inlining. Using the inline threshold gives us a worst-case-maximum estimate of number of chunks in a single manifest given the chosen size threshold.
This is how that dataset was created. My only goal here was to see if there is a no-config way to detect these "coordinate" variables in real world scenarios. I think there is.
I think this is all orthogonal to the idea here, which is to simply split out a "small" manifest and a "big manifest". Other optimizations can be explored later. |
One of our ideas to reduce time to open a dataset with Xarray is to create a separate manifest files for "coordinate" arrays that are commonly loaded in to memory eagerly.
Here is a survey of some datasets with coordinate arrays on the large end of the spectrum (the values for MPAS are an estimate)
If we want a manifest file of size 1MiB, and our default
inline_chunk_threshold_bytes
is 512 bytes, then we get to store max 2048 chunks (assuming they are all inlined).So one no-config approach would be to:
This ignores a bunch of complexity:
Thoughts?
The text was updated successfully, but these errors were encountered: