-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thoughts towards a leaner zarr.js #69
Comments
For reference, here is the output of different $ npm run build
src/index.js, src/indexing.js, src/httpStore.js → dist...
Created bundle index.js: 3.37 kB → 1.46 kB (gzip)
Created bundle indexing.js: 5.39 kB → 2.07 kB (gzip)
Created bundle httpStore.js: 656 B → 373 B (gzip) My vision for this repo would be a "batteries included" top-level export |
Thank you for the detailed write up. I am completely in favour of reducing the bundle size of Zarr. I think it makes the most sense to cut fat where there is the most, which is probably in the dependencies. All of the code in zarr.js itself is probably under 15KB gzipped. There are just not that many lines of code. If we drop something like NestedArray we can maybe shave off a few KB, but it's diminishing returns at that point. A website's logo is probably larger than that. The apps that zarr.js will power will probably be in the hundreds of KBs if not MBs anyway. Bundlephobia shows the size over time well, but it's a bit misleading with the addition of new compressors. ErrorsI agree we can (and should) switch to errors that can be 'subclassed' or checked in user code. The current dependency on zarr for custom stores indeed doesn't have to be that way. What I propose we do is introduce some sentinel value for errors. I think that falling back to isKeyError(err: Error) {
return err.zarrError === "KEY_ERROR";
}
// or more generic
isSpecificZarrError(err: Error, s: string) {
return err.zarrError === s;
}
isZarrError(err: Error) {
return err.zarrError !== undefined;
} Then we can export a file with all these string sentinel values for different types of errors. Switching to a more modern targetThis is an easy win, also for performance. I realized we are targeting ES6 at the moment, but I think that's way too old. It doesn't even have I don't think people using zarr don't target very old browsers anyway, and if they do they can transpile using Babel themselves. By changing the target and module to ES2019 and ESNext respectively the size of Dropping TS interface checker (
|
Here's the branch if you want to play with it: https://github.com/gzuidhof/zarr.js/tree/zarr-weight-loss |
Thanks for the response! I should have time to review #70 this weekend.
I think this is a better idea than mine.
Personally, the only error I find myself reusing is
Great idea.
Great, and thanks for adding the visualization! Looking forward to reviewing. |
Thinking about this a bit more today wrt key errors. Part of me feels it's "correct" in the language context to mimic missing keys by returning undefined, but I see your point. It would be similarly typed to how an es6 map, except methods are async. What are your thoughts on throwing a string literal "KEY_ERROR"? I just want something simple that makes store implementations very straightforward and without a dependency on zarr. |
One day there will be a store or user that will want another field for something to bubble up information about the event. The first example I can come up with that isn't entirely realistic: There's a store with many keys, but you can only read from some with your user. Getting some keys may result in some sort of "permission denied" error, but with just the string literal it would be hard to let the user know. |
TL;DR - I believe modularizing
zarr.js
further and stripping out unused/unnecessary features will make maintenance and adoption easier in the future. The key is deciding what are "core" features and prioritizing design around those as package exports.Description
I love using
zarr-python
withzarr.js
, but in its current state (even with the/core
export)zarr.js
is a bulky dependency (~150kb unminified). Although it is pure ESM and can be treeshaken, the primary issue I've come across is that you can't treeshake classes, and therefore in applications that I just needZarrArray.getRawChunk
, I still need to "pay the price" forget
/getRaw
/set
/setRaw
which contain all the chunk indexing/slicing logic.To this end, I know that it's challenging to convince others to adopt
zarr
for web-applications (cc: @kylebarron). For example, it would be awesome to add a zarr tile loader toloaders.gl
, but right now it makes more sense to duplicate parts ofzarr.js
to decrease the code footprint instead of importing exports fromzarr
.Proposal 1: fully tree-shakeable submodules
I've been experimenting with an implementation of zarr v3 spec (
zarrita.js
), that is structured in a way to that you can progressively opt into more features (and code) if needed for your application. Right now thezarrita/core
module only implementsZarrArray.get_chunk
, and the top-level export extends the ZarrArray prototype to add indexing and slicing and writing to the array (e.g.ZarrArray.set
,ZarrArray.get
). Writing the v3 module in this way, I've been able to reuse the core/indexing logic inzarr-lite
which is a minimal v2 implementation for reading chunks and optionally array selections.Core zarr features
Specifically, I've been thinking about what are the core features of zarr? More motivation can be found on this twitter thread. To me, there are tiers of use that I'd like a zarr implementation to have on the web:
1.) Traverse hierarchy (open nodes of hierarchy, e.g. groups and arrays) + read chunks by key (most minimal set of features useful for the web –
getRawChunk
)2.) (1) + read slices of array using array selection (typically how you think of using zarr in python –
getRaw
))3.) (1) + (2) + create (write) to hierarchy (generally unused on the web, but most feature "complete")
Right now,
zarr.js
is (3), but there is no way to just get the behavior or 1 or 2. Inzarr-lite
I've implemented 1 and 2 as separate exports:This works by using
Object.defineProperties
and extending theZarrArray
prototype with additional methods in the submodule: https://github.com/manzt/zarr-lite/blob/main/src/indexing.js.Benefits of isolating core exports
The main benefit of isolating modules is reduction in code side. You can import from the top-level export and get a "batteries included" zarr, but those writing other libraries or wanting to use core functionally can just import the submodules that make sense for their application. This means that the code for decoding chunks from an array won't be duplicated outside of
zarr.js
and instead folks will just add zarr as s dependency.I also think this has the added benefit of standardizing how features like #16 could be implemented without creating a cost for "core" library users. Win/win.
Proposal 2: remove/move
RawArray
andNestedArray
fromzarr.js
In
zarrita.js
, I experimented with using a minimal object to represent an in-memory array:{ data, shape, stride }
. As a result, the arrays returned fromZarrArray.get
can be composed into other JS ndarray libraries.This removes the need for any array class implementation, and
zarrita/indexing
only needs to implement aset
function to fill andout
+out_selection
fromchunk
+chunk_selection
. https://github.com/manzt/zarrita.js/blob/main/src/ops.jsThe big change is that
zarr.js
would serve as an interface to read slices of massive datasets into memory, but if you want to do something fancy with those arrays, you need to choose a numpy-like library. This would mean there is also much less code we need to maintain, and we can punt on creating something numpy-like for the web. It doesn't need to be a part of zarr.js at all, but a layer on top that we don't need to be responsible for.Proposal 3: reduce code for metadata validation
I think that metadata validation should be very minimal. We control how the metadata is written (if writing) but also
zarr-python
takes care of writing metadata, so I don't think we need the overhead of something like a runtime typescript interface checker, but instead a small function that maybe checks a few fields for compatibility in javascript.Proposal 4: Remove python-isms
We should think deeply about translating native python-isms to native JavaScript alternative. For example, to implement a valid store, currently we need to throw a custom
KeyError
fromzarr
if a chunk is missing. This makes any store implementation requirezarr.js
as a dependency, and adds the addtional (confusing requirement) and that you also can't bundle that store as a separate dependency (instanceof myCustomStore.KeyError !== instanceof zarr.KeyError
) if I bundlemyCustomStore
independently. I'd like to have the ability to:A
KeyError
is a built in python error that's thrown when a key is missing from aMutableMapping
. In JavaScript, the equivalent is to just returnundefined
. This would remove a lot of thetry
/catch
blocks from the core implementation and instead we would just call:As an aside, I also feel that we should standardize store interfaces to be an extension of an ES6 Maps, but where are the functions are optionally async.
Final thoughts
My apologies for the length of this issue, but these are some thoughts that I've had after using
zarr.js
on the web for some time. I wanted to note these ideas and my experiments withzarrita.js
andzarr-lite
here so that we could maybe incorporate some of those lessons learned into a leaner, easier-to-maintain,zarr.js
. This is something I might have the time to work on in the future, but it would mean substantial changes to this repo in it's current state so I don't want to proceed with PRs until we've had a longer discussion.Excited to hear your thoughts.
The text was updated successfully, but these errors were encountered: