Pytorch Dataset Class that Reads From Zarr Archive #24

sadamov · 2024-05-03T09:56:21Z

Summary
Since the weather community and especially ECMWF moved towards a single zarr archive that contains all the data in the state (domain), and one that contains all the data in the boundary, this project should follow the same approach. Zarr has many advantages like parallel computing with dask, lazy loading with xarray, efficient compression with different algorithms and chunking.

Specifics
There are three main data-processing steps happening in the current pipeline. This is a proposal how the work would be split between the three:

Data-Preprocessing
- Usually some format like GRIB2 is converted into xarray->zarr. This step is out of scope
- Pre-computation of forcings, static and grid features
- Computation of normalization constants (stats) and inverse variances
- Generating the boundary mask
Pytorch Dataset [on CPU]:
- Reshaping of 3D variables into stacked 2D variables
- Split data into train/val/test based on some indicator (e.g. time)
- Generate the windowed indices for forcing and boundary
Pytorch Model [on GPU]
- Normalization of the data

Interfaces

Data-Preprocessing
- Input: out of scope
- Output: one or multiple zarr files
Pytorch Dataset [on CPU]:
- Input: one or multiple zarr files
- Output: 5 pytorch tensors with the following dimensions:

      init_states: (2, N_grid, features_dim), 
      target_states: (n_lead_times, N_grid, features_dim), 
      forcing: (n_lead_times, N_grid, forcing_windowed_dim) # window_steps * n_forcing
      boundary: (n_lead_times, N_grid, boundary_windowed_dim) # windowed_steps * n_boundary
      batch_times: (2 + n_lead_times)[str]

Pytorch Model [on GPU]
- Input: 5 pytorch tensors (batched with Pytorch DataLoader)
- Output: out of scope

Implementation
One example for such a pytorch dataset and dataloader can be found here for inpisration: https://github.com/MeteoSwiss/neural-lam/blob/main/neural_lam/weather_dataset.py It needs however quite a bit of work:

Handle multiple zarrs properly
Use the YAML from Replace constants.py with data + region specification from yaml-file #23
Remove temporal forcing calculation
Add boundary to dataflow
...

Draw IO

The text was updated successfully, but these errors were encountered:

joeloskarsson · 2024-05-03T12:51:42Z

Could very nicely use https://lightning.ai/docs/pytorch/stable/common/lightning_module.html#on-after-batch-transfer to normalize once data is on GPU. Makes sure that you never forget about it (all batches on GPU are normalized).

sadamov added the enhancement New feature or request label May 3, 2024

sadamov self-assigned this May 3, 2024

leifdenby mentioned this issue May 22, 2024

Replace constants.py with data + region specification from yaml-file #23

Closed

joeloskarsson mentioned this issue May 30, 2024

Tracking of Input Channel Indices <-> Variable Name and Level #18

Closed

sadamov mentioned this issue Jun 6, 2024

Multiple Zarr to Rule them All #54

Closed

16 tasks

sadamov linked a pull request Jun 6, 2024 that will close this issue

Multiple Zarr to Rule them All #54

Closed

16 tasks

sadamov assigned leifdenby Sep 6, 2024

sadamov linked a pull request Sep 6, 2024 that will close this issue

Add "datastores" to represent input data from zarr, npy, etc #66

Merged

20 tasks

leifdenby closed this as completed in #66 Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pytorch Dataset Class that Reads From Zarr Archive #24

Pytorch Dataset Class that Reads From Zarr Archive #24

sadamov commented May 3, 2024 •

edited

Loading

joeloskarsson commented May 3, 2024

Pytorch Dataset Class that Reads From Zarr Archive #24

Pytorch Dataset Class that Reads From Zarr Archive #24

Comments

sadamov commented May 3, 2024 • edited Loading

joeloskarsson commented May 3, 2024

sadamov commented May 3, 2024 •

edited

Loading