You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary
Since the weather community and especially ECMWF moved towards a single zarr archive that contains all the data in the state (domain), and one that contains all the data in the boundary, this project should follow the same approach. Zarr has many advantages like parallel computing with dask, lazy loading with xarray, efficient compression with different algorithms and chunking.
Specifics
There are three main data-processing steps happening in the current pipeline. This is a proposal how the work would be split between the three:
Data-Preprocessing
Usually some format like GRIB2 is converted into xarray->zarr. This step is out of scope
Pre-computation of forcings, static and grid features
Computation of normalization constants (stats) and inverse variances
Generating the boundary mask
Pytorch Dataset [on CPU]:
Reshaping of 3D variables into stacked 2D variables
Split data into train/val/test based on some indicator (e.g. time)
Generate the windowed indices for forcing and boundary
Pytorch Model [on GPU]
Normalization of the data
Interfaces
Data-Preprocessing
Input: out of scope
Output: one or multiple zarr files
Pytorch Dataset [on CPU]:
Input: one or multiple zarr files
Output: 5 pytorch tensors with the following dimensions:
Summary
Since the weather community and especially ECMWF moved towards a single zarr archive that contains all the data in the state (domain), and one that contains all the data in the boundary, this project should follow the same approach. Zarr has many advantages like parallel computing with dask, lazy loading with xarray, efficient compression with different algorithms and chunking.
Specifics
There are three main data-processing steps happening in the current pipeline. This is a proposal how the work would be split between the three:
Interfaces
Implementation
One example for such a pytorch dataset and dataloader can be found here for inpisration: https://github.com/MeteoSwiss/neural-lam/blob/main/neural_lam/weather_dataset.py It needs however quite a bit of work:
Draw IO
The text was updated successfully, but these errors were encountered: