-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "datastores" to represent input data from zarr, npy, etc #66
Conversation
Okay the remaining bug in |
commit 2cc617e Author: Joel Oskarsson <[email protected]> Date: Mon Nov 18 08:35:03 2024 +0100 Add weights_only=True to all torch.load calls (mllam#86) ## Describe your changes Currently running neural-lam with the latest version of pytorch gives a warning: ``` FutureWarning: You are using torch.load with weights_only=False (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for weights_only will be flipped to True. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via torch.serialization.add_safe_globals. We recommend you start setting weights_only=True for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature. ``` As we only use `torch.load` to load tensors and lists, we can just set `weights_only=True` and get rid of this warning (and increase security I suppose). ## Issue Link None ## Type of change - [x] 🐛 Bug fix (non-breaking change that fixes an issue) - [ ] ✨ New feature (non-breaking change that adds functionality) - [ ] 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected) - [ ] 📖 Documentation (Addition or improvements to documentation) ## Checklist before requesting a review - [x] My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use `pull` with `--rebase` option if possible). - [x] I have performed a self-review of my code - [x] For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values - [x] I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code - [x] I have updated the [README](README.MD) to cover introduced code changes - [ ] I have added tests that prove my fix is effective or that my feature works - [x] I have given the PR a name that clearly describes the change, written in imperative form ([context](https://www.gitkraken.com/learn/git/best-practices/git-commit-message#using-imperative-verb-form)). - [x] I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee. ## Checklist for reviewers Each PR comes with its own improvements and flaws. The reviewer should check the following: - [x] the code is readable - [ ] the code is well tested - [x] the code is documented (including return types and parameters) - [x] the code is easy to maintain ## Author checklist after completed review - [ ] I have added a line to the CHANGELOG describing this change, in a section reflecting type of change (add section where missing): - *added*: when you have added new functionality - *changed*: when default behaviour of the code has been changed - *fixes*: when your contribution fixes a bug ## Checklist for assignee - [ ] PR is up to date with the base branch - [ ] the tests pass - [ ] author has added an entry to the changelog (and designated the change as *added*, *changed* or *fixed*) - Once the PR is ready to be merged, squash commits and merge the PR.
Do we know exactly why the tests are not passing here? From the comments above I thought @sadamov fixes made all tests pass, but they still look red here on GH. Maybe I have missed something? Looking at the logs I see some "Process completed with exit code 137.", which would be running out of memory. Is that the issue? |
I just remembered that I had to locally fix MDP: https://github.com/mllam/mllam-data-prep/blob/8e7a5bc63a1ae1235b82b1f702c00eb33e891a79/mllam_data_prep/config.py#L306 where I added this line: @leifdenby when will you release v0.3.0 of MDP, I think these issues will be fixed there? I don't know about the memory issue, but I cannot run the |
The tests not passing partially relate to MDP, but there seems to also be an OOM issue. Will investigate.
|
And we're green again 🟢 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, this is pretty much good to go now! Only waiting for the MDP compatability to hit merge. I am happy with everything else.
OMG! That makes me happy. With tests running on CPU and GPU. AMAZIN! Ok, the good news is @observingClouds and I have decided how to add the projection info to the datastore config. We are going to go with the approach I already implemented where we use this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I am off for a few days I'm gonna change to approve here, so you can go ahead and hit merge on this once #66 (comment) is sorted.
Ok, this is the big moment! All the tests have passed and approvals from both @joeloskarsson and @sadamov! Finally merging after 4 months of work! 🥳 I am merging! 🚀 |
@leifdenby This is just awesome! So happy with the result. You really introduced a very nice and clean structure to the data-pipeline. Biggest PR of my life, I learned a lot about Python-classes along the way and thouroughly enjoyed working with all of you here on this PR ❤️ |
Fix bugs in recently introduced datastore functionality #66 (error in calculation in `BaseDatastore.get_xy_extent()` and overlooked in-place modification of config dict in `MDPDatastore.coords_projection`), and also fix issue in `ARModel.plot_examples` by using newly introduced (#66) `WeatherDataset.create_dataarray_from_tensor()` to create `xr.DataArray` from prediction tensor and calling plot methods directly on `xr.DataArray` rather than using bare numpy arrays with `matplotlib`.
Describe your changes
This PR builds on #54 (which introduces zarr-based training data) by splitting the
Config
-class introduced in #54 to separately represent the configuration for what data to load from the functions to load data (the latter is what I call a "datastore"). In doing this I have also introduced a general interface through an abstract base classBaseDatastore
with a set of functions that are called in the rest ofneural-lam
which provide data for training/validation/test and information about this data (see #58 for my overview of the methods that #54 uses to load data).The motivation for this work is to allow for a clear separation between how data is loaded into neural-lam and how training/validation/test samples are created from that data. Creating the interface between these two steps makes it clear what is expected to be provided when people want to add new data-sources to neural-lam
In the text below I am trying to use the same nomenclature that @sadamov introduced, namely:
state
,forcing
orstatic
data.grid_index
coordinate, levels and variables into a{category}_feature
coordinate (i.e. these are operations thatnp.ndarray
andxr.Dataset
/xr.DataArray
objectstorch.Tensor
objectsTo support both the multizar config format that @sadamov introduced in #54, the old npyfiles and also data transformed with mllam-data-prep I have currently implemented the following three datastore classes:
neural_lam.datastore.NpyDataStore
: reads data from .npy-files in the format introduced in neural-lam v0.1.0 - this usesdask.delayed
so no array content is read until it is used- removed as we decidedneural_lam.datastore.MultizarrDatastore
: can combines multiple zarr files during train/val/test sampling, with the transformations to facilitate this implemented within neural_lam.datastore.MultizarrDatastore.MDPDatastore
was enoughneural_lam.datastore.MDPDatastore
: can combine multiple zarr datasets either either as a preprocessing step or during sampling, but offloads the implementation of the transformations the mllam-data-prep package.Each of the these inherit from
BaseCartesianDatastore
which itself inherits fromBaseDatastore
. I have added this last layer of indirection to make it easier for non-gridded data to be used inneural-lam
in future.Testing:
create_graph
,create_normalization
commands etc so that they can be called not just from the command line.Caveats:
grid
togrid_index
. I think it ambiguous what "grid" refers to since that could be the grid itself, as well as the grid-index as it was used..variable
as a variable name for a anxr.DataArray
because xr.DataArray.variable is a reserved attribute for data-arrays# target_states: (ar_steps-2, N_grid, d_features)
in WeatherDataset.getitem is incorrect @sadamov, or at least my understand of whatar_steps
represents is different. I expect the target states to have exactlyar_steps
in them, rather thanar_steps-2
. Or said another way, would otherwise happen ifar_steps == 0
?Things I am unsure about:
DataLoader(…, multiprocessing_context="spawn”)
as suggested through RuntimeError: This class is not fork-safe fsspec/filesystem_spec#755 and https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader, but not sure if we should do this or always use local zarr datasets rather than open from s3?On whether something should be in
BaseDatastore
vsWeatherDataset
:WeatherDataset
because it doesn’t apply to “state” category for exampleType of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee