Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/boundary dataloader #90

Open
wants to merge 108 commits into
base: main
Choose a base branch
from

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented Nov 21, 2024

This PR introduces support for handling boundary data through an additional optional datastore. This allows forcing the model's interior state with boundary conditions during training and evaluation. This PR introduces boundary data handling up until WeatherDataset.__getitem__ which now returns 5 elements init_states, target_states, forcing, **boundary**, target_times (#84 will later implement boundary handling on the model side).
Currently boundary datastore must be of type mdp containing forcing category data. The code wa developed in a way that this requirement can be easily relaxed in the future.

Motivation and Context:

In weather modeling incorporating realistic boundary conditions is crucial for accurately simulating atmospheric processes. By providing an optional boundary datastore, the model gains flexibility to accept external boundary data, enhancing prediction accuracy and allowing for more complex simulations. The current workflow using n outmost gridcells as boundary will be fully replaced.
The creation of a new ERA5 WeatherBench MDP example datastore demonstrates the implementation of these features and serves as a template for future models.

Key Changes:

  • Introduced num_past/future_forcing_steps and num_past/future_boundary_steps as separate arguments and CLI flags.
  • Created a new ERA5 WeatherBench MDP example datastore: Developed an example datastore using ERA5 WeatherBench data within the Model-Data-Platform (MDP) framework.
  • Ensured num_grid_points also works for the forcing datastore (i.e. not containing state vars)
  • Combined time slicing for state and forcing and make it timestep-based. The same function now handles state forcing and boundary data and matched the corresponding samples based on actual time (using minimal time differences between state and forcing/boundary).
  • Updated global configuration to support two datastores: Modified the global configuration to accommodate both state and boundary/forcing datastores simultaneously.
  • Added temporal embedding of time_step_diffs: Introduced temporal embeddings by adding time step differences as additional input features to improve the model's temporal awareness.
  • Removed interior/exterior masks from the codebase: Eliminated the use of data masks to simplify the codebase. This is now handled by having two separate datastores
  • Accept boundary data in common_step but do not return it: Modified the common_step method to accept boundary data as input without returning it, preparing for future boundary/state implementation on the model side.
  • Moved the handling of ensemble members to the initialization of WeatherDataset to prevent duplicate time steps in the dataset.

Bugfix:

  • Fixed a bug in analysis time retrieval in the MEPS store: Corrected an issue with retrieving analysis_time from the MEPS data store where duplicate time steps for one member were returned.

Notes:

  • I have introduced strict assertion that time , analysis_time and elapsed_forecast_time all have a consistent step size. This is not strictly necessary but probably what most user want and it helps with the temporal encoding.

Introduced Dependencies

  • "gcsfs>=2021.10.0" Where should we download era5 from, do we want this dependency to Google Cloud Storage?

TODOs:

  • Upload MEPS example with dates close to Danra and next to each other for test/val/train @leifdenby This allows for testing both npy and mdp datastores with the same boundary datastore and limits the required data.

Type of change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📖 Documentation (Addition or improvements to documentation)

Checklist before requesting a review

  • My branch is up-to-date with the target branch - if not update your fork with the changes from the target branch (use pull with --rebase option if possible).
  • I have performed a self-review of my code
  • For any new/modified functions/classes I have added docstrings that clearly describe its purpose, expected inputs and returned values
  • I have placed in-line comments to clarify the intent of any hard-to-understand passages of my code
  • I have updated the README to cover introduced code changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have given the PR a name that clearly describes the change, written in imperative form (context).
  • I have requested a reviewer and an assignee (assignee is responsible for merging). This applies only if you have write access to the repo, otherwise feel free to tag a maintainer to add a reviewer and assignee.

Checklist for reviewers

Each PR comes with its own improvements and flaws. The reviewer should check the following:

  • the code is readable
  • the code is well tested
  • the code is documented (including return types and parameters)
  • the code is easy to maintain

Author checklist after completed review

  • I have added a line to the CHANGELOG describing this change, in a section
    reflecting type of change (add section where missing):
    • added: when you have added new functionality
    • changed: when default behaviour of the code has been changed
    • fixes: when your contribution fixes a bug

Checklist for assignee

  • PR is up to date with the base branch
  • the tests pass
  • author has added an entry to the changelog (and designated the change as added, changed or fixed)
  • Once the PR is ready to be merged, squash commits and merge the PR.

sadamov and others added 30 commits November 21, 2024 13:49
combine two slicing fcts into one
@sadamov sadamov marked this pull request as ready for review December 5, 2024 12:51
@joeloskarsson
Copy link
Collaborator

I think that if we merge main into here the tests will pass. They seem to have failed for the same reason as on main, which was fixed with #94.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First read through: this was joy to read 😊 I have a few comments, but in general I think this is very close to done.

neural_lam/config.py Show resolved Hide resolved
neural_lam/datastore/mdp.py Outdated Show resolved Hide resolved
neural_lam/weather_dataset.py Outdated Show resolved Hide resolved
neural_lam/weather_dataset.py Show resolved Hide resolved
neural_lam/weather_dataset.py Outdated Show resolved Hide resolved
def _process_windowed_data(self, da_windowed, da_state, da_target_times):
"""Helper function to process windowed data. This function stacks the
'forcing_feature' and 'window' dimensions and adds the time step
differences to the existing features as a temporal embedding.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it maybe make sense to split adding the temporal embedding in a separate function? To me this doesn't anything to do with the windowing, but maybe I am missing something. Something like _add_temporal_embedding(...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is highly related. The temporal embedding (which we should probably not refer to as such, as this is not an embedding, so from here on: the delta times) is the time differences for the different entries in the window. After these have been stacked it is not clear to me how you would know what the delta-times are, since you've got rid of the window dimension.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes these are indeed highly related, since the time_deltas directly correspond to the windowing. As the method is concise, I suggest to keep it as is. With the comments it should be clear what is going on. My naming however is bad. I propose to rename time_diff_steps into time_deltas everywhere in weather_dataset.py, see 7e1a246

neural_lam/weather_dataset.py Outdated Show resolved Hide resolved
tests/test_datasets.py Outdated Show resolved Hide resolved
tests/test_time_slicing.py Show resolved Hide resolved
tests/test_time_slicing.py Show resolved Hide resolved
Comment on lines 180 to 187
check_time_overlap(
self.da_state,
self.da_boundary_forcing,
da1_is_forecast=self.datastore.is_forecast,
da2_is_forecast=self.datastore_boundary.is_forecast,
num_past_steps=self.num_past_boundary_steps,
num_future_steps=self.num_future_boundary_steps,
)
Copy link
Collaborator Author

@sadamov sadamov Dec 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Joel Oskarsson
I've been trying out loading data in neural-lam that I have created using the DANRA and ERA5 configs from https://github.com/leifdenby/mllam-data-prep/tree/feat/crop-with-other-dataset. In those configs both datasets cover exactly the same time period. Running neural-lam with num_past_boundary_steps > 0 raises this error https://github.com/sadamov/neural-lam/blob/f0a7046b7ce0c5c955762b67ebe017fdef4eae4e/neural_lam/weather_dataset.py#L207 (even with the fix #90 (comment)). This makes sense, as there is no boundary data before the first time point for the interior. However, an alternative behavior would be to adjust the first time point for the interior (e.g. ignoring the first time steps of the interior data, until data loading is possible for the set num_past_boundary_steps ), to still allow for training in this case. What's your thoughts on this? Now one has to always leave in some extra time steps in the boundary forcing, depending on num_past_boundary_steps , which has to be decided beforehand when the dataset is created. But maybe that makes more sense than allowing for having a different number of samples depending on num_past_boundary_steps ?


Simon Adamov:
true this is indeed by design.
tell the user to include sufficient time-steps and have a very clear requirement and warning
OR: expect most users to chose the same datetimes for boundary and forcing and give a warning how many time steps were removed from interior
As long as only a few timesteps are removed from interior, I think option 2 is nicer to handle. But no user would want half their training data to be ignored "quietly".
So, I think both options are fine, as long as we have clear warnings and error messages. Is 2 your preference?


Joel Oskarsson
That's good ideas. I agree that we should really avoid cutting away much of the training data without any proper warning. But cutting away a few timesteps is not really a problem (but could still show a warning explaining how much is being removed). Yea, option 2 is quite natural I think, since it would be nice to prepare the data by just specifying the exact same time period for the interior and boundary when building corresponding zarrs 😁 I'm thinking that this might be easy to achieve by "just" slicing away a few time steps in the interior xr.Dataarray used in WeatherDataset, but not I have not looked at any details of what such an implementation would look like.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to implement an additional stateless function called crop_time_if_needed in 7e5797e. This function gets called automatically before any time-slicing operations in WeatherDataset. I left all the existing checks and warning in the code, can't hurt. Added additional print statement about how many time steps were removed from the state data. Let me know what you think!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that struck me when looking at this again is that it would often actually be possible to create all the boundary forcing with num_past_boundary_steps > 0 even in cases when the interior and boundary covers exactly the same period. This because boundary forcing is really only needed from time step 3 of the interior data. The first two time steps of the interior data are only used as initial conditions. However, whether the needed time step is included in the data or not for num_past_boundary_steps=1 also depends on the length of time steps of interior data vs boundary data.

So overall I think handling this gets terribly overcomplicated, for just getting one or two extra samples. So much better to stick with this solution with crop_time_if_needed :)

@sadamov
Copy link
Collaborator Author

sadamov commented Dec 20, 2024

@leifdenby @joeloskarsson I have given this a thorough rework and implemented all your feedback from here and from slack. there are two threads unresolved, mostly for you to have a chance to intervene if not in agreement with my solution. Please give this your final review so we can merge soon. It is my understanding that everything up to the return of WeatherDataset should now be ready to run "realistic" experiments with neural-lam.

PS: locally the tests pass for me
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants