-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/boundary dataloader #90
base: main
Are you sure you want to change the base?
Conversation
combine two slicing fcts into one
I think that if we merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First read through: this was joy to read 😊 I have a few comments, but in general I think this is very close to done.
neural_lam/weather_dataset.py
Outdated
def _process_windowed_data(self, da_windowed, da_state, da_target_times): | ||
"""Helper function to process windowed data. This function stacks the | ||
'forcing_feature' and 'window' dimensions and adds the time step | ||
differences to the existing features as a temporal embedding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it maybe make sense to split adding the temporal embedding in a separate function? To me this doesn't anything to do with the windowing, but maybe I am missing something. Something like _add_temporal_embedding(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is highly related. The temporal embedding (which we should probably not refer to as such, as this is not an embedding, so from here on: the delta times) is the time differences for the different entries in the window. After these have been stacked it is not clear to me how you would know what the delta-times are, since you've got rid of the window dimension.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes these are indeed highly related, since the time_deltas
directly correspond to the windowing. As the method is concise, I suggest to keep it as is. With the comments it should be clear what is going on. My naming however is bad. I propose to rename time_diff_steps
into time_deltas
everywhere in weather_dataset.py
, see 7e1a246
Co-authored-by: Leif Denby <[email protected]>
…/neural-lam into feat/boundary_dataloader
neural_lam/weather_dataset.py
Outdated
check_time_overlap( | ||
self.da_state, | ||
self.da_boundary_forcing, | ||
da1_is_forecast=self.datastore.is_forecast, | ||
da2_is_forecast=self.datastore_boundary.is_forecast, | ||
num_past_steps=self.num_past_boundary_steps, | ||
num_future_steps=self.num_future_boundary_steps, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joel Oskarsson
I've been trying out loading data in neural-lam that I have created using the DANRA and ERA5 configs from https://github.com/leifdenby/mllam-data-prep/tree/feat/crop-with-other-dataset. In those configs both datasets cover exactly the same time period. Running neural-lam with num_past_boundary_steps > 0 raises this error https://github.com/sadamov/neural-lam/blob/f0a7046b7ce0c5c955762b67ebe017fdef4eae4e/neural_lam/weather_dataset.py#L207 (even with the fix #90 (comment)). This makes sense, as there is no boundary data before the first time point for the interior. However, an alternative behavior would be to adjust the first time point for the interior (e.g. ignoring the first time steps of the interior data, until data loading is possible for the set num_past_boundary_steps ), to still allow for training in this case. What's your thoughts on this? Now one has to always leave in some extra time steps in the boundary forcing, depending on num_past_boundary_steps , which has to be decided beforehand when the dataset is created. But maybe that makes more sense than allowing for having a different number of samples depending on num_past_boundary_steps ?
Simon Adamov:
true this is indeed by design.
tell the user to include sufficient time-steps and have a very clear requirement and warning
OR: expect most users to chose the same datetimes for boundary and forcing and give a warning how many time steps were removed from interior
As long as only a few timesteps are removed from interior, I think option 2 is nicer to handle. But no user would want half their training data to be ignored "quietly".
So, I think both options are fine, as long as we have clear warnings and error messages. Is 2 your preference?
Joel Oskarsson
That's good ideas. I agree that we should really avoid cutting away much of the training data without any proper warning. But cutting away a few timesteps is not really a problem (but could still show a warning explaining how much is being removed). Yea, option 2 is quite natural I think, since it would be nice to prepare the data by just specifying the exact same time period for the interior and boundary when building corresponding zarrs 😁 I'm thinking that this might be easy to achieve by "just" slicing away a few time steps in the interior xr.Dataarray used in WeatherDataset, but not I have not looked at any details of what such an implementation would look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to implement an additional stateless function called crop_time_if_needed
in 7e5797e. This function gets called automatically before any time-slicing operations in WeatherDataset. I left all the existing checks and warning in the code, can't hurt. Added additional print statement about how many time steps were removed from the state data. Let me know what you think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that struck me when looking at this again is that it would often actually be possible to create all the boundary forcing with num_past_boundary_steps > 0 even in cases when the interior and boundary covers exactly the same period. This because boundary forcing is really only needed from time step 3 of the interior data. The first two time steps of the interior data are only used as initial conditions. However, whether the needed time step is included in the data or not for num_past_boundary_steps=1 also depends on the length of time steps of interior data vs boundary data.
So overall I think handling this gets terribly overcomplicated, for just getting one or two extra samples. So much better to stick with this solution with crop_time_if_needed
:)
@leifdenby @joeloskarsson I have given this a thorough rework and implemented all your feedback from here and from slack. there are two threads unresolved, mostly for you to have a chance to intervene if not in agreement with my solution. Please give this your final review so we can merge soon. It is my understanding that everything up to the return of WeatherDataset should now be ready to run "realistic" experiments with neural-lam. |
This PR introduces support for handling boundary data through an additional optional datastore. This allows forcing the model's interior state with boundary conditions during training and evaluation. This PR introduces boundary data handling up until
WeatherDataset.__getitem__
which now returns 5 elementsinit_states, target_states, forcing, **boundary**, target_times
(#84 will later implement boundary handling on the model side).Currently boundary datastore must be of type mdp containing forcing category data. The code wa developed in a way that this requirement can be easily relaxed in the future.
Motivation and Context:
In weather modeling incorporating realistic boundary conditions is crucial for accurately simulating atmospheric processes. By providing an optional boundary datastore, the model gains flexibility to accept external boundary data, enhancing prediction accuracy and allowing for more complex simulations. The current workflow using n outmost gridcells as boundary will be fully replaced.
The creation of a new ERA5 WeatherBench MDP example datastore demonstrates the implementation of these features and serves as a template for future models.
Key Changes:
num_past/future_forcing_steps
andnum_past/future_boundary_steps
as separate arguments and CLI flags.time
(using minimal time differences between state and forcing/boundary).Bugfix:
Notes:
time
,analysis_time
andelapsed_forecast_time
all have a consistent step size. This is not strictly necessary but probably what most user want and it helps with the temporal encoding.Introduced Dependencies
TODOs:
Type of change
Checklist before requesting a review
pull
with--rebase
option if possible).Checklist for reviewers
Each PR comes with its own improvements and flaws. The reviewer should check the following:
Author checklist after completed review
reflecting type of change (add section where missing):
Checklist for assignee