-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace constants.py with data_config.yaml #31
Conversation
user configs are retrieved either from data_config.yaml or they are set as flags to train_model.py Capabilities of ConfigLoader class extended
lat lon specifications make the code more flexible
Zarr is registered to model buffer Normalization happens on device on_after_batch_transfer
test-case based on meps-example
From a quick glance this looks simply amazing @sadamov! Thanks for doing this work. I will give a thorough review later today/tomorrow. Just tagging @SimonKamuk to have a read and give your thoughts too. I've added @ThomasRieutord to the organisation too. I'll also send Thomas an email so that he definitely sees the PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like this work @sadamov! You've really caught all the bits here (which I'm impressed you've done considering we don't have any tests right now!)
I have just made a few comments/suggestions. Let me know what you think :)
neural_lam/data_config.yaml
Outdated
- wvint_entireAtmosphere_0_instant | ||
- z_isobaricInhPa_1000_instant | ||
- z_isobaricInhPa_500_instant | ||
forcing_dim: 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does forcing_dim
refer to? Is it number of forcing features? In that case maybe we should call this num_forcing_features
instead? The current name implies to me that that "dimension 16" is used for forcing or that there are 16 forcing dimensions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't the "forcing variables" be named too actually? We don't have to do this in this PR, but maybe we should consider that in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a future PR the forcings will be provided by a path to a zarr archive containing forcing features. Since in the current MEPS implementation the calculation of forcings is heavily integrated into the Dataset/Dataloader, I suggest to change the name to num_forcing_dim
for now and implement the fundamental changes "naming forcing variables" once the zarr-based approach was merged into main. See https://github.com/mllam/neural-lam/tree/feature_dataset_yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me! Happy to have the only change here be changed of name to num_forcing_dim
I implemented most requested changes in the latest commit and requested one more review. From my side we are clear to merge. The latest changes were again tested for model training and evaluation. |
neural_lam/config.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not saying we should do this now, but I learnt more about the Meteo-France work on neural-lam
this morning and they make quite heavy use of python dataclasses for configuration storage and schema. This could be something to consider when we want to make the config content more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks again @sadamov !
Hurraay! 🥳 |
parser.add_argument( | ||
"--var_leads_metrics_watch", | ||
type=dict, | ||
default={}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sadamov Can you pass a dict as input on the command line? I could not figure out a way to use this option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah no you can't, I'll make a short PR to fix three bugs that I introduced in this PR. One of them the dictionary here.
Wonderful job with this @sadamov! |
### Summary #31 introduced three minor bugs that are fixed with this PR: - r"" strings are not required in units of `data_config.yaml` - dictionaries cannot be passed as argsparse, rather JSON strings. This bug is related to the flag `var_leads_metrics_watch` --------- Co-authored-by: joeloskarsson <[email protected]>
Summary
This PR replaces the
constants.py
file with adata_config.yaml
file. Dataset related settings can be defined by the user in the new yaml file. Training specific settings were added as additional flags to thetrain_model.py
routine. All respective calls to the old files were replaced.Rationale
/data
folder.constants.py
actually combined both constants and variables, many "constants" should rather be flags totrain_models.py
utils.py
allows for very specific queries of the yaml and calculations based thereon. This branch shows future possibilities of such a class https://github.com/joeloskarsson/neural-lam/tree/feature_dataset_yamlTesting
Both training and evaluation of the model were succesfully tested with the
meps_example
dataset.Note
@leifdenby Could you invite Thomas R. to this repo, in case he wanted to give his input on the yaml file? This PR should mostly serve as a basis for discussion. Maybe we should add more information to the yaml file as you outline in https://github.com/mllam/mllam-data-prep. I think we should always keep in mind how the repository will look like with realistic boundary conditions and zarr-archives as data-input.
This PR solves parts of #23