Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace constants.py with data_config.yaml #31

Merged
merged 27 commits into from
May 22, 2024
Merged

Replace constants.py with data_config.yaml #31

merged 27 commits into from
May 22, 2024

Conversation

sadamov
Copy link
Collaborator

@sadamov sadamov commented May 14, 2024

Summary
This PR replaces the constants.py file with a data_config.yaml file. Dataset related settings can be defined by the user in the new yaml file. Training specific settings were added as additional flags to the train_model.py routine. All respective calls to the old files were replaced.

Rationale

  • Using a Yaml file for data config gives much more flexibility for various datasets used in the community. It also facilitates the future use of forcing and boundary datasets. In a follow-up PR the dataset paths will be defined in the yaml file, removing the dependency on a pre-structured /data folder.
  • It is best practice to define user input in a yaml file, the usage of python scripts for that purpose is not common.
  • The old constants.py actually combined both constants and variables, many "constants" should rather be flags to train_models.py
  • The introduction of a new ConfigClass in utils.py allows for very specific queries of the yaml and calculations based thereon. This branch shows future possibilities of such a class https://github.com/joeloskarsson/neural-lam/tree/feature_dataset_yaml

Testing
Both training and evaluation of the model were succesfully tested with the meps_example dataset.

Note
@leifdenby Could you invite Thomas R. to this repo, in case he wanted to give his input on the yaml file? This PR should mostly serve as a basis for discussion. Maybe we should add more information to the yaml file as you outline in https://github.com/mllam/mllam-data-prep. I think we should always keep in mind how the repository will look like with realistic boundary conditions and zarr-archives as data-input.

This PR solves parts of #23

@sadamov sadamov added the enhancement New feature or request label May 14, 2024
@sadamov sadamov requested a review from leifdenby May 14, 2024 15:31
@sadamov sadamov self-assigned this May 14, 2024
neural_lam/utils.py Outdated Show resolved Hide resolved
@leifdenby
Copy link
Member

From a quick glance this looks simply amazing @sadamov! Thanks for doing this work. I will give a thorough review later today/tomorrow. Just tagging @SimonKamuk to have a read and give your thoughts too. I've added @ThomasRieutord to the organisation too. I'll also send Thomas an email so that he definitely sees the PR.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this work @sadamov! You've really caught all the bits here (which I'm impressed you've done considering we don't have any tests right now!)

I have just made a few comments/suggestions. Let me know what you think :)

README.md Outdated Show resolved Hide resolved
- wvint_entireAtmosphere_0_instant
- z_isobaricInhPa_1000_instant
- z_isobaricInhPa_500_instant
forcing_dim: 16
Copy link
Member

@leifdenby leifdenby May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does forcing_dim refer to? Is it number of forcing features? In that case maybe we should call this num_forcing_features instead? The current name implies to me that that "dimension 16" is used for forcing or that there are 16 forcing dimensions :)

Copy link
Member

@leifdenby leifdenby May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the "forcing variables" be named too actually? We don't have to do this in this PR, but maybe we should consider that in future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR the forcings will be provided by a path to a zarr archive containing forcing features. Since in the current MEPS implementation the calculation of forcings is heavily integrated into the Dataset/Dataloader, I suggest to change the name to num_forcing_dim for now and implement the fundamental changes "naming forcing variables" once the zarr-based approach was merged into main. See https://github.com/mllam/neural-lam/tree/feature_dataset_yaml

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me! Happy to have the only change here be changed of name to num_forcing_dim

neural_lam/data_config.yaml Show resolved Hide resolved
neural_lam/utils.py Outdated Show resolved Hide resolved
neural_lam/utils.py Outdated Show resolved Hide resolved
neural_lam/weather_dataset.py Show resolved Hide resolved
neural_lam/models/ar_model.py Outdated Show resolved Hide resolved
neural_lam/models/ar_model.py Outdated Show resolved Hide resolved
neural_lam/data_config.yaml Outdated Show resolved Hide resolved
@sadamov sadamov requested a review from leifdenby May 21, 2024 10:25
@sadamov
Copy link
Collaborator Author

sadamov commented May 21, 2024

I implemented most requested changes in the latest commit and requested one more review. From my side we are clear to merge. The latest changes were again tested for model training and evaluation.

Copy link
Member

@leifdenby leifdenby May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not saying we should do this now, but I learnt more about the Meteo-France work on neural-lam this morning and they make quite heavy use of python dataclasses for configuration storage and schema. This could be something to consider when we want to make the config content more explicit.

Copy link
Member

@leifdenby leifdenby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks again @sadamov !

@leifdenby
Copy link
Member

Remember to update the changelog before you merge @sadamov! I think I would call this a new (very useful!) feature. A few things also change here (where constants are stored and thus how they are accessed in the code)

@sadamov sadamov merged commit 4a97a12 into main May 22, 2024
1 check passed
@sadamov sadamov deleted the feature_yaml branch May 22, 2024 08:22
@leifdenby
Copy link
Member

Hurraay! 🥳

parser.add_argument(
"--var_leads_metrics_watch",
type=dict,
default={},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadamov Can you pass a dict as input on the command line? I could not figure out a way to use this option

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah no you can't, I'll make a short PR to fix three bugs that I introduced in this PR. One of them the dictionary here.

@joeloskarsson
Copy link
Collaborator

Wonderful job with this @sadamov!

joeloskarsson added a commit that referenced this pull request May 31, 2024
### Summary
#31 introduced three minor bugs
that are fixed with this PR:

- r"" strings are not required in units of `data_config.yaml`
- dictionaries cannot be passed as argsparse, rather JSON strings. This
bug is related to the flag `var_leads_metrics_watch`

---------

Co-authored-by: joeloskarsson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants