Skip to content

Commit

Permalink
warmup config for reproducibility of aifs v0.3 (ecmwf#155)
Browse files Browse the repository at this point in the history
* warmup config for reproducibility of aifs v0.3

* add entry to changelog

* update docs
  • Loading branch information
anaprietonem authored Nov 25, 2024
1 parent e3fe023 commit 0608f21
Show file tree
Hide file tree
Showing 4 changed files with 15 additions and 6 deletions.
2 changes: 2 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ Keep it human-readable, your future self will thank you!
Fixed bug in power spectra plotting for the n320 resolution.

### Added
- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)


- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)

Expand Down
15 changes: 10 additions & 5 deletions docs/user-guide/training.rst
Original file line number Diff line number Diff line change
Expand Up @@ -188,10 +188,11 @@ level has a weighting less than 0.2).
***************

Anemoi training uses the ``CosineLRScheduler`` from PyTorch as it's
learning rate scheduler. The user can configure the maximum learning
rate by setting ``config.training.lr.rate``. Note that this learning
rate is scaled by the number of GPUs where for the `data parallelism
<distributed>`_.
learning rate scheduler. Docs for this scheduler can be found here
https://github.com/huggingface/pytorch-image-models/blob/main/timm/scheduler/cosine_lr.py
The user can configure the maximum learning rate by setting
``config.training.lr.rate``. Note that this learning rate is scaled by
the number of GPUs where for the `data parallelism <distributed>`_.

.. code:: yaml
Expand All @@ -201,7 +202,11 @@ The user can also control the rate at which the learning rate decreases
by setting the total number of iterations through
``config.training.lr.iterations`` and the minimum learning rate reached
through ``config.training.lr.min``. Note that the minimum learning rate
is not scaled by the number of GPUs.
is not scaled by the number of GPUs. The user can also control the
warmup period by setting ``config.training.lr.warmup_t``. If the warmup
period is set to 0, the learning rate will start at the maximum learning
rate. If no warmup period is defined, a default warmup period of 1000
iterations is used.

*********
Rollout
Expand Down
1 change: 1 addition & 0 deletions src/anemoi/training/config/training/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,7 @@ lr:
rate: 0.625e-4 #local_lr
iterations: ${training.max_steps} # NOTE: When max_epochs < max_steps, scheduler will run for max_steps
min: 3e-7 #Not scaled by #GPU
warmup_t: 1000

# Changes in per-gpu batch_size should come with a rescaling of the local_lr
# in order to keep a constant global_lr
Expand Down
3 changes: 2 additions & 1 deletion src/anemoi/training/train/forecaster.py
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ def __init__(
* config.training.lr.rate
/ config.hardware.num_gpus_per_model
)
self.warmup_t = getattr(config.training.lr, "warmup_t", 1000)
self.lr_iterations = config.training.lr.iterations
self.lr_min = config.training.lr.min
self.rollout = config.training.rollout.start
Expand Down Expand Up @@ -638,6 +639,6 @@ def configure_optimizers(self) -> tuple[list[torch.optim.Optimizer], list[dict]]
optimizer,
lr_min=self.lr_min,
t_initial=self.lr_iterations,
warmup_t=1000,
warmup_t=self.warmup_t,
)
return [optimizer], [{"scheduler": scheduler, "interval": "step"}]

0 comments on commit 0608f21

Please sign in to comment.