Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/develop' into pr/aw_rescale
Browse files Browse the repository at this point in the history
  • Loading branch information
havardhhaugen committed Dec 2, 2024
2 parents 796ad56 + 7e363fa commit 2c7c548
Show file tree
Hide file tree
Showing 11 changed files with 100 additions and 65 deletions.
35 changes: 31 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,55 +8,82 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
Please add your functional changes to the appropriate section in the PR.
Keep it human-readable, your future self will thank you!

## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.0...HEAD)
## [Unreleased](https://github.com/ecmwf/anemoi-training/compare/0.3.1...HEAD)

## [0.3.1 - AIFS v0.3 Compatibility](https://github.com/ecmwf/anemoi-training/compare/0.3.0...0.3.1) - 2024-11-28

### Changed
- Perform full shuffle of training dataset [#153](https://github.com/ecmwf/anemoi-training/pull/153)

### Fixed

- Update `n_pixel` used by datashader to better adapt across resolutions #152

- Fixed bug in power spectra plotting for the n320 resolution.

- Allow histogram and spectrum plot for one variable [#165](https://github.com/ecmwf/anemoi-training/pull/165)

### Added
- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)

### Added

- Introduce variable to configure (Cosine Annealing) optimizer warm up [#155](https://github.com/ecmwf/anemoi-training/pull/155)
- Add reader groups to reduce CPU memory usage and increase dataloader throughput [#76](https://github.com/ecmwf/anemoi-training/pull/76)
- Bump `anemoi-graphs` version to 0.4.1 [#159](https://github.com/ecmwf/anemoi-training/pull/159)

### Changed

## [0.3.0 - Loss & Callback Refactors](https://github.com/ecmwf/anemoi-training/compare/0.2.2...0.3.0) - 2024-11-14

### Changed

- Increase the default MlFlow HTTP max retries [#111](https://github.com/ecmwf/anemoi-training/pull/111)

### Fixed

- Rename loss_scaling to variable_loss_scaling [#138](https://github.com/ecmwf/anemoi-training/pull/138)

- Refactored callbacks. [#60](https://github.com/ecmwf/anemoi-training/pulls/60)

- Updated docs [#115](https://github.com/ecmwf/anemoi-training/pull/115)
- Fix enabling LearningRateMonitor [#119](https://github.com/ecmwf/anemoi-training/pull/119)

- Refactored rollout [#87](https://github.com/ecmwf/anemoi-training/pulls/87)

- Enable longer validation rollout than training

- Expand iterables in logging [#91](https://github.com/ecmwf/anemoi-training/pull/91)

- Save entire config in mlflow


### Added

- Included more loss functions and allowed configuration [#70](https://github.com/ecmwf/anemoi-training/pull/70)

- Include option to use datashader and optimised asyncronohous callbacks [#102](https://github.com/ecmwf/anemoi-training/pull/102)
- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)

- Fix that applies the metric_ranges in the post-processed variable space [#116](https://github.com/ecmwf/anemoi-training/pull/116)

- Allow updates to scalars [#137](https://github.com/ecmwf/anemoi-training/pulls/137)

- Add without subsetting in ScaleTensor

- Sub-hour datasets [#63](https://github.com/ecmwf/anemoi-training/pull/63)

- Add synchronisation workflow [#92](https://github.com/ecmwf/anemoi-training/pull/92)

- Feat: Anemoi Profiler compatible with mlflow and using Pytorch (Kineto) Profiler for memory report [38](https://github.com/ecmwf/anemoi-training/pull/38/)

- Feat: Save a gif for longer rollouts in validation [#65](https://github.com/ecmwf/anemoi-training/pull/65)

- New limited area config file added, limited_area.yaml. [#134](https://github.com/ecmwf/anemoi-training/pull/134/)

- New stretched grid config added, stretched_grid.yaml [#133](https://github.com/ecmwf/anemoi-training/pull/133)
- Functionality to change the weight attribute of nodes in the graph at the start of training without re-generating the graph. [#136] (https://github.com/ecmwf/anemoi-training/pull/136)

- Custom System monitor for Nvidia and AMD GPUs [#147](https://github.com/ecmwf/anemoi-training/pull/147)


### Changed

- Renamed frequency keys in callbacks configuration. [#118](https://github.com/ecmwf/anemoi-training/pull/118)
Expand Down
4 changes: 2 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,8 @@ classifiers = [
dynamic = [ "version" ]

dependencies = [
"anemoi-datasets>=0.4",
"anemoi-graphs>=0.4",
"anemoi-datasets>=0.5.2",
"anemoi-graphs>=0.4.1",
"anemoi-models>=0.3",
"anemoi-utils[provenance]>=0.4.4",
"datashader>=0.16.3",
Expand Down
10 changes: 5 additions & 5 deletions src/anemoi/training/config/graph/encoder_decoder_only.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,15 +22,15 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
14 changes: 7 additions & 7 deletions src/anemoi/training/config/graph/limited_area.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -23,23 +23,23 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
# Processor configuration
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
target_mask_attr_name: cutout
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
target_mask_attr_name: cutout
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
16 changes: 8 additions & 8 deletions src/anemoi/training/config/graph/multi_scale.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,22 +22,22 @@ edges:
# Encoder configuration
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.CutOffEdges # options: KNNEdges, CutOffEdges
cutoff_factor: 0.6 # only for cutoff method
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Processor configuration
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
- source_name: ${graph.hidden}
# Decoder configuration
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges # options: KNNEdges, CutOffEdges
num_nearest_neighbours: 3 # only for knn method
attributes: ${graph.attributes.edges}

Expand Down
12 changes: 6 additions & 6 deletions src/anemoi/training/config/graph/stretched_grid.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,22 +34,22 @@ edges:
# Encoder
- source_name: ${graph.data}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 12
attributes: ${graph.attributes.edges}
# Processor
- source_name: ${graph.hidden}
target_name: ${graph.hidden}
edge_builder:
_target_: anemoi.graphs.edges.MultiScaleEdges
edge_builders:
- _target_: anemoi.graphs.edges.MultiScaleEdges
x_hops: 1
attributes: ${graph.attributes.edges}
# Decoder
- source_name: ${graph.hidden}
target_name: ${graph.data}
edge_builder:
_target_: anemoi.graphs.edges.KNNEdges
edge_builders:
- _target_: anemoi.graphs.edges.KNNEdges
num_nearest_neighbours: 3
attributes: ${graph.attributes.edges}

Expand Down
28 changes: 9 additions & 19 deletions src/anemoi/training/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,7 @@ def per_worker_init(self, n_workers: int, worker_id: int) -> None:

low = shard_start + worker_id * self.n_samples_per_worker
high = min(shard_start + (worker_id + 1) * self.n_samples_per_worker, shard_end)
self.chunk_index_range = np.arange(low, high, dtype=np.uint32)

LOGGER.debug(
"Worker %d (pid %d, global_rank %d, model comm group %d) has low/high range %d / %d",
Expand All @@ -212,27 +213,17 @@ def per_worker_init(self, n_workers: int, worker_id: int) -> None:
high,
)

self.chunk_index_range = self.valid_date_indices[np.arange(low, high, dtype=np.uint32)]

# each worker must have a different seed for its random number generator,
# otherwise all the workers will output exactly the same data
# should we check lightning env variable "PL_SEED_WORKERS" here?
# but we alwyas want to seed these anyways ...

base_seed = get_base_seed()

seed = (
base_seed * (self.model_comm_group_id + 1) - worker_id
) # note that test, validation etc. datasets get same seed
torch.manual_seed(seed)
random.seed(seed)
self.rng = np.random.default_rng(seed=seed)
torch.manual_seed(base_seed)
random.seed(base_seed)
self.rng = np.random.default_rng(seed=base_seed)
sanity_rnd = self.rng.random(1)

LOGGER.debug(
(
"Worker %d (%s, pid %d, glob. rank %d, model comm group %d, "
"group_rank %d, base_seed %d) using seed %d, sanity rnd %f"
"group_rank %d, base_seed %d), sanity rnd %f"
),
worker_id,
self.label,
Expand All @@ -241,7 +232,6 @@ def per_worker_init(self, n_workers: int, worker_id: int) -> None:
self.model_comm_group_id,
self.model_comm_group_rank,
base_seed,
seed,
sanity_rnd,
)

Expand All @@ -256,12 +246,12 @@ def __iter__(self) -> torch.Tensor:
"""
if self.shuffle:
shuffled_chunk_indices = self.rng.choice(
self.chunk_index_range,
size=self.n_samples_per_worker,
self.valid_date_indices,
size=len(self.valid_date_indices),
replace=False,
)
)[self.chunk_index_range]
else:
shuffled_chunk_indices = self.chunk_index_range
shuffled_chunk_indices = self.valid_date_indices[self.chunk_index_range]

LOGGER.debug(
(
Expand Down
4 changes: 2 additions & 2 deletions src/anemoi/training/diagnostics/callbacks/plot.py
Original file line number Diff line number Diff line change
Expand Up @@ -938,7 +938,7 @@ def _plot(
torch.cat(tuple(x[self.sample_idx : self.sample_idx + 1, ...].cpu() for x in outputs[1])),
in_place=False,
)
output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
output_tensor = pl_module.output_mask.apply(output_tensor, dim=1, fill_value=np.nan).numpy()
data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
data = data.numpy()

Expand Down Expand Up @@ -999,7 +999,7 @@ def process(
torch.cat(tuple(x[self.sample_idx : self.sample_idx + 1, ...].cpu() for x in outputs[1])),
in_place=False,
)
output_tensor = pl_module.output_mask.apply(output_tensor, dim=2, fill_value=np.nan).numpy()
output_tensor = pl_module.output_mask.apply(output_tensor, dim=1, fill_value=np.nan).numpy()
data[1:, ...] = pl_module.output_mask.apply(data[1:, ...], dim=2, fill_value=np.nan)
data = data.numpy()
return data, output_tensor
Expand Down
11 changes: 10 additions & 1 deletion src/anemoi/training/diagnostics/mlflow/logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,16 @@ def _clean_params(params: dict[str, Any]) -> dict[str, Any]:
dict[str, Any]
Cleaned up params ready for MlFlow.
"""
prefixes_to_remove = ["hardware", "data", "dataloader", "model", "training", "diagnostics", "metadata.config"]
prefixes_to_remove = [
"hardware",
"data",
"dataloader",
"model",
"training",
"diagnostics",
"metadata.config",
"metadata.dataset.variables_metadata",
]
keys_to_remove = [key for key in params if any(key.startswith(prefix) for prefix in prefixes_to_remove)]
for key in keys_to_remove:
del params[key]
Expand Down
27 changes: 17 additions & 10 deletions src/anemoi/training/diagnostics/plots.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from matplotlib.collections import PathCollection
from matplotlib.colors import BoundaryNorm
from matplotlib.colors import ListedColormap
from matplotlib.colors import Normalize
from matplotlib.colors import TwoSlopeNorm
from pyshtools.expand import SHGLQ
from pyshtools.expand import SHExpandGLQ
Expand Down Expand Up @@ -568,8 +569,12 @@ def error_plot_in_degrees(array1: np.ndarray, array2: np.ndarray) -> np.ndarray:
datashader=datashader,
)
else:
single_plot(fig, ax[1], lon, lat, truth, title=f"{vname} target", datashader=datashader)
single_plot(fig, ax[2], lon, lat, pred, title=f"{vname} pred", datashader=datashader)
combined_data = np.concatenate((input_, truth, pred))
# For 'errors', only persistence and increments need identical colorbar-limits
combined_error = np.concatenate(((pred - input_), (truth - input_)))
norm = Normalize(vmin=np.nanmin(combined_data), vmax=np.nanmax(combined_data))
single_plot(fig, ax[1], lon, lat, truth, norm=norm, title=f"{vname} target", datashader=datashader)
single_plot(fig, ax[2], lon, lat, pred, norm=norm, title=f"{vname} pred", datashader=datashader)
single_plot(
fig,
ax[3],
Expand Down Expand Up @@ -619,15 +624,15 @@ def error_plot_in_degrees(array1: np.ndarray, array2: np.ndarray) -> np.ndarray:
datashader=datashader,
)
else:
single_plot(fig, ax[0], lon, lat, input_, title=f"{vname} input", datashader=datashader)
single_plot(fig, ax[0], lon, lat, input_, norm=norm, title=f"{vname} input", datashader=datashader)
single_plot(
fig,
ax[4],
lon,
lat,
pred - input_,
cmap="bwr",
norm=TwoSlopeNorm(vcenter=0.0),
norm=TwoSlopeNorm(vmin=combined_error.min(), vcenter=0.0, vmax=combined_error.max()),
title=f"{vname} increment [pred - input]",
datashader=datashader,
)
Expand All @@ -638,7 +643,7 @@ def error_plot_in_degrees(array1: np.ndarray, array2: np.ndarray) -> np.ndarray:
lat,
truth - input_,
cmap="bwr",
norm=TwoSlopeNorm(vcenter=0.0),
norm=TwoSlopeNorm(vmin=combined_error.min(), vcenter=0.0, vmax=combined_error.max()),
title=f"{vname} persist err",
datashader=datashader,
)
Expand Down Expand Up @@ -703,13 +708,13 @@ def single_plot(
else:
df = pd.DataFrame({"val": data, "x": lon, "y": lat})
# Adjust binning to match the resolution of the data
n_pixels = int(np.floor(data.shape[0] / 212))
lower_limit = 25
upper_limit = 500
n_pixels = max(min(int(np.floor(data.shape[0] * 0.004)), upper_limit), lower_limit)
psc = dsshow(
df,
dsh.Point("x", "y"),
dsh.mean("val"),
vmin=data.min(),
vmax=data.max(),
cmap=cmap,
plot_width=n_pixels,
plot_height=n_pixels,
Expand All @@ -718,8 +723,10 @@ def single_plot(
ax=ax,
)

ax.set_xlim((-np.pi, np.pi))
ax.set_ylim((-np.pi / 2, np.pi / 2))
xmin, xmax = max(lon.min(), -np.pi), min(lon.max(), np.pi)
ymin, ymax = max(lat.min(), -np.pi / 2), min(lat.max(), np.pi / 2)
ax.set_xlim((xmin - 0.1, xmax + 0.1))
ax.set_ylim((ymin - 0.1, ymax + 0.1))

continents.plot_continents(ax)

Expand Down
Loading

0 comments on commit 2c7c548

Please sign in to comment.