Restart trainer between stages #45

jelmervdl · 2023-12-23T15:40:59Z

This changes the way the trainer (marian) is wrapped. Previously, the trainer was started once, and would be fed until we ran out of data to feed it, or it stopped by closing its stdin.

This change contains:

Restart the trainer for each stage. This allows you to change arguments to the trainer between stages.
If the trainer exits a stage early, but cleanly, we take that as a signal to continue to the next stage. This allows you to do early stopping.

Example of 1:

start:
  mix:
  - clean 0.8
  - medium 0.2
  - dirty 0
  - until clean inf # will continue forever, but `--stop-early` will stop + move on.
  modifiers:
  - UpperCase: 0.05
  - TitleCase: 0.05
  arguments: --stop-early

trainer: marian -c config.yml

I'm a bit confused about whether this is the way to go, since stage-specific modifiers are overriding the global modifier config, while stage-specific arguments are appending to the end of the trainer command. ~~On top of that, the trainer key in the config is a string that's split with shlex.split(), while arguments is a list of arguments.~~ Edit: changed arguments to a string that's split using shlex.split(), like the trainer config option.

Let's see whether this does better or worse. It moves the logic of "hey you're ready to move on" to the trainer program for a bit.

…he trainer stalled

# Conflicts: # src/opustrainer/trainer.py # test.py

eu9ene · 2024-01-04T01:11:14Z

It looks great! A couple of thoughts, both things are optional:

Arguments as string are usable but it would also be more readable to write them as YAML, for example:

arguments:
    early-stopping: 20
    learning-rate: 0.03

which would be converted to --early-stopping 20 --learning-rate 0.03. There's some ambiguity here, but it would be convenient. If this is supported it would make sense to add a unit test.

Did I get it right that now we always rerun Marian/trainer for each stage? Is it possible that it might lead to some complications for training? When we restart the trainer it restores the last checkpoint that might make training a bit less stable. Also would it be possible for a user to specify wrong arguments for Marian that would break training with restart? I'm thinking we could restart only if arguments specified for a stage. It might prevent some bugs for users that don't need this functionality.

jelmervdl · 2024-01-07T23:37:12Z

Arguments as string are usable but it would also be more readable to write them as YAML

I initially chose not to do a YAML list for the arguments because they will eventually get converted to string. But Marian has a specific way how config is translated between the config yaml and the command line options which we could play into.

I don't want to 100% fix OpusTrainer to marian though, so it would then accept either str or dict[str,...] and would convert the latter to marian compatible cli arguments.

Did I get it right that now we always rerun Marian/trainer for each stage? Is it possible that it might lead to some complications for training?

Yes, I'm tempted to do that for consistency. You would lose something by restarting, probably, but it would be consistent. The alternative is that, when switching stages, OpusTrainer compares the arguments of the two stages it switches between and decides to restart based on that. But that feels a lot less predictable.

Also would it be possible for a user to specify wrong arguments for Marian that would break training with restart?

Yep, but you'd have the clear checkpoint to fall back on. To be fair, you should have this in all scenarios since it will fail at the start directly after having switched onto the next stage.

It might prevent some bugs for users that don't need this functionality.

… And I find that pretty convincing. A third implementation would be restart between stages if any stage has arguments specified, and no restart if the feature is never used.

For testing, maybe a different feature could work. Something where you specify run only X% of data so you can see whether it moves correctly through all the stages but maybe by severely underfeeding it you can make marian stall very quickly and you'd get it to move on quickly.

jelmervdl · 2024-01-16T13:57:10Z

Abandoned for now in favour of re-launching OpusTrainer with a different configuration (and different launch arguments for the trainer) altogether.

jelmervdl added 7 commits January 23, 2023 11:48

Let it run stages until the trainer stalls

cec58ef

Let's see whether this does better or worse. It moves the logic of "hey you're ready to move on" to the trainer program for a bit.

Add test for next_stage()

ba9332f

Don't restore state every time we advance to the next stage because t…

08e20f6

…he trainer stalled

Fix _dump->dump

d19f1c5

Merge branch 'main' into stages-till-stall

f9bef93

# Conflicts: # src/opustrainer/trainer.py # test.py

Add test for early stopping

b1a7f04

Fix tests

f142670

jelmervdl mentioned this pull request Dec 23, 2023

Support training arguments for training stages #44

Closed

jelmervdl linked an issue Dec 23, 2023 that may be closed by this pull request

Support training arguments for training stages #44

Closed

Add arguments key per stage

7fde43e

jelmervdl force-pushed the stages-till-stall branch from 9957bd4 to 7fde43e Compare December 23, 2023 21:32

jelmervdl added 2 commits December 24, 2023 11:37

Changed stage level arguments to a string

0fd4742

Update readme to match new arguments string

6ba41b4

eu9ene mentioned this pull request Jan 10, 2024

Unstable training with OpusTrainer mozilla/translations#314

Closed

jelmervdl closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restart trainer between stages #45

Restart trainer between stages #45

jelmervdl commented Dec 23, 2023 •

edited

Loading

eu9ene commented Jan 4, 2024

jelmervdl commented Jan 7, 2024

jelmervdl commented Jan 16, 2024

Restart trainer between stages #45

Restart trainer between stages #45

Conversation

jelmervdl commented Dec 23, 2023 • edited Loading

eu9ene commented Jan 4, 2024

jelmervdl commented Jan 7, 2024

jelmervdl commented Jan 16, 2024

jelmervdl commented Dec 23, 2023 •

edited

Loading