Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart trainer between stages #45

Closed
wants to merge 10 commits into from
Closed

Restart trainer between stages #45

wants to merge 10 commits into from

Conversation

jelmervdl
Copy link
Contributor

@jelmervdl jelmervdl commented Dec 23, 2023

This changes the way the trainer (marian) is wrapped. Previously, the trainer was started once, and would be fed until we ran out of data to feed it, or it stopped by closing its stdin.

This change contains:

  1. Restart the trainer for each stage. This allows you to change arguments to the trainer between stages.
  2. If the trainer exits a stage early, but cleanly, we take that as a signal to continue to the next stage. This allows you to do early stopping.

Example of 1:

start:
  mix:
  - clean 0.8
  - medium 0.2
  - dirty 0
  - until clean inf # will continue forever, but `--stop-early` will stop + move on.
  modifiers:
  - UpperCase: 0.05
  - TitleCase: 0.05
  arguments: --stop-early

trainer: marian -c config.yml

I'm a bit confused about whether this is the way to go, since stage-specific modifiers are overriding the global modifier config, while stage-specific arguments are appending to the end of the trainer command. On top of that, the trainer key in the config is a string that's split with shlex.split(), while arguments is a list of arguments. Edit: changed arguments to a string that's split using shlex.split(), like the trainer config option.

Let's see whether this does better or worse. It moves the logic of "hey you're ready to move on" to the trainer program for a bit.
# Conflicts:
#	src/opustrainer/trainer.py
#	test.py
@jelmervdl jelmervdl linked an issue Dec 23, 2023 that may be closed by this pull request
@eu9ene
Copy link
Contributor

eu9ene commented Jan 4, 2024

It looks great! A couple of thoughts, both things are optional:

  • Arguments as string are usable but it would also be more readable to write them as YAML, for example:
arguments:
    early-stopping: 20
    learning-rate: 0.03

which would be converted to --early-stopping 20 --learning-rate 0.03. There's some ambiguity here, but it would be convenient. If this is supported it would make sense to add a unit test.

  • Did I get it right that now we always rerun Marian/trainer for each stage? Is it possible that it might lead to some complications for training? When we restart the trainer it restores the last checkpoint that might make training a bit less stable. Also would it be possible for a user to specify wrong arguments for Marian that would break training with restart? I'm thinking we could restart only if arguments specified for a stage. It might prevent some bugs for users that don't need this functionality.

@jelmervdl
Copy link
Contributor Author

Arguments as string are usable but it would also be more readable to write them as YAML

I initially chose not to do a YAML list for the arguments because they will eventually get converted to string. But Marian has a specific way how config is translated between the config yaml and the command line options which we could play into.

I don't want to 100% fix OpusTrainer to marian though, so it would then accept either str or dict[str,...] and would convert the latter to marian compatible cli arguments.

Did I get it right that now we always rerun Marian/trainer for each stage? Is it possible that it might lead to some complications for training?

Yes, I'm tempted to do that for consistency. You would lose something by restarting, probably, but it would be consistent. The alternative is that, when switching stages, OpusTrainer compares the arguments of the two stages it switches between and decides to restart based on that. But that feels a lot less predictable.

Also would it be possible for a user to specify wrong arguments for Marian that would break training with restart?

Yep, but you'd have the clear checkpoint to fall back on. To be fair, you should have this in all scenarios since it will fail at the start directly after having switched onto the next stage.

It might prevent some bugs for users that don't need this functionality.

… And I find that pretty convincing. A third implementation would be restart between stages if any stage has arguments specified, and no restart if the feature is never used.

For testing, maybe a different feature could work. Something where you specify run only X% of data so you can see whether it moves correctly through all the stages but maybe by severely underfeeding it you can make marian stall very quickly and you'd get it to move on quickly.

@jelmervdl
Copy link
Contributor Author

Abandoned for now in favour of re-launching OpusTrainer with a different configuration (and different launch arguments for the trainer) altogether.

@jelmervdl jelmervdl closed this Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support training arguments for training stages
2 participants