In this tutorial, we will introduce some methods about how to construct optimizers, customize learning rate, momentum schedules, parameter-wise configuration, gradient clipping, gradient accumulation, and customize self-implemented methods for the project.
We already support to use all the optimizers implemented by PyTorch, and to use and modify them, please change the optimizer
field of config files.
For example, if you want to use SGD, the modification could be as the following.
optimizer = dict(type='SGD', lr=0.0003, weight_decay=0.0001)
To modify the learning rate of the model, just modify the lr
in the config of optimizer. You can also directly set other arguments according to the API doc of PyTorch.
For example, if you want to use Adam
with the setting like torch.optim.Adam(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
in PyTorch, the config should looks like:
optimizer = dict(type='Adam', lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
In addition to optimizers implemented by PyTorch, we also implement a customized LARS in mmselfsup/core/optimizer/optimizers.py
Learning rate decay is widely used to improve performance. And to use learning rate decay, please set the lr_confg
field in config files.
For example, we use CosineAnnealing policy to train SimCLR, and the config is:
lr_config = dict(
policy='CosineAnnealing',
...)
Then during training, the program will call CosineAnealingLrUpdaterHook periodically to update the learning rate.
We also support many other learning rate schedules here, such as Poly schedule.
In the early stage, training is easy to be volatile, and warmup is a technique to reduce volatility. With warmup, the learning rate will increase gradually from a small value to the expected value.
In MMSelfSup, we use lr_config
to configure the warmup strategy, the main parameters are as follows:
warmup
: The warmup curve type. Please choose one from 'constant', 'linear', 'exp' andNone
, andNone
means disable warmup.warmup_by_epoch
: whether warmup by epoch or not, default to be True, if set to be False, warmup by iter.warmup_iters
: the number of warm-up iterations, whenwarmup_by_epoch=True
, the unit is epoch; whenwarmup_by_epoch=False
, the unit is the number of iterations (iter).warmup_ratio
: warm-up initial learning rate will calculate aslr = lr * warmup_ratio
.
Here are some examples:
1.linear & warmup by iter
lr_config = dict(
policy='CosineAnnealing',
by_epoch=False,
min_lr_ratio=1e-2,
warmup='linear',
warmup_ratio=1e-3,
warmup_iters=20 * 1252,
warmup_by_epoch=False)
2.exp & warmup by epoch
lr_config = dict(
policy='CosineAnnealing',
min_lr=0,
warmup='exp',
warmup_iters=5,
warmup_ratio=0.1,
warmup_by_epoch=True)
We support the momentum scheduler to modify the model's momentum according to learning rate, which could make the model converge in a faster way.
Momentum scheduler is usually used with LR scheduler, for example, the following config is used to accelerate convergence. For more details, please refer to the implementation of CyclicLrUpdater and CyclicMomentumUpdater.
Here is an example:
lr_config = dict(
policy='cyclic',
target_ratio=(10, 1e-4),
cyclic_times=1,
step_ratio_up=0.4,
)
momentum_config = dict(
policy='cyclic',
target_ratio=(0.85 / 0.95, 1),
cyclic_times=1,
step_ratio_up=0.4,
)
Some models may have some parameter-specific settings for optimization, for example, no weight decay to the BatchNorm layer and the bias in each layer. To finely configure them, we can use the paramwise_options
in optimizer.
For example, if we do not want to apply weight decay to the parameters of BatchNorm or GroupNorm, and the bias in each layer, we can use following config file:
optimizer = dict(
type=...,
lr=...,
paramwise_options={
'(bn|gn)(\\d+)?.(weight|bias)':
dict(weight_decay=0.),
'bias': dict(weight_decay=0.)
})
Besides the basic function of PyTorch optimizers, we also provide some enhancement functions, such as gradient clipping, gradient accumulation, etc. Please refer to MMCV for more details.
Currently we support grad_clip
option in optimizer_config
, and you can refer to PyTorch Documentation for more arguments .
Here is an example:
optimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))
# norm_type: type of the used p-norm, here norm_type is 2.
When inheriting from base and modifying configs, if grad_clip=None
in base, _delete_=True
is needed.
When there is not enough computation resource, the batch size can only be set to a small value, which may degrade the performance of model. Gradient accumulation can be used to solve this problem.
Here is an example:
data = dict(samples_per_gpu=64)
optimizer_config = dict(type="DistOptimizerHook", update_interval=4)
Indicates that during training, back-propagation is performed every 4 iters. And the above is equivalent to:
data = dict(samples_per_gpu=256)
optimizer_config = dict(type="OptimizerHook")
In academic research and industrial practice, it is likely that you need some optimization methods not implemented by MMSelfSup, and you can add them through the following methods.
Implement your CustomizedOptim
in mmselfsup/core/optimizer/optimizers.py
import torch
from torch.optim import * # noqa: F401,F403
from torch.optim.optimizer import Optimizer, required
from mmcv.runner.optimizer.builder import OPTIMIZERS
@OPTIMIZER.register_module()
class CustomizedOptim(Optimizer):
def __init__(self, *args, **kwargs):
## TODO
@torch.no_grad()
def step(self):
## TODO
Import it in mmselfsup/core/optimizer/__init__.py
from .optimizers import CustomizedOptim
from .builder import build_optimizer
__all__ = ['CustomizedOptim', 'build_optimizer', ...]
Use it in your config file
optimizer = dict(
type='CustomizedOptim',
...
)