ranzen.torch.optimizers¶
Classes:
Implements Adafactor algorithm. |
|
Implements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning. |
|
Implements the 'Sharpness Aware Minimization' (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM. |
- class Adafactor(params, *, lr=None, eps=(1e-30, 0.001), clipping_threshold=1.0, decay_rate=0.8, beta1=None, weight_decay=0.0, multiply_by_parameter_scale=False, warmup_init=False)¶
Bases:
OptimizerImplements Adafactor algorithm.
This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. Note that this optimizer internally adjusts the learning rate depending on the
multiply_by_parameter_scale,relative_stepandwarmup_initoptions. To use a manual (external) learning rate schedule you should setmultiply_by_parameter_scale=Falseandrelative_step=False.- Parameters:
params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups
lr (float | None) – learning rate. If
None, a time-dependent learning rate will instead be computed.eps (tuple[float, float]) – regularization constants for square gradient and parameter scale respectively.
clipping_threshold (float) – threshold of root mean square of final gradient update.
decay_rate (float) – coefficient used to compute running averages of square gradient.
beta1 (float | None) – coefficient used for computing running averages of gradient.
weight_decay (float) – weight decay coefficient.
multiply_by_parameter_scale (bool) – if True, learning rate is scaled by root mean square of parameter.
warmup_init (bool) – time-dependent learning rate computation depends on whether warm-up initialization is being used.
- step(closure=None)¶
Performs a single optimization step. :param closure: A closure that reevaluates the model and returns the loss. :returns: loss returned by the closure if
closureis notNoneelseNone. :raises RuntimeError: if gradients are sparse.- Parameters:
closure (Callable[[...], Tensor] | None)
- Return type:
Tensor | None
- class LAMB(params, lr=0.001, betas=(0.9, 0.999), *, eps=1e-06, weight_decay=0.0, clamp_value=10.0, debias=False)¶
Bases:
OptimizerImplements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning.
LAMB serves as the AdamW counterpart to the LARS optimizer, similarly employing layerwise adaptive learning rates train models effectively with large batch sizes.
Note
Implementation based on: https://github.com/cybertronai/pytorch-lamb
- Parameters:
params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – learning rate.
betas (tuple[float, float]) – coefficients used for computing running averages of gradient and its square.
eps (float) – term added to the denominator to improve numerical stability.
weight_decay (float) – weight decay coefficient.
clamp_value (float) – value to clamp the norm of the weights to.
debias (bool) – whether to include the bias-correction term (1 - beta**step) from Adam.
- Raises:
ValueError – if any one of
lr,betas,eps, orweight_decayis not in its permitted range.
- step(closure=None)¶
Performs a single optimization step.
- Parameters:
closure (Callable[[...], Tensor] | None) – A closure that reevaluates the model and returns the loss.
- Returns:
loss returned by the closure if
closureis notNoneelseNone.- Raises:
RuntimeError – if gradients are sparse.
- Return type:
Tensor | None
- class SAM(base_optimizer, *, rho=0.05, adaptive=True)¶
Bases:
OptimizerImplements the ‘Sharpness Aware Minimization’ (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM.
SAM seeks parameters that lie in neighborhoods having uniformly low loss (rather than parameters that only themselves have low loss value). The adaptive variant of the algorithm addresses the original algorithm’s sensitivity to parameter re-scaling that can lead to weakening of the connection between sharpness and generalization gap.
- Parameters:
base_optimizer (Optimizer) – Base optimizer for SAM.
rho (float) – Neighborhood size.
adaptive (bool) – Whether to use the adaptive variant of the algorithm.
- Raises:
ValueError – if
rhois negative.- Example:
# Use AdamW as the base optimizer. base_optimizer = AdamW(model.parameters()) # Wrap the base optimizer in SAM. optimizer = SAM(base_optimizer) # Closure required for recomputing the loss after computing epsilon(w). def _closure(): return loss_function(logits=model(input), targets=targets) loss = _closure() loss.backward() optimizer.step(closure=_closure) optimizer.zero_grad()
- load_state_dict(state_dict)¶
Loads the optimizer state.
- Parameters:
state_dict (dict[str, Any]) – optimizer state. Should be an object returned from a call to
state_dict().- Return type:
None
- step(closure)¶
Performs a single optimization step.
- Parameters:
closure (Callable[[...], Tensor]) – A closure that reevaluates the model and returns the loss.
- Returns:
loss returned by the closure if
closureis notNoneelseNone.- Return type:
Tensor