ranzen.torch.optimizers¶
Classes:
Implements Adafactor algorithm. |
|
Implements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning. |
|
Implements the 'Sharpness Aware Minimization' (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM. |
- class Adafactor(params, *, lr=None, eps=(1e-30, 0.001), clipping_threshold=1.0, decay_rate=0.8, beta1=None, weight_decay=0.0, multiply_by_parameter_scale=False, warmup_init=False)¶
Bases:
Optimizer
Implements Adafactor algorithm.
This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. Note that this optimizer internally adjusts the learning rate depending on the
multiply_by_parameter_scale
,relative_step
andwarmup_init
options. To use a manual (external) learning rate schedule you should setmultiply_by_parameter_scale=False
andrelative_step=False
.- Parameters:
params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups
lr (float | None) – learning rate. If
None
, a time-dependent learning rate will instead be computed.eps (tuple[float, float]) – regularization constants for square gradient and parameter scale respectively.
clipping_threshold (float) – threshold of root mean square of final gradient update.
decay_rate (float) – coefficient used to compute running averages of square gradient.
beta1 (float | None) – coefficient used for computing running averages of gradient.
weight_decay (float) – weight decay coefficient.
multiply_by_parameter_scale (bool) – if True, learning rate is scaled by root mean square of parameter.
warmup_init (bool) – time-dependent learning rate computation depends on whether warm-up initialization is being used.
- step(closure=None)¶
Performs a single optimization step. :param closure: A closure that reevaluates the model and returns the loss. :returns: loss returned by the closure if
closure
is notNone
elseNone
. :raises RuntimeError: if gradients are sparse.- Parameters:
closure (Callable[[...], Tensor] | None)
- Return type:
Tensor | None
- class LAMB(params, lr=0.001, betas=(0.9, 0.999), *, eps=1e-06, weight_decay=0.0, clamp_value=10.0, debias=False)¶
Bases:
Optimizer
Implements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning.
LAMB serves as the AdamW counterpart to the LARS optimizer, similarly employing layerwise adaptive learning rates train models effectively with large batch sizes.
Note
Implementation based on: https://github.com/cybertronai/pytorch-lamb
- Parameters:
params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – learning rate.
betas (tuple[float, float]) – coefficients used for computing running averages of gradient and its square.
eps (float) – term added to the denominator to improve numerical stability.
weight_decay (float) – weight decay coefficient.
clamp_value (float) – value to clamp the norm of the weights to.
debias (bool) – whether to include the bias-correction term (1 - beta**step) from Adam.
- Raises:
ValueError – if any one of
lr
,betas
,eps
, orweight_decay
is not in its permitted range.
- step(closure=None)¶
Performs a single optimization step.
- Parameters:
closure (Callable[[...], Tensor] | None) – A closure that reevaluates the model and returns the loss.
- Returns:
loss returned by the closure if
closure
is notNone
elseNone
.- Raises:
RuntimeError – if gradients are sparse.
- Return type:
Tensor | None
- class SAM(base_optimizer, *, rho=0.05, adaptive=True)¶
Bases:
Optimizer
Implements the ‘Sharpness Aware Minimization’ (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM.
SAM seeks parameters that lie in neighborhoods having uniformly low loss (rather than parameters that only themselves have low loss value). The adaptive variant of the algorithm addresses the original algorithm’s sensitivity to parameter re-scaling that can lead to weakening of the connection between sharpness and generalization gap.
- Parameters:
base_optimizer (Optimizer) – Base optimizer for SAM.
rho (float) – Neighborhood size.
adaptive (bool) – Whether to use the adaptive variant of the algorithm.
- Raises:
ValueError – if
rho
is negative.- Example:
# Use AdamW as the base optimizer. base_optimizer = AdamW(model.parameters()) # Wrap the base optimizer in SAM. optimizer = SAM(base_optimizer) # Closure required for recomputing the loss after computing epsilon(w). def _closure(): return loss_function(logits=model(input), targets=targets) loss = _closure() loss.backward() optimizer.step(closure=_closure) optimizer.zero_grad()
- load_state_dict(state_dict)¶
Loads the optimizer state.
- Parameters:
state_dict (dict[str, Any]) – optimizer state. Should be an object returned from a call to
state_dict()
.- Return type:
None
- step(closure)¶
Performs a single optimization step.
- Parameters:
closure (Callable[[...], Tensor]) – A closure that reevaluates the model and returns the loss.
- Returns:
loss returned by the closure if
closure
is notNone
elseNone
.- Return type:
Tensor