ranzen.torch.optimizers

Classes:

Adafactor

Implements Adafactor algorithm.

LAMB

Implements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning.

SAM

Implements the 'Sharpness Aware Minimization' (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM.

class Adafactor(params, *, lr=None, eps=(1e-30, 0.001), clipping_threshold=1.0, decay_rate=0.8, beta1=None, weight_decay=0.0, multiply_by_parameter_scale=False, warmup_init=False)

Bases: Optimizer

Implements Adafactor algorithm.

This implementation is based on: Adafactor: Adaptive Learning Rates with Sublinear Memory Cost. Note that this optimizer internally adjusts the learning rate depending on the multiply_by_parameter_scale, relative_step and warmup_init options. To use a manual (external) learning rate schedule you should set multiply_by_parameter_scale=False and relative_step=False.

Parameters:
  • params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups

  • lr (float | None) – learning rate. If None, a time-dependent learning rate will instead be computed.

  • eps (tuple[float, float]) – regularization constants for square gradient and parameter scale respectively.

  • clipping_threshold (float) – threshold of root mean square of final gradient update.

  • decay_rate (float) – coefficient used to compute running averages of square gradient.

  • beta1 (float | None) – coefficient used for computing running averages of gradient.

  • weight_decay (float) – weight decay coefficient.

  • multiply_by_parameter_scale (bool) – if True, learning rate is scaled by root mean square of parameter.

  • warmup_init (bool) – time-dependent learning rate computation depends on whether warm-up initialization is being used.

step(closure=None)

Performs a single optimization step. :param closure: A closure that reevaluates the model and returns the loss. :returns: loss returned by the closure if closure is not None else None. :raises RuntimeError: if gradients are sparse.

Parameters:

closure (Callable[[...], Tensor] | None)

Return type:

Tensor | None

class LAMB(params, lr=0.001, betas=(0.9, 0.999), *, eps=1e-06, weight_decay=0.0, clamp_value=10.0, debias=False)

Bases: Optimizer

Implements the LAMB algorithm introduced in Large Batch Optimization for Deep Learning.

LAMB serves as the AdamW counterpart to the LARS optimizer, similarly employing layerwise adaptive learning rates train models effectively with large batch sizes.

Note

Implementation based on: https://github.com/cybertronai/pytorch-lamb

Parameters:
  • params (Iterable[Tensor]) – iterable of parameters to optimize or dicts defining parameter groups.

  • lr (float) – learning rate.

  • betas (tuple[float, float]) – coefficients used for computing running averages of gradient and its square.

  • eps (float) – term added to the denominator to improve numerical stability.

  • weight_decay (float) – weight decay coefficient.

  • clamp_value (float) – value to clamp the norm of the weights to.

  • debias (bool) – whether to include the bias-correction term (1 - beta**step) from Adam.

Raises:

ValueError – if any one of lr, betas, eps, or weight_decay is not in its permitted range.

step(closure=None)

Performs a single optimization step.

Parameters:

closure (Callable[[...], Tensor] | None) – A closure that reevaluates the model and returns the loss.

Returns:

loss returned by the closure if closure is not None else None.

Raises:

RuntimeError – if gradients are sparse.

Return type:

Tensor | None

class SAM(base_optimizer, *, rho=0.05, adaptive=True)

Bases: Optimizer

Implements the ‘Sharpness Aware Minimization’ (SAM) algorithm introducued in Sharpness Aware Minimization along with the adaptive variant of it introduced in ASAM.

SAM seeks parameters that lie in neighborhoods having uniformly low loss (rather than parameters that only themselves have low loss value). The adaptive variant of the algorithm addresses the original algorithm’s sensitivity to parameter re-scaling that can lead to weakening of the connection between sharpness and generalization gap.

Parameters:
  • base_optimizer (Optimizer) – Base optimizer for SAM.

  • rho (float) – Neighborhood size.

  • adaptive (bool) – Whether to use the adaptive variant of the algorithm.

Raises:

ValueError – if rho is negative.

Example:
# Use AdamW as the base optimizer.
base_optimizer = AdamW(model.parameters())
# Wrap the base optimizer in SAM.
optimizer = SAM(base_optimizer)

# Closure required for recomputing the loss after computing epsilon(w).
def _closure():
  return loss_function(logits=model(input), targets=targets)

loss = _closure()
loss.backward()

optimizer.step(closure=_closure)
optimizer.zero_grad()
load_state_dict(state_dict)

Loads the optimizer state.

Parameters:

state_dict (dict[str, Any]) – optimizer state. Should be an object returned from a call to state_dict().

Return type:

None

step(closure)

Performs a single optimization step.

Parameters:

closure (Callable[[...], Tensor]) – A closure that reevaluates the model and returns the loss.

Returns:

loss returned by the closure if closure is not None else None.

Return type:

Tensor