Adam: Adaptive Moment Estimation¶
Adam (Adaptive Moment Estimation) computes per-parameter adaptive learning rates from the first and second gradient moments. Adam combines the advantages of two other optimizers: AdaGrad, which adapts the learning rate to the parameters, and RMSProp, which uses a moving average of squared gradients to set per-parameter learning rates. Adam also introduces bias-corrected estimates of the first and second gradient averages.
Adam was introduced by Diederik Kingma and Jimmy Ba in Adam: A Method for Stochastic Optimization.
Hyperparameters¶
optimi sets the default \(\beta\)s to (0.9, 0.99) and default \(\epsilon\) to 1e-6. These values reflect current best-practices and usually outperform the PyTorch defaults.
If training on large batch sizes or observing training loss spikes, consider reducing \(\beta_2\) between \([0.95, 0.99)\).
optimi’s implementation of Adam combines Adam with both AdamW decouple_wd=True and Adam with fully decoupled weight decay decouple_lr=True. Weight decay will likely need to be reduced when using fully decoupled weight decay as the learning rate will not modify the effective weight decay.
Adam ¶
Adam optimizer. Optionally with decoupled weight decay (AdamW).
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| params | Iterable[Tensor] | Iterable[dict] | Iterable of parameters to optimize or dicts defining parameter groups | required | 
| lr | float | Learning rate | required | 
| betas | tuple[float, float] | Coefficients for gradient and squared gradient moving averages (default: (0.9, 0.99)) | (0.9, 0.99) | 
| weight_decay | float | Weight decay coefficient. If  | 0 | 
| eps | float | Added to denominator to improve numerical stability (default: 1e-6) | 1e-06 | 
| decouple_wd | bool | Apply decoupled weight decay instead of L2 penalty (default: False) | False | 
| decouple_lr | bool | Apply fully decoupled weight decay instead of L2 penalty (default: False) | False | 
| max_lr | float | None | Maximum scheduled learning rate. Set if  | None | 
| kahan_sum | bool | None | Enables Kahan summation for more accurate parameter updates when training in low precision (float16 or bfloat16). If unspecified, automatically applies for low precision parameters (default: None) | None | 
| foreach | bool | None | Enables the foreach implementation. If unspecified, tries to use foreach over for-loop implementation since it can be significantly faster (default: None) | None | 
| triton | bool | None | Enables Triton implementation. If unspecified, tries to use Triton as it is significantly faster than both for-loop and foreach implementations (default: None) | None | 
| gradient_release | bool | Fuses optimizer step and zero_grad as part of the parameter's backward
pass. Requires model hooks created with  | False | 
Algorithm¶
Adam with L2 regularization.
optimi’s Adam also supports AdamW’s decoupled weight decay and fully decoupled weight decay, which are not shown.