# SGD¶

class torch.optim.SGD(params, lr=<required parameter>, momentum=0, dampening=0, weight_decay=0, nesterov=False)[source]

Implements stochastic gradient descent (optionally with momentum).

Nesterov momentum is based on the formula from On the importance of initialization and momentum in deep learning.

Parameters
• params (iterable) – iterable of parameters to optimize or dicts defining parameter groups

• lr (float) – learning rate

• momentum (float, optional) – momentum factor (default: 0)

• weight_decay (float, optional) – weight decay (L2 penalty) (default: 0)

• dampening (float, optional) – dampening for momentum (default: 0)

• nesterov (bool, optional) – enables Nesterov momentum (default: False)

Example

>>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9)
>>> loss_fn(model(input), target).backward()
>>> optimizer.step()


Note

The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.

Considering the specific case of Momentum, the update can be written as

\begin{aligned} v_{t+1} & = \mu * v_{t} + g_{t+1}, \\ p_{t+1} & = p_{t} - \text{lr} * v_{t+1}, \end{aligned}

where $p$, $g$, $v$ and $\mu$ denote the parameters, gradient, velocity, and momentum respectively.

This is in contrast to Sutskever et. al. and other frameworks which employ an update of the form

\begin{aligned} v_{t+1} & = \mu * v_{t} + \text{lr} * g_{t+1}, \\ p_{t+1} & = p_{t} - v_{t+1}. \end{aligned}

The Nesterov version is analogously modified.

add_param_group(param_group)

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters
• param_group (dict) – Specifies what Tensors should be optimized along with group

• optimization options. (specific) –

load_state_dict(state_dict)

Parameters

state_dict (dict) – optimizer state. Should be an object returned from a call to state_dict().

state_dict()

Returns the state of the optimizer as a dict.

It contains two entries:

• state - a dict holding current optimization state. Its content

differs between optimizer classes.

• param_groups - a dict containing all parameter groups

step(closure=None)[source]

Performs a single optimization step.

Parameters

closure (callable, optional) – A closure that reevaluates the model and returns the loss.

zero_grad(set_to_none=False)

Sets the gradients of all optimized torch.Tensor s to zero.

Parameters

set_to_none (bool) – instead of setting to zero, set the grads to None. This will in general have lower memory footprint, and can modestly improve performance. However, it changes certain behaviors. For example: 1. When the user tries to access a gradient and perform manual ops on it, a None attribute or a Tensor full of 0s will behave differently. 2. If the user requests zero_grad(set_to_none=True) followed by a backward pass, .grads are guaranteed to be None for params that did not receive a gradient. 3. torch.optim optimizers have a different behavior if the gradient is 0 or None (in one case it does the step with a gradient of 0 and in the other it skips the step altogether).