torch.nn.functional¶
Convolution functions¶
conv1d¶

torch.nn.functional.
conv1d
(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1) → Tensor¶ Applies a 1D convolution over an input signal composed of several input planes.
This operator supports TensorFloat32.
See
Conv1d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iW)$
weight – filters of shape $(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kW)$
bias – optional bias of shape $(\text{out\_channels})$ . Default:
None
stride – the stride of the convolving kernel. Can be a single number or a oneelement tuple (sW,). Default: 1
padding – implicit paddings on both sides of the input. Can be a single number or a oneelement tuple (padW,). Default: 0
dilation – the spacing between kernel elements. Can be a single number or a oneelement tuple (dW,). Default: 1
groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
Examples:
>>> filters = torch.randn(33, 16, 3) >>> inputs = torch.randn(20, 16, 50) >>> F.conv1d(inputs, filters)
conv2d¶

torch.nn.functional.
conv2d
(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1) → Tensor¶ Applies a 2D convolution over an input image composed of several input planes.
This operator supports TensorFloat32.
See
Conv2d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iH , iW)$
weight – filters of shape $(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kH , kW)$
bias – optional bias tensor of shape $(\text{out\_channels})$ . Default:
None
stride – the stride of the convolving kernel. Can be a single number or a tuple (sH, sW). Default: 1
padding – implicit paddings on both sides of the input. Can be a single number or a tuple (padH, padW). Default: 0
dilation – the spacing between kernel elements. Can be a single number or a tuple (dH, dW). Default: 1
groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
Examples:
>>> # With square kernels and equal stride >>> filters = torch.randn(8,4,3,3) >>> inputs = torch.randn(1,4,5,5) >>> F.conv2d(inputs, filters, padding=1)
conv3d¶

torch.nn.functional.
conv3d
(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1) → Tensor¶ Applies a 3D convolution over an input image composed of several input planes.
This operator supports TensorFloat32.
See
Conv3d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iT , iH , iW)$
weight – filters of shape $(\text{out\_channels} , \frac{\text{in\_channels}}{\text{groups}} , kT , kH , kW)$
bias – optional bias tensor of shape $(\text{out\_channels})$ . Default: None
stride – the stride of the convolving kernel. Can be a single number or a tuple (sT, sH, sW). Default: 1
padding – implicit paddings on both sides of the input. Can be a single number or a tuple (padT, padH, padW). Default: 0
dilation – the spacing between kernel elements. Can be a single number or a tuple (dT, dH, dW). Default: 1
groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
Examples:
>>> filters = torch.randn(33, 16, 3, 3, 3) >>> inputs = torch.randn(20, 16, 50, 10, 20) >>> F.conv3d(inputs, filters)
conv_transpose1d¶

torch.nn.functional.
conv_transpose1d
(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1, dilation=1) → Tensor¶ Applies a 1D transposed convolution operator over an input signal composed of several input planes, sometimes also called “deconvolution”.
This operator supports TensorFloat32.
See
ConvTranspose1d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iW)$
weight – filters of shape $(\text{in\_channels} , \frac{\text{out\_channels}}{\text{groups}} , kW)$
bias – optional bias of shape $(\text{out\_channels})$ . Default: None
stride – the stride of the convolving kernel. Can be a single number or a tuple
(sW,)
. Default: 1padding –
dilation * (kernel_size  1)  padding
zeropadding will be added to both sides of each dimension in the input. Can be a single number or a tuple(padW,)
. Default: 0output_padding – additional size added to one side of each dimension in the output shape. Can be a single number or a tuple
(out_padW)
. Default: 0groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
dilation – the spacing between kernel elements. Can be a single number or a tuple
(dW,)
. Default: 1
Examples:
>>> inputs = torch.randn(20, 16, 50) >>> weights = torch.randn(16, 33, 5) >>> F.conv_transpose1d(inputs, weights)
conv_transpose2d¶

torch.nn.functional.
conv_transpose2d
(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1, dilation=1) → Tensor¶ Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.
This operator supports TensorFloat32.
See
ConvTranspose2d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iH , iW)$
weight – filters of shape $(\text{in\_channels} , \frac{\text{out\_channels}}{\text{groups}} , kH , kW)$
bias – optional bias of shape $(\text{out\_channels})$ . Default: None
stride – the stride of the convolving kernel. Can be a single number or a tuple
(sH, sW)
. Default: 1padding –
dilation * (kernel_size  1)  padding
zeropadding will be added to both sides of each dimension in the input. Can be a single number or a tuple(padH, padW)
. Default: 0output_padding – additional size added to one side of each dimension in the output shape. Can be a single number or a tuple
(out_padH, out_padW)
. Default: 0groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
dilation – the spacing between kernel elements. Can be a single number or a tuple
(dH, dW)
. Default: 1
Examples:
>>> # With square kernels and equal stride >>> inputs = torch.randn(1, 4, 5, 5) >>> weights = torch.randn(4, 8, 3, 3) >>> F.conv_transpose2d(inputs, weights, padding=1)
conv_transpose3d¶

torch.nn.functional.
conv_transpose3d
(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1, dilation=1) → Tensor¶ Applies a 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”
This operator supports TensorFloat32.
See
ConvTranspose3d
for details and output shape.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iT , iH , iW)$
weight – filters of shape $(\text{in\_channels} , \frac{\text{out\_channels}}{\text{groups}} , kT , kH , kW)$
bias – optional bias of shape $(\text{out\_channels})$ . Default: None
stride – the stride of the convolving kernel. Can be a single number or a tuple
(sT, sH, sW)
. Default: 1padding –
dilation * (kernel_size  1)  padding
zeropadding will be added to both sides of each dimension in the input. Can be a single number or a tuple(padT, padH, padW)
. Default: 0output_padding – additional size added to one side of each dimension in the output shape. Can be a single number or a tuple
(out_padT, out_padH, out_padW)
. Default: 0groups – split input into groups, $\text{in\_channels}$ should be divisible by the number of groups. Default: 1
dilation – the spacing between kernel elements. Can be a single number or a tuple (dT, dH, dW). Default: 1
Examples:
>>> inputs = torch.randn(20, 16, 50, 10, 20) >>> weights = torch.randn(16, 33, 3, 3, 3) >>> F.conv_transpose3d(inputs, weights)
unfold¶

torch.nn.functional.
unfold
(input, kernel_size, dilation=1, padding=0, stride=1)[source]¶ Extracts sliding local blocks from an batched input tensor.
Warning
Currently, only 4D input tensors (batched imagelike tensors) are supported.
Warning
More than one element of the unfolded tensor may refer to a single memory location. As a result, inplace operations (especially ones that are vectorized) may result in incorrect behavior. If you need to write to the tensor, please clone it first.
See
torch.nn.Unfold
for details
fold¶

torch.nn.functional.
fold
(input, output_size, kernel_size, dilation=1, padding=0, stride=1)[source]¶ Combines an array of sliding local blocks into a large containing tensor.
Warning
Currently, only 3D output tensors (unfolded batched imagelike tensors) are supported.
See
torch.nn.Fold
for details
Pooling functions¶
avg_pool1d¶

torch.nn.functional.
avg_pool1d
(input, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True) → Tensor¶ Applies a 1D average pooling over an input signal composed of several input planes.
See
AvgPool1d
for details and output shape. Parameters
input – input tensor of shape $(\text{minibatch} , \text{in\_channels} , iW)$
kernel_size – the size of the window. Can be a single number or a tuple (kW,)
stride – the stride of the window. Can be a single number or a tuple (sW,). Default:
kernel_size
padding – implicit zero paddings on both sides of the input. Can be a single number or a tuple (padW,). Default: 0
ceil_mode – when True, will use ceil instead of floor to compute the output shape. Default:
False
count_include_pad – when True, will include the zeropadding in the averaging calculation. Default:
True
Examples:
>>> # pool of square window of size=3, stride=2 >>> input = torch.tensor([[[1, 2, 3, 4, 5, 6, 7]]], dtype=torch.float32) >>> F.avg_pool1d(input, kernel_size=3, stride=2) tensor([[[ 2., 4., 6.]]])
avg_pool2d¶

torch.nn.functional.
avg_pool2d
(input, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True, divisor_override=None) → Tensor¶ Applies 2D averagepooling operation in $kH \times kW$ regions by step size $sH \times sW$ steps. The number of output features is equal to the number of input planes.
See
AvgPool2d
for details and output shape. Parameters
input – input tensor $(\text{minibatch} , \text{in\_channels} , iH , iW)$
kernel_size – size of the pooling region. Can be a single number or a tuple (kH, kW)
stride – stride of the pooling operation. Can be a single number or a tuple (sH, sW). Default:
kernel_size
padding – implicit zero paddings on both sides of the input. Can be a single number or a tuple (padH, padW). Default: 0
ceil_mode – when True, will use ceil instead of floor in the formula to compute the output shape. Default:
False
count_include_pad – when True, will include the zeropadding in the averaging calculation. Default:
True
divisor_override – if specified, it will be used as divisor, otherwise size of the pooling region will be used. Default: None
avg_pool3d¶

torch.nn.functional.
avg_pool3d
(input, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True, divisor_override=None) → Tensor¶ Applies 3D averagepooling operation in $kT \times kH \times kW$ regions by step size $sT \times sH \times sW$ steps. The number of output features is equal to $\lfloor\frac{\text{input planes}}{sT}\rfloor$ .
See
AvgPool3d
for details and output shape. Parameters
input – input tensor $(\text{minibatch} , \text{in\_channels} , iT \times iH , iW)$
kernel_size – size of the pooling region. Can be a single number or a tuple (kT, kH, kW)
stride – stride of the pooling operation. Can be a single number or a tuple (sT, sH, sW). Default:
kernel_size
padding – implicit zero paddings on both sides of the input. Can be a single number or a tuple (padT, padH, padW), Default: 0
ceil_mode – when True, will use ceil instead of floor in the formula to compute the output shape
count_include_pad – when True, will include the zeropadding in the averaging calculation
divisor_override – if specified, it will be used as divisor, otherwise size of the pooling region will be used. Default: None
max_pool1d¶
max_pool2d¶
max_pool3d¶
max_unpool1d¶

torch.nn.functional.
max_unpool1d
(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]¶ Computes a partial inverse of
MaxPool1d
.See
MaxUnpool1d
for details.
max_unpool2d¶

torch.nn.functional.
max_unpool2d
(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]¶ Computes a partial inverse of
MaxPool2d
.See
MaxUnpool2d
for details.
max_unpool3d¶

torch.nn.functional.
max_unpool3d
(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]¶ Computes a partial inverse of
MaxPool3d
.See
MaxUnpool3d
for details.
lp_pool1d¶

torch.nn.functional.
lp_pool1d
(input, norm_type, kernel_size, stride=None, ceil_mode=False)[source]¶ Applies a 1D poweraverage pooling over an input signal composed of several input planes. If the sum of all inputs to the power of p is zero, the gradient is set to zero as well.
See
LPPool1d
for details.
lp_pool2d¶

torch.nn.functional.
lp_pool2d
(input, norm_type, kernel_size, stride=None, ceil_mode=False)[source]¶ Applies a 2D poweraverage pooling over an input signal composed of several input planes. If the sum of all inputs to the power of p is zero, the gradient is set to zero as well.
See
LPPool2d
for details.
adaptive_max_pool1d¶

torch.nn.functional.
adaptive_max_pool1d
(*args, **kwargs)¶ Applies a 1D adaptive max pooling over an input signal composed of several input planes.
See
AdaptiveMaxPool1d
for details and output shape. Parameters
output_size – the target output size (single integer)
return_indices – whether to return pooling indices. Default:
False
adaptive_max_pool2d¶

torch.nn.functional.
adaptive_max_pool2d
(*args, **kwargs)¶ Applies a 2D adaptive max pooling over an input signal composed of several input planes.
See
AdaptiveMaxPool2d
for details and output shape. Parameters
output_size – the target output size (single integer or doubleinteger tuple)
return_indices – whether to return pooling indices. Default:
False
adaptive_max_pool3d¶

torch.nn.functional.
adaptive_max_pool3d
(*args, **kwargs)¶ Applies a 3D adaptive max pooling over an input signal composed of several input planes.
See
AdaptiveMaxPool3d
for details and output shape. Parameters
output_size – the target output size (single integer or tripleinteger tuple)
return_indices – whether to return pooling indices. Default:
False
adaptive_avg_pool1d¶

torch.nn.functional.
adaptive_avg_pool1d
(input, output_size) → Tensor¶ Applies a 1D adaptive average pooling over an input signal composed of several input planes.
See
AdaptiveAvgPool1d
for details and output shape. Parameters
output_size – the target output size (single integer)
adaptive_avg_pool2d¶

torch.nn.functional.
adaptive_avg_pool2d
(input, output_size)[source]¶ Applies a 2D adaptive average pooling over an input signal composed of several input planes.
See
AdaptiveAvgPool2d
for details and output shape. Parameters
output_size – the target output size (single integer or doubleinteger tuple)
adaptive_avg_pool3d¶

torch.nn.functional.
adaptive_avg_pool3d
(input, output_size)[source]¶ Applies a 3D adaptive average pooling over an input signal composed of several input planes.
See
AdaptiveAvgPool3d
for details and output shape. Parameters
output_size – the target output size (single integer or tripleinteger tuple)
Nonlinear activation functions¶
threshold¶

torch.nn.functional.
threshold
(input, threshold, value, inplace=False)¶ Thresholds each element of the input Tensor.
See
Threshold
for more details.

torch.nn.functional.
threshold_
(input, threshold, value) → Tensor¶ Inplace version of
threshold()
.
relu¶
hardtanh¶

torch.nn.functional.
hardtanh
(input, min_val=1., max_val=1., inplace=False) → Tensor[source]¶ Applies the HardTanh function elementwise. See
Hardtanh
for more details.

torch.nn.functional.
hardtanh_
(input, min_val=1., max_val=1.) → Tensor¶ Inplace version of
hardtanh()
.
hardswish¶

torch.nn.functional.
hardswish
(input: torch.Tensor, inplace: bool = False) → torch.Tensor[source]¶ Applies the hardswish function, elementwise, as described in the paper:
$\text{Hardswish}(x) = \begin{cases} 0 & \text{if~} x \le 3, \\ x & \text{if~} x \ge +3, \\ x \cdot (x + 3) /6 & \text{otherwise} \end{cases}$See
Hardswish
for more details.
relu6¶
elu¶
selu¶
celu¶
leaky_relu¶

torch.nn.functional.
leaky_relu
(input, negative_slope=0.01, inplace=False) → Tensor[source]¶ Applies elementwise, $\text{LeakyReLU}(x) = \max(0, x) + \text{negative\_slope} * \min(0, x)$
See
LeakyReLU
for more details.

torch.nn.functional.
leaky_relu_
(input, negative_slope=0.01) → Tensor¶ Inplace version of
leaky_relu()
.
prelu¶
rrelu¶
glu¶
gelu¶
logsigmoid¶

torch.nn.functional.
logsigmoid
(input) → Tensor¶ Applies elementwise $\text{LogSigmoid}(x_i) = \log \left(\frac{1}{1 + \exp(x_i)}\right)$
See
LogSigmoid
for more details.
hardshrink¶

torch.nn.functional.
hardshrink
(input, lambd=0.5) → Tensor[source]¶ Applies the hard shrinkage function elementwise
See
Hardshrink
for more details.
tanhshrink¶

torch.nn.functional.
tanhshrink
(input) → Tensor[source]¶ Applies elementwise, $\text{Tanhshrink}(x) = x  \text{Tanh}(x)$
See
Tanhshrink
for more details.
softsign¶
softplus¶

torch.nn.functional.
softplus
(input, beta=1, threshold=20) → Tensor¶ Applies elementwise, the function $\text{Softplus}(x) = \frac{1}{\beta} * \log(1 + \exp(\beta * x))$ .
For numerical stability the implementation reverts to the linear function when $input \times \beta > threshold$ .
See
Softplus
for more details.
softmin¶

torch.nn.functional.
softmin
(input, dim=None, _stacklevel=3, dtype=None)[source]¶ Applies a softmin function.
Note that $\text{Softmin}(x) = \text{Softmax}(x)$ . See softmax definition for mathematical formula.
See
Softmin
for more details. Parameters
input (Tensor) – input
dim (int) – A dimension along which softmin will be computed (so every slice along dim will sum to 1).
dtype (
torch.dtype
, optional) – the desired data type of returned tensor. If specified, the input tensor is casted todtype
before the operation is performed. This is useful for preventing data type overflows. Default: None.
softmax¶

torch.nn.functional.
softmax
(input, dim=None, _stacklevel=3, dtype=None)[source]¶ Applies a softmax function.
Softmax is defined as:
$\text{Softmax}(x_{i}) = \frac{\exp(x_i)}{\sum_j \exp(x_j)}$
It is applied to all slices along dim, and will rescale them so that the elements lie in the range [0, 1] and sum to 1.
See
Softmax
for more details. Parameters
input (Tensor) – input
dim (int) – A dimension along which softmax will be computed.
dtype (
torch.dtype
, optional) – the desired data type of returned tensor. If specified, the input tensor is casted todtype
before the operation is performed. This is useful for preventing data type overflows. Default: None.
Note
This function doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use log_softmax instead (it’s faster and has better numerical properties).
softshrink¶

torch.nn.functional.
softshrink
(input, lambd=0.5) → Tensor¶ Applies the soft shrinkage function elementwise
See
Softshrink
for more details.
gumbel_softmax¶

torch.nn.functional.
gumbel_softmax
(logits, tau=1, hard=False, eps=1e10, dim=1)[source]¶ Samples from the GumbelSoftmax distribution (Link 1 Link 2) and optionally discretizes.
 Parameters
logits – […, num_features] unnormalized log probabilities
tau – nonnegative scalar temperature
hard – if
True
, the returned samples will be discretized as onehot vectors, but will be differentiated as if it is the soft sample in autograddim (int) – A dimension along which softmax will be computed. Default: 1.
 Returns
Sampled tensor of same shape as logits from the GumbelSoftmax distribution. If
hard=True
, the returned samples will be onehot, otherwise they will be probability distributions that sum to 1 across dim.
Note
This function is here for legacy reasons, may be removed from nn.Functional in the future.
Note
The main trick for hard is to do y_hard  y_soft.detach() + y_soft
It achieves two things:  makes the output value exactly onehot (since we add then subtract y_soft value)  makes the gradient equal to y_soft gradient (since we strip all other gradients)
 Examples::
>>> logits = torch.randn(20, 32) >>> # Sample soft categorical using reparametrization trick: >>> F.gumbel_softmax(logits, tau=1, hard=False) >>> # Sample hard categorical using "Straightthrough" trick: >>> F.gumbel_softmax(logits, tau=1, hard=True)
log_softmax¶

torch.nn.functional.
log_softmax
(input, dim=None, _stacklevel=3, dtype=None)[source]¶ Applies a softmax followed by a logarithm.
While mathematically equivalent to log(softmax(x)), doing these two operations separately is slower, and numerically unstable. This function uses an alternative formulation to compute the output and gradient correctly.
See
LogSoftmax
for more details. Parameters
input (Tensor) – input
dim (int) – A dimension along which log_softmax will be computed.
dtype (
torch.dtype
, optional) – the desired data type of returned tensor. If specified, the input tensor is casted todtype
before the operation is performed. This is useful for preventing data type overflows. Default: None.
tanh¶
sigmoid¶
hardsigmoid¶

torch.nn.functional.
hardsigmoid
(input) → Tensor[source]¶ Applies the elementwise function
$\text{Hardsigmoid}(x) = \begin{cases} 0 & \text{if~} x \le 3, \\ 1 & \text{if~} x \ge +3, \\ x / 6 + 1 / 2 & \text{otherwise} \end{cases}$ Parameters
inplace – If set to
True
, will do this operation inplace. Default:False
See
Hardsigmoid
for more details.
silu¶

torch.nn.functional.
silu
(input, inplace=False)[source]¶ Applies the silu function, elementwise.
$\text{silu}(x) = x * \sigma(x), \text{where } \sigma(x) \text{ is the logistic sigmoid.}$Note
See Gaussian Error Linear Units (GELUs) where the SiLU (Sigmoid Linear Unit) was originally coined, and see SigmoidWeighted Linear Units for Neural Network Function Approximation in Reinforcement Learning and Swish: a SelfGated Activation Function where the SiLU was experimented with later.
See
SiLU
for more details.
Normalization functions¶
batch_norm¶

torch.nn.functional.
batch_norm
(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e05)[source]¶ Applies Batch Normalization for each channel across a batch of data.
See
BatchNorm1d
,BatchNorm2d
,BatchNorm3d
for details.
instance_norm¶

torch.nn.functional.
instance_norm
(input, running_mean=None, running_var=None, weight=None, bias=None, use_input_stats=True, momentum=0.1, eps=1e05)[source]¶ Applies Instance Normalization for each channel in each data sample in a batch.
See
InstanceNorm1d
,InstanceNorm2d
,InstanceNorm3d
for details.
layer_norm¶
local_response_norm¶

torch.nn.functional.
local_response_norm
(input, size, alpha=0.0001, beta=0.75, k=1.0)[source]¶ Applies local response normalization over an input signal composed of several input planes, where channels occupy the second dimension. Applies normalization across channels.
See
LocalResponseNorm
for details.
normalize¶

torch.nn.functional.
normalize
(input, p=2, dim=1, eps=1e12, out=None)[source]¶ Performs $L_p$ normalization of inputs over specified dimension.
For a tensor
input
of sizes $(n_0, ..., n_{dim}, ..., n_k)$ , each $n_{dim}$ element vector $v$ along dimensiondim
is transformed as$v = \frac{v}{\max(\lVert v \rVert_p, \epsilon)}.$With the default arguments it uses the Euclidean norm over vectors along dimension $1$ for normalization.
 Parameters
input – input tensor of any shape
p (float) – the exponent value in the norm formulation. Default: 2
dim (int) – the dimension to reduce. Default: 1
eps (float) – small value to avoid division by zero. Default: 1e12
out (Tensor, optional) – the output tensor. If
out
is used, this operation won’t be differentiable.
Linear functions¶
linear¶

torch.nn.functional.
linear
(input, weight, bias=None)[source]¶ Applies a linear transformation to the incoming data: $y = xA^T + b$ .
This operator supports TensorFloat32.
Shape:
Input: $(N, *, in\_features)$ N is the batch size, * means any number of additional dimensions
Weight: $(out\_features, in\_features)$
Bias: $(out\_features)$
Output: $(N, *, out\_features)$
bilinear¶

torch.nn.functional.
bilinear
(input1, input2, weight, bias=None)[source]¶ Applies a bilinear transformation to the incoming data: $y = x_1^T A x_2 + b$
Shape:
input1: $(N, *, H_{in1})$ where $H_{in1}=\text{in1\_features}$ and $*$ means any number of additional dimensions. All but the last dimension of the inputs should be the same.
input2: $(N, *, H_{in2})$ where $H_{in2}=\text{in2\_features}$
weight: $(\text{out\_features}, \text{in1\_features}, \text{in2\_features})$
bias: $(\text{out\_features})$
output: $(N, *, H_{out})$ where $H_{out}=\text{out\_features}$ and all but the last dimension are the same shape as the input.
Dropout functions¶
dropout¶

torch.nn.functional.
dropout
(input, p=0.5, training=True, inplace=False)[source]¶ During training, randomly zeroes some of the elements of the input tensor with probability
p
using samples from a Bernoulli distribution.See
Dropout
for details. Parameters
p – probability of an element to be zeroed. Default: 0.5
training – apply dropout if is
True
. Default:True
inplace – If set to
True
, will do this operation inplace. Default:False
alpha_dropout¶

torch.nn.functional.
alpha_dropout
(input, p=0.5, training=False, inplace=False)[source]¶ Applies alpha dropout to the input.
See
AlphaDropout
for details.
feature_alpha_dropout¶

torch.nn.functional.
feature_alpha_dropout
(input, p=0.5, training=False, inplace=False)[source]¶ Randomly masks out entire channels (a channel is a feature map, e.g. the $j$ th channel of the $i$ th sample in the batch input is a tensor $\text{input}[i, j]$ ) of the input tensor). Instead of setting activations to zero, as in regular Dropout, the activations are set to the negative saturation value of the SELU activation function.
Each element will be masked independently on every forward call with probability
p
using samples from a Bernoulli distribution. The elements to be masked are randomized on every forward call, and scaled and shifted to maintain zero mean and unit variance.See
FeatureAlphaDropout
for details. Parameters
p – dropout probability of a channel to be zeroed. Default: 0.5
training – apply dropout if is
True
. Default:True
inplace – If set to
True
, will do this operation inplace. Default:False
dropout2d¶

torch.nn.functional.
dropout2d
(input, p=0.5, training=True, inplace=False)[source]¶ Randomly zero out entire channels (a channel is a 2D feature map, e.g., the $j$ th channel of the $i$ th sample in the batched input is a 2D tensor $\text{input}[i, j]$ ) of the input tensor). Each channel will be zeroed out independently on every forward call with probability
p
using samples from a Bernoulli distribution.See
Dropout2d
for details. Parameters
p – probability of a channel to be zeroed. Default: 0.5
training – apply dropout if is
True
. Default:True
inplace – If set to
True
, will do this operation inplace. Default:False
dropout3d¶

torch.nn.functional.
dropout3d
(input, p=0.5, training=True, inplace=False)[source]¶ Randomly zero out entire channels (a channel is a 3D feature map, e.g., the $j$ th channel of the $i$ th sample in the batched input is a 3D tensor $\text{input}[i, j]$ ) of the input tensor). Each channel will be zeroed out independently on every forward call with probability
p
using samples from a Bernoulli distribution.See
Dropout3d
for details. Parameters
p – probability of a channel to be zeroed. Default: 0.5
training – apply dropout if is
True
. Default:True
inplace – If set to
True
, will do this operation inplace. Default:False
Sparse functions¶
embedding¶

torch.nn.functional.
embedding
(input, weight, padding_idx=None, max_norm=None, norm_type=2.0, scale_grad_by_freq=False, sparse=False)[source]¶ A simple lookup table that looks up embeddings in a fixed dictionary and size.
This module is often used to retrieve word embeddings using indices. The input to the module is a list of indices, and the embedding matrix, and the output is the corresponding word embeddings.
See
torch.nn.Embedding
for more details. Parameters
input (LongTensor) – Tensor containing indices into the embedding matrix
weight (Tensor) – The embedding matrix with number of rows equal to the maximum possible index + 1, and number of columns equal to the embedding size
padding_idx (int, optional) – If given, pads the output with the embedding vector at
padding_idx
(initialized to zeros) whenever it encounters the index.max_norm (float, optional) – If given, each embedding vector with norm larger than
max_norm
is renormalized to have normmax_norm
. Note: this will modifyweight
inplace.norm_type (float, optional) – The p of the pnorm to compute for the
max_norm
option. Default2
.scale_grad_by_freq (boolean, optional) – If given, this will scale gradients by the inverse of frequency of the words in the minibatch. Default
False
.sparse (bool, optional) – If
True
, gradient w.r.t.weight
will be a sparse tensor. See Notes undertorch.nn.Embedding
for more details regarding sparse gradients.
 Shape:
Input: LongTensor of arbitrary shape containing the indices to extract
 Weight: Embedding matrix of floating point type with shape (V, embedding_dim),
where V = maximum index + 1 and embedding_dim = the embedding size
Output: (*, embedding_dim), where * is the input shape
Examples:
>>> # a batch of 2 samples of 4 indices each >>> input = torch.tensor([[1,2,4,5],[4,3,2,9]]) >>> # an embedding matrix containing 10 tensors of size 3 >>> embedding_matrix = torch.rand(10, 3) >>> F.embedding(input, embedding_matrix) tensor([[[ 0.8490, 0.9625, 0.6753], [ 0.9666, 0.7761, 0.6108], [ 0.6246, 0.9751, 0.3618], [ 0.4161, 0.2419, 0.7383]], [[ 0.6246, 0.9751, 0.3618], [ 0.0237, 0.7794, 0.0528], [ 0.9666, 0.7761, 0.6108], [ 0.3385, 0.8612, 0.1867]]]) >>> # example with padding_idx >>> weights = torch.rand(10, 3) >>> weights[0, :].zero_() >>> embedding_matrix = weights >>> input = torch.tensor([[0,2,0,5]]) >>> F.embedding(input, embedding_matrix, padding_idx=0) tensor([[[ 0.0000, 0.0000, 0.0000], [ 0.5609, 0.5384, 0.8720], [ 0.0000, 0.0000, 0.0000], [ 0.6262, 0.2438, 0.7471]]])
embedding_bag¶

torch.nn.functional.
embedding_bag
(input, weight, offsets=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, mode='mean', sparse=False, per_sample_weights=None, include_last_offset=False)[source]¶ Computes sums, means or maxes of bags of embeddings, without instantiating the intermediate embeddings.
See
torch.nn.EmbeddingBag
for more details.Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
 Parameters
input (LongTensor) – Tensor containing bags of indices into the embedding matrix
weight (Tensor) – The embedding matrix with number of rows equal to the maximum possible index + 1, and number of columns equal to the embedding size
offsets (LongTensor, optional) – Only used when
input
is 1D.offsets
determines the starting index position of each bag (sequence) ininput
.max_norm (float, optional) – If given, each embedding vector with norm larger than
max_norm
is renormalized to have normmax_norm
. Note: this will modifyweight
inplace.norm_type (float, optional) – The
p
in thep
norm to compute for themax_norm
option. Default2
.scale_grad_by_freq (boolean, optional) – if given, this will scale gradients by the inverse of frequency of the words in the minibatch. Default
False
. Note: this option is not supported whenmode="max"
.mode (string, optional) –
"sum"
,"mean"
or"max"
. Specifies the way to reduce the bag. Default:"mean"
sparse (bool, optional) – if
True
, gradient w.r.t.weight
will be a sparse tensor. See Notes undertorch.nn.Embedding
for more details regarding sparse gradients. Note: this option is not supported whenmode="max"
.per_sample_weights (Tensor, optional) – a tensor of float / double weights, or None to indicate all weights should be taken to be 1. If specified,
per_sample_weights
must have exactly the same shape as input and is treated as having the sameoffsets
, if those are not None.include_last_offset (bool, optional) – if
True
, the size of offsets is equal to the number of bags + 1.last element is the size of the input, or the ending index position of the last bag (The) –
Shape:
input
(LongTensor) andoffsets
(LongTensor, optional)If
input
is 2D of shape (B, N),it will be treated as
B
bags (sequences) each of fixed lengthN
, and this will returnB
values aggregated in a way depending on themode
.offsets
is ignored and required to beNone
in this case.If
input
is 1D of shape (N),it will be treated as a concatenation of multiple bags (sequences).
offsets
is required to be a 1D tensor containing the starting index positions of each bag ininput
. Therefore, foroffsets
of shape (B),input
will be viewed as havingB
bags. Empty bags (i.e., having 0length) will have returned vectors filled by zeros.
weight
(Tensor): the learnable weights of the module of shape (num_embeddings, embedding_dim)per_sample_weights
(Tensor, optional). Has the same shape asinput
.output
: aggregated embedding values of shape (B, embedding_dim)
Examples:
>>> # an Embedding module containing 10 tensors of size 3 >>> embedding_matrix = torch.rand(10, 3) >>> # a batch of 2 samples of 4 indices each >>> input = torch.tensor([1,2,4,5,4,3,2,9]) >>> offsets = torch.tensor([0,4]) >>> F.embedding_bag(embedding_matrix, input, offsets) tensor([[ 0.3397, 0.3552, 0.5545], [ 0.5893, 0.4386, 0.5882]])
one_hot¶

torch.nn.functional.
one_hot
(tensor, num_classes=1) → LongTensor¶ Takes LongTensor with index values of shape
(*)
and returns a tensor of shape(*, num_classes)
that have zeros everywhere except where the index of last dimension matches the corresponding value of the input tensor, in which case it will be 1.See also Onehot on Wikipedia .
 Parameters
tensor (LongTensor) – class values of any shape.
num_classes (int) – Total number of classes. If set to 1, the number of classes will be inferred as one greater than the largest class value in the input tensor.
 Returns
LongTensor that has one more dimension with 1 values at the index of last dimension indicated by the input, and 0 everywhere else.
Examples
>>> F.one_hot(torch.arange(0, 5) % 3) tensor([[1, 0, 0], [0, 1, 0], [0, 0, 1], [1, 0, 0], [0, 1, 0]]) >>> F.one_hot(torch.arange(0, 5) % 3, num_classes=5) tensor([[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 1, 0, 0], [1, 0, 0, 0, 0], [0, 1, 0, 0, 0]]) >>> F.one_hot(torch.arange(0, 6).view(3,2) % 3) tensor([[[1, 0, 0], [0, 1, 0]], [[0, 0, 1], [1, 0, 0]], [[0, 1, 0], [0, 0, 1]]])
Distance functions¶
pairwise_distance¶

torch.nn.functional.
pairwise_distance
(x1, x2, p=2.0, eps=1e06, keepdim=False)[source]¶ See
torch.nn.PairwiseDistance
for details
cosine_similarity¶

torch.nn.functional.
cosine_similarity
(x1, x2, dim=1, eps=1e8) → Tensor¶ Returns cosine similarity between x1 and x2, computed along dim.
$\text{similarity} = \dfrac{x_1 \cdot x_2}{\max(\Vert x_1 \Vert _2 \cdot \Vert x_2 \Vert _2, \epsilon)}$ Parameters
 Shape:
Input: $(\ast_1, D, \ast_2)$ where D is at position dim.
Output: $(\ast_1, \ast_2)$ where 1 is at position dim.
Example:
>>> input1 = torch.randn(100, 128) >>> input2 = torch.randn(100, 128) >>> output = F.cosine_similarity(input1, input2) >>> print(output)
pdist¶

torch.nn.functional.
pdist
(input, p=2) → Tensor¶ Computes the pnorm distance between every pair of row vectors in the input. This is identical to the upper triangular portion, excluding the diagonal, of torch.norm(input[:, None]  input, dim=2, p=p). This function will be faster if the rows are contiguous.
If input has shape $N \times M$ then the output will have shape $\frac{1}{2} N (N  1)$ .
This function is equivalent to scipy.spatial.distance.pdist(input, ‘minkowski’, p=p) if $p \in (0, \infty)$ . When $p = 0$ it is equivalent to scipy.spatial.distance.pdist(input, ‘hamming’) * M. When $p = \infty$ , the closest scipy function is scipy.spatial.distance.pdist(xn, lambda x, y: np.abs(x  y).max()).
 Parameters
input – input tensor of shape $N \times M$ .
p – p value for the pnorm distance to calculate between each vector pair $\in [0, \infty]$ .
Loss functions¶
binary_cross_entropy¶

torch.nn.functional.
binary_cross_entropy
(input, target, weight=None, size_average=None, reduce=None, reduction='mean')[source]¶ Function that measures the Binary Cross Entropy between the target and the output.
See
BCELoss
for details. Parameters
input – Tensor of arbitrary shape
target – Tensor of the same shape as input
weight (Tensor, optional) – a manual rescaling weight if provided it’s repeated to match input tensor shape
size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Note:size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
. Default:'mean'
Examples:
>>> input = torch.randn((3, 2), requires_grad=True) >>> target = torch.rand((3, 2), requires_grad=False) >>> loss = F.binary_cross_entropy(F.sigmoid(input), target) >>> loss.backward()
binary_cross_entropy_with_logits¶

torch.nn.functional.
binary_cross_entropy_with_logits
(input, target, weight=None, size_average=None, reduce=None, reduction='mean', pos_weight=None)[source]¶ Function that measures Binary Cross Entropy between target and output logits.
See
BCEWithLogitsLoss
for details. Parameters
input – Tensor of arbitrary shape
target – Tensor of the same shape as input
weight (Tensor, optional) – a manual rescaling weight if provided it’s repeated to match input tensor shape
size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Note:size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
. Default:'mean'
pos_weight (Tensor, optional) – a weight of positive examples. Must be a vector with length equal to the number of classes.
Examples:
>>> input = torch.randn(3, requires_grad=True) >>> target = torch.empty(3).random_(2) >>> loss = F.binary_cross_entropy_with_logits(input, target) >>> loss.backward()
poisson_nll_loss¶

torch.nn.functional.
poisson_nll_loss
(input, target, log_input=True, full=False, size_average=None, eps=1e08, reduce=None, reduction='mean')[source]¶ Poisson negative log likelihood loss.
See
PoissonNLLLoss
for details. Parameters
input – expectation of underlying Poisson distribution.
target – random sample $target \sim \text{Poisson}(input)$ .
log_input – if
True
the loss is computed as $\exp(\text{input})  \text{target} * \text{input}$ , ifFalse
then loss is $\text{input}  \text{target} * \log(\text{input}+\text{eps})$ . Default:True
full – whether to compute full loss, i. e. to add the Stirling approximation term. Default:
False
$\text{target} * \log(\text{target})  \text{target} + 0.5 * \log(2 * \pi * \text{target})$ .size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
eps (float, optional) – Small value to avoid evaluation of $\log(0)$ when
log_input`=``False`
. Default: 1e8reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Note:size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
. Default:'mean'
cosine_embedding_loss¶

torch.nn.functional.
cosine_embedding_loss
(input1, input2, target, margin=0, size_average=None, reduce=None, reduction='mean') → Tensor[source]¶ See
CosineEmbeddingLoss
for details.
cross_entropy¶

torch.nn.functional.
cross_entropy
(input, target, weight=None, size_average=None, ignore_index=100, reduce=None, reduction='mean')[source]¶ This criterion combines log_softmax and nll_loss in a single function.
See
CrossEntropyLoss
for details. Parameters
input (Tensor) – $(N, C)$ where C = number of classes or $(N, C, H, W)$ in case of 2D Loss, or $(N, C, d_1, d_2, ..., d_K)$ where $K \geq 1$ in the case of Kdimensional loss.
target (Tensor) – $(N)$ where each value is $0 \leq \text{targets}[i] \leq C1$ , or $(N, d_1, d_2, ..., d_K)$ where $K \geq 1$ for Kdimensional loss.
weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size C
size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient. When
size_average
isTrue
, the loss is averaged over nonignored targets. Default: 100reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Note:size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
. Default:'mean'
Examples:
>>> input = torch.randn(3, 5, requires_grad=True) >>> target = torch.randint(5, (3,), dtype=torch.int64) >>> loss = F.cross_entropy(input, target) >>> loss.backward()
ctc_loss¶

torch.nn.functional.
ctc_loss
(log_probs, targets, input_lengths, target_lengths, blank=0, reduction='mean', zero_infinity=False)[source]¶ The Connectionist Temporal Classification loss.
See
CTCLoss
for details.Note
In some circumstances when using the CUDA backend with CuDNN, this operator may select a nondeterministic algorithm to increase performance. If this is undesirable, you can try to make the operation deterministic (potentially at a performance cost) by setting
torch.backends.cudnn.deterministic = True
. Please see the notes on Reproducibility for background.Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
 Parameters
log_probs – $(T, N, C)$ where C = number of characters in alphabet including blank, T = input length, and N = batch size. The logarithmized probabilities of the outputs (e.g. obtained with
torch.nn.functional.log_softmax()
).targets – $(N, S)$ or (sum(target_lengths)). Targets cannot be blank. In the second form, the targets are assumed to be concatenated.
input_lengths – $(N)$ . Lengths of the inputs (must each be $\leq T$ )
target_lengths – $(N)$ . Lengths of the targets
blank (int, optional) – Blank label. Default $0$ .
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the output losses will be divided by the target lengths and then the mean over the batch is taken,'sum'
: the output will be summed. Default:'mean'
zero_infinity (bool, optional) – Whether to zero infinite losses and the associated gradients. Default:
False
Infinite losses mainly occur when the inputs are too short to be aligned to the targets.
Example:
>>> log_probs = torch.randn(50, 16, 20).log_softmax(2).detach().requires_grad_() >>> targets = torch.randint(1, 20, (16, 30), dtype=torch.long) >>> input_lengths = torch.full((16,), 50, dtype=torch.long) >>> target_lengths = torch.randint(10,30,(16,), dtype=torch.long) >>> loss = F.ctc_loss(log_probs, targets, input_lengths, target_lengths) >>> loss.backward()
hinge_embedding_loss¶

torch.nn.functional.
hinge_embedding_loss
(input, target, margin=1.0, size_average=None, reduce=None, reduction='mean') → Tensor[source]¶ See
HingeEmbeddingLoss
for details.
kl_div¶

torch.nn.functional.
kl_div
(input, target, size_average=None, reduce=None, reduction='mean', log_target=False)[source]¶ The KullbackLeibler divergence Loss
See
KLDivLoss
for details. Parameters
input – Tensor of arbitrary shape
target – Tensor of the same shape as input
size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'batchmean'
'sum'
'mean'
.'none'
: no reduction will be applied'batchmean'
: the sum of the output will be divided by the batchsize'sum'
: the output will be summed'mean'
: the output will be divided by the number of elements in the output Default:'mean'
log_target (bool) – A flag indicating whether
target
is passed in the log space. It is recommended to pass certain distributions (likesoftmax
) in the log space to avoid numerical issues caused by explicitlog
. Default:False
Note
size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
.Note
:attr:
reduction
='mean'
doesn’t return the true kl divergence value, please use :attr:reduction
='batchmean'
which aligns with KL math definition. In the next major release,'mean'
will be changed to be the same as ‘batchmean’.
l1_loss¶
mse_loss¶
margin_ranking_loss¶

torch.nn.functional.
margin_ranking_loss
(input1, input2, target, margin=0, size_average=None, reduce=None, reduction='mean') → Tensor[source]¶ See
MarginRankingLoss
for details.
multilabel_margin_loss¶

torch.nn.functional.
multilabel_margin_loss
(input, target, size_average=None, reduce=None, reduction='mean') → Tensor[source]¶ See
MultiLabelMarginLoss
for details.
multilabel_soft_margin_loss¶

torch.nn.functional.
multilabel_soft_margin_loss
(input, target, weight=None, size_average=None) → Tensor[source]¶ See
MultiLabelSoftMarginLoss
for details.
multi_margin_loss¶

torch.nn.functional.
multi_margin_loss
(input, target, p=1, margin=1.0, weight=None, size_average=None, reduce=None, reduction='mean')[source]¶  multi_margin_loss(input, target, p=1, margin=1, weight=None, size_average=None,
reduce=None, reduction=’mean’) > Tensor
See
MultiMarginLoss
for details.
nll_loss¶

torch.nn.functional.
nll_loss
(input, target, weight=None, size_average=None, ignore_index=100, reduce=None, reduction='mean')[source]¶ The negative log likelihood loss.
See
NLLLoss
for details. Parameters
input – $(N, C)$ where C = number of classes or $(N, C, H, W)$ in case of 2D Loss, or $(N, C, d_1, d_2, ..., d_K)$ where $K \geq 1$ in the case of Kdimensional loss.
target – $(N)$ where each value is $0 \leq \text{targets}[i] \leq C1$ , or $(N, d_1, d_2, ..., d_K)$ where $K \geq 1$ for Kdimensional loss.
weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size C
size_average (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged over each loss element in the batch. Note that for some losses, there multiple elements per sample. If the fieldsize_average
is set toFalse
, the losses are instead summed for each minibatch. Ignored when reduce isFalse
. Default:True
ignore_index (int, optional) – Specifies a target value that is ignored and does not contribute to the input gradient. When
size_average
isTrue
, the loss is averaged over nonignored targets. Default: 100reduce (bool, optional) – Deprecated (see
reduction
). By default, the losses are averaged or summed over observations for each minibatch depending onsize_average
. Whenreduce
isFalse
, returns a loss per batch element instead and ignoressize_average
. Default:True
reduction (string, optional) – Specifies the reduction to apply to the output:
'none'
'mean'
'sum'
.'none'
: no reduction will be applied,'mean'
: the sum of the output will be divided by the number of elements in the output,'sum'
: the output will be summed. Note:size_average
andreduce
are in the process of being deprecated, and in the meantime, specifying either of those two args will overridereduction
. Default:'mean'
Example:
>>> # input is of size N x C = 3 x 5 >>> input = torch.randn(3, 5, requires_grad=True) >>> # each element in target has to have 0 <= value < C >>> target = torch.tensor([1, 0, 4]) >>> output = F.nll_loss(F.log_softmax(input), target) >>> output.backward()
smooth_l1_loss¶

torch.nn.functional.
smooth_l1_loss
(input, target, size_average=None, reduce=None, reduction='mean', beta=1.0)[source]¶ Function that uses a squared term if the absolute elementwise error falls below beta and an L1 term otherwise.
See
SmoothL1Loss
for details.
soft_margin_loss¶

torch.nn.functional.
soft_margin_loss
(input, target, size_average=None, reduce=None, reduction='mean') → Tensor[source]¶ See
SoftMarginLoss
for details.
triplet_margin_loss¶

torch.nn.functional.
triplet_margin_loss
(anchor, positive, negative, margin=1.0, p=2, eps=1e06, swap=False, size_average=None, reduce=None, reduction='mean')[source]¶ See
TripletMarginLoss
for details
triplet_margin_with_distance_loss¶

torch.nn.functional.
triplet_margin_with_distance_loss
(anchor, positive, negative, *, distance_function=None, margin=1.0, swap=False, reduction='mean')[source]¶ See
TripletMarginWithDistanceLoss
for details.
Vision functions¶
pixel_shuffle¶

torch.nn.functional.
pixel_shuffle
()¶ Rearranges elements in a tensor of shape $(*, C \times r^2, H, W)$ to a tensor of shape $(*, C, H \times r, W \times r)$ .
See
PixelShuffle
for details. Parameters
Examples:
>>> input = torch.randn(1, 9, 4, 4) >>> output = torch.nn.functional.pixel_shuffle(input, 3) >>> print(output.size()) torch.Size([1, 1, 12, 12])
pad¶

torch.nn.functional.
pad
(input, pad, mode='constant', value=0)¶ Pads tensor.
 Padding size:
The padding size by which to pad some dimensions of
input
are described starting from the last dimension and moving forward. $\left\lfloor\frac{\text{len(pad)}}{2}\right\rfloor$ dimensions ofinput
will be padded. For example, to pad only the last dimension of the input tensor, thenpad
has the form $(\text{padding\_left}, \text{padding\_right})$ ; to pad the last 2 dimensions of the input tensor, then use $(\text{padding\_left}, \text{padding\_right},$ $\text{padding\_top}, \text{padding\_bottom})$ ; to pad the last 3 dimensions, use $(\text{padding\_left}, \text{padding\_right},$ $\text{padding\_top}, \text{padding\_bottom}$ $\text{padding\_front}, \text{padding\_back})$ . Padding mode:
See
torch.nn.ConstantPad2d
,torch.nn.ReflectionPad2d
, andtorch.nn.ReplicationPad2d
for concrete examples on how each of the padding modes works. Constant padding is implemented for arbitrary dimensions. Replicate padding is implemented for padding the last 3 dimensions of 5D input tensor, or the last 2 dimensions of 4D input tensor, or the last dimension of 3D input tensor. Reflect padding is only implemented for padding the last 2 dimensions of 4D input tensor, or the last dimension of 3D input tensor.
Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
 Parameters
Examples:
>>> t4d = torch.empty(3, 3, 4, 2) >>> p1d = (1, 1) # pad last dim by 1 on each side >>> out = F.pad(t4d, p1d, "constant", 0) # effectively zero padding >>> print(out.size()) torch.Size([3, 3, 4, 4]) >>> p2d = (1, 1, 2, 2) # pad last dim by (1, 1) and 2nd to last by (2, 2) >>> out = F.pad(t4d, p2d, "constant", 0) >>> print(out.size()) torch.Size([3, 3, 8, 4]) >>> t4d = torch.empty(3, 3, 4, 2) >>> p3d = (0, 1, 2, 1, 3, 3) # pad by (0, 1), (2, 1), and (3, 3) >>> out = F.pad(t4d, p3d, "constant", 0) >>> print(out.size()) torch.Size([3, 9, 7, 3])
interpolate¶

torch.nn.functional.
interpolate
(input, size=None, scale_factor=None, mode='nearest', align_corners=None, recompute_scale_factor=None)[source]¶ Down/up samples the input to either the given
size
or the givenscale_factor
The algorithm used for interpolation is determined by
mode
.Currently temporal, spatial and volumetric sampling are supported, i.e. expected inputs are 3D, 4D or 5D in shape.
The input dimensions are interpreted in the form: minibatch x channels x [optional depth] x [optional height] x width.
The modes available for resizing are: nearest, linear (3Donly), bilinear, bicubic (4Donly), trilinear (5Donly), area
 Parameters
input (Tensor) – the input tensor
size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]) – output spatial size.
scale_factor (float or Tuple[float]) – multiplier for spatial size. Has to match input size if it is a tuple.
mode (str) – algorithm used for upsampling:
'nearest'
'linear'
'bilinear'
'bicubic'
'trilinear'
'area'
. Default:'nearest'
align_corners (bool, optional) – Geometrically, we consider the pixels of the input and output as squares rather than points. If set to
True
, the input and output tensors are aligned by the center points of their corner pixels, preserving the values at the corner pixels. If set toFalse
, the input and output tensors are aligned by the corner points of their corner pixels, and the interpolation uses edge value padding for outofboundary values, making this operation independent of input size whenscale_factor
is kept the same. This only has an effect whenmode
is'linear'
,'bilinear'
,'bicubic'
or'trilinear'
. Default:False
recompute_scale_factor (bool, optional) – recompute the scale_factor for use in the interpolation calculation. When scale_factor is passed as a parameter, it is used to compute the output_size. If recompute_scale_factor is
`False
or not specified, the passedin scale_factor will be used in the interpolation computation. Otherwise, a new scale_factor will be computed based on the output and input sizes for use in the interpolation computation (i.e. the computation will be identical to if the computed output_size were passedin explicitly). Note that when scale_factor is floatingpoint, the recomputed scale_factor may differ from the one passed in due to rounding and precision issues.
Note
With
mode='bicubic'
, it’s possible to cause overshoot, in other words it can produce negative values or values greater than 255 for images. Explicitly callresult.clamp(min=0, max=255)
if you want to reduce the overshoot when displaying the image.Warning
With
align_corners = True
, the linearly interpolating modes (linear, bilinear, and trilinear) don’t proportionally align the output and input pixels, and thus the output values can depend on the input size. This was the default behavior for these modes up to version 0.3.1. Since then, the default behavior isalign_corners = False
. SeeUpsample
for concrete examples on how this affects the outputs.Warning
When scale_factor is specified, if recompute_scale_factor=True, scale_factor is used to compute the output_size which will then be used to infer new scales for the interpolation. The default behavior for recompute_scale_factor changed to False in 1.6.0, and scale_factor is used in the interpolation calculation.
Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
upsample¶

torch.nn.functional.
upsample
(input, size=None, scale_factor=None, mode='nearest', align_corners=None)[source]¶ Upsamples the input to either the given
size
or the givenscale_factor
Warning
This function is deprecated in favor of
torch.nn.functional.interpolate()
. This is equivalent withnn.functional.interpolate(...)
.Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
The algorithm used for upsampling is determined by
mode
.Currently temporal, spatial and volumetric upsampling are supported, i.e. expected inputs are 3D, 4D or 5D in shape.
The input dimensions are interpreted in the form: minibatch x channels x [optional depth] x [optional height] x width.
The modes available for upsampling are: nearest, linear (3Donly), bilinear, bicubic (4Donly), trilinear (5Donly)
 Parameters
input (Tensor) – the input tensor
size (int or Tuple[int] or Tuple[int, int] or Tuple[int, int, int]) – output spatial size.
scale_factor (float or Tuple[float]) – multiplier for spatial size. Has to match input size if it is a tuple.
mode (string) – algorithm used for upsampling:
'nearest'
'linear'
'bilinear'
'bicubic'
'trilinear'
. Default:'nearest'
align_corners (bool, optional) – Geometrically, we consider the pixels of the input and output as squares rather than points. If set to
True
, the input and output tensors are aligned by the center points of their corner pixels, preserving the values at the corner pixels. If set toFalse
, the input and output tensors are aligned by the corner points of their corner pixels, and the interpolation uses edge value padding for outofboundary values, making this operation independent of input size whenscale_factor
is kept the same. This only has an effect whenmode
is'linear'
,'bilinear'
,'bicubic'
or'trilinear'
. Default:False
Note
With
mode='bicubic'
, it’s possible to cause overshoot, in other words it can produce negative values or values greater than 255 for images. Explicitly callresult.clamp(min=0, max=255)
if you want to reduce the overshoot when displaying the image.Warning
With
align_corners = True
, the linearly interpolating modes (linear, bilinear, and trilinear) don’t proportionally align the output and input pixels, and thus the output values can depend on the input size. This was the default behavior for these modes up to version 0.3.1. Since then, the default behavior isalign_corners = False
. SeeUpsample
for concrete examples on how this affects the outputs.
upsample_nearest¶

torch.nn.functional.
upsample_nearest
(input, size=None, scale_factor=None)[source]¶ Upsamples the input, using nearest neighbours’ pixel values.
Warning
This function is deprecated in favor of
torch.nn.functional.interpolate()
. This is equivalent withnn.functional.interpolate(..., mode='nearest')
.Currently spatial and volumetric upsampling are supported (i.e. expected inputs are 4 or 5 dimensional).
 Parameters
Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
upsample_bilinear¶

torch.nn.functional.
upsample_bilinear
(input, size=None, scale_factor=None)[source]¶ Upsamples the input, using bilinear upsampling.
Warning
This function is deprecated in favor of
torch.nn.functional.interpolate()
. This is equivalent withnn.functional.interpolate(..., mode='bilinear', align_corners=True)
.Expected inputs are spatial (4 dimensional). Use upsample_trilinear fo volumetric (5 dimensional) inputs.
 Parameters
Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
grid_sample¶

torch.nn.functional.
grid_sample
(input, grid, mode='bilinear', padding_mode='zeros', align_corners=None)[source]¶ Given an
input
and a flowfieldgrid
, computes theoutput
usinginput
values and pixel locations fromgrid
.Currently, only spatial (4D) and volumetric (5D)
input
are supported.In the spatial (4D) case, for
input
with shape $(N, C, H_\text{in}, W_\text{in})$ andgrid
with shape $(N, H_\text{out}, W_\text{out}, 2)$ , the output will have shape $(N, C, H_\text{out}, W_\text{out})$ .For each output location
output[n, :, h, w]
, the size2 vectorgrid[n, h, w]
specifiesinput
pixel locationsx
andy
, which are used to interpolate the output valueoutput[n, :, h, w]
. In the case of 5D inputs,grid[n, d, h, w]
specifies thex
,y
,z
pixel locations for interpolatingoutput[n, :, d, h, w]
.mode
argument specifiesnearest
orbilinear
interpolation method to sample the input pixels.grid
specifies the sampling pixel locations normalized by theinput
spatial dimensions. Therefore, it should have most values in the range of[1, 1]
. For example, valuesx = 1, y = 1
is the lefttop pixel ofinput
, and valuesx = 1, y = 1
is the rightbottom pixel ofinput
.If
grid
has values outside the range of[1, 1]
, the corresponding outputs are handled as defined bypadding_mode
. Options arepadding_mode="zeros"
: use0
for outofbound grid locations,padding_mode="border"
: use border values for outofbound grid locations,padding_mode="reflection"
: use values at locations reflected by the border for outofbound grid locations. For location far away from the border, it will keep being reflected until becoming in bound, e.g., (normalized) pixel locationx = 3.5
reflects by border1
and becomesx' = 1.5
, then reflects by border1
and becomesx'' = 0.5
.
Note
This function is often used in conjunction with
affine_grid()
to build Spatial Transformer Networks .Note
When using the CUDA backend, this operation may induce nondeterministic behaviour in its backward pass that is not easily switched off. Please see the notes on Reproducibility for background.
Note
NaN values in
grid
would be interpreted as1
. Parameters
input (Tensor) – input of shape $(N, C, H_\text{in}, W_\text{in})$ (4D case) or $(N, C, D_\text{in}, H_\text{in}, W_\text{in})$ (5D case)
grid (Tensor) – flowfield of shape $(N, H_\text{out}, W_\text{out}, 2)$ (4D case) or $(N, D_\text{out}, H_\text{out}, W_\text{out}, 3)$ (5D case)
mode (str) – interpolation mode to calculate output values
'bilinear'
'nearest'
. Default:'bilinear'
Note: Whenmode='bilinear'
and the input is 5D, the interpolation mode used internally will actually be trilinear. However, when the input is 4D, the interpolation mode will legitimately be bilinear.padding_mode (str) – padding mode for outside grid values
'zeros'
'border'
'reflection'
. Default:'zeros'
align_corners (bool, optional) – Geometrically, we consider the pixels of the input as squares rather than points. If set to
True
, the extrema (1
and1
) are considered as referring to the center points of the input’s corner pixels. If set toFalse
, they are instead considered as referring to the corner points of the input’s corner pixels, making the sampling more resolution agnostic. This option parallels thealign_corners
option ininterpolate()
, and so whichever option is used here should also be used there to resize the input image before grid sampling. Default:False
 Returns
output Tensor
 Return type
output (Tensor)
Warning
When
align_corners = True
, the grid positions depend on the pixel size relative to the input image size, and so the locations sampled bygrid_sample()
will differ for the same input given at different resolutions (that is, after being upsampled or downsampled). The default behavior up to version 1.2.0 wasalign_corners = True
. Since then, the default behavior has been changed toalign_corners = False
, in order to bring it in line with the default forinterpolate()
.
affine_grid¶

torch.nn.functional.
affine_grid
(theta, size, align_corners=None)[source]¶ Generates a 2D or 3D flow field (sampling grid), given a batch of affine matrices
theta
.Note
This function is often used in conjunction with
grid_sample()
to build Spatial Transformer Networks . Parameters
theta (Tensor) – input batch of affine matrices with shape ($N \times 2 \times 3$ ) for 2D or ($N \times 3 \times 4$ ) for 3D
size (torch.Size) – the target output image size. ($N \times C \times H \times W$ for 2D or $N \times C \times D \times H \times W$ for 3D) Example: torch.Size((32, 3, 24, 24))
align_corners (bool, optional) – if
True
, consider1
and1
to refer to the centers of the corner pixels rather than the image corners. Refer togrid_sample()
for a more complete description. A grid generated byaffine_grid()
should be passed togrid_sample()
with the same setting for this option. Default:False
 Returns
output Tensor of size ($N \times H \times W \times 2$ )
 Return type
output (Tensor)
Warning
When
align_corners = True
, the grid positions depend on the pixel size relative to the input image size, and so the locations sampled bygrid_sample()
will differ for the same input given at different resolutions (that is, after being upsampled or downsampled). The default behavior up to version 1.2.0 wasalign_corners = True
. Since then, the default behavior has been changed toalign_corners = False
, in order to bring it in line with the default forinterpolate()
.Warning
When
align_corners = True
, 2D affine transforms on 1D data and 3D affine transforms on 2D data (that is, when one of the spatial dimensions has unit size) are illdefined, and not an intended use case. This is not a problem whenalign_corners = False
. Up to version 1.2.0, all grid points along a unit dimension were considered arbitrarily to be at1
. From version 1.3.0, underalign_corners = True
all grid points along a unit dimension are considered to be at`0
(the center of the input image).
DataParallel functions (multiGPU, distributed)¶
data_parallel¶

torch.nn.parallel.
data_parallel
(module, inputs, device_ids=None, output_device=None, dim=0, module_kwargs=None)[source]¶ Evaluates module(input) in parallel across the GPUs given in device_ids.
This is the functional version of the DataParallel module.
 Parameters
module (Module) – the module to evaluate in parallel
inputs (Tensor) – inputs to the module
device_ids (list of python:int or torch.device) – GPU ids on which to replicate module
output_device (list of python:int or torch.device) – GPU location of the output Use 1 to indicate the CPU. (default: device_ids[0])
 Returns
a Tensor containing the result of module(input) located on output_device