torch.nn

Parameters

class torch.nn.Parameter[source]

A kind of Variable that is to be considered a module parameter.

Parameters are Variable subclasses, that have a very special property when used with Module s - when they’re assigned as Module attributes they are automatically added to the list of its parameters, and will appear e.g. in parameters() iterator. Assigning a Variable doesn’t have such effect. This is because one might want to cache some temporary state, like last hidden state of the RNN, in the model. If there was no such class as Parameter, these temporaries would get registered too.

Another difference is that parameters can’t be volatile and that they require gradient by default.

Parameters:

Containers

Module

class torch.nn.Module[source]

Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes:

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
       x = F.relu(self.conv1(x))
       return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call .cuda(), etc.

children()[source]

Returns an iterator over children modules.

cpu(device_id=None)[source]

Moves all model parameters and buffers to the CPU.

cuda(device_id=None)[source]

Moves all model parameters and buffers to the GPU.

Parameters:device_id (int, optional) – if specified, all parameters will be copied to that device
double()[source]

Casts all parameters and buffers to double datatype.

eval()[source]

Sets the module in evaluation mode.

This has any effect only on modules such as Dropout or BatchNorm.

float()[source]

Casts all parameters and buffers to float datatype.

forward(*input)[source]

Defines the computation performed at every call.

Should be overriden by all subclasses.

half()[source]

Casts all parameters and buffers to half datatype.

load_state_dict(state_dict)[source]

Copies parameters and buffers from state_dict into this module and its descendants. The keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Parameters:state_dict (dict) – A dict containing parameters and persistent buffers.
parameters(memo=None)[source]

Returns an iterator over module parameters.

This is typically passed to an optimizer.

Example

>>> for param in model.parameters():
>>>     print(type(param.data), param.size())
<class 'torch.FloatTensor'> (20L,)
<class 'torch.FloatTensor'> (20L, 1L, 5L, 5L)
register_backward_hook(hook)[source]

Registers a backward hook on the module.

The hook will be called every time the gradients with respect to module inputs are computed. The hook should have the following signature:

hook(module, grad_input, grad_output) -> Tensor or None

The grad_input and grad_output may be tuples if the module has multiple inputs or outputs. The hook should not modify its arguments, but it can optionally return a new gradient with respect to input that will be used in place of grad_input in subsequent computations.

This function returns a handle with a method handle.remove() that removes the hook from the module.

register_buffer(name, tensor)[source]

Adds a persistent buffer to the module.

This is typically used to register a buffer that should not to be considered a model parameter. For example, BatchNorm’s running_mean is not a parameter, but is part of the persistent state.

Buffers can be accessed as attributes using given names.

Example

>>> self.register_buffer('running_mean', torch.zeros(num_features))
register_forward_hook(hook)[source]

Registers a forward hook on the module.

The hook will be called every time forward() computes an output. It should have the following signature:

hook(module, input, output) -> None

The hook should not modify the input or output. This function returns a handle with a method handle.remove() that removes the hook from the module.

register_parameter(name, param)[source]

Adds a parameter to the module.

The parameter can be accessed as an attribute using given name.

state_dict(destination=None, prefix='')[source]

Returns a dictionary containing a whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names.

Example

>>> module.state_dict().keys()
['bias', 'weight']
train(mode=True)[source]

Sets the module in training mode.

This has any effect only on modules such as Dropout or BatchNorm.

zero_grad()[source]

Sets gradients of all model parameters to zero.

Sequential

class torch.nn.Sequential(*args)[source]

A sequential container. Modules will be added to it in the order they are passed in the constructor. Alternatively, an ordered dict of modules can also be passed in.

To make it easier to understand, given is a small example:

# Example of using Sequential
model = nn.Sequential(
          nn.Conv2d(1,20,5),
          nn.ReLU(),
          nn.Conv2d(20,64,5),
          nn.ReLU()
        )

# Example of using Sequential with OrderedDict
model = nn.Sequential(OrderedDict([
          ('conv1', nn.Conv2d(1,20,5)),
          ('relu1', nn.ReLU()),
          ('conv2', nn.Conv2d(20,64,5)),
          ('relu2', nn.ReLU())
        ]))

ModuleList

class torch.nn.ModuleList(modules=None)[source]

Holds submodules in a list.

ModuleList can be indexed like a regular Python list, but modules it contains are properly registered, and will be visible by all Module methods.

Parameters:modules (list, optional) – a list of modules to add

Example:

class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(10, 10) for i in range(10)])

    def forward(self, x):
        # ModuleList can act as an iterable, or be indexed using ints
        for i, l in enumerate(self.linears):
            x = self.linears[i // 2](x) + l(x)
        return x
append(module)[source]

Appends a given module at the end of the list.

Parameters:module (nn.Module) – module to append
extend(modules)[source]

Appends modules from a Python list at the end.

Parameters:modules (list) – list of modules to append

ParameterList

class torch.nn.ParameterList(parameters=None)[source]

Holds submodules in a list.

ParameterList can be indexed like a regular Python list, but parameters it contains are properly registered, and will be visible by all Module methods.

Parameters:modules (list, optional) – a list of nn.Parameter` to add

Example:

class MyModule(nn.Module):
    def __init__(self):
        super(MyModule, self).__init__()
        self.params = nn.ParameterList([nn.Parameter(torch.randn(10, 10)) for i in range(10)])

    def forward(self, x):
        # ModuleList can act as an iterable, or be indexed using ints
        for i, p in enumerate(self.params):
            x = self.params[i // 2].mm(x) + p.mm(x)
        return x
append(parameter)[source]

Appends a given parameter at the end of the list.

Parameters:parameter (nn.Parameter) – parameter to append
extend(parameters)[source]

Appends parameters from a Python list at the end.

Parameters:parameters (list) – list of parameters to append

Convolution Layers

Conv1d

class torch.nn.Conv1d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)[source]

Applies a 1D convolution over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C_{in}, L)\) and output \((N, C_{out}, L_{out})\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_{out_j}) = bias(C_{out_j}) + \sum_{{k}=0}^{C_{in}-1} weight(C_{out_j}, k) \star input(N_i, k) \end{array}\]

where \(\star\) is the valid cross-correlation operator

stride controls the stride for the cross-correlation.
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.
groups controls the connections between inputs and outputs.
At groups=1, all inputs are convolved to all outputs.
At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • dilation (int or tuple, optional) – Spacing between kernel elements
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, L_{in})\)
  • Output: \((N, C_{out}, L_{out})\) where \(L_{out} = floor((L_{in} + 2 * padding - dilation * (kernel\_size - 1) - 1) / stride + 1)\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (out_channels, in_channels, kernel_size)
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

Examples:

>>> m = nn.Conv1d(16, 33, 3, stride=2)
>>> input = autograd.Variable(torch.randn(20, 16, 50))
>>> output = m(input)

Conv2d

class torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)[source]

Applies a 2D convolution over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C_{in}, H, W)\) and output \((N, C_{out}, H_{out}, W_{out})\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_{out_j}) = bias(C_{out_j}) + \sum_{{k}=0}^{C_{in}-1} weight(C_{out_j}, k) \star input(N_i, k) \end{array}\]

where \(\star\) is the valid 2D cross-correlation operator

stride controls the stride for the cross-correlation.
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.
groups controls the connections between inputs and outputs.
At groups=1, all inputs are convolved to all outputs.
At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

The parameters kernel_size, stride, padding, dilation can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • dilation (int or tuple, optional) – Spacing between kernel elements
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, H_{in}, W_{in})\)
  • Output: \((N, C_{out}, H_{out}, W_{out})\) where \(H_{out} = floor((H_{in} + 2 * padding[0] - dilation[0] * (kernel\_size[0] - 1) - 1) / stride[0] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[1] - dilation[1] * (kernel\_size[1] - 1) - 1) / stride[1] + 1)\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (out_channels, in_channels, kernel_size[0], kernel_size[1])
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

Examples:

>>> # With square kernels and equal stride
>>> m = nn.Conv2d(16, 33, 3, stride=2)
>>> # non-square kernels and unequal stride and with padding
>>> m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
>>> # non-square kernels and unequal stride and with padding and dilation
>>> m = nn.Conv2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2), dilation=(3, 1))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 100))
>>> output = m(input)

Conv3d

class torch.nn.Conv3d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True)[source]

Applies a 3D convolution over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C_{in}, D, H, W)\) and output \((N, C_{out}, D_{out}, H_{out}, W_{out})\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_{out_j}) = bias(C_{out_j}) + \sum_{{k}=0}^{C_{in}-1} weight(C_{out_j}, k) \star input(N_i, k) \end{array}\]

where \(\star\) is the valid 3D cross-correlation operator

stride controls the stride for the cross-correlation.
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.
groups controls the connections between inputs and outputs.
At groups=1, all inputs are convolved to all outputs.
At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

The parameters kernel_size, stride, padding, dilation can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of three ints – in which case, the first int is used for the depth dimension, the second int for the height dimension and the third int for the width dimension

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • dilation (int or tuple, optional) – Spacing between kernel elements
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, D_{in}, H_{in}, W_{in})\)
  • Output: \((N, C_{out}, D_{out}, H_{out}, W_{out})\) where \(D_{out} = floor((D_{in} + 2 * padding[0] - dilation[0] * (kernel\_size[0] - 1) - 1) / stride[0] + 1)\) \(H_{out} = floor((H_{in} + 2 * padding[1] - dilation[1] * (kernel\_size[1] - 1) - 1) / stride[1] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[2] - dilation[2] * (kernel\_size[2] - 1) - 1) / stride[2] + 1)\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (out_channels, in_channels, kernel_size[0], kernel_size[1], kernel_size[2])
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

Examples:

>>> # With square kernels and equal stride
>>> m = nn.Conv3d(16, 33, 3, stride=2)
>>> # non-square kernels and unequal stride and with padding
>>> m = nn.Conv3d(16, 33, (3, 5, 2), stride=(2, 1, 1), padding=(4, 2, 0))
>>> input = autograd.Variable(torch.randn(20, 16, 10, 50, 100))
>>> output = m(input)

ConvTranspose1d

class torch.nn.ConvTranspose1d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True)[source]

Applies a 1D transposed convolution operator over an input image composed of several input planes.

This module can be seen as the gradient of Conv1d with respect to its input. It is sometimes (but incorrectly) refered to as a deconvolutional operation.

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • output_padding (int or tuple, optional) – Zero-padding added to one side of the output
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, L_{in})\)
  • Output: \((N, C_{out}, L_{out})\) where \(L_{out} = (L_{in} - 1) * stride - 2 * padding + kernel\_size + output\_padding\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (in_channels, out_channels, kernel_size[0], kernel_size[1])
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

ConvTranspose2d

class torch.nn.ConvTranspose2d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True)[source]

Applies a 2D transposed convolution operator over an input image composed of several input planes.

This module can be seen as the gradient of Conv2d with respect to its input. It is sometimes (but incorrectly) refered to as a deconvolutional operation.

stride controls the stride for the cross-correlation.
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
If output_padding is non-zero, then the output is implicitly zero-padded on one side for output_padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.
groups controls the connections between inputs and outputs.
At groups=1, all inputs are convolved to all outputs.
At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

The parameters kernel_size, stride, padding, output_padding can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • output_padding (int or tuple, optional) – Zero-padding added to one side of the output
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, H_{in}, W_{in})\)
  • Output: \((N, C_{out}, H_{out}, W_{out})\) where \(H_{out} = (H_{in} - 1) * stride[0] - 2 * padding[0] + kernel\_size[0] + output\_padding[0]\) \(W_{out} = (W_{in} - 1) * stride[1] - 2 * padding[1] + kernel\_size[1] + output\_padding[1]\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (in_channels, out_channels, kernel_size[0], kernel_size[1])
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

Examples:

>>> # With square kernels and equal stride
>>> m = nn.ConvTranspose2d(16, 33, 3, stride=2)
>>> # non-square kernels and unequal stride and with padding
>>> m = nn.ConvTranspose2d(16, 33, (3, 5), stride=(2, 1), padding=(4, 2))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 100))
>>> output = m(input)
>>> # exact output size can be also specified as an argument
>>> input = autograd.Variable(torch.randn(1, 16, 12, 12))
>>> downsample = nn.Conv2d(16, 16, 3, stride=2, padding=1)
>>> upsample = nn.ConvTranspose2d(16, 16, 3, stride=2, padding=1)
>>> h = downsample(input)
>>> h.size()
torch.Size([1, 16, 6, 6])
>>> output = upsample(h, output_size=input.size())
>>> output.size()
torch.Size([1, 16, 12, 12])

ConvTranspose3d

class torch.nn.ConvTranspose3d(in_channels, out_channels, kernel_size, stride=1, padding=0, output_padding=0, groups=1, bias=True)[source]

Applies a 3D transposed convolution operator over an input image composed of several input planes. The transposed convolution operator multiplies each input value element-wise by a learnable kernel, and sums over the outputs from all input feature planes.

This module can be seen as the exact reverse of Conv3d. It is sometimes (but incorrectly) refered to as a deconvolutional operation.

stride controls the stride for the cross-correlation.
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
If output_padding is non-zero, then the output is implicitly zero-padded on one side for output_padding number of points
groups controls the connections between inputs and outputs.
At groups=1, all inputs are convolved to all outputs.
At groups=2, the operation becomes equivalent to having two conv layers side by side, each seeing half the input channels, and producing half the output channels, and both subsequently concatenated.

The parameters kernel_size, stride, padding, output_padding can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of three ints – in which case, the first int is used for the depth dimension, the second int for the width dimension and the third int for the width dimension

Note

Depending of the size of your kernel, several (of the last) columns of the input might be lost, because it is a valid cross-correlation, and not a full cross-correlation. It is up to the user to add proper padding.

Parameters:
  • in_channels (int) – Number of channels in the input image
  • out_channels (int) – Number of channels produced by the convolution
  • kernel_size (int or tuple) – Size of the convolving kernel
  • stride (int or tuple, optional) – Stride of the convolution
  • padding (int or tuple, optional) – Zero-padding added to both sides of the input
  • output_padding (int or tuple, optional) – Zero-padding added to one side of the output
  • groups (int, optional) – Number of blocked connections from input channels to output channels
  • bias (bool, optional) – If True, adds a learnable bias to the output
Shape:
  • Input: \((N, C_{in}, D_{in}, H_{in}, W_{in})\)
  • Output: \((N, C_{out}, D_{out}, H_{out}, W_{out})\) where \(D_{out} = (D_{in} - 1) * stride[0] - 2 * padding[0] + kernel\_size[0] + output\_padding[0]\) \(H_{out} = (H_{in} - 1) * stride[1] - 2 * padding[1] + kernel\_size[1] + output\_padding[1]\) \(W_{out} = (W_{in} - 1) * stride[2] - 2 * padding[2] + kernel\_size[2] + output\_padding[2]\)
Variables:
  • weight (Tensor) – the learnable weights of the module of shape (in_channels, out_channels, kernel_size[0], kernel_size[1], kernel_size[2])
  • bias (Tensor) – the learnable bias of the module of shape (out_channels)

Examples:

>>> # With square kernels and equal stride
>>> m = nn.ConvTranspose3d(16, 33, 3, stride=2)
>>> # non-square kernels and unequal stride and with padding
>>> m = nn.Conv3d(16, 33, (3, 5, 2), stride=(2, 1, 1), padding=(0, 4, 2))
>>> input = autograd.Variable(torch.randn(20, 16, 10, 50, 100))
>>> output = m(input)

Pooling Layers

MaxPool1d

class torch.nn.MaxPool1d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)[source]

Applies a 1D max pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, L)\) and output \((N, C, L_{out})\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, k) = \max_{{m}=0}^{{kernel\_size}-1} input(N_i, C_j, stride * k + m) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.
Parameters:
  • kernel_size – the size of the window to take a max over
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • dilation – a parameter that controls the stride of elements in the window
  • return_indices – if True, will return the max indices along with the outputs. Useful when Unpooling later
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
Shape:
  • Input: \((N, C, L_{in})\)
  • Output: \((N, C, L_{out})\) where \(L_{out} = floor((L_{in} + 2 * padding - dilation * (kernel\_size - 1) - 1) / stride + 1)\)

Examples:

>>> # pool of size=3, stride=2
>>> m = nn.MaxPool1d(3, stride=2)
>>> input = autograd.Variable(torch.randn(20, 16, 50))
>>> output = m(input)

MaxPool2d

class torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)[source]

Applies a 2D max pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, H, W)\), output \((N, C, H_{out}, W_{out})\) and kernel_size \((kH, kW)\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, h, w) = \max_{{m}=0}^{kH-1} \max_{{n}=0}^{kW-1} input(N_i, C_j, stride[0] * h + m, stride[1] * w + n) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.

The parameters kernel_size, stride, padding, dilation can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension
Parameters:
  • kernel_size – the size of the window to take a max over
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • dilation – a parameter that controls the stride of elements in the window
  • return_indices – if True, will return the max indices along with the outputs. Useful when Unpooling later
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = floor((H_{in} + 2 * padding[0] - dilation[0] * (kernel\_size[0] - 1) - 1) / stride[0] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[1] - dilation[1] * (kernel\_size[1] - 1) - 1) / stride[1] + 1)\)

Examples:

>>> # pool of square window of size=3, stride=2
>>> m = nn.MaxPool2d(3, stride=2)
>>> # pool of non-square window
>>> m = nn.MaxPool2d((3, 2), stride=(2, 1))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 32))
>>> output = m(input)

MaxPool3d

class torch.nn.MaxPool3d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)[source]

Applies a 3D max pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, D, H, W)\), output \((N, C, D_{out}, H_{out}, W_{out})\) and kernel_size \((kD, kH, kW)\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, d, h, w) = \max_{{k}=0}^{kD-1} \max_{{m}=0}^{kH-1} \max_{{n}=0}^{kW-1} input(N_i, C_j, stride[0] * k + d, stride[1] * h + m, stride[2] * w + n) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points
dilation controls the spacing between the kernel points. It is harder to describe, but this link has a nice visualization of what dilation does.

The parameters kernel_size, stride, padding, dilation can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of three ints – in which case, the first int is used for the depth dimension, the second int for the width dimension and the third int for the width dimension
Parameters:
  • kernel_size – the size of the window to take a max over
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • dilation – a parameter that controls the stride of elements in the window
  • return_indices – if True, will return the max indices along with the outputs. Useful when Unpooling later
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
Shape:
  • Input: \((N, C, D_{in}, H_{in}, W_{in})\)
  • Output: \((N, C, D_{out}, H_{out}, W_{out})\) where \(D_{out} = floor((D_{in} + 2 * padding[0] - dilation[0] * (kernel\_size[0] - 1) - 1) / stride[0] + 1)\) \(H_{out} = floor((H_{in} + 2 * padding[1] - dilation[1] * (kernel\_size[1] - 1) - 1) / stride[1] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[2] - dilation[2] * (kernel\_size[2] - 1) - 1) / stride[2] + 1)\)

Examples:

>>> # pool of square window of size=3, stride=2
>>> m = nn.MaxPool3d(3, stride=2)
>>> # pool of non-square window
>>> m = nn.MaxPool3d((3, 2, 2), stride=(2, 1, 2))
>>> input = autograd.Variable(torch.randn(20, 16, 50,44, 31))
>>> output = m(input)

MaxUnpool1d

class torch.nn.MaxUnpool1d(kernel_size, stride=None, padding=0)[source]

Computes a partial inverse of MaxPool1d.

MaxPool1d is not fully invertible, since the non-maximal values are lost.

MaxUnpool1d takes in as input the output of MaxPool1d including the indices of the maximal values and computes a partial inverse in which all non-maximal values are set to zero.

Note

MaxPool1d can map several input sizes to the same output sizes. Hence, the inversion process can get ambiguous. To accommodate this, you can provide the needed output size as an additional argument output_size in the forward call. See the Inputs and Example below.

Parameters:
  • kernel_size (int or tuple) – Size of the max pooling window.
  • stride (int or tuple) – Stride of the max pooling window. It is set to kernel_size by default.
  • padding (int or tuple) – Padding that was added to the input
Inputs:
  • input: the input Tensor to invert
  • indices: the indices given out by MaxPool1d
  • output_size (optional) : a torch.Size that specifies the targeted output size
Shape:
  • Input: \((N, C, H_{in})\)
  • Output: \((N, C, H_{out})\) where \(H_{out} = (H_{in} - 1) * stride[0] - 2 * padding[0] + kernel\_size[0]\) or as given by output_size in the call operator

Example:

>>> pool = nn.MaxPool1d(2, stride=2, return_indices=True)
>>> unpool = nn.MaxUnpool1d(2, stride=2)
>>> input = Variable(torch.Tensor([[[1, 2, 3, 4, 5, 6, 7, 8]]]))
>>> output, indices = pool(input)
>>> unpool(output, indices)
Variable containing:
(0 ,.,.) =
   0   2   0   4   0   6   0   8
[torch.FloatTensor of size 1x1x8]

>>> # Example showcasing the use of output_size
>>> input = Variable(torch.Tensor([[[1, 2, 3, 4, 5, 6, 7, 8, 9]]]))
>>> output, indices = pool(input)
>>> unpool(output, indices, output_size=input.size())
Variable containing:
(0 ,.,.) =
   0   2   0   4   0   6   0   8   0
[torch.FloatTensor of size 1x1x9]

>>> unpool(output, indices)
Variable containing:
(0 ,.,.) =
   0   2   0   4   0   6   0   8
[torch.FloatTensor of size 1x1x8]

MaxUnpool2d

class torch.nn.MaxUnpool2d(kernel_size, stride=None, padding=0)[source]

Computes a partial inverse of MaxPool2d.

MaxPool2d is not fully invertible, since the non-maximal values are lost.

MaxUnpool2d takes in as input the output of MaxPool2d including the indices of the maximal values and computes a partial inverse in which all non-maximal values are set to zero.

Note

MaxPool2d can map several input sizes to the same output sizes. Hence, the inversion process can get ambiguous. To accommodate this, you can provide the needed output size as an additional argument output_size in the forward call. See the Inputs and Example below.

Parameters:
  • kernel_size (int or tuple) – Size of the max pooling window.
  • stride (int or tuple) – Stride of the max pooling window. It is set to kernel_size by default.
  • padding (int or tuple) – Padding that was added to the input
Inputs:
  • input: the input Tensor to invert
  • indices: the indices given out by MaxPool2d
  • output_size (optional) : a torch.Size that specifies the targeted output size
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = (H_{in} - 1) * stride[0] -2 * padding[0] + kernel\_size[0]\) \(W_{out} = (W_{in} - 1) * stride[1] -2 * padding[1] + kernel\_size[1]\) or as given by output_size in the call operator

Example:

>>> pool = nn.MaxPool2d(2, stride=2, return_indices=True)
>>> unpool = nn.MaxUnpool2d(2, stride=2)
>>> input = Variable(torch.Tensor([[[[ 1,  2,  3,  4],
...                                  [ 5,  6,  7,  8],
...                                  [ 9, 10, 11, 12],
...                                  [13, 14, 15, 16]]]]))
>>> output, indices = pool(input)
>>> unpool(output, indices)
Variable containing:
(0 ,0 ,.,.) =
   0   0   0   0
   0   6   0   8
   0   0   0   0
   0  14   0  16
[torch.FloatTensor of size 1x1x4x4]

>>> # specify a different output size than input size
>>> unpool(output, indices, output_size=torch.Size([1, 1, 5, 5]))
Variable containing:
(0 ,0 ,.,.) =
   0   0   0   0   0
   6   0   8   0   0
   0   0   0  14   0
  16   0   0   0   0
   0   0   0   0   0
[torch.FloatTensor of size 1x1x5x5]

MaxUnpool3d

class torch.nn.MaxUnpool3d(kernel_size, stride=None, padding=0)[source]

Computes a partial inverse of MaxPool3d.

MaxPool3d is not fully invertible, since the non-maximal values are lost. MaxUnpool3d takes in as input the output of MaxPool3d including the indices of the maximal values and computes a partial inverse in which all non-maximal values are set to zero.

Note

MaxPool3d can map several input sizes to the same output sizes. Hence, the inversion process can get ambiguous. To accommodate this, you can provide the needed output size as an additional argument output_size in the forward call. See the Inputs section below.

Parameters:
  • kernel_size (int or tuple) – Size of the max pooling window.
  • stride (int or tuple) – Stride of the max pooling window. It is set to kernel_size by default.
  • padding (int or tuple) – Padding that was added to the input
Inputs:
  • input: the input Tensor to invert
  • indices: the indices given out by MaxPool3d
  • output_size (optional) : a torch.Size that specifies the targeted output size
Shape:
  • Input: \((N, C, D_{in}, H_{in}, W_{in})\)
  • Output: \((N, C, D_{out}, H_{out}, W_{out})\) where \(D_{out} = (D_{in} - 1) * stride[0] - 2 * padding[0] + kernel\_size[0]\) \(H_{out} = (H_{in} - 1) * stride[1] - 2 * padding[1] + kernel\_size[1]\) \(W_{out} = (W_{in} - 1) * stride[2] - 2 * padding[2] + kernel\_size[2]\) or as given by output_size in the call operator

Example:

>>> # pool of square window of size=3, stride=2
>>> pool = nn.MaxPool3d(3, stride=2, return_indices=True)
>>> unpool = nn.MaxUnpool3d(3, stride=2)
>>> output, indices = pool(Variable(torch.randn(20, 16, 51, 33, 15)))
>>> unpooled_output = unpool(output, indices)
>>> unpooled_output.size()
torch.Size([20, 16, 51, 33, 15])

AvgPool1d

class torch.nn.AvgPool1d(kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True)[source]

Applies a 1D average pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, L)\), output \((N, C, L_{out})\) and kernel_size \(k\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, l) = 1 / k * \sum_{{m}=0}^{k} input(N_i, C_j, stride * l + m) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points

The parameters kernel_size, stride, padding can each be an int or a one-element tuple.

Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
  • count_include_pad – when True, will include the zero-padding in the averaging calculation
Shape:
  • Input: \((N, C, L_{in})\)
  • Output: \((N, C, L_{out})\) where \(L_{out} = floor((L_{in} + 2 * padding - kernel\_size) / stride + 1)\)

Examples:

>>> # pool with window of size=3, stride=2
>>> m = nn.AvgPool1d(3, stride=2)
>>> m(Variable(torch.Tensor([[[1,2,3,4,5,6,7]]])))
Variable containing:
(0 ,.,.) =
  2  4  6
[torch.FloatTensor of size 1x1x3]

AvgPool2d

class torch.nn.AvgPool2d(kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True)[source]

Applies a 2D average pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, H, W)\), output \((N, C, H_{out}, W_{out})\) and kernel_size \((kH, kW)\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, h, w) = 1 / (kH * kW) * \sum_{{m}=0}^{kH-1} \sum_{{n}=0}^{kW-1} input(N_i, C_j, stride[0] * h + m, stride[1] * w + n) \end{array}\]
If padding is non-zero, then the input is implicitly zero-padded on both sides for padding number of points

The parameters kernel_size, stride, padding can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension
Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
  • count_include_pad – when True, will include the zero-padding in the averaging calculation
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = floor((H_{in} + 2 * padding[0] - kernel\_size[0]) / stride[0] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[1] - kernel\_size[1]) / stride[1] + 1)\)

Examples:

>>> # pool of square window of size=3, stride=2
>>> m = nn.AvgPool2d(3, stride=2)
>>> # pool of non-square window
>>> m = nn.AvgPool2d((3, 2), stride=(2, 1))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 32))
>>> output = m(input)

AvgPool3d

class torch.nn.AvgPool3d(kernel_size, stride=None)[source]

Applies a 3D average pooling over an input signal composed of several input planes.

In the simplest case, the output value of the layer with input size \((N, C, D, H, W)\), output \((N, C, D_{out}, H_{out}, W_{out})\) and kernel_size \((kD, kH, kW)\) can be precisely described as:

\[\begin{array}{ll} out(N_i, C_j, d, h, w) = 1 / (kD * kH * kW) * \sum_{{k}=0}^{kD-1} \sum_{{m}=0}^{kH-1} \sum_{{n}=0}^{kW-1} input(N_i, C_j, stride[0] * d + k, stride[1] * h + m, stride[2] * w + n) \end{array}\]

The parameters kernel_size, stride can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of three ints – in which case, the first int is used for the depth dimension, the second int for the width dimension and the third int for the width dimension
Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
Shape:
  • Input: \((N, C, D_{in}, H_{in}, W_{in})\)
  • Output: \((N, C, D_{out}, H_{out}, W_{out})\) where \(D_{out} = floor((D_{in} - kernel\_size[0]) / stride[0] + 1)\) \(H_{out} = floor((H_{in} - kernel\_size[1]) / stride[1] + 1)\) \(W_{out} = floor((W_{in} - kernel\_size[2]) / stride[2] + 1)\)

Examples:

>>> # pool of square window of size=3, stride=2
>>> m = nn.AvgPool3d(3, stride=2)
>>> # pool of non-square window
>>> m = nn.AvgPool3d((3, 2, 2), stride=(2, 1, 2))
>>> input = autograd.Variable(torch.randn(20, 16, 50,44, 31))
>>> output = m(input)

FractionalMaxPool2d

class torch.nn.FractionalMaxPool2d(kernel_size, output_size=None, output_ratio=None, return_indices=False, _random_samples=None)[source]

Applies a 2D fractional max pooling over an input signal composed of several input planes.

Fractiona MaxPooling is described in detail in the paper Fractional MaxPooling by Ben Graham

The max-pooling operation is applied in kHxkW regions by a stochastic step size determined by the target output size. The number of output features is equal to the number of input planes.

Parameters:
  • kernel_size – the size of the window to take a max over. Can be a single number k (for a square kernel of k x k) or a tuple (kh x kw)
  • output_size – the target output size of the image of the form oH x oW. Can be a tuple (oH, oW) or a single number oH for a square image oH x oH
  • output_ratio – If one wants to have an output size as a ratio of the input size, this option can be given. This has to be a number or tuple in the range (0, 1)
  • return_indices – if True, will return the indices along with the outputs. Useful to pass to nn.MaxUnpool2d . Default: False

Examples

>>> # pool of square window of size=3, and target output size 13x12
>>> m = nn.FractionalMaxPool2d(3, output_size=(13, 12))
>>> # pool of square window and target output size being half of input image size
>>> m = nn.FractionalMaxPool2d(3, output_ratio=(0.5, 0.5))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 32))
>>> output = m(input)

LPPool2d

class torch.nn.LPPool2d(norm_type, kernel_size, stride=None, ceil_mode=False)[source]

Applies a 2D power-average pooling over an input signal composed of several input planes.

On each window, the function computed is: \(f(X) = pow(sum(pow(X, p)), 1/p)\)

  • At p = infinity, one gets Max Pooling
  • At p = 1, one gets Average Pooling

The parameters kernel_size, stride can either be:

  • a single int – in which case the same value is used for the height and width dimension
  • a tuple of two ints – in which case, the first int is used for the height dimension, and the second int for the width dimension
Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = floor((H_{in} + 2 * padding[0] - dilation[0] * (kernel\_size[0] - 1) - 1) / stride[0] + 1)\) \(W_{out} = floor((W_{in} + 2 * padding[1] - dilation[1] * (kernel\_size[1] - 1) - 1) / stride[1] + 1)\)

Examples:

>>> # power-2 pool of square window of size=3, stride=2
>>> m = nn.LPPool2d(2, 3, stride=2)
>>> # pool of non-square window of power 1.2
>>> m = nn.LPPool2d(1.2, (3, 2), stride=(2, 1))
>>> input = autograd.Variable(torch.randn(20, 16, 50, 32))
>>> output = m(input)

Non-linear Activations

ReLU

class torch.nn.ReLU(inplace=False)[source]

Applies the rectified linear unit function element-wise \({ReLU}(x)= max(0, x)\)

Parameters:inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.ReLU()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

ReLU6

class torch.nn.ReLU6(inplace=False)[source]

Applies the element-wise function \({ReLU6}(x) = min(max(0,x), 6)\)

Parameters:inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.ReLU6()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

ELU

class torch.nn.ELU(alpha=1.0, inplace=False)[source]

Applies element-wise, \(f(x) = max(0,x) + min(0, alpha * (exp(x) - 1))\)

Parameters:
  • alpha – the alpha value for the ELU formulation
  • inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.ELU()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

PReLU

class torch.nn.PReLU(num_parameters=1, init=0.25)[source]

Applies element-wise the function \(PReLU(x) = max(0,x) + a * min(0,x)\) Here “a” is a learnable parameter. When called without arguments, nn.PReLU() uses a single parameter “a” across all input channels. If called with nn.PReLU(nChannels), a separate “a” is used for each input channel.

Note

weight decay should not be used when learning “a” for good performance.

Parameters:
  • num_parameters – number of “a” to learn. Default: 1
  • init – the initial value of “a”. Default: 0.25
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.PReLU()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

LeakyReLU

class torch.nn.LeakyReLU(negative_slope=0.01, inplace=False)[source]

Applies element-wise, \(f(x) = max(0, x) + {negative\_slope} * min(0, x)\)

Parameters:
  • negative_slope – Controls the angle of the negative slope. Default: 1e-2
  • inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.LeakyReLU(0.1)
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Threshold

class torch.nn.Threshold(threshold, value, inplace=False)[source]

Thresholds each element of the input Tensor

Threshold is defined as:

y =  x        if x >= threshold
     value    if x <  threshold
Parameters:
  • threshold – The value to threshold at
  • value – The value to replace with
  • inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Threshold(0.1, 20)
>>> input = Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Hardtanh

class torch.nn.Hardtanh(min_value=-1, max_value=1, inplace=False)[source]

Applies the HardTanh function element-wise

HardTanh is defined as:

f(x) = +1, if x  >  1
f(x) = -1, if x  < -1
f(x) =  x,  otherwise

The range of the linear region \([-1, 1]\) can be adjusted

Parameters:
  • min_value – minimum value of the linear region range
  • max_value – maximum value of the linear region range
  • inplace – can optionally do the operation in-place
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.HardTanh(-2, 2)
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Sigmoid

class torch.nn.Sigmoid[source]

Applies the element-wise function \(f(x) = 1 / ( 1 + exp(-x))\)

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Sigmoid()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Tanh

class torch.nn.Tanh[source]

Applies element-wise, \(f(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))\)

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Tanh()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

LogSigmoid

class torch.nn.LogSigmoid[source]

Applies element-wise \(LogSigmoid(x) = log( 1 / (1 + exp(-x_i)))\)

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.LogSigmoid()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Softplus

class torch.nn.Softplus(beta=1, threshold=20)[source]

Applies element-wise \(f(x) = 1/beta * log(1 + exp(beta * x_i))\)

SoftPlus is a smooth approximation to the ReLU function and can be used to constrain the output of a machine to always be positive.

For numerical stability the implementation reverts to the linear function for inputs above a certain value.

Parameters:
  • beta – the beta value for the Softplus formulation. Default: 1
  • threshold – values above this revert to a linear function. Default: 20
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Softplus()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Softshrink

class torch.nn.Softshrink(lambd=0.5)[source]

Applies the soft shrinkage function elementwise

SoftShrinkage operator is defined as:

f(x) = x-lambda, if x > lambda >  f(x) = x+lambda, if x < -lambda
f(x) = 0, otherwise
Parameters:lambd – the lambda value for the Softshrink formulation. Default: 0.5
Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Softshrink()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Softsign

class torch.nn.Softsign[source]

Applies element-wise, the function \(f(x) = x / (1 + |x|)\)

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Softsign()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Tanhshrink

class torch.nn.Tanhshrink[source]

Applies element-wise, \(Tanhshrink(x) = x - Tanh(x)\)

Shape:
  • Input: \((N, *)\) where * means, any number of additional dimensions
  • Output: \((N, *)\), same shape as the input

Examples:

>>> m = nn.Tanhshrink()
>>> input = autograd.Variable(torch.randn(2))
>>> print(input)
>>> print(m(input))

Softmin

class torch.nn.Softmin[source]

Applies the Softmin function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range (0, 1) and sum to 1

\(f(x) = exp(-x_i - {shift}) / sum_j exp(-x_j - {shift})\)

where \({shift} = max_i - x_i\)

Shape:
  • Input: \((N, L)\)
  • Output: \((N, L)\)
Returns:a Tensor of the same dimension and shape as the input, with values in the range [0, 1]

Examples:

>>> m = nn.Softmin()
>>> input = autograd.Variable(torch.randn(2, 3))
>>> print(input)
>>> print(m(input))

Softmax

class torch.nn.Softmax[source]

Applies the Softmax function to an n-dimensional input Tensor rescaling them so that the elements of the n-dimensional output Tensor lie in the range (0,1) and sum to 1

Softmax is defined as \(f_i(x) = exp(x_i - shift) / sum_j exp(x_j - shift)\) where shift = max_i x_i

Shape:
  • Input: \((N, L)\)
  • Output: \((N, L)\)
Returns:a Tensor of the same dimension and shape as the input with values in the range [0, 1]

Note

This module doesn’t work directly with NLLLoss, which expects the Log to be computed between the Softmax and itself. Use Logsoftmax instead (it’s faster).

Examples:

>>> m = nn.Softmax()
>>> input = autograd.Variable(torch.randn(2, 3))
>>> print(input)
>>> print(m(input))

LogSoftmax

class torch.nn.LogSoftmax[source]

Applies the Log(Softmax(x)) function to an n-dimensional input Tensor. The LogSoftmax formulation can be simplified as

\(f_i(x) = log(1 / a * exp(x_i))\) where \(a = sum_j exp(x_j)\)

Shape:
  • Input: \((N, L)\)
  • Output: \((N, L)\)
Returns:a Tensor of the same dimension and shape as the input with values in the range [-inf, 0)

Examples:

>>> m = nn.LogSoftmax()
>>> input = autograd.Variable(torch.randn(2, 3))
>>> print(input)
>>> print(m(input))

Normalization layers

BatchNorm1d

class torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True)[source]

Applies Batch Normalization over a 2d or 3d input that is seen as a mini-batch.

\[y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta\]

The mean and standard-deviation are calculated per-dimension over the mini-batches and gamma and beta are learnable parameter vectors of size N (where N is the input size).

During training, this layer keeps a running estimate of its computed mean and variance. The running sum is kept with a default momentum of 0.1.

During evaluation, this running mean/variance is used for normalization.

Parameters:
  • num_features – num_features from an expected input of size batch_size x num_features [x width]
  • eps – a value added to the denominator for numerical stability. Default: 1e-5
  • momentum – the value used for the running_mean and running_var computation. Default: 0.1
  • affine – a boolean value that when set to true, gives the layer learnable affine parameters.
Shape:
  • Input: \((N, C)\) or \((N, C, L)\)
  • Output: \((N, C)\) or \((N, C, L)\) (same shape as input)

Examples

>>> # With Learnable Parameters
>>> m = nn.BatchNorm1d(100)
>>> # Without Learnable Parameters
>>> m = nn.BatchNorm1d(100, affine=False)
>>> input = autograd.Variable(torch.randn(20, 100))
>>> output = m(input)

BatchNorm2d

class torch.nn.BatchNorm2d(num_features, eps=1e-05, momentum=0.1, affine=True)[source]

Applies Batch Normalization over a 4d input that is seen as a mini-batch of 3d inputs

\[y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta\]

The mean and standard-deviation are calculated per-dimension over the mini-batches and gamma and beta are learnable parameter vectors of size N (where N is the input size).

During training, this layer keeps a running estimate of its computed mean and variance. The running sum is kept with a default momentum of 0.1.

During evaluation, this running mean/variance is used for normalization.

Parameters:
  • num_features – num_features from an expected input of size batch_size x num_features x height x width
  • eps – a value added to the denominator for numerical stability. Default: 1e-5
  • momentum – the value used for the running_mean and running_var computation. Default: 0.1
  • affine – a boolean value that when set to true, gives the layer learnable affine parameters.
Shape:
  • Input: \((N, C, H, W)\)
  • Output: \((N, C, H, W)\) (same shape as input)

Examples

>>> # With Learnable Parameters
>>> m = nn.BatchNorm2d(100)
>>> # Without Learnable Parameters
>>> m = nn.BatchNorm2d(100, affine=False)
>>> input = autograd.Variable(torch.randn(20, 100, 35, 45))
>>> output = m(input)

BatchNorm3d

class torch.nn.BatchNorm3d(num_features, eps=1e-05, momentum=0.1, affine=True)[source]

Applies Batch Normalization over a 5d input that is seen as a mini-batch of 4d inputs

\[y = \frac{x - mean[x]}{ \sqrt{Var[x]} + \epsilon} * gamma + beta\]

The mean and standard-deviation are calculated per-dimension over the mini-batches and gamma and beta are learnable parameter vectors of size N (where N is the input size).

During training, this layer keeps a running estimate of its computed mean and variance. The running sum is kept with a default momentum of 0.1.

During evaluation, this running mean/variance is used for normalization.

Parameters:
  • num_features – num_features from an expected input of size batch_size x num_features x height x width
  • eps – a value added to the denominator for numerical stability. Default: 1e-5
  • momentum – the value used for the running_mean and running_var computation. Default: 0.1
  • affine – a boolean value that when set to true, gives the layer learnable affine parameters.
Shape:
  • Input: \((N, C, D, H, W)\)
  • Output: \((N, C, D, H, W)\) (same shape as input)

Examples

>>> # With Learnable Parameters
>>> m = nn.BatchNorm3d(100)
>>> # Without Learnable Parameters
>>> m = nn.BatchNorm3d(100, affine=False)
>>> input = autograd.Variable(torch.randn(20, 100, 35, 45, 10))
>>> output = m(input)

Recurrent layers

RNN

class torch.nn.RNN(*args, **kwargs)[source]

Applies a multi-layer Elman RNN with tanh or ReLU non-linearity to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[h_t = \tanh(w_{ih} * x_t + b_{ih} + w_{hh} * h_{(t-1)} + b_{hh})\]

where \(h_t\) is the hidden state at time t, and \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer. If nonlinearity=’relu’, then ReLU is used instead of tanh.

Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • num_layers – Number of recurrent layers.
  • nonlinearity – The non-linearity to use [‘tanh’|’relu’]. Default: ‘tanh’
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)
  • dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
  • bidirectional – If True, becomes a bidirectional RNN. Default: False
Inputs: input, h_0
  • input (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() for details.
  • h_0 (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
Outputs: output, h_n
  • output (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_k) from the last layer of the RNN, for each k. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
  • h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for k=seq_len.
Variables:
  • weight_ih_l[k] – the learnable input-hidden weights of the k-th layer, of shape (input_size x hidden_size)
  • weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer, of shape (hidden_size x hidden_size)
  • bias_ih_l[k] – the learnable input-hidden bias of the k-th layer, of shape (hidden_size)
  • bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer, of shape (hidden_size)

Examples:

>>> rnn = nn.RNN(10, 20, 2)
>>> input = Variable(torch.randn(5, 3, 10))
>>> h0 = Variable(torch.randn(2, 3, 20))
>>> output, hn = rnn(input, h0)

LSTM

class torch.nn.LSTM(*args, **kwargs)[source]

Applies a multi-layer long short-term memory (LSTM) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[\begin{split}\begin{array}{ll} i_t = sigmoid(W_{ii} x_t + b_{ii} + W_{hi} h_{(t-1)} + b_{hi}) \\ f_t = sigmoid(W_{if} x_t + b_{if} + W_{hf} h_{(t-1)} + b_{hf}) \\ g_t = \tanh(W_{ig} x_t + b_{ig} + W_{hc} h_{(t-1)} + b_{hg}) \\ o_t = sigmoid(W_{io} x_t + b_{io} + W_{ho} h_{(t-1)} + b_{ho}) \\ c_t = f_t * c_{(t-1)} + i_t * g_t \\ h_t = o_t * \tanh(c_t) \end{array}\end{split}\]

where \(h_t\) is the hidden state at time t, \(c_t\) is the cell state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(i_t\), \(f_t\), \(g_t\), \(o_t\) are the input, forget, cell, and out gates, respectively.

Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • num_layers – Number of recurrent layers.
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)
  • dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
  • bidirectional – If True, becomes a bidirectional RNN. Default: False
Inputs: input, (h_0, c_0)
  • input (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() for details.
  • h_0 (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
  • c_0 (num_layers * num_directions, batch, hidden_size): tensor containing the initial cell state for each element in the batch.
Outputs: output, (h_n, c_n)
  • output (seq_len, batch, hidden_size * num_directions): tensor containing the output features (h_t) from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
  • h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t=seq_len
  • c_n (num_layers * num_directions, batch, hidden_size): tensor containing the cell state for t=seq_len
Variables:
  • weight_ih_l[k] – the learnable input-hidden weights of the k-th layer (W_ii|W_if|W_ig|W_io), of shape (input_size x 4*hidden_size)
  • weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer (W_hi|W_hf|W_hg|W_ho), of shape (hidden_size x 4*hidden_size)
  • bias_ih_l[k] – the learnable input-hidden bias of the k-th layer (b_ii|b_if|b_ig|b_io), of shape (4*hidden_size)
  • bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer (W_hi|W_hf|W_hg|b_ho), of shape (4*hidden_size)

Examples:

>>> rnn = nn.LSTM(10, 20, 2)
>>> input = Variable(torch.randn(5, 3, 10))
>>> h0 = Variable(torch.randn(2, 3, 20))
>>> c0 = Variable(torch.randn(2, 3, 20))
>>> output, hn = rnn(input, (h0, c0))

GRU

class torch.nn.GRU(*args, **kwargs)[source]

Applies a multi-layer gated recurrent unit (GRU) RNN to an input sequence.

For each element in the input sequence, each layer computes the following function:

\[\begin{split}\begin{array}{ll} r_t = sigmoid(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\ i_t = sigmoid(W_{ii} x_t + b_{ii} + W_hi h_{(t-1)} + b_{hi}) \\ n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\ h_t = (1 - i_t) * n_t + i_t * h_{(t-1)} \\ \end{array}\end{split}\]

where \(h_t\) is the hidden state at time t, \(x_t\) is the hidden state of the previous layer at time t or \(input_t\) for the first layer, and \(r_t\), \(i_t\), \(n_t\) are the reset, input, and new gates, respectively.

Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • num_layers – Number of recurrent layers.
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • batch_first – If True, then the input and output tensors are provided as (batch, seq, feature)
  • dropout – If non-zero, introduces a dropout layer on the outputs of each RNN layer except the last layer
  • bidirectional – If True, becomes a bidirectional RNN. Default: False
Inputs: input, h_0
  • input (seq_len, batch, input_size): tensor containing the features of the input sequence. The input can also be a packed variable length sequence. See torch.nn.utils.rnn.pack_padded_sequence() for details.
  • h_0 (num_layers * num_directions, batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
Outputs: output, h_n
  • output (seq_len, batch, hidden_size * num_directions): tensor containing the output features h_t from the last layer of the RNN, for each t. If a torch.nn.utils.rnn.PackedSequence has been given as the input, the output will also be a packed sequence.
  • h_n (num_layers * num_directions, batch, hidden_size): tensor containing the hidden state for t=seq_len
Variables:
  • weight_ih_l[k] – the learnable input-hidden weights of the k-th layer (W_ir|W_ii|W_in), of shape (input_size x 3*hidden_size)
  • weight_hh_l[k] – the learnable hidden-hidden weights of the k-th layer (W_hr|W_hi|W_hn), of shape (hidden_size x 3*hidden_size)
  • bias_ih_l[k] – the learnable input-hidden bias of the k-th layer (b_ir|b_ii|b_in), of shape (3*hidden_size)
  • bias_hh_l[k] – the learnable hidden-hidden bias of the k-th layer (W_hr|W_hi|W_hn), of shape (3*hidden_size)

Examples:

>>> rnn = nn.GRU(10, 20, 2)
>>> input = Variable(torch.randn(5, 3, 10))
>>> h0 = Variable(torch.randn(2, 3, 20))
>>> output, hn = rnn(input, h0)

RNNCell

class torch.nn.RNNCell(input_size, hidden_size, bias=True, nonlinearity='tanh')[source]

An Elman RNN cell with tanh or ReLU non-linearity.

\[h' = \tanh(w_{ih} * x + b_{ih} + w_{hh} * h + b_{hh})\]

If nonlinearity=’relu’, then ReLU is used in place of tanh.

Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
  • nonlinearity – The non-linearity to use [‘tanh’|’relu’]. Default: ‘tanh’
Inputs: input, hidden
  • input (batch, input_size): tensor containing input features
  • hidden (batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
Outputs: h’
  • h’ (batch, hidden_size): tensor containing the next hidden state for each element in the batch
Variables:
  • weight_ih – the learnable input-hidden weights, of shape (input_size x hidden_size)
  • weight_hh – the learnable hidden-hidden weights, of shape (hidden_size x hidden_size)
  • bias_ih – the learnable input-hidden bias, of shape (hidden_size)
  • bias_hh – the learnable hidden-hidden bias, of shape (hidden_size)

Examples:

>>> rnn = nn.RNNCell(10, 20)
>>> input = Variable(torch.randn(6, 3, 10))
>>> hx = Variable(torch.randn(3, 20))
>>> output = []
>>> for i in range(6):
...     hx = rnn(input[i], hx)
...     output.append(hx)

LSTMCell

class torch.nn.LSTMCell(input_size, hidden_size, bias=True)[source]

A long short-term memory (LSTM) cell.

\[\begin{split}\begin{array}{ll} i = sigmoid(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\ f = sigmoid(W_{if} x + b_{if} + W_{hf} h + b_{hf}) \\ g = \tanh(W_{ig} x + b_{ig} + W_{hc} h + b_{hg}) \\ o = sigmoid(W_{io} x + b_{io} + W_{ho} h + b_{ho}) \\ c' = f * c + i * g \\ h' = o * \tanh(c_t) \\ \end{array}\end{split}\]
Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
Inputs: input, (h_0, c_0)
  • input (batch, input_size): tensor containing input features
  • h_0 (batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
  • c_0 (batch. hidden_size): tensor containing the initial cell state for each element in the batch.
Outputs: h_1, c_1
  • h_1 (batch, hidden_size): tensor containing the next hidden state for each element in the batch
  • c_1 (batch, hidden_size): tensor containing the next cell state for each element in the batch
Variables:
  • weight_ih – the learnable input-hidden weights, of shape (input_size x hidden_size)
  • weight_hh – the learnable hidden-hidden weights, of shape (hidden_size x hidden_size)
  • bias_ih – the learnable input-hidden bias, of shape (hidden_size)
  • bias_hh – the learnable hidden-hidden bias, of shape (hidden_size)

Examples:

>>> rnn = nn.LSTMCell(10, 20)
>>> input = Variable(torch.randn(6, 3, 10))
>>> hx = Variable(torch.randn(3, 20))
>>> cx = Variable(torch.randn(3, 20))
>>> output = []
>>> for i in range(6):
...     hx, cx = rnn(input[i], (hx, cx))
...     output.append(hx)

GRUCell

class torch.nn.GRUCell(input_size, hidden_size, bias=True)[source]

A gated recurrent unit (GRU) cell

\[\begin{split}\begin{array}{ll} r = sigmoid(W_{ir} x + b_{ir} + W_{hr} h + b_{hr}) \\ i = sigmoid(W_{ii} x + b_{ii} + W_{hi} h + b_{hi}) \\ n = \tanh(W_{in} x + b_{in} + r * (W_{hn} h + b_{hn})) \\ h' = (1 - i) * n + i * h \end{array}\end{split}\]
Parameters:
  • input_size – The number of expected features in the input x
  • hidden_size – The number of features in the hidden state h
  • bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
Inputs: input, hidden
  • input (batch, input_size): tensor containing input features
  • hidden (batch, hidden_size): tensor containing the initial hidden state for each element in the batch.
Outputs: h’
  • h’: (batch, hidden_size): tensor containing the next hidden state for each element in the batch
Variables:
  • weight_ih – the learnable input-hidden weights, of shape (input_size x hidden_size)
  • weight_hh – the learnable hidden-hidden weights, of shape (hidden_size x hidden_size)
  • bias_ih – the learnable input-hidden bias, of shape (hidden_size)
  • bias_hh – the learnable hidden-hidden bias, of shape (hidden_size)

Examples:

>>> rnn = nn.GRUCell(10, 20)
>>> input = Variable(torch.randn(6, 3, 10))
>>> hx = Variable(torch.randn(3, 20))
>>> output = []
>>> for i in range(6):
...     hx = rnn(input[i], hx)
...     output.append(hx)

Linear layers

Linear

class torch.nn.Linear(in_features, out_features, bias=True)[source]

Applies a linear transformation to the incoming data: \(y = Ax + b\)

Parameters:
  • in_features – size of each input sample
  • out_features – size of each output sample
  • bias – If set to False, the layer will not learn an additive bias. Default: True
Shape:
  • Input: \((N, in\_features)\)
  • Output: \((N, out\_features)\)
Variables:
  • weight – the learnable weights of the module of shape (out_features x in_features)
  • bias – the learnable bias of the module of shape (out_features)

Examples:

>>> m = nn.Linear(20, 30)
>>> input = autograd.Variable(torch.randn(128, 20))
>>> output = m(input)
>>> print(output.size())

Dropout layers

Dropout

class torch.nn.Dropout(p=0.5, inplace=False)[source]

Randomly zeroes some of the elements of the input tensor. The elements to zero are randomized on every forward call.

Parameters:
  • p – probability of an element to be zeroed. Default: 0.5
  • inplace – If set to True, will do this operation in-place. Default: false
Shape:
  • Input: Any. Input can be of any shape
  • Output: Same. Output is of the same shape as input

Examples:

>>> m = nn.Dropout(p=0.2)
>>> input = autograd.Variable(torch.randn(20, 16))
>>> output = m(input)

Dropout2d

class torch.nn.Dropout2d(p=0.5, inplace=False)[source]

Randomly zeroes whole channels of the input tensor. The channels to zero-out are randomized on every forward call.

Usually the input comes from Conv2d modules.

As described in the paper Efficient Object Localization Using Convolutional Networks , if adjacent pixels within feature maps are strongly correlated (as is normally the case in early convolution layers) then iid dropout will not regularize the activations and will otherwise just result in an effective learning rate decrease.

In this case, nn.Dropout2d() will help promote independence between feature maps and should be used instead.

Parameters:
  • p (float, optional) – probability of an element to be zeroed.
  • inplace (bool, optional) – If set to True, will do this operation in-place
Shape:
  • Input: \((N, C, H, W)\)
  • Output: \((N, C, H, W)\) (same shape as input)

Examples:

>>> m = nn.Dropout2d(p=0.2)
>>> input = autograd.Variable(torch.randn(20, 16, 32, 32))
>>> output = m(input)

Dropout3d

class torch.nn.Dropout3d(p=0.5, inplace=False)[source]

Randomly zeroes whole channels of the input tensor. The channels to zero are randomized on every forward call.

Usually the input comes from Conv3d modules.

As described in the paper Efficient Object Localization Using Convolutional Networks , if adjacent pixels within feature maps are strongly correlated (as is normally the case in early convolution layers) then iid dropout will not regularize the activations and will otherwise just result in an effective learning rate decrease.

In this case, nn.Dropout3d() will help promote independence between feature maps and should be used instead.

Parameters:
  • p (float, optional) – probability of an element to be zeroed.
  • inplace (bool, optional) – If set to True, will do this operation in-place
Shape:
  • Input: \((N, C, D, H, W)\)
  • Output: \((N, C, D, H, W)\) (same shape as input)

Examples:

>>> m = nn.Dropout3d(p=0.2)
>>> input = autograd.Variable(torch.randn(20, 16, 4, 32, 32))
>>> output = m(input)

Sparse layers

Embedding

class torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None, max_norm=None, norm_type=2, scale_grad_by_freq=False, sparse=False)[source]

A simple lookup table that stores embeddings of a fixed dictionary and size.

This module is often used to store word embeddings and retrieve them using indices. The input to the module is a list of indices, and the output is the corresponding word embeddings.

Parameters:
  • num_embeddings (int) – size of the dictionary of embeddings
  • embedding_dim (int) – the size of each embedding vector
  • padding_idx (int, optional) – If given, pads the output with zeros whenever it encounters the index.
  • max_norm (float, optional) – If given, will renormalize the embeddings to always have a norm lesser than this
  • norm_type (float, optional) – The p of the p-norm to compute for the max_norm option
  • scale_grad_by_freq (boolean, optional) – if given, this will scale gradients by the frequency of the words in the dictionary.
Variables:

weight (Tensor) – the learnable weights of the module of shape (num_embeddings, embedding_dim)

Shape:
  • Input: LongTensor (N, W), N = mini-batch, W = number of indices to extract per mini-batch
  • Output: (N, W, embedding_dim)

Examples:

>>> # an Embedding module containing 10 tensors of size 3
>>> embedding = nn.Embedding(10, 3)
>>> # a batch of 2 samples of 4 indices each
>>> input = Variable(torch.LongTensor([[1,2,4,5],[4,3,2,9]]))
>>> embedding(input)

Variable containing:
(0 ,.,.) =
 -1.0822  1.2522  0.2434
  0.8393 -0.6062 -0.3348
  0.6597  0.0350  0.0837
  0.5521  0.9447  0.0498

(1 ,.,.) =
  0.6597  0.0350  0.0837
 -0.1527  0.0877  0.4260
  0.8393 -0.6062 -0.3348
 -0.8738 -0.9054  0.4281
[torch.FloatTensor of size 2x4x3]

>>> # example with padding_idx
>>> embedding = nn.Embedding(10, 3, padding_idx=0)
>>> input = Variable(torch.LongTensor([[0,2,0,5]]))
>>> embedding(input)

Variable containing:
(0 ,.,.) =
  0.0000  0.0000  0.0000
  0.3452  0.4937 -0.9361
  0.0000  0.0000  0.0000
  0.0706 -2.1962 -0.6276
[torch.FloatTensor of size 1x4x3]

Loss functions

L1Loss

class torch.nn.L1Loss(size_average=True)[source]

Creates a criterion that measures the mean absolute value of the element-wise difference between input x and target y:

\({loss}(x, y) = 1/n \sum |x_i - y_i|\)

x and y arbitrary shapes with a total of n elements each.

The sum operation still operates over all the elements, and divides by n.

The division by n can be avoided if one sets the constructor argument sizeAverage=False

MSELoss

class torch.nn.MSELoss(size_average=True)[source]

Creates a criterion that measures the mean squared error between n elements in the input x and target y:

\({loss}(x, y) = 1/n \sum |x_i - y_i|^2\)

x and y arbitrary shapes with a total of n elements each.

The sum operation still operates over all the elements, and divides by n.

The division by n can be avoided if one sets the internal variable sizeAverage to False.

CrossEntropyLoss

class torch.nn.CrossEntropyLoss(weight=None, size_average=True)[source]

This criterion combines LogSoftMax and NLLLoss in one single class.

It is useful when training a classification problem with n classes. If provided, the optional argument weights should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.

The input is expected to contain scores for each class.

input has to be a 2D Tensor of size batch x n.

This criterion expects a class index (0 to nClasses-1) as the target for each value of a 1D tensor of size n

The loss can be described as:

loss(x, class) = -log(exp(x[class]) / (\sum_j exp(x[j])))
               = -x[class] + log(\sum_j exp(x[j]))

or in the case of the weights argument being specified:

loss(x, class) = weights[class] * (-x[class] + log(\sum_j exp(x[j])))

The losses are averaged across observations for each minibatch.

Shape:
  • Input: \((N, C)\) where C = number of classes
  • Target: \((N)\) where each value is 0 <= targets[i] <= C-1

NLLLoss

class torch.nn.NLLLoss(weight=None, size_average=True)[source]

The negative log likelihood loss. It is useful to train a classification problem with n classes

If provided, the optional argument weights should be a 1D Tensor assigning weight to each of the classes.

This is particularly useful when you have an unbalanced training set.

The input given through a forward call is expected to contain log-probabilities of each class: input has to be a 2D Tensor of size (minibatch, n)

Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network.

You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.

The target that this loss expects is a class index (0 to N-1, where N = number of classes)

The loss can be described as:

loss(x, class) = -x[class]

or in the case of the weights argument it is specified as follows:

loss(x, class) = -weights[class] * x[class]
Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size “nclasses”
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.
Shape:
  • Input: \((N, C)\) where C = number of classes
  • Target: \((N)\) where each value is 0 <= targets[i] <= C-1
Variables:weight – the class-weights given as input to the constructor

Examples:

>>> m = nn.LogSoftmax()
>>> loss = nn.NLLLoss()
>>> # input is of size nBatch x nClasses = 3 x 5
>>> input = autograd.Variable(torch.randn(3, 5), requires_grad=True)
>>> # each element in target has to have 0 <= value < nclasses
>>> target = autograd.Variable(torch.LongTensor([1, 0, 4]))
>>> output = loss(m(input), target)
>>> output.backward()

NLLLoss2d

class torch.nn.NLLLoss2d(weight=None, size_average=True)[source]

This is negative log likehood loss, but for image inputs. It computes NLL loss per-pixel.

This loss does not support per-class weights

Parameters:
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a 1D Tensor having as many elements, as there are classes.
  • size_average – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch. Default: True
Shape:
  • Input: \((N, C, H, W)\) where C = number of classes
  • Target: \((N, H, W)\) where each value is 0 <= targets[i] <= C-1

Examples

>>> m = nn.Conv2d(16, 32, (3, 3)).float()
>>> loss = nn.NLLLoss2d()
>>> # input is of size nBatch x nClasses x height x width
>>> input = autograd.Variable(torch.randn(3, 16, 10, 10))
>>> # each element in target has to have 0 <= value < nclasses
>>> target = autograd.Variable(torch.LongTensor(3, 8, 8).random_(0, 4))
>>> output = loss(m(input), target)
>>> output.backward()

KLDivLoss

class torch.nn.KLDivLoss(weight=None, size_average=True)[source]

The Kullback-Leibler divergence Loss

KL divergence is a useful distance measure for continuous distributions and is often useful when performing direct regression over the space of (discretely sampled) continuous output distributions.

As with NLLLoss, the input given is expected to contain log-probabilities, however unlike ClassNLLLoss, input is not restricted to a 2D Tensor, because the criterion is applied element-wise.

This criterion expects a target Tensor of the same size as the input Tensor.

The loss can be described as:

\[loss(x, target) = 1/n \sum(target_i * (log(target_i) - x_i))\]

By default, the losses are averaged for each minibatch over observations as well as over dimensions. However, if the field sizeAverage is set to False, the losses are instead summed.

BCELoss

class torch.nn.BCELoss(weight=None, size_average=True)[source]

Creates a criterion that measures the Binary Cross Entropy between the target and the output:

..math:: loss(o, t) = - 1/n sum_i (t[i] * log(o[i]) + (1 - t[i]) * log(1 - o[i]))

or in the case of the weights argument being specified:

..math:: loss(o, t) = - 1/n sum_i weights[i] * (t[i] * log(o[i]) + (1 - t[i]) * log(1 - o[i]))

This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets t[i] should be numbers between 0 and 1.

By default, the losses are averaged for each minibatch over observations as well as over dimensions. However, if the field sizeAverage is set to False, the losses are instead summed.

MarginRankingLoss

class torch.nn.MarginRankingLoss(margin=0, size_average=True)[source]

Creates a criterion that measures the loss given inputs x1, x2, two 1D min-batch Tensor`s, and a label 1D mini-batch tensor `y with values (1 or -1).

If y == 1 then it assumed the first input should be ranked higher (have a larger value) than the second input, and vice-versa for y == -1.

The loss function for each sample in the mini-batch is:

loss(x, y) = max(0, -y * (x1 - x2) + margin)

if the internal variable sizeAverage = True, the loss function averages the loss over the batch samples; if sizeAverage = False, then the loss function sums over the batch samples. By default, sizeAverage equals to True.

HingeEmbeddingLoss

class torch.nn.HingeEmbeddingLoss(size_average=True)[source]

Measures the loss given an input x which is a 2D mini-batch tensor and a labels y, a 1D tensor containg values (1 or -1). This is usually used for measuring whether two inputs are similar or dissimilar, e.g. using the L1 pairwise distance, and is typically used for learning nonlinear embeddings or semi-supervised learning:

                 { x_i,                  if y_i ==  1
loss(x, y) = 1/n {
                 { max(0, margin - x_i), if y_i == -1

x and y arbitrary shapes with a total of n elements each the sum operation still operates over all the elements, and divides by n.

The division by n can be avoided if one sets the internal variable sizeAverage=False.

The margin has a default value of 1, or can be set in the constructor.

MultiLabelMarginLoss

class torch.nn.MultiLabelMarginLoss(size_average=True)[source]

Creates a criterion that optimizes a multi-class multi-classification hinge loss (margin-based loss) between input x (a 2D mini-batch Tensor) and output y (which is a 2D Tensor of target class indices). For each sample in the mini-batch:

loss(x, y) = sum_ij(max(0, 1 - (x[y[j]] - x[i]))) / x.size(0)

where i == 0 to x.size(0), j == 0 to y.size(0), y[j] != 0, and i != y[j] for all i and j.

y and x must have the same size.

The criterion only considers the first non zero y[j] targets.

This allows for different samples to have variable amounts of target classes

SmoothL1Loss

class torch.nn.SmoothL1Loss(size_average=True)[source]

Creates a criterion that uses a squared term if the absolute element-wise error falls below 1 and an L1 term otherwise. It is less sensitive to outliers than the MSELoss and in some cases prevents exploding gradients (e.g. see “Fast R-CNN” paper by Ross Girshick). Also known as the Huber loss:

                      { 0.5 * (x_i - y_i)^2, if |x_i - y_i| < 1
loss(x, y) = 1/n \sum {
                      { |x_i - y_i| - 0.5,   otherwise

x and y arbitrary shapes with a total of n elements each the sum operation still operates over all the elements, and divides by n.

The division by n can be avoided if one sets the internal variable sizeAverage to False

SoftMarginLoss

class torch.nn.SoftMarginLoss(size_average=True)[source]

Creates a criterion that optimizes a two-class classification logistic loss between input x (a 2D mini-batch Tensor) and target y (which is a tensor containing either 1 or -1).

loss(x, y) = sum_i (log(1 + exp(-y[i]*x[i]))) / x.nelement()

The normalization by the number of elements in the input can be disabled by setting self.sizeAverage to False.

MultiLabelSoftMarginLoss

class torch.nn.MultiLabelSoftMarginLoss(weight=None, size_average=True)[source]

Creates a criterion that optimizes a multi-label one-versus-all loss based on max-entropy, between input x (a 2D mini-batch Tensor) and target y (a binary 2D Tensor). For each sample in the minibatch:

loss(x, y) = - sum_i (y[i] log( exp(x[i]) / (1 + exp(x[i])))
                      + (1-y[i]) log(1/(1+exp(x[i])))) / x:nElement()

where i == 0 to x.nElement()-1, y[i] in {0,1}. y and x must have the same size.

CosineEmbeddingLoss

class torch.nn.CosineEmbeddingLoss(margin=0, size_average=True)[source]

Creates a criterion that measures the loss given an input tensors x1, x2 and a Tensor label y with values 1 or -1. This is used for measuring whether two inputs are similar or dissimilar, using the cosine distance, and is typically used for learning nonlinear embeddings or semi-supervised learning.

margin should be a number from -1 to 1, 0 to 0.5 is suggested. If margin is missing, the default value is 0.

The loss function for each sample is:

             { 1 - cos(x1, x2),              if y ==  1
loss(x, y) = {
             { max(0, cos(x1, x2) - margin), if y == -1

If the internal variable sizeAverage is equal to True, the loss function averages the loss over the batch samples; if sizeAverage is False, then the loss function sums over the batch samples. By default, sizeAverage = True.

MultiMarginLoss

class torch.nn.MultiMarginLoss(p=1, margin=1, weight=None, size_average=True)[source]

Creates a criterion that optimizes a multi-class classification hinge loss (margin-based loss) between input x (a 2D mini-batch Tensor) and output y (which is a 1D tensor of target class indices, 0 <= y <= x.size(1)):

For each mini-batch sample:

loss(x, y) = sum_i(max(0, (margin - x[y] + x[i]))^p) / x.size(0)
             where `i == 0` to `x.size(0)` and `i != y`.

Optionally, you can give non-equal weighting on the classes by passing a 1D weights tensor into the constructor.

The loss function then becomes:

loss(x, y) = sum_i(max(0, w[y] * (margin - x[y] - x[i]))^p) / x.size(0)

By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed.

Vision layers

PixelShuffle

class torch.nn.PixelShuffle(upscale_factor)[source]

Rearranges elements in a Tensor of shape \((*, C * r^2, H, W]\) to a tensor of shape \((C, H * r, W * r)\).

This is useful for implementing efficient sub-pixel convolution with a stride of \(1/r\).

Look at the paper: Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network by Shi et. al (2016) for more details

Parameters:upscale_factor (int) – factor to increase spatial resolution by
Shape:
  • Input: \((N, C * {upscale\_factor}^2, H, W)\)
  • Output: \((N, C, H * {upscale\_factor}, W * {upscale\_factor})\)

Examples:

>>> ps = nn.PixelShuffle(3)
>>> input = autograd.Variable(torch.Tensor(1, 9, 4, 4))
>>> output = ps(input)
>>> print(output.size())
torch.Size([1, 1, 12, 12])

UpsamplingNearest2d

class torch.nn.UpsamplingNearest2d(size=None, scale_factor=None)[source]

Applies a 2D nearest neighbor upsampling to an input signal composed of several input channels.

To specify the scale, it takes either the size or the scale_factor as it’s constructor argument.

When size is given, it is the output size of the image (h, w).

Parameters:
  • size (tuple, optional) – a tuple of ints (H_out, W_out) output sizes
  • scale_factor (int, optional) – the multiplier for the image height / width
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = floor(H_{in} * scale\_factor)\) \(W_{out} = floor(W_{in} * scale\_factor)\)

Examples:

>>> inp
Variable containing:
(0 ,0 ,.,.) =
  1  2
  3  4
[torch.FloatTensor of size 1x1x2x2]

>>> m = nn.UpsamplingNearest2d(scale_factor=2)
>>> m(inp)
Variable containing:
(0 ,0 ,.,.) =
  1  1  2  2
  1  1  2  2
  3  3  4  4
  3  3  4  4
[torch.FloatTensor of size 1x1x4x4]

UpsamplingBilinear2d

class torch.nn.UpsamplingBilinear2d(size=None, scale_factor=None)[source]

Applies a 2D bilinear upsampling to an input signal composed of several input channels.

To specify the scale, it takes either the size or the scale_factor as it’s constructor argument.

When size is given, it is the output size of the image (h, w).

Parameters:
  • size (tuple, optional) – a tuple of ints (H_out, W_out) output sizes
  • scale_factor (int, optional) – the multiplier for the image height / width
Shape:
  • Input: \((N, C, H_{in}, W_{in})\)
  • Output: \((N, C, H_{out}, W_{out})\) where \(H_{out} = floor(H_{in} * scale\_factor)\) \(W_{out} = floor(W_{in} * scale\_factor)\)

Examples:

>>> inp
Variable containing:
(0 ,0 ,.,.) =
  1  2
  3  4
[torch.FloatTensor of size 1x1x2x2]

>>> m = nn.UpsamplingBilinear2d(scale_factor=2)
>>> m(inp)
Variable containing:
(0 ,0 ,.,.) =
  1.0000  1.3333  1.6667  2.0000
  1.6667  2.0000  2.3333  2.6667
  2.3333  2.6667  3.0000  3.3333
  3.0000  3.3333  3.6667  4.0000
[torch.FloatTensor of size 1x1x4x4]

Multi-GPU layers

DataParallel

class torch.nn.DataParallel(module, device_ids=None, output_device=None, dim=0)[source]

Implements data parallelism at the module level.

This container parallelizes the application of the given module by splitting the input across the specified devices by chunking in the batch dimension. In the forward pass, the module is replicated on each device, and each replica handles a portion of the input. During the backwards pass, gradients from each replica are summed into the original module.

The batch size should be larger than the number of GPUs used. It should also be an integer multiple of the number of GPUs so that each chunk is the same size (so that each GPU processes the same number of samples).

See also: Use nn.DataParallel instead of multiprocessing

Arbitrary positional and keyword inputs are allowed to be passed into DataParallel EXCEPT Tensors. All variables will be scattered on dim specified (default 0). Primitive types will be broadcasted, but all other types will be a shallow copy and can be corrupted if written to in the model’s forward pass.

Parameters:
  • module – module to be parallelized
  • device_ids – CUDA devices (default: all devices)
  • output_device – device location of output (default: device_ids[0])

Example:

>>> net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
>>> output = net(input_var)

Utilities

clip_grad_norm

torch.nn.utils.clip_grad_norm(parameters, max_norm, norm_type=2)[source]

Clips gradient norm of an iterable of parameters.

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

Parameters:
  • parameters (Iterable[Variable]) – an iterable of Variables that will have gradients normalized
  • max_norm (float or int) – max norm of the gradients
  • norm_type (float or int) – type of the used p-norm. Can be 'inf' for infinity norm.
Returns:

Total norm of the parameters (viewed as a single vector).

PackedSequence

torch.nn.utils.rnn.PackedSequence(_cls, data, batch_sizes)[source]

Holds the data and list of batch_sizes of a packed sequence.

All RNN modules accept packed sequences as inputs.

Note

Instances of this class should never be created manually. They are meant to be instantiated by functions like pack_padded_sequence().

Variables:
  • data (Variable) – Variable containing packed sequence
  • batch_sizes (list[int]) – list of integers holding information about the batch size at each sequence step

pack_padded_sequence

torch.nn.utils.rnn.pack_padded_sequence(input, lengths, batch_first=False)[source]

Packs a Variable containing padded sequences of variable length.

Input can be of size TxBx* where T is the length of the longest sequence (equal to lengths[0]), B is the batch size, and * is any number of dimensions (including 0). If batch_first is True BxTx* inputs are expected.

The sequences should be sorted by length in a decreasing order, i.e. input[:,0] should be the longest sequence, and input[:,B-1] the shortest one.

Note

This function accept any input that has at least two dimensions. You can apply it to pack the labels, and use the output of the RNN with them to compute the loss directly. A Variable can be retrieved from a PackedSequence object by accessing its .data attribute.

Parameters:
  • input (Variable) – padded batch of variable length sequences.
  • lengths (list[int]) – list of sequences lengths of each batch element.
  • batch_first (bool, optional) – if True, the input is expected in BxTx* format.
Returns:

a PackedSequence object

pad_packed_sequence

torch.nn.utils.rnn.pad_packed_sequence(sequence, batch_first=False)[source]

Pads a packed batch of variable length sequences.

It is an inverse operation to pack_padded_sequence().

The returned Variable’s data will be of size TxBx*, where T is the length of the longest sequence and B is the batch size. If batch_size is True, the data will be transposed into BxTx* format.

Batch elements will be ordered decreasingly by their length.

Parameters:
  • sequence (PackedSequence) – batch to pad
  • batch_first (bool, optional) – if True, the output will be in BxTx* format.
Returns:

Tuple of Variable containing the padded sequence, and a list of lengths of each sequence in the batch.

torch.nn.functional

Convolution functions

conv1d

torch.nn.functional.conv1d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)[source]

Applies a 1D convolution over an input signal composed of several input planes.

See Conv1d for details and output shape.

Parameters:
  • input – input tensor of shape (minibatch x in_channels x iW)
  • weight – filters of shape (out_channels, in_channels, kW)
  • bias – optional bias of shape (out_channels)
  • stride – the stride of the convolving kernel, default 1

Examples

>>> filters = autograd.Variable(torch.randn(33, 16, 3))
>>> inputs = autograd.Variable(torch.randn(20, 16, 50))
>>> F.conv1d(inputs, filters)

conv2d

torch.nn.functional.conv2d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)[source]

Applies a 2D convolution over an input image composed of several input planes.

See Conv2d for details and output shape.

Parameters:
  • input – input tensor (minibatch x in_channels x iH x iW)
  • weight – filters tensor (out_channels, in_channels/groups, kH, kW)
  • bias – optional bias tensor (out_channels)
  • stride – the stride of the convolving kernel. Can be a single number or a tuple (sh x sw). Default: 1
  • padding – implicit zero padding on the input. Can be a single number or a tuple. Default: 0
  • groups – split input into groups, in_channels should be divisible by the number of groups

Examples

>>> # With square kernels and equal stride
>>> filters = autograd.Variable(torch.randn(8,4,3,3))
>>> inputs = autograd.Variable(torch.randn(1,4,5,5))
>>> F.conv2d(inputs, filters, padding=1)

conv3d

torch.nn.functional.conv3d(input, weight, bias=None, stride=1, padding=0, dilation=1, groups=1)[source]
Applies a 3D convolution over an input image composed of several input
planes.

See Conv3d for details and output shape.

Parameters:
  • input – input tensor of shape (minibatch x in_channels x iT x iH x iW)
  • weight – filters tensor of shape (out_channels, in_channels, kT, kH, kW)
  • bias – optional bias tensor of shape (out_channels)
  • stride – the stride of the convolving kernel. Can be a single number or a tuple (st x sh x sw). Default: 1
  • padding – implicit zero padding on the input. Can be a single number or a tuple. Default: 0

Examples

>>> filters = autograd.Variable(torch.randn(33, 16, 3, 3, 3))
>>> inputs = autograd.Variable(torch.randn(20, 16, 50, 10, 20))
>>> F.conv3d(inputs, filters)

conv_transpose1d

torch.nn.functional.conv_transpose1d(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1)[source]

conv_transpose2d

torch.nn.functional.conv_transpose2d(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1)[source]

Applies a 2D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”.

See ConvTranspose2d for details and output shape.

Parameters:
  • input – input tensor of shape (minibatch x in_channels x iH x iW)
  • weight – filters of shape (in_channels x out_channels x kH x kW)
  • bias – optional bias of shape (out_channels)
  • stride – the stride of the convolving kernel, a single number or a tuple (sh x sw). Default: 1
  • padding – implicit zero padding on the input, a single number or a tuple (padh x padw). Default: 0
  • groups – split input into groups, in_channels should be divisible by the number of groups
  • output_padding – A zero-padding of 0 <= padding < stride that should be added to the output. Can be a single number or a tuple. Default: 0

conv_transpose3d

torch.nn.functional.conv_transpose3d(input, weight, bias=None, stride=1, padding=0, output_padding=0, groups=1)[source]

Applies a 3D transposed convolution operator over an input image composed of several input planes, sometimes also called “deconvolution”

See ConvTranspose3d for details and output shape.

Parameters:
  • input – input tensor of shape (minibatch x in_channels x iT x iH x iW)
  • weight – filters of shape (in_channels x out_channels x kH x kW)
  • bias – optional bias of shape (out_channels)
  • stride – the stride of the convolving kernel, a single number or a tuple (sh x sw). Default: 1
  • padding – implicit zero padding on the input, a single number or a tuple (padh x padw). Default: 0

Pooling functions

avg_pool1d

torch.nn.functional.avg_pool1d(input, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True)[source]

Applies a 1D average pooling over an input signal composed of several input planes.

See AvgPool1d for details and output shape.

Parameters:
  • kernel_size – the size of the window
  • stride – the stride of the window. Default value is kernel_size
  • padding – implicit zero padding to be added on both sides
  • ceil_mode – when True, will use ceil instead of floor to compute the output shape
  • count_include_pad – when True, will include the zero-padding in the averaging calculation

Example

>>> # pool of square window of size=3, stride=2
>>> input = Variable(torch.Tensor([[[1,2,3,4,5,6,7]]]))
>>> F.avg_pool1d(input, kernel_size=3, stride=2)
Variable containing:
(0 ,.,.) =
  2  4  6
[torch.FloatTensor of size 1x1x3]

avg_pool2d

torch.nn.functional.avg_pool2d(input, kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True)[source]

Applies 2D average-pooling operation in kh x kw regions by step size dh x dw steps. The number of output features is equal to the number of input planes.

See AvgPool2d for details and output shape.

Parameters:
  • input – input tensor (minibatch x in_channels x iH x iW)
  • kernel_size – size of the pooling region, a single number or a tuple (kh x kw)
  • stride – stride of the pooling operation, a single number or a tuple (sh x sw). Default is equal to kernel size
  • padding – implicit zero padding on the input, a single number or a tuple (padh x padw), Default: 0
  • ceil_mode – operation that defines spatial output shape
  • count_include_pad – divide by the number of elements inside the original non-padded image or kh * kw

avg_pool3d

torch.nn.functional.avg_pool3d(input, kernel_size, stride=None)[source]

Applies 3D average-pooling operation in kt x kh x kw regions by step size kt x dh x dw steps. The number of output features is equal to the number of input planes / dt.

max_pool1d

torch.nn.functional.max_pool1d(input, kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False)[source]

max_pool2d

torch.nn.functional.max_pool2d(input, kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False)[source]

max_pool3d

torch.nn.functional.max_pool3d(input, kernel_size, stride=None, padding=0, dilation=1, ceil_mode=False, return_indices=False)[source]

max_unpool1d

torch.nn.functional.max_unpool1d(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]

max_unpool2d

torch.nn.functional.max_unpool2d(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]

max_unpool3d

torch.nn.functional.max_unpool3d(input, indices, kernel_size, stride=None, padding=0, output_size=None)[source]

lp_pool2d

torch.nn.functional.lp_pool2d(input, norm_type, kernel_size, stride=None, ceil_mode=False)[source]

Non-linear activation functions

threshold

torch.nn.functional.threshold(input, threshold, value, inplace=False)[source]

relu

torch.nn.functional.relu(input, inplace=False)[source]

hardtanh

torch.nn.functional.hardtanh(input, min_val=-1.0, max_val=1.0, inplace=False)[source]

relu6

torch.nn.functional.relu6(input, inplace=False)[source]

elu

torch.nn.functional.elu(input, alpha=1.0, inplace=False)[source]

leaky_relu

torch.nn.functional.leaky_relu(input, negative_slope=0.01, inplace=False)[source]

prelu

torch.nn.functional.prelu(input, weight)[source]

rrelu

torch.nn.functional.rrelu(input, lower=0.125, upper=0.3333333333333333, training=False, inplace=False)[source]

logsigmoid

torch.nn.functional.logsigmoid(input)[source]

hardshrink

torch.nn.functional.hardshrink(input, lambd=0.5)[source]

tanhshrink

torch.nn.functional.tanhshrink(input)[source]

softsign

torch.nn.functional.softsign(input)[source]

softplus

torch.nn.functional.softplus(input, beta=1, threshold=20)[source]

softmin

torch.nn.functional.softmin(input)[source]

softmax

torch.nn.functional.softmax(input)[source]

softshrink

torch.nn.functional.softshrink(input, lambd=0.5)[source]

log_softmax

torch.nn.functional.log_softmax(input)[source]

tanh

torch.nn.functional.tanh(input)[source]

sigmoid

torch.nn.functional.sigmoid(input)[source]

Normalization functions

batch_norm

torch.nn.functional.batch_norm(input, running_mean, running_var, weight=None, bias=None, training=False, momentum=0.1, eps=1e-05)[source]

Linear functions

linear

torch.nn.functional.linear(input, weight, bias=None)[source]

Dropout functions

dropout

torch.nn.functional.dropout(input, p=0.5, training=False, inplace=False)[source]

Loss functions

nll_loss

torch.nn.functional.nll_loss(input, target, weight=None, size_average=True)[source]

The negative log likelihood loss.

See NLLLoss for details.

Parameters:
  • input\((N, C)\) where C = number of classes
  • target\((N)\) where each value is 0 <= targets[i] <= C-1
  • weight (Variable, optional) – a manual rescaling weight given to each class. If given, has to be a Variable of size “nclasses”
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.
Variables:

weight – the class-weights given as input to the constructor

Example

>>> # input is of size nBatch x nClasses = 3 x 5
>>> input = autograd.Variable(torch.randn(3, 5))
>>> # each element in target has to have 0 <= value < nclasses
>>> target = autograd.Variable(torch.LongTensor([1, 0, 4]))
>>> output = F.nll_loss(F.log_softmax(input), target)
>>> output.backward()

kl_div

torch.nn.functional.kl_div(input, target, size_average=True)[source]

The Kullback-Leibler divergence Loss.

See KLDivLoss for details.

Parameters:
  • input – Variable of arbitrary shape
  • target – Variable of the same shape as input
  • size_average – if True the output is divided by the number of elements in input tensor

cross_entropy

torch.nn.functional.cross_entropy(input, target, weight=None, size_average=True)[source]

This criterion combines log_softmax and nll_loss in one single class.

See torch.nn.CrossEntropyLoss for details.

Parameters:
  • input – Variable \((N, C)\) where C = number of classes
  • target – Variable \((N)\) where each value is 0 <= targets[i] <= C-1
  • weight (Tensor, optional) – a manual rescaling weight given to each class. If given, has to be a Tensor of size “nclasses”
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.

binary_cross_entropy

torch.nn.functional.binary_cross_entropy(input, target, weight=None, size_average=True)[source]

Function that measures the Binary Cross Entropy between the target and the output:

See BCELoss for details.

Parameters:
  • input – Variable of arbitrary shape
  • target – Variable of the same shape as input
  • weight (Variable, optional) – a manual rescaling weight if provided it’s repeated to match input tensor shape
  • size_average (bool, optional) – By default, the losses are averaged over observations for each minibatch. However, if the field sizeAverage is set to False, the losses are instead summed for each minibatch.

smooth_l1_loss

torch.nn.functional.smooth_l1_loss(input, target, size_average=True)[source]

Vision functions

pixel_shuffle

torch.nn.functional.pixel_shuffle(input, upscale_factor)[source]

Rearranges elements in a tensor of shape [*, C*r^2, H, W] to a tensor of shape [C, H*r, W*r].

See PixelShuffle for details.

Parameters:
  • input (Variable) – Input
  • upscale_factor (int) – factor to increase spatial resolution by

Examples

>>> ps = nn.PixelShuffle(3)
>>> input = autograd.Variable(torch.Tensor(1, 9, 4, 4))
>>> output = ps(input)
>>> print(output.size())
torch.Size([1, 1, 12, 12])

pad

torch.nn.functional.pad(input, pad, mode='constant', value=0)[source]

Pads tensor.

Currently only 2D and 3D padding supported. In case of 4D input tensor pad should be in form (pad_l, pad_r, pad_t, pad_b ) In case of 5D pad should be (pleft, pright, ptop, pbottom, pfront, pback)

Parameters:
  • input (Variable) – 4D or 5D tensor
  • pad (tuple) – 4-elem or 6-elem tuple
  • mode – ‘constant’, ‘reflect’ or ‘replicate’
  • value – fill value for ‘constant’ padding

torch.nn.init

torch.nn.init.uniform(tensor, a=0, b=1)[source]

Fills the input Tensor or Variable with values drawn from a uniform U(a,b)

Parameters:
  • tensor – a n-dimension torch.Tensor
  • a – the lower bound of the uniform distribution
  • b – the upper bound of the uniform distribution

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.uniform(w)
torch.nn.init.normal(tensor, mean=0, std=1)[source]

Fills the input Tensor or Variable with values drawn from a normal distribution with the given mean and std

Parameters:
  • tensor – a n-dimension torch.Tensor
  • mean – the mean of the normal distribution
  • std – the standard deviation of the normal distribution

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.normal(w)
torch.nn.init.constant(tensor, val)[source]

Fills the input Tensor or Variable with the value val

Parameters:
  • tensor – a n-dimension torch.Tensor
  • val – the value to fill the tensor with

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.constant(w)
torch.nn.init.xavier_uniform(tensor, gain=1)[source]

Fills the input Tensor or Variable with values according to the method described in “Understanding the difficulty of training deep feedforward neural networks” - Glorot, X. and Bengio, Y., using a uniform distribution. The resulting tensor will have values sampled from U(-a, a) where a = gain * sqrt(2/(fan_in + fan_out)) * sqrt(3)

Parameters:
  • tensor – a n-dimension torch.Tensor
  • gain – an optional scaling factor to be applied

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.xavier_uniform(w, gain=math.sqrt(2.0))
torch.nn.init.xavier_normal(tensor, gain=1)[source]

Fills the input Tensor or Variable with values according to the method described in “Understanding the difficulty of training deep feedforward neural networks” - Glorot, X. and Bengio, Y., using a normal distribution. The resulting tensor will have values sampled from normal distribution with mean=0 and std = gain * sqrt(2/(fan_in + fan_out))

Parameters:
  • tensor – a n-dimension torch.Tensor
  • gain – an optional scaling factor to be applied

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.xavier_normal(w)
torch.nn.init.kaiming_uniform(tensor, a=0, mode='fan_in')[source]

Fills the input Tensor or Variable with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al using a uniform distribution. The resulting tensor will have values sampled from U(-bound, bound) where bound = sqrt(2/((1 + a^2) * fan_in)) * sqrt(3)

Parameters:
  • tensor – a n-dimension torch.Tensor
  • a – the coefficient of the slope of the rectifier used after this layer (0 for ReLU by default)
  • mode – either ‘fan_in’ (default) or ‘fan_out’. Choosing fan_in preserves the magnitude of the variance of the weights in the forward pass. Choosing fan_out preserves the magnitudes in the backwards pass.

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.kaiming_uniform(w, mode='fan_in')
torch.nn.init.kaiming_normal(tensor, a=0, mode='fan_in')[source]

Fills the input Tensor or Variable with values according to the method described in “Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification” - He, K. et al using a normal distribution. The resulting tensor will have values sampled from normal distribution with mean=0 and std = sqrt( 2/((1 + a^2) * fan_in))

Parameters:
  • tensor – a n-dimension torch.Tensor
  • a – the coefficient of the slope of the rectifier used after this layer (0 for ReLU by default)
  • mode – either ‘fan_in’ (default) or ‘fan_out’. Choosing fan_in preserves the magnitude of the variance of the weights in the forward pass. Choosing fan_out preserves the magnitudes in the backwards pass.

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.kaiming_normal(w, mode='fan_out')
torch.nn.init.orthogonal(tensor, gain=1)[source]

Fills the input Tensor or Variable with a (semi) orthogonal matrix. The input tensor must have at least 2 dimensions, and for tensors with more than 2 dimensions the trailing dimensions are flattened. viewed as 2D representation with rows equal to the first dimension and columns equal to the product of as a sparse matrix, where the non-zero elements will be drawn from a normal distribution with mean=0 and std=`std`. Reference: “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”-Saxe, A. et al.

Parameters:
  • tensor – a n-dimension torch.Tensor, where n >= 2
  • gain – optional gain to be applied

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.orthogonal(w)
torch.nn.init.sparse(tensor, sparsity, std=0.01)[source]

Fills the 2D input Tensor or Variable as a sparse matrix, where the non-zero elements will be drawn from a normal distribution with mean=0 and std=`std`.

Parameters:
  • tensor – a n-dimension torch.Tensor
  • sparsity – The fraction of elements in each column to be set to zero
  • std – the standard deviation of the normal distribution used to generate the non-zero values

Examples

>>> w = torch.Tensor(3, 5)
>>> nn.init.sparse(w, sparsity=0.1)