Note

Click here to download the full example code

Introduction ||
Tensors ||
Autograd ||
**Building Models** ||
TensorBoard Support ||
Training Models ||
Model Understanding

# Building Models with PyTorch¶

Follow along with the video below or on youtube.

`torch.nn.Module`

and `torch.nn.Parameter`

¶

In this video, we’ll be discussing some of the tools PyTorch makes available for building deep learning networks.

Except for `Parameter`

, the classes we discuss in this video are all
subclasses of `torch.nn.Module`

. This is the PyTorch base class meant
to encapsulate behaviors specific to PyTorch Models and their
components.

One important behavior of `torch.nn.Module`

is registering parameters.
If a particular `Module`

subclass has learning weights, these weights
are expressed as instances of `torch.nn.Parameter`

. The `Parameter`

class is a subclass of `torch.Tensor`

, with the special behavior that
when they are assigned as attributes of a `Module`

, they are added to
the list of that modules parameters. These parameters may be accessed
through the `parameters()`

method on the `Module`

class.

As a simple example, here’s a very simple model with two linear layers and an activation function. We’ll create an instance of it and ask it to report on its parameters:

```
import torch
class TinyModel(torch.nn.Module):
def __init__(self):
super(TinyModel, self).__init__()
self.linear1 = torch.nn.Linear(100, 200)
self.activation = torch.nn.ReLU()
self.linear2 = torch.nn.Linear(200, 10)
self.softmax = torch.nn.Softmax()
def forward(self, x):
x = self.linear1(x)
x = self.activation(x)
x = self.linear2(x)
x = self.softmax(x)
return x
tinymodel = TinyModel()
print('The model:')
print(tinymodel)
print('\n\nJust one layer:')
print(tinymodel.linear2)
print('\n\nModel params:')
for param in tinymodel.parameters():
print(param)
print('\n\nLayer params:')
for param in tinymodel.linear2.parameters():
print(param)
```

Out:

```
The model:
TinyModel(
(linear1): Linear(in_features=100, out_features=200, bias=True)
(activation): ReLU()
(linear2): Linear(in_features=200, out_features=10, bias=True)
(softmax): Softmax(dim=None)
)
Just one layer:
Linear(in_features=200, out_features=10, bias=True)
Model params:
Parameter containing:
tensor([[ 0.0571, 0.0514, 0.0380, ..., -0.0842, 0.0774, 0.0065],
[-0.0913, 0.0189, -0.0532, ..., -0.0845, 0.0352, 0.0051],
[ 0.0251, 0.0836, -0.0172, ..., 0.0570, -0.0572, -0.0473],
...,
[-0.0350, 0.0485, 0.0463, ..., 0.0860, 0.0746, 0.0073],
[ 0.0270, 0.0855, 0.0798, ..., 0.0445, 0.0396, -0.0221],
[ 0.0194, 0.0152, 0.0956, ..., 0.0007, 0.0366, 0.0250]],
requires_grad=True)
Parameter containing:
tensor([-2.1610e-02, -5.0603e-02, 8.6782e-02, 6.0751e-03, -6.3726e-02,
2.4468e-02, 1.2892e-02, 9.2461e-02, 2.1300e-02, 2.7373e-02,
9.4679e-02, 1.4067e-02, -2.6608e-02, -8.6886e-02, -4.7837e-02,
-6.1986e-02, -4.6729e-02, -8.2987e-02, -9.0866e-04, -9.5583e-02,
-4.8372e-04, 9.7001e-02, -4.0596e-02, -1.9317e-02, -9.0029e-02,
5.7627e-02, -9.4703e-02, -2.0492e-02, -7.2362e-02, 7.5222e-02,
-9.7996e-02, 5.6772e-02, 2.2736e-02, -9.3790e-03, 9.3635e-02,
-9.4123e-02, -2.7665e-02, -8.8355e-02, -2.7177e-02, 4.3059e-02,
7.0330e-02, -5.0305e-02, 1.3456e-03, -7.7540e-02, 5.3380e-02,
-6.9149e-03, 1.2532e-02, 8.0125e-02, 5.6433e-02, 1.2123e-02,
-5.1828e-02, -4.8273e-02, -5.9009e-02, -4.7401e-02, 2.0305e-02,
5.7873e-02, 6.3690e-02, 2.1616e-02, 3.8619e-02, 7.3971e-02,
-1.6028e-02, -8.0447e-02, -8.7910e-02, 2.1118e-02, 8.2500e-02,
9.8243e-02, 6.4076e-02, 6.1481e-02, -8.7497e-02, 1.6044e-02,
-5.4011e-03, -2.8502e-02, 1.3706e-02, -1.1251e-02, 4.0769e-02,
-8.3764e-02, -5.1807e-02, -9.0810e-02, 6.6931e-02, -1.9685e-02,
8.1601e-02, 9.2864e-02, 9.7851e-02, 2.8761e-02, -7.8003e-02,
4.0935e-02, 9.0031e-02, -9.3671e-02, 6.4133e-02, 6.5193e-02,
-5.5207e-02, 2.5968e-02, 7.1156e-03, -1.3688e-03, -8.5449e-02,
-4.5128e-02, -9.4556e-02, -3.7420e-02, 1.2335e-02, 2.9909e-02,
-4.5760e-02, 1.1676e-02, -6.0563e-03, 3.4332e-02, 9.2552e-02,
5.6492e-02, -4.9660e-02, 3.9791e-02, 1.1446e-02, 1.2436e-02,
6.3308e-03, 5.9118e-02, 1.3268e-02, 1.9854e-02, -8.5503e-02,
-7.3752e-02, -9.1903e-02, -5.5283e-02, 7.8254e-02, -9.1527e-02,
-1.8005e-02, -7.6893e-03, -6.4159e-02, 2.3980e-02, -7.7777e-02,
6.0304e-02, -4.4397e-02, -5.3985e-03, 3.2544e-06, 7.5553e-02,
-1.8083e-02, 5.0106e-02, -3.7003e-02, -9.7276e-02, 9.7635e-02,
-2.8739e-02, 6.5956e-03, -6.9433e-02, -6.2872e-02, 3.7345e-02,
-5.7289e-02, -1.9548e-02, -8.4536e-02, 3.3381e-02, 9.9169e-02,
-1.1814e-03, -1.7997e-02, 7.9620e-02, -6.1143e-03, 4.3198e-02,
8.0696e-04, 5.6093e-03, 5.0833e-04, -9.1980e-02, -5.3342e-02,
-8.1917e-02, 6.1473e-02, 9.2090e-02, 3.6675e-04, -7.1452e-02,
-4.1001e-02, -9.4951e-03, -8.8975e-02, 7.7082e-02, -5.3429e-02,
1.1141e-02, 5.4670e-02, -9.6425e-02, -4.9106e-02, -8.6400e-02,
7.4532e-02, 5.7822e-02, 5.9280e-03, -3.4205e-02, 9.2063e-02,
6.7175e-02, -4.5628e-02, -8.5734e-02, -2.8184e-02, -4.8467e-02,
-6.3843e-02, -5.2728e-02, -2.4884e-02, 7.9072e-02, -7.7626e-02,
-9.9929e-02, 7.6227e-02, -5.5655e-02, -5.1605e-02, -6.1012e-02,
9.1945e-02, -8.8931e-02, -5.0817e-02, -8.0194e-02, -3.5432e-02,
-7.9072e-02, -5.2822e-02, 3.3801e-02, 2.4268e-02, 4.7813e-02],
requires_grad=True)
Parameter containing:
tensor([[-0.0062, 0.0150, -0.0102, ..., 0.0325, 0.0597, -0.0381],
[-0.0240, 0.0614, 0.0021, ..., -0.0529, 0.0215, 0.0547],
[-0.0157, 0.0068, -0.0157, ..., 0.0446, 0.0137, -0.0209],
...,
[-0.0330, 0.0282, -0.0684, ..., -0.0548, -0.0681, 0.0631],
[ 0.0497, 0.0644, -0.0552, ..., -0.0525, -0.0120, 0.0385],
[-0.0114, -0.0346, 0.0067, ..., -0.0390, 0.0026, -0.0091]],
requires_grad=True)
Parameter containing:
tensor([-0.0362, -0.0086, 0.0368, 0.0635, -0.0117, -0.0418, -0.0185, -0.0370,
-0.0502, 0.0313], requires_grad=True)
Layer params:
Parameter containing:
tensor([[-0.0062, 0.0150, -0.0102, ..., 0.0325, 0.0597, -0.0381],
[-0.0240, 0.0614, 0.0021, ..., -0.0529, 0.0215, 0.0547],
[-0.0157, 0.0068, -0.0157, ..., 0.0446, 0.0137, -0.0209],
...,
[-0.0330, 0.0282, -0.0684, ..., -0.0548, -0.0681, 0.0631],
[ 0.0497, 0.0644, -0.0552, ..., -0.0525, -0.0120, 0.0385],
[-0.0114, -0.0346, 0.0067, ..., -0.0390, 0.0026, -0.0091]],
requires_grad=True)
Parameter containing:
tensor([-0.0362, -0.0086, 0.0368, 0.0635, -0.0117, -0.0418, -0.0185, -0.0370,
-0.0502, 0.0313], requires_grad=True)
```

This shows the fundamental structure of a PyTorch model: there is an
`__init__()`

method that defines the layers and other components of a
model, and a `forward()`

method where the computation gets done. Note
that we can print the model, or any of its submodules, to learn about
its structure.

## Common Layer Types¶

### Linear Layers¶

The most basic type of neural network layer is a *linear* or *fully
connected* layer. This is a layer where every input influences every
output of the layer to a degree specified by the layer’s weights. If a
model has *m* inputs and *n* outputs, the weights will be an *m*x*n*
matrix. For example:

```
lin = torch.nn.Linear(3, 2)
x = torch.rand(1, 3)
print('Input:')
print(x)
print('\n\nWeight and Bias parameters:')
for param in lin.parameters():
print(param)
y = lin(x)
print('\n\nOutput:')
print(y)
```

Out:

```
Input:
tensor([[0.2434, 0.1368, 0.0267]])
Weight and Bias parameters:
Parameter containing:
tensor([[ 0.5738, -0.2361, 0.1416],
[ 0.4828, -0.1578, 0.2356]], requires_grad=True)
Parameter containing:
tensor([-0.1179, 0.3006], requires_grad=True)
Output:
tensor([[-0.0067, 0.4028]], grad_fn=<AddmmBackward0>)
```

If you do the matrix multiplication of `x`

by the linear layer’s
weights, and add the biases, you’ll find that you get the output vector
`y`

.

One other important feature to note: When we checked the weights of our
layer with `lin.weight`

, it reported itself as a `Parameter`

(which
is a subclass of `Tensor`

), and let us know that it’s tracking
gradients with autograd. This is a default behavior for `Parameter`

that differs from `Tensor`

.

Linear layers are used widely in deep learning models. One of the most
common places you’ll see them is in classifier models, which will
usually have one or more linear layers at the end, where the last layer
will have *n* outputs, where *n* is the number of classes the classifier
addresses.

### Convolutional Layers¶

*Convolutional* layers are built to handle data with a high degree of
spatial correlation. They are very commonly used in computer vision,
where they detect close groupings of features which the compose into
higher-level features. They pop up in other contexts too - for example,
in NLP applications, where the a word’s immediate context (that is, the
other words nearby in the sequence) can affect the meaning of a
sentence.

We saw convolutional layers in action in LeNet5 in an earlier video:

```
import torch.functional as F
class LeNet(torch.nn.Module):
def __init__(self):
super(LeNet, self).__init__()
# 1 input image channel (black & white), 6 output channels, 5x5 square convolution
# kernel
self.conv1 = torch.nn.Conv2d(1, 6, 5)
self.conv2 = torch.nn.Conv2d(6, 16, 3)
# an affine operation: y = Wx + b
self.fc1 = torch.nn.Linear(16 * 6 * 6, 120) # 6*6 from image dimension
self.fc2 = torch.nn.Linear(120, 84)
self.fc3 = torch.nn.Linear(84, 10)
def forward(self, x):
# Max pooling over a (2, 2) window
x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
# If the size is a square you can only specify a single number
x = F.max_pool2d(F.relu(self.conv2(x)), 2)
x = x.view(-1, self.num_flat_features(x))
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
def num_flat_features(self, x):
size = x.size()[1:] # all dimensions except the batch dimension
num_features = 1
for s in size:
num_features *= s
return num_features
```

Let’s break down what’s happening in the convolutional layers of this
model. Starting with `conv1`

:

- LeNet5 is meant to take in a 1x32x32 black & white image.
**The first argument to a convolutional layer’s constructor is the number of input channels.**Here, it is 1. If we were building this model to look at 3-color channels, it would be 3. - A convolutional layer is like a window that scans over the image,
looking for a pattern it recognizes. These patterns are called
*features,*and one of the parameters of a convolutional layer is the number of features we would like it to learn.**This is the second argument to the constructor is the number of output features.**Here, we’re asking our layer to learn 6 features. - Just above, I likened the convolutional layer to a window - but how
big is the window?
**The third argument is the window or kernel size.**Here, the “5” means we’ve chosen a 5x5 kernel. (If you want a kernel with height different from width, you can specify a tuple for this argument - e.g.,`(3, 5)`

to get a 3x5 convolution kernel.)

The output of a convolutional layer is an *activation map* - a spatial
representation of the presence of features in the input tensor.
`conv1`

will give us an output tensor of 6x28x28; 6 is the number of
features, and 28 is the height and width of our map. (The 28 comes from
the fact that when scanning a 5-pixel window over a 32-pixel row, there
are only 28 valid positions.)

We then pass the output of the convolution through a ReLU activation function (more on activation functions later), then through a max pooling layer. The max pooling layer takes features near each other in the activation map and groups them together. It does this by reducing the tensor, merging every 2x2 group of cells in the output into a single cell, and assigning that cell the maximum value of the 4 cells that went into it. This gives us a lower-resolution version of the activation map, with dimensions 6x14x14.

Our next convolutional layer, `conv2`

, expects 6 input channels
(corresponding to the 6 features sought by the first layer), has 16
output channels, and a 3x3 kernel. It puts out a 16x12x12 activation
map, which is again reduced by a max pooling layer to 16x6x6. Prior to
passing this output to the linear layers, it is reshaped to a 16 * 6 *
6 = 576-element vector for consumption by the next layer.

There are convolutional layers for addressing 1D, 2D, and 3D tensors. There are also many more optional arguments for a conv layer constructor, including stride length(e.g., only scanning every second or every third position) in the input, padding (so you can scan out to the edges of the input), and more. See the documentation for more information.

### Recurrent Layers¶

*Recurrent neural networks* (or *RNNs)* are used for sequential data -
anything from time-series measurements from a scientific instrument to
natural language sentences to DNA nucleotides. An RNN does this by
maintaining a *hidden state* that acts as a sort of memory for what it
has seen in the sequence so far.

The internal structure of an RNN layer - or its variants, the LSTM (long short-term memory) and GRU (gated recurrent unit) - is moderately complex and beyond the scope of this video, but we’ll show you what one looks like in action with an LSTM-based part-of-speech tagger (a type of classifier that tells you if a word is a noun, verb, etc.):

```
class LSTMTagger(torch.nn.Module):
def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
super(LSTMTagger, self).__init__()
self.hidden_dim = hidden_dim
self.word_embeddings = torch.nn.Embedding(vocab_size, embedding_dim)
# The LSTM takes word embeddings as inputs, and outputs hidden states
# with dimensionality hidden_dim.
self.lstm = torch.nn.LSTM(embedding_dim, hidden_dim)
# The linear layer that maps from hidden state space to tag space
self.hidden2tag = torch.nn.Linear(hidden_dim, tagset_size)
def forward(self, sentence):
embeds = self.word_embeddings(sentence)
lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
tag_scores = F.log_softmax(tag_space, dim=1)
return tag_scores
```

The constructor has four arguments:

`vocab_size`

is the number of words in the input vocabulary. Each word is a one-hot vector (or unit vector) in a`vocab_size`

-dimensional space.`tagset_size`

is the number of tags in the output set.`embedding_dim`

is the size of the*embedding*space for the vocabulary. An embedding maps a vocabulary onto a low-dimensional space, where words with similar meanings are close together in the space.`hidden_dim`

is the size of the LSTM’s memory.

The input will be a sentence with the words represented as indices of
one-hot vectors. The embedding layer will then map these down to an
`embedding_dim`

-dimensional space. The LSTM takes this sequence of
embeddings and iterates over it, fielding an output vector of length
`hidden_dim`

. The final linear layer acts as a classifier; applying
`log_softmax()`

to the output of the final layer converts the output
into a normalized set of estimated probabilities that a given word maps
to a given tag.

If you’d like to see this network in action, check out the Sequence Models and LSTM Networks tutorial on pytorch.org.

### Transformers¶

*Transformers* are multi-purpose networks that have taken over the state
of the art in NLP with models like BERT. A discussion of transformer
architecture is beyond the scope of this video, but PyTorch has a
`Transformer`

class that allows you to define the overall parameters
of a transformer model - the number of attention heads, the number of
encoder & decoder layers, dropout and activation functions, etc. (You
can even build the BERT model from this single class, with the right
parameters!) The `torch.nn.Transformer`

class also has classes to
encapsulate the individual components (`TransformerEncoder`

,
`TransformerDecoder`

) and subcomponents (`TransformerEncoderLayer`

,
`TransformerDecoderLayer`

). For details, check out the
documentation
on transformer classes, and the relevant
tutorial
on pytorch.org.

## Other Layers and Functions¶

### Data Manipulation Layers¶

There are other layer types that perform important functions in models, but don’t participate in the learning process themselves.

**Max pooling** (and its twin, min pooling) reduce a tensor by combining
cells, and assigning the maximum value of the input cells to the output
cell (we saw this). For example:

```
my_tensor = torch.rand(1, 6, 6)
print(my_tensor)
maxpool_layer = torch.nn.MaxPool2d(3)
print(maxpool_layer(my_tensor))
```

Out:

```
tensor([[[0.5411, 0.0483, 0.2942, 0.8217, 0.6970, 0.9515],
[0.3894, 0.5568, 0.1896, 0.3720, 0.4051, 0.4246],
[0.6309, 0.9324, 0.4207, 0.0084, 0.2297, 0.4796],
[0.4238, 0.8163, 0.0321, 0.9867, 0.4923, 0.2789],
[0.1043, 0.9663, 0.2690, 0.0341, 0.1976, 0.5090],
[0.1249, 0.1805, 0.3648, 0.4620, 0.6176, 0.4570]]])
tensor([[[0.9324, 0.9515],
[0.9663, 0.9867]]])
```

If you look closely at the values above, you’ll see that each of the values in the maxpooled output is the maximum value of each quadrant of the 6x6 input.

**Normalization layers** re-center and normalize the output of one layer
before feeding it to another. Centering the and scaling the intermediate
tensors has a number of beneficial effects, such as letting you use
higher learning rates without exploding/vanishing gradients.

```
my_tensor = torch.rand(1, 4, 4) * 20 + 5
print(my_tensor)
print(my_tensor.mean())
norm_layer = torch.nn.BatchNorm1d(4)
normed_tensor = norm_layer(my_tensor)
print(normed_tensor)
print(normed_tensor.mean())
```

Out:

```
tensor([[[22.4249, 6.7776, 17.2258, 11.7652],
[ 5.1992, 16.0877, 13.7682, 15.9702],
[ 5.2930, 5.6240, 8.0620, 9.8038],
[ 5.3078, 14.2051, 11.5529, 23.4982]]])
tensor(12.0353)
tensor([[[ 1.3442, -1.3262, 0.4569, -0.4750],
[-1.6945, 0.7470, 0.2269, 0.7206],
[-1.0303, -0.8510, 0.4691, 1.4122],
[-1.2735, 0.0862, -0.3191, 1.5063]]],
grad_fn=<NativeBatchNormBackward0>)
tensor(-7.4506e-08, grad_fn=<MeanBackward0>)
```

Running the cell above, we’ve added a large scaling factor and offset to
an input tensor; you should see the input tensor’s `mean()`

somewhere
in the neighborhood of 15. After running it through the normalization
layer, you can see that the values are smaller, and grouped around zero
- in fact, the mean should be very small (> 1e-8).

This is beneficial because many activation functions (discussed below) have their strongest gradients near 0, but sometimes suffer from vanishing or exploding gradients for inputs that drive them far away from zero. Keeping the data centered around the area of steepest gradient will tend to mean faster, better learning and higher feasible learning rates.

**Dropout layers** are a tool for encouraging *sparse representations*
in your model - that is, pushing it to do inference with less data.

Dropout layers work by randomly setting parts of the input tensor
*during training* - dropout layers are always turned off for inference.
This forces the model to learn against this masked or reduced dataset.
For example:

```
my_tensor = torch.rand(1, 4, 4)
dropout = torch.nn.Dropout(p=0.4)
print(dropout(my_tensor))
print(dropout(my_tensor))
```

Out:

```
tensor([[[1.6089, 0.0000, 0.3209, 1.2512],
[0.1894, 0.0000, 0.0000, 0.1348],
[0.0000, 0.9695, 0.0217, 0.0000],
[1.0857, 0.3673, 0.0000, 0.7505]]])
tensor([[[0.0000, 0.5674, 0.3209, 0.0000],
[0.1894, 1.5554, 0.0000, 0.1348],
[0.3677, 0.9695, 0.0217, 1.1259],
[1.0857, 0.3673, 0.9233, 0.7505]]])
```

Above, you can see the effect of dropout on a sample tensor. You can use
the optional `p`

argument to set the probability of an individual
weight dropping out; if you don’t it defaults to 0.5.

### Activation Functions¶

Activation functions make deep learning possible. A neural network is
really a program - with many parameters - that *simulates a mathematical
function*. If all we did was multiple tensors by layer weights
repeatedly, we could only simulate *linear functions;* further, there
would be no point to having many layers, as the whole network would
reduce could be reduced to a single matrix multiplication. Inserting
*non-linear* activation functions between layers is what allows a deep
learning model to simulate any function, rather than just linear ones.

`torch.nn.Module`

has objects encapsulating all of the major
activation functions including ReLU and its many variants, Tanh,
Hardtanh, sigmoid, and more. It also includes other functions, such as
Softmax, that are most useful at the output stage of a model.

### Loss Functions¶

Loss functions tell us how far a model’s prediction is from the correct answer. PyTorch contains a variety of loss functions, including common MSE (mean squared error = L2 norm), Cross Entropy Loss and Negative Likelihood Loss (useful for classifiers), and others.

**Total running time of the script:** ( 0 minutes 0.071 seconds)