{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For tips on running notebooks in Google Colab, see\n",
    "# https://pytorch.org/tutorials/beginner/colab\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deep Learning with PyTorch\n",
    "==========================\n",
    "\n",
    "Deep Learning Building Blocks: Affine maps, non-linearities and objectives\n",
    "--------------------------------------------------------------------------\n",
    "\n",
    "Deep learning consists of composing linearities with non-linearities in\n",
    "clever ways. The introduction of non-linearities allows for powerful\n",
    "models. In this section, we will play with these core components, make\n",
    "up an objective function, and see how the model is trained.\n",
    "\n",
    "### Affine Maps\n",
    "\n",
    "One of the core workhorses of deep learning is the affine map, which is\n",
    "a function $f(x)$ where\n",
    "\n",
    "$$f(x) = Ax + b$$\n",
    "\n",
    "for a matrix $A$ and vectors $x, b$. The parameters to be learned here\n",
    "are $A$ and $b$. Often, $b$ is refered to as the *bias* term.\n",
    "\n",
    "PyTorch and most other deep learning frameworks do things a little\n",
    "differently than traditional linear algebra. It maps the rows of the\n",
    "input instead of the columns. That is, the $i$\\'th row of the output\n",
    "below is the mapping of the $i$\\'th row of the input under $A$, plus the\n",
    "bias term. Look at the example below.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Author: Robert Guthrie\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "\n",
    "torch.manual_seed(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "lin = nn.Linear(5, 3)  # maps from R^5 to R^3, parameters A, b\n",
    "# data is 2x5.  A maps from 5 to 3... can we map \"data\" under A?\n",
    "data = torch.randn(2, 5)\n",
    "print(lin(data))  # yes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Non-Linearities\n",
    "===============\n",
    "\n",
    "First, note the following fact, which will explain why we need\n",
    "non-linearities in the first place. Suppose we have two affine maps\n",
    "$f(x) = Ax + b$ and $g(x) = Cx + d$. What is $f(g(x))$?\n",
    "\n",
    "$$f(g(x)) = A(Cx + d) + b = ACx + (Ad + b)$$\n",
    "\n",
    "$AC$ is a matrix and $Ad + b$ is a vector, so we see that composing\n",
    "affine maps gives you an affine map.\n",
    "\n",
    "From this, you can see that if you wanted your neural network to be long\n",
    "chains of affine compositions, that this adds no new power to your model\n",
    "than just doing a single affine map.\n",
    "\n",
    "If we introduce non-linearities in between the affine layers, this is no\n",
    "longer the case, and we can build much more powerful models.\n",
    "\n",
    "There are a few core non-linearities.\n",
    "$\\tanh(x), \\sigma(x), \\text{ReLU}(x)$ are the most common. You are\n",
    "probably wondering: \\\"why these functions? I can think of plenty of\n",
    "other non-linearities.\\\" The reason for this is that they have gradients\n",
    "that are easy to compute, and computing gradients is essential for\n",
    "learning. For example\n",
    "\n",
    "$$\\frac{d\\sigma}{dx} = \\sigma(x)(1 - \\sigma(x))$$\n",
    "\n",
    "A quick note: although you may have learned some neural networks in your\n",
    "intro to AI class where $\\sigma(x)$ was the default non-linearity,\n",
    "typically people shy away from it in practice. This is because the\n",
    "gradient *vanishes* very quickly as the absolute value of the argument\n",
    "grows. Small gradients means it is hard to learn. Most people default to\n",
    "tanh or ReLU.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# In pytorch, most non-linearities are in torch.functional (we have it imported as F)\n",
    "# Note that non-linearites typically don't have parameters like affine maps do.\n",
    "# That is, they don't have weights that are updated during training.\n",
    "data = torch.randn(2, 2)\n",
    "print(data)\n",
    "print(F.relu(data))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Softmax and Probabilities\n",
    "=========================\n",
    "\n",
    "The function $\\text{Softmax}(x)$ is also just a non-linearity, but it is\n",
    "special in that it usually is the last operation done in a network. This\n",
    "is because it takes in a vector of real numbers and returns a\n",
    "probability distribution. Its definition is as follows. Let $x$ be a\n",
    "vector of real numbers (positive, negative, whatever, there are no\n",
    "constraints). Then the i\\'th component of $\\text{Softmax}(x)$ is\n",
    "\n",
    "$$\\frac{\\exp(x_i)}{\\sum_j \\exp(x_j)}$$\n",
    "\n",
    "It should be clear that the output is a probability distribution: each\n",
    "element is non-negative and the sum over all components is 1.\n",
    "\n",
    "You could also think of it as just applying an element-wise\n",
    "exponentiation operator to the input to make everything non-negative and\n",
    "then dividing by the normalization constant.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Softmax is also in torch.nn.functional\n",
    "data = torch.randn(5)\n",
    "print(data)\n",
    "print(F.softmax(data, dim=0))\n",
    "print(F.softmax(data, dim=0).sum())  # Sums to 1 because it is a distribution!\n",
    "print(F.log_softmax(data, dim=0))  # theres also log_softmax"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Objective Functions\n",
    "===================\n",
    "\n",
    "The objective function is the function that your network is being\n",
    "trained to minimize (in which case it is often called a *loss function*\n",
    "or *cost function*). This proceeds by first choosing a training\n",
    "instance, running it through your neural network, and then computing the\n",
    "loss of the output. The parameters of the model are then updated by\n",
    "taking the derivative of the loss function. Intuitively, if your model\n",
    "is completely confident in its answer, and its answer is wrong, your\n",
    "loss will be high. If it is very confident in its answer, and its answer\n",
    "is correct, the loss will be low.\n",
    "\n",
    "The idea behind minimizing the loss function on your training examples\n",
    "is that your network will hopefully generalize well and have small loss\n",
    "on unseen examples in your dev set, test set, or in production. An\n",
    "example loss function is the *negative log likelihood loss*, which is a\n",
    "very common objective for multi-class classification. For supervised\n",
    "multi-class classification, this means training the network to minimize\n",
    "the negative log probability of the correct output (or equivalently,\n",
    "maximize the log probability of the correct output).\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Optimization and Training\n",
    "=========================\n",
    "\n",
    "So what we can compute a loss function for an instance? What do we do\n",
    "with that? We saw earlier that Tensors know how to compute gradients\n",
    "with respect to the things that were used to compute it. Well, since our\n",
    "loss is an Tensor, we can compute gradients with respect to all of the\n",
    "parameters used to compute it! Then we can perform standard gradient\n",
    "updates. Let $\\theta$ be our parameters, $L(\\theta)$ the loss function,\n",
    "and $\\eta$ a positive learning rate. Then:\n",
    "\n",
    "$$\\theta^{(t+1)} = \\theta^{(t)} - \\eta \\nabla_\\theta L(\\theta)$$\n",
    "\n",
    "There are a huge collection of algorithms and active research in\n",
    "attempting to do something more than just this vanilla gradient update.\n",
    "Many attempt to vary the learning rate based on what is happening at\n",
    "train time. You don\\'t need to worry about what specifically these\n",
    "algorithms are doing unless you are really interested. Torch provides\n",
    "many in the torch.optim package, and they are all completely\n",
    "transparent. Using the simplest gradient update is the same as the more\n",
    "complicated algorithms. Trying different update algorithms and different\n",
    "parameters for the update algorithms (like different initial learning\n",
    "rates) is important in optimizing your network\\'s performance. Often,\n",
    "just replacing vanilla SGD with an optimizer like Adam or RMSProp will\n",
    "boost performance noticably.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Creating Network Components in PyTorch\n",
    "======================================\n",
    "\n",
    "Before we move on to our focus on NLP, lets do an annotated example of\n",
    "building a network in PyTorch using only affine maps and\n",
    "non-linearities. We will also see how to compute a loss function, using\n",
    "PyTorch\\'s built in negative log likelihood, and update parameters by\n",
    "backpropagation.\n",
    "\n",
    "All network components should inherit from nn.Module and override the\n",
    "forward() method. That is about it, as far as the boilerplate is\n",
    "concerned. Inheriting from nn.Module provides functionality to your\n",
    "component. For example, it makes it keep track of its trainable\n",
    "parameters, you can swap it between CPU and GPU with the `.to(device)`\n",
    "method, where device can be a CPU device `torch.device(\"cpu\")` or CUDA\n",
    "device `torch.device(\"cuda:0\")`.\n",
    "\n",
    "Let\\'s write an annotated example of a network that takes in a sparse\n",
    "bag-of-words representation and outputs a probability distribution over\n",
    "two labels: \\\"English\\\" and \\\"Spanish\\\". This model is just logistic\n",
    "regression.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Example: Logistic Regression Bag-of-Words classifier\n",
    "====================================================\n",
    "\n",
    "Our model will map a sparse BoW representation to log probabilities over\n",
    "labels. We assign each word in the vocab an index. For example, say our\n",
    "entire vocab is two words \\\"hello\\\" and \\\"world\\\", with indices 0 and 1\n",
    "respectively. The BoW vector for the sentence \\\"hello hello hello\n",
    "hello\\\" is\n",
    "\n",
    "$$\\left[ 4, 0 \\right]$$\n",
    "\n",
    "For \\\"hello world world hello\\\", it is\n",
    "\n",
    "$$\\left[ 2, 2 \\right]$$\n",
    "\n",
    "etc. In general, it is\n",
    "\n",
    "$$\\left[ \\text{Count}(\\text{hello}), \\text{Count}(\\text{world}) \\right]$$\n",
    "\n",
    "Denote this BOW vector as $x$. The output of our network is:\n",
    "\n",
    "$$\\log \\text{Softmax}(Ax + b)$$\n",
    "\n",
    "That is, we pass the input through an affine map and then do log\n",
    "softmax.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "data = [(\"me gusta comer en la cafeteria\".split(), \"SPANISH\"),\n",
    "        (\"Give it to me\".split(), \"ENGLISH\"),\n",
    "        (\"No creo que sea una buena idea\".split(), \"SPANISH\"),\n",
    "        (\"No it is not a good idea to get lost at sea\".split(), \"ENGLISH\")]\n",
    "\n",
    "test_data = [(\"Yo creo que si\".split(), \"SPANISH\"),\n",
    "             (\"it is lost on me\".split(), \"ENGLISH\")]\n",
    "\n",
    "# word_to_ix maps each word in the vocab to a unique integer, which will be its\n",
    "# index into the Bag of words vector\n",
    "word_to_ix = {}\n",
    "for sent, _ in data + test_data:\n",
    "    for word in sent:\n",
    "        if word not in word_to_ix:\n",
    "            word_to_ix[word] = len(word_to_ix)\n",
    "print(word_to_ix)\n",
    "\n",
    "VOCAB_SIZE = len(word_to_ix)\n",
    "NUM_LABELS = 2\n",
    "\n",
    "\n",
    "class BoWClassifier(nn.Module):  # inheriting from nn.Module!\n",
    "\n",
    "    def __init__(self, num_labels, vocab_size):\n",
    "        # calls the init function of nn.Module.  Dont get confused by syntax,\n",
    "        # just always do it in an nn.Module\n",
    "        super(BoWClassifier, self).__init__()\n",
    "\n",
    "        # Define the parameters that you will need.  In this case, we need A and b,\n",
    "        # the parameters of the affine mapping.\n",
    "        # Torch defines nn.Linear(), which provides the affine map.\n",
    "        # Make sure you understand why the input dimension is vocab_size\n",
    "        # and the output is num_labels!\n",
    "        self.linear = nn.Linear(vocab_size, num_labels)\n",
    "\n",
    "        # NOTE! The non-linearity log softmax does not have parameters! So we don't need\n",
    "        # to worry about that here\n",
    "\n",
    "    def forward(self, bow_vec):\n",
    "        # Pass the input through the linear layer,\n",
    "        # then pass that through log_softmax.\n",
    "        # Many non-linearities and other functions are in torch.nn.functional\n",
    "        return F.log_softmax(self.linear(bow_vec), dim=1)\n",
    "\n",
    "\n",
    "def make_bow_vector(sentence, word_to_ix):\n",
    "    vec = torch.zeros(len(word_to_ix))\n",
    "    for word in sentence:\n",
    "        vec[word_to_ix[word]] += 1\n",
    "    return vec.view(1, -1)\n",
    "\n",
    "\n",
    "def make_target(label, label_to_ix):\n",
    "    return torch.LongTensor([label_to_ix[label]])\n",
    "\n",
    "\n",
    "model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)\n",
    "\n",
    "# the model knows its parameters.  The first output below is A, the second is b.\n",
    "# Whenever you assign a component to a class variable in the __init__ function\n",
    "# of a module, which was done with the line\n",
    "# self.linear = nn.Linear(...)\n",
    "# Then through some Python magic from the PyTorch devs, your module\n",
    "# (in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters\n",
    "for param in model.parameters():\n",
    "    print(param)\n",
    "\n",
    "# To run the model, pass in a BoW vector\n",
    "# Here we don't need to train, so the code is wrapped in torch.no_grad()\n",
    "with torch.no_grad():\n",
    "    sample = data[0]\n",
    "    bow_vector = make_bow_vector(sample[0], word_to_ix)\n",
    "    log_probs = model(bow_vector)\n",
    "    print(log_probs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Which of the above values corresponds to the log probability of ENGLISH,\n",
    "and which to SPANISH? We never defined it, but we need to if we want to\n",
    "train the thing.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "label_to_ix = {\"SPANISH\": 0, \"ENGLISH\": 1}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So lets train! To do this, we pass instances through to get log\n",
    "probabilities, compute a loss function, compute the gradient of the loss\n",
    "function, and then update the parameters with a gradient step. Loss\n",
    "functions are provided by Torch in the nn package. nn.NLLLoss() is the\n",
    "negative log likelihood loss we want. It also defines optimization\n",
    "functions in torch.optim. Here, we will just use SGD.\n",
    "\n",
    "Note that the *input* to NLLLoss is a vector of log probabilities, and a\n",
    "target label. It doesn\\'t compute the log probabilities for us. This is\n",
    "why the last layer of our network is log softmax. The loss function\n",
    "nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log\n",
    "softmax for you.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Run on test data before we train, just to see a before-and-after\n",
    "with torch.no_grad():\n",
    "    for instance, label in test_data:\n",
    "        bow_vec = make_bow_vector(instance, word_to_ix)\n",
    "        log_probs = model(bow_vec)\n",
    "        print(log_probs)\n",
    "\n",
    "# Print the matrix column corresponding to \"creo\"\n",
    "print(next(model.parameters())[:, word_to_ix[\"creo\"]])\n",
    "\n",
    "loss_function = nn.NLLLoss()\n",
    "optimizer = optim.SGD(model.parameters(), lr=0.1)\n",
    "\n",
    "# Usually you want to pass over the training data several times.\n",
    "# 100 is much bigger than on a real data set, but real datasets have more than\n",
    "# two instances.  Usually, somewhere between 5 and 30 epochs is reasonable.\n",
    "for epoch in range(100):\n",
    "    for instance, label in data:\n",
    "        # Step 1. Remember that PyTorch accumulates gradients.\n",
    "        # We need to clear them out before each instance\n",
    "        model.zero_grad()\n",
    "\n",
    "        # Step 2. Make our BOW vector and also we must wrap the target in a\n",
    "        # Tensor as an integer. For example, if the target is SPANISH, then\n",
    "        # we wrap the integer 0. The loss function then knows that the 0th\n",
    "        # element of the log probabilities is the log probability\n",
    "        # corresponding to SPANISH\n",
    "        bow_vec = make_bow_vector(instance, word_to_ix)\n",
    "        target = make_target(label, label_to_ix)\n",
    "\n",
    "        # Step 3. Run our forward pass.\n",
    "        log_probs = model(bow_vec)\n",
    "\n",
    "        # Step 4. Compute the loss, gradients, and update the parameters by\n",
    "        # calling optimizer.step()\n",
    "        loss = loss_function(log_probs, target)\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "\n",
    "with torch.no_grad():\n",
    "    for instance, label in test_data:\n",
    "        bow_vec = make_bow_vector(instance, word_to_ix)\n",
    "        log_probs = model(bow_vec)\n",
    "        print(log_probs)\n",
    "\n",
    "# Index corresponding to Spanish goes up, English goes down!\n",
    "print(next(model.parameters())[:, word_to_ix[\"creo\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We got the right answer! You can see that the log probability for\n",
    "Spanish is much higher in the first example, and the log probability for\n",
    "English is much higher in the second for the test data, as it should be.\n",
    "\n",
    "Now you see how to make a PyTorch component, pass some data through it\n",
    "and do gradient updates. We are ready to dig deeper into what deep NLP\n",
    "has to offer.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}