{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# For tips on running notebooks in Google Colab, see\n",
    "# https://pytorch.org/tutorials/beginner/colab\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Word Embeddings: Encoding Lexical Semantics\n",
    "===========================================\n",
    "\n",
    "Word embeddings are dense vectors of real numbers, one per word in your\n",
    "vocabulary. In NLP, it is almost always the case that your features are\n",
    "words! But how should you represent a word in a computer? You could\n",
    "store its ascii character representation, but that only tells you what\n",
    "the word *is*, it doesn\\'t say much about what it *means* (you might be\n",
    "able to derive its part of speech from its affixes, or properties from\n",
    "its capitalization, but not much). Even more, in what sense could you\n",
    "combine these representations? We often want dense outputs from our\n",
    "neural networks, where the inputs are $|V|$ dimensional, where $V$ is\n",
    "our vocabulary, but often the outputs are only a few dimensional (if we\n",
    "are only predicting a handful of labels, for instance). How do we get\n",
    "from a massive dimensional space to a smaller dimensional space?\n",
    "\n",
    "How about instead of ascii representations, we use a one-hot encoding?\n",
    "That is, we represent the word $w$ by\n",
    "\n",
    "$$\\overbrace{\\left[ 0, 0, \\dots, 1, \\dots, 0, 0 \\right]}^\\text{|V| elements}$$\n",
    "\n",
    "where the 1 is in a location unique to $w$. Any other word will have a 1\n",
    "in some other location, and a 0 everywhere else.\n",
    "\n",
    "There is an enormous drawback to this representation, besides just how\n",
    "huge it is. It basically treats all words as independent entities with\n",
    "no relation to each other. What we really want is some notion of\n",
    "*similarity* between words. Why? Let\\'s see an example.\n",
    "\n",
    "Suppose we are building a language model. Suppose we have seen the\n",
    "sentences\n",
    "\n",
    "-   The mathematician ran to the store.\n",
    "-   The physicist ran to the store.\n",
    "-   The mathematician solved the open problem.\n",
    "\n",
    "in our training data. Now suppose we get a new sentence never before\n",
    "seen in our training data:\n",
    "\n",
    "-   The physicist solved the open problem.\n",
    "\n",
    "Our language model might do OK on this sentence, but wouldn\\'t it be\n",
    "much better if we could use the following two facts:\n",
    "\n",
    "-   We have seen mathematician and physicist in the same role in a\n",
    "    sentence. Somehow they have a semantic relation.\n",
    "-   We have seen mathematician in the same role in this new unseen\n",
    "    sentence as we are now seeing physicist.\n",
    "\n",
    "and then infer that physicist is actually a good fit in the new unseen\n",
    "sentence? This is what we mean by a notion of similarity: we mean\n",
    "*semantic similarity*, not simply having similar orthographic\n",
    "representations. It is a technique to combat the sparsity of linguistic\n",
    "data, by connecting the dots between what we have seen and what we\n",
    "haven\\'t. This example of course relies on a fundamental linguistic\n",
    "assumption: that words appearing in similar contexts are related to each\n",
    "other semantically. This is called the [distributional\n",
    "hypothesis](https://en.wikipedia.org/wiki/Distributional_semantics).\n",
    "\n",
    "Getting Dense Word Embeddings\n",
    "-----------------------------\n",
    "\n",
    "How can we solve this problem? That is, how could we actually encode\n",
    "semantic similarity in words? Maybe we think up some semantic\n",
    "attributes. For example, we see that both mathematicians and physicists\n",
    "can run, so maybe we give these words a high score for the \\\"is able to\n",
    "run\\\" semantic attribute. Think of some other attributes, and imagine\n",
    "what you might score some common words on those attributes.\n",
    "\n",
    "If each attribute is a dimension, then we might give each word a vector,\n",
    "like this:\n",
    "\n",
    "$$q_\\text{mathematician} = \\left[ \\overbrace{2.3}^\\text{can run},\n",
    "\\overbrace{9.4}^\\text{likes coffee}, \\overbrace{-5.5}^\\text{majored in Physics}, \\dots \\right]$$\n",
    "\n",
    "$$q_\\text{physicist} = \\left[ \\overbrace{2.5}^\\text{can run},\n",
    "\\overbrace{9.1}^\\text{likes coffee}, \\overbrace{6.4}^\\text{majored in Physics}, \\dots \\right]$$\n",
    "\n",
    "Then we can get a measure of similarity between these words by doing:\n",
    "\n",
    "$$\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = q_\\text{physicist} \\cdot q_\\text{mathematician}$$\n",
    "\n",
    "Although it is more common to normalize by the lengths:\n",
    "\n",
    "$$\\text{Similarity}(\\text{physicist}, \\text{mathematician}) = \\frac{q_\\text{physicist} \\cdot q_\\text{mathematician}}\n",
    "{\\| q_\\text{physicist} \\| \\| q_\\text{mathematician} \\|} = \\cos (\\phi)$$\n",
    "\n",
    "Where $\\phi$ is the angle between the two vectors. That way, extremely\n",
    "similar words (words whose embeddings point in the same direction) will\n",
    "have similarity 1. Extremely dissimilar words should have similarity -1.\n",
    "\n",
    "You can think of the sparse one-hot vectors from the beginning of this\n",
    "section as a special case of these new vectors we have defined, where\n",
    "each word basically has similarity 0, and we gave each word some unique\n",
    "semantic attribute. These new vectors are *dense*, which is to say their\n",
    "entries are (typically) non-zero.\n",
    "\n",
    "But these new vectors are a big pain: you could think of thousands of\n",
    "different semantic attributes that might be relevant to determining\n",
    "similarity, and how on earth would you set the values of the different\n",
    "attributes? Central to the idea of deep learning is that the neural\n",
    "network learns representations of the features, rather than requiring\n",
    "the programmer to design them herself. So why not just let the word\n",
    "embeddings be parameters in our model, and then be updated during\n",
    "training? This is exactly what we will do. We will have some *latent\n",
    "semantic attributes* that the network can, in principle, learn. Note\n",
    "that the word embeddings will probably not be interpretable. That is,\n",
    "although with our hand-crafted vectors above we can see that\n",
    "mathematicians and physicists are similar in that they both like coffee,\n",
    "if we allow a neural network to learn the embeddings and see that both\n",
    "mathematicians and physicists have a large value in the second\n",
    "dimension, it is not clear what that means. They are similar in some\n",
    "latent semantic dimension, but this probably has no interpretation to\n",
    "us.\n",
    "\n",
    "In summary, **word embeddings are a representation of the \\*semantics\\*\n",
    "of a word, efficiently encoding semantic information that might be\n",
    "relevant to the task at hand**. You can embed other things too: part of\n",
    "speech tags, parse trees, anything! The idea of feature embeddings is\n",
    "central to the field.\n",
    "\n",
    "Word Embeddings in Pytorch\n",
    "--------------------------\n",
    "\n",
    "Before we get to a worked example and an exercise, a few quick notes\n",
    "about how to use embeddings in Pytorch and in deep learning programming\n",
    "in general. Similar to how we defined a unique index for each word when\n",
    "making one-hot vectors, we also need to define an index for each word\n",
    "when using embeddings. These will be keys into a lookup table. That is,\n",
    "embeddings are stored as a $|V| \\times D$ matrix, where $D$ is the\n",
    "dimensionality of the embeddings, such that the word assigned index $i$\n",
    "has its embedding stored in the $i$\\'th row of the matrix. In all of my\n",
    "code, the mapping from words to indices is a dictionary named\n",
    "word\\_to\\_ix.\n",
    "\n",
    "The module that allows you to use embeddings is torch.nn.Embedding,\n",
    "which takes two arguments: the vocabulary size, and the dimensionality\n",
    "of the embeddings.\n",
    "\n",
    "To index into this table, you must use torch.LongTensor (since the\n",
    "indices are integers, not floats).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# Author: Robert Guthrie\n",
    "\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "\n",
    "torch.manual_seed(1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "word_to_ix = {\"hello\": 0, \"world\": 1}\n",
    "embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings\n",
    "lookup_tensor = torch.tensor([word_to_ix[\"hello\"]], dtype=torch.long)\n",
    "hello_embed = embeds(lookup_tensor)\n",
    "print(hello_embed)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An Example: N-Gram Language Modeling\n",
    "====================================\n",
    "\n",
    "Recall that in an n-gram language model, given a sequence of words $w$,\n",
    "we want to compute\n",
    "\n",
    "$$P(w_i | w_{i-1}, w_{i-2}, \\dots, w_{i-n+1} )$$\n",
    "\n",
    "Where $w_i$ is the ith word of the sequence.\n",
    "\n",
    "In this example, we will compute the loss function on some training\n",
    "examples and update the parameters with backpropagation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "CONTEXT_SIZE = 2\n",
    "EMBEDDING_DIM = 10\n",
    "# We will use Shakespeare Sonnet 2\n",
    "test_sentence = \"\"\"When forty winters shall besiege thy brow,\n",
    "And dig deep trenches in thy beauty's field,\n",
    "Thy youth's proud livery so gazed on now,\n",
    "Will be a totter'd weed of small worth held:\n",
    "Then being asked, where all thy beauty lies,\n",
    "Where all the treasure of thy lusty days;\n",
    "To say, within thine own deep sunken eyes,\n",
    "Were an all-eating shame, and thriftless praise.\n",
    "How much more praise deserv'd thy beauty's use,\n",
    "If thou couldst answer 'This fair child of mine\n",
    "Shall sum my count, and make my old excuse,'\n",
    "Proving his beauty by succession thine!\n",
    "This were to be new made when thou art old,\n",
    "And see thy blood warm when thou feel'st it cold.\"\"\".split()\n",
    "# we should tokenize the input, but we will ignore that for now\n",
    "# build a list of tuples.\n",
    "# Each tuple is ([ word_i-CONTEXT_SIZE, ..., word_i-1 ], target word)\n",
    "ngrams = [\n",
    "    (\n",
    "        [test_sentence[i - j - 1] for j in range(CONTEXT_SIZE)],\n",
    "        test_sentence[i]\n",
    "    )\n",
    "    for i in range(CONTEXT_SIZE, len(test_sentence))\n",
    "]\n",
    "# Print the first 3, just so you can see what they look like.\n",
    "print(ngrams[:3])\n",
    "\n",
    "vocab = set(test_sentence)\n",
    "word_to_ix = {word: i for i, word in enumerate(vocab)}\n",
    "\n",
    "\n",
    "class NGramLanguageModeler(nn.Module):\n",
    "\n",
    "    def __init__(self, vocab_size, embedding_dim, context_size):\n",
    "        super(NGramLanguageModeler, self).__init__()\n",
    "        self.embeddings = nn.Embedding(vocab_size, embedding_dim)\n",
    "        self.linear1 = nn.Linear(context_size * embedding_dim, 128)\n",
    "        self.linear2 = nn.Linear(128, vocab_size)\n",
    "\n",
    "    def forward(self, inputs):\n",
    "        embeds = self.embeddings(inputs).view((1, -1))\n",
    "        out = F.relu(self.linear1(embeds))\n",
    "        out = self.linear2(out)\n",
    "        log_probs = F.log_softmax(out, dim=1)\n",
    "        return log_probs\n",
    "\n",
    "\n",
    "losses = []\n",
    "loss_function = nn.NLLLoss()\n",
    "model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)\n",
    "optimizer = optim.SGD(model.parameters(), lr=0.001)\n",
    "\n",
    "for epoch in range(10):\n",
    "    total_loss = 0\n",
    "    for context, target in ngrams:\n",
    "\n",
    "        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words\n",
    "        # into integer indices and wrap them in tensors)\n",
    "        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)\n",
    "\n",
    "        # Step 2. Recall that torch *accumulates* gradients. Before passing in a\n",
    "        # new instance, you need to zero out the gradients from the old\n",
    "        # instance\n",
    "        model.zero_grad()\n",
    "\n",
    "        # Step 3. Run the forward pass, getting log probabilities over next\n",
    "        # words\n",
    "        log_probs = model(context_idxs)\n",
    "\n",
    "        # Step 4. Compute your loss function. (Again, Torch wants the target\n",
    "        # word wrapped in a tensor)\n",
    "        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))\n",
    "\n",
    "        # Step 5. Do the backward pass and update the gradient\n",
    "        loss.backward()\n",
    "        optimizer.step()\n",
    "\n",
    "        # Get the Python number from a 1-element Tensor by calling tensor.item()\n",
    "        total_loss += loss.item()\n",
    "    losses.append(total_loss)\n",
    "print(losses)  # The loss decreased every iteration over the training data!\n",
    "\n",
    "# To get the embedding of a particular word, e.g. \"beauty\"\n",
    "print(model.embeddings.weight[word_to_ix[\"beauty\"]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise: Computing Word Embeddings: Continuous Bag-of-Words\n",
    "============================================================\n",
    "\n",
    "The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep\n",
    "learning. It is a model that tries to predict words given the context of\n",
    "a few words before and a few words after the target word. This is\n",
    "distinct from language modeling, since CBOW is not sequential and does\n",
    "not have to be probabilistic. Typically, CBOW is used to quickly train\n",
    "word embeddings, and these embeddings are used to initialize the\n",
    "embeddings of some more complicated model. Usually, this is referred to\n",
    "as *pretraining embeddings*. It almost always helps performance a couple\n",
    "of percent.\n",
    "\n",
    "The CBOW model is as follows. Given a target word $w_i$ and an $N$\n",
    "context window on each side, $w_{i-1}, \\dots, w_{i-N}$ and\n",
    "$w_{i+1}, \\dots, w_{i+N}$, referring to all context words collectively\n",
    "as $C$, CBOW tries to minimize\n",
    "\n",
    "$$-\\log p(w_i | C) = -\\log \\text{Softmax}\\left(A(\\sum_{w \\in C} q_w) + b\\right)$$\n",
    "\n",
    "where $q_w$ is the embedding for word $w$.\n",
    "\n",
    "Implement this model in Pytorch by filling in the class below. Some\n",
    "tips:\n",
    "\n",
    "-   Think about which parameters you need to define.\n",
    "-   Make sure you know what shape each operation expects. Use .view() if\n",
    "    you need to reshape.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right\n",
    "raw_text = \"\"\"We are about to study the idea of a computational process.\n",
    "Computational processes are abstract beings that inhabit computers.\n",
    "As they evolve, processes manipulate other abstract things called data.\n",
    "The evolution of a process is directed by a pattern of rules\n",
    "called a program. People create programs to direct processes. In effect,\n",
    "we conjure the spirits of the computer with our spells.\"\"\".split()\n",
    "\n",
    "# By deriving a set from `raw_text`, we deduplicate the array\n",
    "vocab = set(raw_text)\n",
    "vocab_size = len(vocab)\n",
    "\n",
    "word_to_ix = {word: i for i, word in enumerate(vocab)}\n",
    "data = []\n",
    "for i in range(CONTEXT_SIZE, len(raw_text) - CONTEXT_SIZE):\n",
    "    context = (\n",
    "        [raw_text[i - j - 1] for j in range(CONTEXT_SIZE)]\n",
    "        + [raw_text[i + j + 1] for j in range(CONTEXT_SIZE)]\n",
    "    )\n",
    "    target = raw_text[i]\n",
    "    data.append((context, target))\n",
    "print(data[:5])\n",
    "\n",
    "\n",
    "class CBOW(nn.Module):\n",
    "\n",
    "    def __init__(self):\n",
    "        pass\n",
    "\n",
    "    def forward(self, inputs):\n",
    "        pass\n",
    "\n",
    "# Create your model and train. Here are some functions to help you make\n",
    "# the data ready for use by your module.\n",
    "\n",
    "\n",
    "def make_context_vector(context, word_to_ix):\n",
    "    idxs = [word_to_ix[w] for w in context]\n",
    "    return torch.tensor(idxs, dtype=torch.long)\n",
    "\n",
    "\n",
    "make_context_vector(data[0][0], word_to_ix)  # example"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}