.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_beginner_torchtext_translation_tutorial.py: Language Translation with TorchText =================================== This tutorial shows how to use ``torchtext`` to preprocess data from a well-known dataset containing sentences in both English and German and use it to train a sequence-to-sequence model with attention that can translate German sentences into English. It is based off of `this tutorial `__ from PyTorch community member `Ben Trevett `__ with Ben's permission. We update the tutorials by removing some legacy code. By the end of this tutorial, you will be able to preprocess sentences into tensors for NLP modeling and use `torch.utils.data.DataLoader `__ for training and validing the model. Data Processing ---------------- ``torchtext`` has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor. Note: the tokenization in this tutorial requires `Spacy `__ We use Spacy because it provides strong support for tokenization in languages other than English. ``torchtext`` provides a ``basic_english`` tokenizer and supports other tokenizers for English (e.g. `Moses `__) but for language translation - where multiple languages are required - Spacy is your best bet. To run this tutorial, first install ``spacy`` using ``pip`` or ``conda``. Next, download the raw data for the English and German Spacy tokenizers: :: python -m spacy download en python -m spacy download de .. code-block:: default import torchtext import torch from torchtext.data.utils import get_tokenizer from collections import Counter from torchtext.vocab import Vocab from torchtext.utils import download_from_url, extract_archive import io url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/' train_urls = ('train.de.gz', 'train.en.gz') val_urls = ('val.de.gz', 'val.en.gz') test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz') train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls] val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls] test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls] de_tokenizer = get_tokenizer('spacy', language='de') en_tokenizer = get_tokenizer('spacy', language='en') def build_vocab(filepath, tokenizer): counter = Counter() with io.open(filepath, encoding="utf8") as f: for string_ in f: counter.update(tokenizer(string_)) return Vocab(counter, specials=['', '', '', '']) de_vocab = build_vocab(train_filepaths[0], de_tokenizer) en_vocab = build_vocab(train_filepaths[1], en_tokenizer) def data_process(filepaths): raw_de_iter = iter(io.open(filepaths[0], encoding="utf8")) raw_en_iter = iter(io.open(filepaths[1], encoding="utf8")) data = [] for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter): de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)], dtype=torch.long) en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)], dtype=torch.long) data.append((de_tensor_, en_tensor_)) return data train_data = data_process(train_filepaths) val_data = data_process(val_filepaths) test_data = data_process(test_filepaths) ``DataLoader`` ---------------- The last ``torch`` specific feature we'll use is the ``DataLoader``, which is easy to use since it takes the data as its first argument. Specifically, as the docs say: ``DataLoader`` combines a dataset and a sampler, and provides an iterable over the given dataset. The ``DataLoader`` supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning. Please pay attention to ``collate_fn`` (optional) that merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset. .. code-block:: default import torch device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') BATCH_SIZE = 128 PAD_IDX = de_vocab[''] BOS_IDX = de_vocab[''] EOS_IDX = de_vocab[''] from torch.nn.utils.rnn import pad_sequence from torch.utils.data import DataLoader def generate_batch(data_batch): de_batch, en_batch = [], [] for (de_item, en_item) in data_batch: de_batch.append(torch.cat([torch.tensor([BOS_IDX]), de_item, torch.tensor([EOS_IDX])], dim=0)) en_batch.append(torch.cat([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0)) de_batch = pad_sequence(de_batch, padding_value=PAD_IDX) en_batch = pad_sequence(en_batch, padding_value=PAD_IDX) return de_batch, en_batch train_iter = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) test_iter = DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, collate_fn=generate_batch) Defining our ``nn.Module`` and ``Optimizer`` ---------------- That's mostly it from a ``torchtext`` perspecive: with the dataset built and the iterator defined, the rest of this tutorial simply defines our model as an ``nn.Module``, along with an ``Optimizer``, and then trains it. Our model specifically, follows the architecture described `here `__ (you can find a significantly more commented version `here `__). Note: this model is just an example model that can be used for language translation; we choose it because it is a standard model for the task, not because it is the recommended model to use for translation. As you're likely aware, state-of-the-art models are currently based on Transformers; you can see PyTorch's capabilities for implementing Transformer layers `here `__; and in particular, the "attention" used in the model below is different from the multi-headed self-attention present in a transformer model. .. code-block:: default import random from typing import Tuple import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch import Tensor class Encoder(nn.Module): def __init__(self, input_dim: int, emb_dim: int, enc_hid_dim: int, dec_hid_dim: int, dropout: float): super().__init__() self.input_dim = input_dim self.emb_dim = emb_dim self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.dropout = dropout self.embedding = nn.Embedding(input_dim, emb_dim) self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True) self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim) self.dropout = nn.Dropout(dropout) def forward(self, src: Tensor) -> Tuple[Tensor]: embedded = self.dropout(self.embedding(src)) outputs, hidden = self.rnn(embedded) hidden = torch.tanh(self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))) return outputs, hidden class Attention(nn.Module): def __init__(self, enc_hid_dim: int, dec_hid_dim: int, attn_dim: int): super().__init__() self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.attn_in = (enc_hid_dim * 2) + dec_hid_dim self.attn = nn.Linear(self.attn_in, attn_dim) def forward(self, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tensor: src_len = encoder_outputs.shape[0] repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1) encoder_outputs = encoder_outputs.permute(1, 0, 2) energy = torch.tanh(self.attn(torch.cat(( repeated_decoder_hidden, encoder_outputs), dim = 2))) attention = torch.sum(energy, dim=2) return F.softmax(attention, dim=1) class Decoder(nn.Module): def __init__(self, output_dim: int, emb_dim: int, enc_hid_dim: int, dec_hid_dim: int, dropout: int, attention: nn.Module): super().__init__() self.emb_dim = emb_dim self.enc_hid_dim = enc_hid_dim self.dec_hid_dim = dec_hid_dim self.output_dim = output_dim self.dropout = dropout self.attention = attention self.embedding = nn.Embedding(output_dim, emb_dim) self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim) self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim) self.dropout = nn.Dropout(dropout) def _weighted_encoder_rep(self, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tensor: a = self.attention(decoder_hidden, encoder_outputs) a = a.unsqueeze(1) encoder_outputs = encoder_outputs.permute(1, 0, 2) weighted_encoder_rep = torch.bmm(a, encoder_outputs) weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2) return weighted_encoder_rep def forward(self, input: Tensor, decoder_hidden: Tensor, encoder_outputs: Tensor) -> Tuple[Tensor]: input = input.unsqueeze(0) embedded = self.dropout(self.embedding(input)) weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden, encoder_outputs) rnn_input = torch.cat((embedded, weighted_encoder_rep), dim = 2) output, decoder_hidden = self.rnn(rnn_input, decoder_hidden.unsqueeze(0)) embedded = embedded.squeeze(0) output = output.squeeze(0) weighted_encoder_rep = weighted_encoder_rep.squeeze(0) output = self.out(torch.cat((output, weighted_encoder_rep, embedded), dim = 1)) return output, decoder_hidden.squeeze(0) class Seq2Seq(nn.Module): def __init__(self, encoder: nn.Module, decoder: nn.Module, device: torch.device): super().__init__() self.encoder = encoder self.decoder = decoder self.device = device def forward(self, src: Tensor, trg: Tensor, teacher_forcing_ratio: float = 0.5) -> Tensor: batch_size = src.shape[1] max_len = trg.shape[0] trg_vocab_size = self.decoder.output_dim outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device) encoder_outputs, hidden = self.encoder(src) # first input to the decoder is the token output = trg[0,:] for t in range(1, max_len): output, hidden = self.decoder(output, hidden, encoder_outputs) outputs[t] = output teacher_force = random.random() < teacher_forcing_ratio top1 = output.max(1)[1] output = (trg[t] if teacher_force else top1) return outputs INPUT_DIM = len(de_vocab) OUTPUT_DIM = len(en_vocab) # ENC_EMB_DIM = 256 # DEC_EMB_DIM = 256 # ENC_HID_DIM = 512 # DEC_HID_DIM = 512 # ATTN_DIM = 64 # ENC_DROPOUT = 0.5 # DEC_DROPOUT = 0.5 ENC_EMB_DIM = 32 DEC_EMB_DIM = 32 ENC_HID_DIM = 64 DEC_HID_DIM = 64 ATTN_DIM = 8 ENC_DROPOUT = 0.5 DEC_DROPOUT = 0.5 enc = Encoder(INPUT_DIM, ENC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, ENC_DROPOUT) attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM) dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, ENC_HID_DIM, DEC_HID_DIM, DEC_DROPOUT, attn) model = Seq2Seq(enc, dec, device).to(device) def init_weights(m: nn.Module): for name, param in m.named_parameters(): if 'weight' in name: nn.init.normal_(param.data, mean=0, std=0.01) else: nn.init.constant_(param.data, 0) model.apply(init_weights) optimizer = optim.Adam(model.parameters()) def count_parameters(model: nn.Module): return sum(p.numel() for p in model.parameters() if p.requires_grad) print(f'The model has {count_parameters(model):,} trainable parameters') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none The model has 3,491,552 trainable parameters Note: when scoring the performance of a language translation model in particular, we have to tell the ``nn.CrossEntropyLoss`` function to ignore the indices where the target is simply padding. .. code-block:: default PAD_IDX = en_vocab.stoi[''] criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX) Finally, we can train and evaluate this model: .. code-block:: default import math import time def train(model: nn.Module, iterator: torch.utils.data.DataLoader, optimizer: optim.Optimizer, criterion: nn.Module, clip: float): model.train() epoch_loss = 0 for _, (src, trg) in enumerate(iterator): src, trg = src.to(device), trg.to(device) optimizer.zero_grad() output = model(src, trg) output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) loss = criterion(output, trg) loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(), clip) optimizer.step() epoch_loss += loss.item() return epoch_loss / len(iterator) def evaluate(model: nn.Module, iterator: torch.utils.data.DataLoader, criterion: nn.Module): model.eval() epoch_loss = 0 with torch.no_grad(): for _, (src, trg) in enumerate(iterator): src, trg = src.to(device), trg.to(device) output = model(src, trg, 0) #turn off teacher forcing output = output[1:].view(-1, output.shape[-1]) trg = trg[1:].view(-1) loss = criterion(output, trg) epoch_loss += loss.item() return epoch_loss / len(iterator) def epoch_time(start_time: int, end_time: int): elapsed_time = end_time - start_time elapsed_mins = int(elapsed_time / 60) elapsed_secs = int(elapsed_time - (elapsed_mins * 60)) return elapsed_mins, elapsed_secs N_EPOCHS = 10 CLIP = 1 best_valid_loss = float('inf') for epoch in range(N_EPOCHS): start_time = time.time() train_loss = train(model, train_iter, optimizer, criterion, CLIP) valid_loss = evaluate(model, valid_iter, criterion) end_time = time.time() epoch_mins, epoch_secs = epoch_time(start_time, end_time) print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s') print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}') print(f'\t Val. Loss: {valid_loss:.3f} | Val. PPL: {math.exp(valid_loss):7.3f}') test_loss = evaluate(model, test_iter, criterion) print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |') .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Epoch: 01 | Time: 0m 58s Train Loss: 5.763 | Train PPL: 318.278 Val. Loss: 5.281 | Val. PPL: 196.603 Epoch: 02 | Time: 0m 58s Train Loss: 4.777 | Train PPL: 118.805 Val. Loss: 5.041 | Val. PPL: 154.694 Epoch: 03 | Time: 0m 59s Train Loss: 4.526 | Train PPL: 92.411 Val. Loss: 4.935 | Val. PPL: 139.111 Epoch: 04 | Time: 0m 58s Train Loss: 4.367 | Train PPL: 78.774 Val. Loss: 4.823 | Val. PPL: 124.304 Epoch: 05 | Time: 0m 58s Train Loss: 4.249 | Train PPL: 70.070 Val. Loss: 4.766 | Val. PPL: 117.415 Epoch: 06 | Time: 0m 57s Train Loss: 4.174 | Train PPL: 64.978 Val. Loss: 4.735 | Val. PPL: 113.907 Epoch: 07 | Time: 0m 58s Train Loss: 4.065 | Train PPL: 58.237 Val. Loss: 4.643 | Val. PPL: 103.859 Epoch: 08 | Time: 0m 58s Train Loss: 3.975 | Train PPL: 53.270 Val. Loss: 4.606 | Val. PPL: 100.102 Epoch: 09 | Time: 0m 58s Train Loss: 3.913 | Train PPL: 50.069 Val. Loss: 4.540 | Val. PPL: 93.690 Epoch: 10 | Time: 0m 58s Train Loss: 3.831 | Train PPL: 46.103 Val. Loss: 4.499 | Val. PPL: 89.940 | Test Loss: 4.477 | Test PPL: 87.958 | Next steps -------------- - Check out the rest of Ben Trevett's tutorials using ``torchtext`` `here `__ - Stay tuned for a tutorial using other ``torchtext`` features along with ``nn.Transformer`` for language modeling via next word prediction! .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 10 minutes 5.766 seconds) .. _sphx_glr_download_beginner_torchtext_translation_tutorial.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: torchtext_translation_tutorial.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: torchtext_translation_tutorial.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_