Language Translation with TorchText

This tutorial shows how to use torchtext to preprocess data from a well-known dataset containing sentences in both English and German and use it to train a sequence-to-sequence model with attention that can translate German sentences into English.

It is based off of this tutorial from PyTorch community member Ben Trevett with Ben’s permission. We update the tutorials by removing some legacy code.

By the end of this tutorial, you will be able to preprocess sentences into tensors for NLP modeling and use for training and validing the model.

Data Processing

torchtext has utilities for creating datasets that can be easily iterated through for the purposes of creating a language translation model. In this example, we show how to tokenize a raw text sentence, build vocabulary, and numericalize tokens into tensor.

Note: the tokenization in this tutorial requires Spacy We use Spacy because it provides strong support for tokenization in languages other than English. torchtext provides a basic_english tokenizer and supports other tokenizers for English (e.g. Moses) but for language translation - where multiple languages are required - Spacy is your best bet.

To run this tutorial, first install spacy using pip or conda. Next, download the raw data for the English and German Spacy tokenizers:

python -m spacy download en
python -m spacy download de
import torchtext
import torch
from import get_tokenizer
from collections import Counter
from torchtext.vocab import Vocab
from torchtext.utils import download_from_url, extract_archive
import io

url_base = ''
train_urls = ('', 'train.en.gz')
val_urls = ('', 'val.en.gz')
test_urls = ('', 'test_2016_flickr.en.gz')

train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')

def build_vocab(filepath, tokenizer):
  counter = Counter()
  with, encoding="utf8") as f:
    for string_ in f:
  return Vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
en_vocab = build_vocab(train_filepaths[1], en_tokenizer)

def data_process(filepaths):
  raw_de_iter = iter([0], encoding="utf8"))
  raw_en_iter = iter([1], encoding="utf8"))
  data = []
  for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter):
    de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)],
    en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)],
    data.append((de_tensor_, en_tensor_))
  return data

train_data = data_process(train_filepaths)
val_data = data_process(val_filepaths)
test_data = data_process(test_filepaths)


The last torch specific feature we’ll use is the DataLoader, which is easy to use since it takes the data as its first argument. Specifically, as the docs say: DataLoader combines a dataset and a sampler, and provides an iterable over the given dataset. The DataLoader supports both map-style and iterable-style datasets with single- or multi-process loading, customizing loading order and optional automatic batching (collation) and memory pinning.

Please pay attention to collate_fn (optional) that merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset.

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

PAD_IDX = de_vocab['<pad>']
BOS_IDX = de_vocab['<bos>']
EOS_IDX = de_vocab['<eos>']

from torch.nn.utils.rnn import pad_sequence
from import DataLoader

def generate_batch(data_batch):
  de_batch, en_batch = [], []
  for (de_item, en_item) in data_batch:
    de_batch.append([torch.tensor([BOS_IDX]), de_item, torch.tensor([EOS_IDX])], dim=0))
    en_batch.append([torch.tensor([BOS_IDX]), en_item, torch.tensor([EOS_IDX])], dim=0))
  de_batch = pad_sequence(de_batch, padding_value=PAD_IDX)
  en_batch = pad_sequence(en_batch, padding_value=PAD_IDX)
  return de_batch, en_batch

train_iter = DataLoader(train_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
valid_iter = DataLoader(val_data, batch_size=BATCH_SIZE,
                        shuffle=True, collate_fn=generate_batch)
test_iter = DataLoader(test_data, batch_size=BATCH_SIZE,
                       shuffle=True, collate_fn=generate_batch)

Defining our nn.Module and Optimizer

That’s mostly it from a torchtext perspecive: with the dataset built and the iterator defined, the rest of this tutorial simply defines our model as an nn.Module, along with an Optimizer, and then trains it.

Our model specifically, follows the architecture described here (you can find a significantly more commented version here).

Note: this model is just an example model that can be used for language translation; we choose it because it is a standard model for the task, not because it is the recommended model to use for translation. As you’re likely aware, state-of-the-art models are currently based on Transformers; you can see PyTorch’s capabilities for implementing Transformer layers here; and in particular, the “attention” used in the model below is different from the multi-headed self-attention present in a transformer model.

import random
from typing import Tuple

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch import Tensor

class Encoder(nn.Module):
    def __init__(self,
                 input_dim: int,
                 emb_dim: int,
                 enc_hid_dim: int,
                 dec_hid_dim: int,
                 dropout: float):

        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.dropout = dropout

        self.embedding = nn.Embedding(input_dim, emb_dim)

        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional = True)

        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)

        self.dropout = nn.Dropout(dropout)

    def forward(self,
                src: Tensor) -> Tuple[Tensor]:

        embedded = self.dropout(self.embedding(src))

        outputs, hidden = self.rnn(embedded)

        hidden = torch.tanh(self.fc([-2,:,:], hidden[-1,:,:]), dim = 1)))

        return outputs, hidden

class Attention(nn.Module):
    def __init__(self,
                 enc_hid_dim: int,
                 dec_hid_dim: int,
                 attn_dim: int):

        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim

        self.attn_in = (enc_hid_dim * 2) + dec_hid_dim

        self.attn = nn.Linear(self.attn_in, attn_dim)

    def forward(self,
                decoder_hidden: Tensor,
                encoder_outputs: Tensor) -> Tensor:

        src_len = encoder_outputs.shape[0]

        repeated_decoder_hidden = decoder_hidden.unsqueeze(1).repeat(1, src_len, 1)

        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        energy = torch.tanh(self.attn(
            dim = 2)))

        attention = torch.sum(energy, dim=2)

        return F.softmax(attention, dim=1)

class Decoder(nn.Module):
    def __init__(self,
                 output_dim: int,
                 emb_dim: int,
                 enc_hid_dim: int,
                 dec_hid_dim: int,
                 dropout: int,
                 attention: nn.Module):

        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.output_dim = output_dim
        self.dropout = dropout
        self.attention = attention

        self.embedding = nn.Embedding(output_dim, emb_dim)

        self.rnn = nn.GRU((enc_hid_dim * 2) + emb_dim, dec_hid_dim)

        self.out = nn.Linear(self.attention.attn_in + emb_dim, output_dim)

        self.dropout = nn.Dropout(dropout)

    def _weighted_encoder_rep(self,
                              decoder_hidden: Tensor,
                              encoder_outputs: Tensor) -> Tensor:

        a = self.attention(decoder_hidden, encoder_outputs)

        a = a.unsqueeze(1)

        encoder_outputs = encoder_outputs.permute(1, 0, 2)

        weighted_encoder_rep = torch.bmm(a, encoder_outputs)

        weighted_encoder_rep = weighted_encoder_rep.permute(1, 0, 2)

        return weighted_encoder_rep

    def forward(self,
                input: Tensor,
                decoder_hidden: Tensor,
                encoder_outputs: Tensor) -> Tuple[Tensor]:

        input = input.unsqueeze(0)

        embedded = self.dropout(self.embedding(input))

        weighted_encoder_rep = self._weighted_encoder_rep(decoder_hidden,

        rnn_input =, weighted_encoder_rep), dim = 2)

        output, decoder_hidden = self.rnn(rnn_input, decoder_hidden.unsqueeze(0))

        embedded = embedded.squeeze(0)
        output = output.squeeze(0)
        weighted_encoder_rep = weighted_encoder_rep.squeeze(0)

        output = self.out(,
                                     embedded), dim = 1))

        return output, decoder_hidden.squeeze(0)

class Seq2Seq(nn.Module):
    def __init__(self,
                 encoder: nn.Module,
                 decoder: nn.Module,
                 device: torch.device):

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self,
                src: Tensor,
                trg: Tensor,
                teacher_forcing_ratio: float = 0.5) -> Tensor:

        batch_size = src.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        encoder_outputs, hidden = self.encoder(src)

        # first input to the decoder is the <sos> token
        output = trg[0,:]

        for t in range(1, max_len):
            output, hidden = self.decoder(output, hidden, encoder_outputs)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.max(1)[1]
            output = (trg[t] if teacher_force else top1)

        return outputs

INPUT_DIM = len(de_vocab)
OUTPUT_DIM = len(en_vocab)
# ENC_EMB_DIM = 256
# DEC_EMB_DIM = 256
# ENC_HID_DIM = 512
# DEC_HID_DIM = 512
# ATTN_DIM = 64



attn = Attention(ENC_HID_DIM, DEC_HID_DIM, ATTN_DIM)


model = Seq2Seq(enc, dec, device).to(device)

def init_weights(m: nn.Module):
    for name, param in m.named_parameters():
        if 'weight' in name:
            nn.init.normal_(, mean=0, std=0.01)
            nn.init.constant_(, 0)


optimizer = optim.Adam(model.parameters())

def count_parameters(model: nn.Module):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')


The model has 3,491,552 trainable parameters

Note: when scoring the performance of a language translation model in particular, we have to tell the nn.CrossEntropyLoss function to ignore the indices where the target is simply padding.

PAD_IDX = en_vocab.stoi['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

Finally, we can train and evaluate this model:

import math
import time

def train(model: nn.Module,
          optimizer: optim.Optimizer,
          criterion: nn.Module,
          clip: float):


    epoch_loss = 0

    for _, (src, trg) in enumerate(iterator):
        src, trg =,


        output = model(src, trg)

        output = output[1:].view(-1, output.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(output, trg)


        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)


        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def evaluate(model: nn.Module,
             criterion: nn.Module):


    epoch_loss = 0

    with torch.no_grad():

        for _, (src, trg) in enumerate(iterator):
            src, trg =,

            output = model(src, trg, 0) #turn off teacher forcing

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(output, trg)

            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

def epoch_time(start_time: int,
               end_time: int):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):

    start_time = time.time()

    train_loss = train(model, train_iter, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iter, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

test_loss = evaluate(model, test_iter, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')


Epoch: 01 | Time: 0m 58s
        Train Loss: 5.763 | Train PPL: 318.278
         Val. Loss: 5.281 |  Val. PPL: 196.603
Epoch: 02 | Time: 0m 58s
        Train Loss: 4.777 | Train PPL: 118.805
         Val. Loss: 5.041 |  Val. PPL: 154.694
Epoch: 03 | Time: 0m 59s
        Train Loss: 4.526 | Train PPL:  92.411
         Val. Loss: 4.935 |  Val. PPL: 139.111
Epoch: 04 | Time: 0m 58s
        Train Loss: 4.367 | Train PPL:  78.774
         Val. Loss: 4.823 |  Val. PPL: 124.304
Epoch: 05 | Time: 0m 58s
        Train Loss: 4.249 | Train PPL:  70.070
         Val. Loss: 4.766 |  Val. PPL: 117.415
Epoch: 06 | Time: 0m 57s
        Train Loss: 4.174 | Train PPL:  64.978
         Val. Loss: 4.735 |  Val. PPL: 113.907
Epoch: 07 | Time: 0m 58s
        Train Loss: 4.065 | Train PPL:  58.237
         Val. Loss: 4.643 |  Val. PPL: 103.859
Epoch: 08 | Time: 0m 58s
        Train Loss: 3.975 | Train PPL:  53.270
         Val. Loss: 4.606 |  Val. PPL: 100.102
Epoch: 09 | Time: 0m 58s
        Train Loss: 3.913 | Train PPL:  50.069
         Val. Loss: 4.540 |  Val. PPL:  93.690
Epoch: 10 | Time: 0m 58s
        Train Loss: 3.831 | Train PPL:  46.103
         Val. Loss: 4.499 |  Val. PPL:  89.940
| Test Loss: 4.477 | Test PPL:  87.958 |

Next steps

  • Check out the rest of Ben Trevett’s tutorials using torchtext here
  • Stay tuned for a tutorial using other torchtext features along with nn.Transformer for language modeling via next word prediction!

Total running time of the script: ( 10 minutes 5.766 seconds)

Gallery generated by Sphinx-Gallery


Access comprehensive developer documentation for PyTorch

View Docs


Get in-depth tutorials for beginners and advanced developers

View Tutorials


Find development resources and get your questions answered

View Resources