Shortcuts

torchtext.datasets

General use cases are as follows:

# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)

The following datasets are available:

Text Classification

AG_NEWS

torchtext.datasets.AG_NEWS(root='.data', split=('train', 'test'))[source]

AG_NEWS dataset

Separately returns the train/test split

Number of lines per split:

train: 120000

test: 7600

Number of classes

4

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

SogouNews

torchtext.datasets.SogouNews(root='.data', split=('train', 'test'))[source]

SogouNews dataset

Separately returns the train/test split

Number of lines per split:

train: 450000

test: 60000

Number of classes

5

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

DBpedia

torchtext.datasets.DBpedia(root='.data', split=('train', 'test'))[source]

DBpedia dataset

Separately returns the train/test split

Number of lines per split:

train: 560000

test: 70000

Number of classes

14

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

YelpReviewPolarity

torchtext.datasets.YelpReviewPolarity(root='.data', split=('train', 'test'))[source]

YelpReviewPolarity dataset

Separately returns the train/test split

Number of lines per split:

train: 560000

test: 38000

Number of classes

2

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

YelpReviewFull

torchtext.datasets.YelpReviewFull(root='.data', split=('train', 'test'))[source]

YelpReviewFull dataset

Separately returns the train/test split

Number of lines per split:

train: 650000

test: 50000

Number of classes

5

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

YahooAnswers

torchtext.datasets.YahooAnswers(root='.data', split=('train', 'test'))[source]

YahooAnswers dataset

Separately returns the train/test split

Number of lines per split:

train: 1400000

test: 60000

Number of classes

10

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

AmazonReviewPolarity

torchtext.datasets.AmazonReviewPolarity(root='.data', split=('train', 'test'))[source]

AmazonReviewPolarity dataset

Separately returns the train/test split

Number of lines per split:

train: 3600000

test: 400000

Number of classes

2

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

AmazonReviewFull

torchtext.datasets.AmazonReviewFull(root='.data', split=('train', 'test'))[source]

AmazonReviewFull dataset

Separately returns the train/test split

Number of lines per split:

train: 3000000

test: 650000

Number of classes

5

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

IMDb

torchtext.datasets.IMDB(root='.data', split=('train', 'test'))[source]

IMDB dataset

Separately returns the train/test split

Number of lines per split:

train: 25000

test: 25000

Number of classes

2

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

Language Modeling

WikiText-2

torchtext.datasets.WikiText2(root='.data', split=('train', 'valid', 'test'))[source]

WikiText2 dataset

Separately returns the train/valid/test split

Number of lines per split:

train: 36718

valid: 3760

test: 4358

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

WikiText103

torchtext.datasets.WikiText103(root='.data', split=('train', 'valid', 'test'))[source]

WikiText103 dataset

Separately returns the train/valid/test split

Number of lines per split:

train: 1801350

valid: 3760

test: 4358

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

PennTreebank

torchtext.datasets.PennTreebank(root='.data', split=('train', 'valid', 'test'))[source]

PennTreebank dataset

Separately returns the train/valid/test split

Number of lines per split:

train: 42068

valid: 3370

test: 3761

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

Machine Translation

Multi30k

torchtext.datasets.Multi30k(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]

Multi30k dataset

Reference: http://www.statmt.org/wmt16/multimodal-task.html#task1

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)

IWSLT2016

torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]

IWSLT2016 dataset

The available datasets include following:

Language pairs:

‘en’

‘fr’

‘de’

‘cs’

‘ar’

‘en’

x

x

x

x

‘fr’

x

‘de’

x

‘cs’

x

‘ar’

x

valid/test sets: [‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’, ‘tst2014’]

For additional details refer to source website: https://wit3.fbk.eu/2016-01

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language

  • valid_set – a string to identify validation set.

  • test_set – a string to identify test set.

Examples

>>> from torchtext.datasets import IWSLT2016
>>> train_iter, valid_iter, test_iter = IWSLT2016()
>>> src_sentence, tgt_sentence = next(train_iter)

IWSLT2017

torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]

IWSLT2017 dataset

The available datasets include following:

Language pairs:

‘en’

‘nl’

‘de’

‘it’

‘ro’

‘en’

x

x

x

x

‘nl’

x

x

x

x

‘de’

x

x

x

x

‘it’

x

x

x

x

‘ro’

x

x

x

x

For additional details refer to source website: https://wit3.fbk.eu/2017-01

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language

Examples

>>> from torchtext.datasets import IWSLT2017
>>> train_iter, valid_iter, test_iter = IWSLT2017()
>>> src_sentence, tgt_sentence = next(train_iter)

Sequence Tagging

UDPOS

torchtext.datasets.UDPOS(root='.data', split=('train', 'valid', 'test'))[source]

UDPOS dataset

Separately returns the train/valid/test split

Number of lines per split:

train: 12543

valid: 2002

test: 2077

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

CoNLL2000Chunking

torchtext.datasets.CoNLL2000Chunking(root='.data', split=('train', 'test'))[source]

CoNLL2000Chunking dataset

Separately returns the train/test split

Number of lines per split:

train: 8936

test: 2012

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)

Question Answer

SQuAD 1.0

torchtext.datasets.SQuAD1(root='.data', split=('train', 'dev'))[source]

SQuAD1 dataset

Separately returns the train/dev split

Number of lines per split:

train: 87599

dev: 10570

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)

SQuAD 2.0

torchtext.datasets.SQuAD2(root='.data', split=('train', 'dev'))[source]

SQuAD2 dataset

Separately returns the train/dev split

Number of lines per split:

train: 130319

dev: 11873

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)

Unsupervised Learning

EnWik9

torchtext.datasets.EnWik9(root='.data', split=('train', ))[source]

EnWik9 dataset

Separately returns the train split

Number of lines per split:

train: 13147026

Parameters
  • root – Directory where the datasets are saved. Default: .data

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’,)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources