torchtext.datasets

General use cases are as follows:

# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)

The following datasets are available:

Datasets

Text Classification

AG_NEWS

torchtext.datasets.AG_NEWS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AG_NEWS Dataset

For additional details refer to https://paperswithcode.com/dataset/ag-news

Number of lines per split:

train: 120000
test: 7600

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 4) and text

Return type

(int, str)

AmazonReviewFull

torchtext.datasets.AmazonReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AmazonReviewFull Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 3000000
test: 650000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 5) and text containing the review title and text

Return type

(int, str)

AmazonReviewPolarity

torchtext.datasets.AmazonReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AmazonReviewPolarity Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 3600000
test: 400000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 2) and text containing the review title and text

Return type

(int, str)

DBpedia

torchtext.datasets.DBpedia(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

DBpedia Dataset

For additional details refer to https://www.dbpedia.org/resources/latest-core/

Number of lines per split:

train: 560000
test: 70000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 14) and text containing the news title and contents

Return type

(int, str)

IMDb

torchtext.datasets.IMDB(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

IMDB Dataset

For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/

Number of lines per split:

train: 25000
test: 25000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 2) and text containing the movie review

Return type

(int, str)

SogouNews

torchtext.datasets.SogouNews(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

SogouNews Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 450000

test: 60000

Args:
root: Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’) split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

returns

DataPipe that yields tuple of label (1 to 5) and text containing the news title and contents

rtype

(int, str)

SST2

torchtext.datasets.SST2(root='.data', split=('train', 'dev', 'test'))[source]

SST2 Dataset

For additional details refer to https://nlp.stanford.edu/sentiment/

Number of lines per split:

train: 67349
dev: 872
test: 1821

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns

DataPipe that yields tuple of text and/or label (1 to 4). The test split only returns text.

Return type

Union[(int, str), (str,)]

Tutorials using SST2:: SST-2 Binary text classification with XLM-RoBERTa model

YahooAnswers

torchtext.datasets.YahooAnswers(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YahooAnswers Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 1400000
test: 60000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 10) and text containing the question title, question content, and best answer

Return type

(int, str)

YelpReviewFull

torchtext.datasets.YelpReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YelpReviewFull Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 650000
test: 50000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 5) and text containing the review

Return type

(int, str)

YelpReviewPolarity

torchtext.datasets.YelpReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YelpReviewPolarity Dataset

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:

train: 560000
test: 38000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields tuple of label (1 to 2) and text containing the review

Return type

(int, str)

Language Modeling

PennTreebank

torchtext.datasets.PennTreebank(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

PennTreebank Dataset

For additional details refer to https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html

Number of lines per split:

train: 42068
valid: 3370
test: 3761

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns

DataPipe that yields text from the Treebank corpus

Return type

str

WikiText-2

torchtext.datasets.WikiText2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

WikiText2 Dataset

For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Number of lines per split:

train: 36718
valid: 3760
test: 4358

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns

DataPipe that yields text from Wikipedia articles

Return type

str

WikiText103

torchtext.datasets.WikiText103(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

WikiText103 Dataset

For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Number of lines per split:

train: 1801350
valid: 3760
test: 4358

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns

DataPipe that yields text from Wikipedia articles

Return type

str

Machine Translation

IWSLT2016

torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]

IWSLT2016 dataset

For additional details refer to https://wit3.fbk.eu/2016-01

The available datasets include following:

Language pairs:

	“en”	“fr”	“de”	“cs”	“ar”
“en”		x	x	x	x
“fr”	x
“de”	x
“cs”	x
“ar”	x

valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.

Returns

DataPipe that yields tuple of source and target sentences

Return type

(str, str)

Examples

>>> from torchtext.datasets import IWSLT2016
>>> train_iter, valid_iter, test_iter = IWSLT2016()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

IWSLT2017

torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]

IWSLT2017 dataset

For additional details refer to https://wit3.fbk.eu/2017-01

The available datasets include following:

Language pairs:

	“en”	“nl”	“de”	“it”	“ro”
“en”		x	x	x	x
“nl”	x		x	x	x
“de”	x	x		x	x
“it”	x	x	x		x
“ro”	x	x	x	x

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language

Returns

DataPipe that yields tuple of source and target sentences

Return type

(str, str)

Examples

>>> from torchtext.datasets import IWSLT2017
>>> train_iter, valid_iter, test_iter = IWSLT2017()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

Multi30k

torchtext.datasets.Multi30k(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[source]

Multi30k dataset

For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1

Number of lines per split:

train: 29000
valid: 1014
test: 1000

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)

Returns

DataPipe that yields tuple of source and target sentences

Return type

(str, str)

Sequence Tagging

CoNLL2000Chunking

torchtext.datasets.CoNLL2000Chunking(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

CoNLL2000Chunking Dataset

For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/

Number of lines per split:

train: 8936
test: 2012

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns

DataPipe that yields list of words along with corresponding Parts-of-speech tag and chunk tag

Return type

[list(str), list(str), list(str)]

UDPOS

torchtext.datasets.UDPOS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

UDPOS Dataset

Number of lines per split:

train: 12543
valid: 2002
test: 2077

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns

DataPipe that yields list of words along with corresponding parts-of-speech tags

Return type

[list(str), list(str)]

Question Answer

SQuAD 1.0

torchtext.datasets.SQuAD1(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]

SQuAD1 Dataset

For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:

train: 87599
dev: 10570

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)

Returns

DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context

Return type

(str, str, list(str), list(int))

SQuAD 2.0

torchtext.datasets.SQuAD2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]

SQuAD2 Dataset

For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:

train: 130319
dev: 11873

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)

Returns

DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context

Return type

(str, str, list(str), list(int))

Unsupervised Learning

CC100

torchtext.datasets.CC100(root: str, language_code: str = 'en')[source]

CC100 Dataset

For additional details refer to https://data.statmt.org/cc-100/

Parameters

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
language_code – the language of the dataset

Returns

DataPipe that yields tuple of language code and text

Return type

(str, str)

EnWik9

torchtext.datasets.EnWik9(root: str)[source]

EnWik9 dataset

For additional details refer to http://mattmahoney.net/dc/textdata.html

Number of lines in dataset: 13147026

Parameters: root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
Returns: DataPipe that yields raw text rows from WnWik9 dataset
Return type: str

torchtext.datasets

Text Classification

AG_NEWS

AmazonReviewFull

AmazonReviewPolarity

DBpedia

IMDb

SogouNews

SST2

YahooAnswers

YelpReviewFull

YelpReviewPolarity

Language Modeling

PennTreebank

WikiText-2

WikiText103

Machine Translation

IWSLT2016

IWSLT2017

Multi30k

Sequence Tagging

CoNLL2000Chunking

UDPOS

Question Answer

SQuAD 1.0

SQuAD 2.0

Unsupervised Learning

CC100

EnWik9

Docs

Tutorials

Resources