torchtext.datasets
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
return line.split()
tokens = []
for label, line in train_iter:
tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification
AG_NEWS
-
torchtext.datasets.
AG_NEWS
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] AG_NEWS Dataset
For additional details refer to https://paperswithcode.com/dataset/ag-news
- Number of lines per split:
train: 120000
test: 7600
AmazonReviewFull
-
torchtext.datasets.
AmazonReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] AmazonReviewFull Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3000000
test: 650000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review title and text
- Return type
AmazonReviewPolarity
-
torchtext.datasets.
AmazonReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] AmazonReviewPolarity Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3600000
test: 400000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review title and text
- Return type
DBpedia
-
torchtext.datasets.
DBpedia
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] DBpedia Dataset
For additional details refer to https://www.dbpedia.org/resources/latest-core/
- Number of lines per split:
train: 560000
test: 70000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 14) and text containing the news title and contents
- Return type
IMDb
-
torchtext.datasets.
IMDB
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] IMDB Dataset
For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/
- Number of lines per split:
train: 25000
test: 25000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the movie review
- Return type
SogouNews
-
torchtext.datasets.
SogouNews
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] SogouNews Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 450000
test: 60000
- Args:
root: Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’) split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- returns
DataPipe that yields tuple of label (1 to 5) and text containing the news title and contents
- rtype
(int, str)
SST2
-
torchtext.datasets.
SST2
(root='.data', split=('train', 'dev', 'test'))[source] SST2 Dataset
For additional details refer to https://nlp.stanford.edu/sentiment/
- Number of lines per split:
train: 67349
dev: 872
test: 1821
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and/or label (1 to 4). The test split only returns text.
- Return type
- Tutorials using
SST2
:
YahooAnswers
-
torchtext.datasets.
YahooAnswers
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] YahooAnswers Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 1400000
test: 60000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 10) and text containing the question title, question content, and best answer
- Return type
YelpReviewFull
-
torchtext.datasets.
YelpReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] YelpReviewFull Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 650000
test: 50000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review
- Return type
YelpReviewPolarity
-
torchtext.datasets.
YelpReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] YelpReviewPolarity Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 560000
test: 38000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review
- Return type
Language Modeling
PennTreebank
-
torchtext.datasets.
PennTreebank
(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source] PennTreebank Dataset
For additional details refer to https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html
- Number of lines per split:
train: 42068
valid: 3370
test: 3761
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from the Treebank corpus
- Return type
WikiText-2
-
torchtext.datasets.
WikiText2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source] WikiText2 Dataset
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 36718
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
WikiText103
-
torchtext.datasets.
WikiText103
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source] WikiText103 Dataset
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 1801350
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
Machine Translation
IWSLT2016
-
torchtext.datasets.
IWSLT2016
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source] IWSLT2016 dataset
For additional details refer to https://wit3.fbk.eu/2016-01
The available datasets include following:
Language pairs:
“en”
“fr”
“de”
“cs”
“ar”
“en”
x
x
x
x
“fr”
x
“de”
x
“cs”
x
“ar”
x
valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(iter(train_iter))
IWSLT2017
-
torchtext.datasets.
IWSLT2017
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source] IWSLT2017 dataset
For additional details refer to https://wit3.fbk.eu/2017-01
The available datasets include following:
Language pairs:
“en”
“nl”
“de”
“it”
“ro”
“en”
x
x
x
x
“nl”
x
x
x
x
“de”
x
x
x
x
“it”
x
x
x
x
“ro”
x
x
x
x
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(iter(train_iter))
Multi30k
-
torchtext.datasets.
Multi30k
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[source] Multi30k dataset
For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1
- Number of lines per split:
train: 29000
valid: 1014
test: 1000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Sequence Tagging
CoNLL2000Chunking
-
torchtext.datasets.
CoNLL2000Chunking
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source] CoNLL2000Chunking Dataset
For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/
- Number of lines per split:
train: 8936
test: 2012
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields list of words along with corresponding Parts-of-speech tag and chunk tag
- Return type
UDPOS
-
torchtext.datasets.
UDPOS
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source] UDPOS Dataset
- Number of lines per split:
train: 12543
valid: 2002
test: 2077
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields list of words along with corresponding parts-of-speech tags
- Return type
Question Answer
SQuAD 1.0
-
torchtext.datasets.
SQuAD1
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source] SQuAD1 Dataset
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 87599
dev: 10570
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
SQuAD 2.0
-
torchtext.datasets.
SQuAD2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source] SQuAD2 Dataset
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 130319
dev: 11873
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
Unsupervised Learning
CC100
-
torchtext.datasets.
CC100
(root: str, language_code: str = 'en')[source] CC100 Dataset
For additional details refer to https://data.statmt.org/cc-100/
EnWik9
-
torchtext.datasets.
EnWik9
(root: str)[source] EnWik9 dataset
For additional details refer to http://mattmahoney.net/dc/textdata.html
Number of lines in dataset: 13147026
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
- Returns
DataPipe that yields raw text rows from WnWik9 dataset
- Return type