torchtext.datasets¶
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
return line.split()
tokens = []
for label, line in train_iter:
tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification¶
AG_NEWS¶
AmazonReviewFull¶
-
torchtext.datasets.
AmazonReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ AmazonReviewFull Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3000000
test: 650000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review title and text
- Return type
AmazonReviewPolarity¶
-
torchtext.datasets.
AmazonReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ AmazonReviewPolarity Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3600000
test: 400000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review title and text
- Return type
DBpedia¶
-
torchtext.datasets.
DBpedia
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ DBpedia Dataset
For additional details refer to https://www.dbpedia.org/resources/latest-core/
- Number of lines per split:
train: 560000
test: 70000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 14) and text containing the news title and contents
- Return type
IMDb¶
-
torchtext.datasets.
IMDB
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ IMDB Dataset
For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/
- Number of lines per split:
train: 25000
test: 25000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the movie review
- Return type
SogouNews¶
-
torchtext.datasets.
SogouNews
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ SogouNews Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 450000
test: 60000
- Args:
root: Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’) split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- returns
DataPipe that yields tuple of label (1 to 5) and text containing the news title and contents
- rtype
(int, str)
SST2¶
-
torchtext.datasets.
SST2
(root='.data', split=('train', 'dev', 'test'))[source]¶ SST2 Dataset
For additional details refer to https://nlp.stanford.edu/sentiment/
- Number of lines per split:
train: 67349
dev: 872
test: 1821
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and/or label (1 to 4). The test split only returns text.
- Return type
- Tutorials using
SST2
:
YahooAnswers¶
-
torchtext.datasets.
YahooAnswers
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YahooAnswers Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 1400000
test: 60000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 10) and text containing the question title, question content, and best answer
- Return type
YelpReviewFull¶
-
torchtext.datasets.
YelpReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YelpReviewFull Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 650000
test: 50000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review
- Return type
YelpReviewPolarity¶
-
torchtext.datasets.
YelpReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YelpReviewPolarity Dataset
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 560000
test: 38000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review
- Return type
Language Modeling¶
PennTreebank¶
-
torchtext.datasets.
PennTreebank
(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ PennTreebank Dataset
For additional details refer to https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html
- Number of lines per split:
train: 42068
valid: 3370
test: 3761
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from the Treebank corpus
- Return type
WikiText-2¶
-
torchtext.datasets.
WikiText2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ WikiText2 Dataset
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 36718
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
WikiText103¶
-
torchtext.datasets.
WikiText103
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ WikiText103 Dataset
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 1801350
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
Machine Translation¶
IWSLT2016¶
-
torchtext.datasets.
IWSLT2016
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]¶ IWSLT2016 dataset
For additional details refer to https://wit3.fbk.eu/2016-01
The available datasets include following:
Language pairs:
“en”
“fr”
“de”
“cs”
“ar”
“en”
x
x
x
x
“fr”
x
“de”
x
“cs”
x
“ar”
x
valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(iter(train_iter))
IWSLT2017¶
-
torchtext.datasets.
IWSLT2017
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶ IWSLT2017 dataset
For additional details refer to https://wit3.fbk.eu/2017-01
The available datasets include following:
Language pairs:
“en”
“nl”
“de”
“it”
“ro”
“en”
x
x
x
x
“nl”
x
x
x
x
“de”
x
x
x
x
“it”
x
x
x
x
“ro”
x
x
x
x
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(iter(train_iter))
Multi30k¶
-
torchtext.datasets.
Multi30k
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[source]¶ Multi30k dataset
For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1
- Number of lines per split:
train: 29000
valid: 1014
test: 1000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Sequence Tagging¶
CoNLL2000Chunking¶
-
torchtext.datasets.
CoNLL2000Chunking
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ CoNLL2000Chunking Dataset
For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/
- Number of lines per split:
train: 8936
test: 2012
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields list of words along with corresponding Parts-of-speech tag and chunk tag
- Return type
UDPOS¶
-
torchtext.datasets.
UDPOS
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ UDPOS Dataset
- Number of lines per split:
train: 12543
valid: 2002
test: 2077
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields list of words along with corresponding parts-of-speech tags
- Return type
Question Answer¶
SQuAD 1.0¶
-
torchtext.datasets.
SQuAD1
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]¶ SQuAD1 Dataset
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 87599
dev: 10570
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
SQuAD 2.0¶
-
torchtext.datasets.
SQuAD2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]¶ SQuAD2 Dataset
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 130319
dev: 11873
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
Unsupervised Learning¶
CC100¶
-
torchtext.datasets.
CC100
(root: str, language_code: str = 'en')[source]¶ CC100 Dataset
For additional details refer to https://data.statmt.org/cc-100/
EnWik9¶
-
torchtext.datasets.
EnWik9
(root: str)[source]¶ EnWik9 dataset
For additional details refer to http://mattmahoney.net/dc/textdata.html
Number of lines in dataset: 13147026
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
- Returns
DataPipe that yields raw text rows from WnWik9 dataset
- Return type