torchtext.datasets¶
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
return line.split()
tokens = []
for label, line in train_iter:
tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification¶
AG_NEWS¶
-
torchtext.datasets.
AG_NEWS
(root='.data', split=('train', 'test'))[source]¶ AG_NEWS dataset
Separately returns the train/test split
- Number of lines per split:
train: 120000
test: 7600
- Number of classes
4
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
SogouNews¶
-
torchtext.datasets.
SogouNews
(root='.data', split=('train', 'test'))[source]¶ SogouNews dataset
Separately returns the train/test split
- Number of lines per split:
train: 450000
test: 60000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
DBpedia¶
-
torchtext.datasets.
DBpedia
(root='.data', split=('train', 'test'))[source]¶ DBpedia dataset
Separately returns the train/test split
- Number of lines per split:
train: 560000
test: 70000
- Number of classes
14
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YelpReviewPolarity¶
-
torchtext.datasets.
YelpReviewPolarity
(root='.data', split=('train', 'test'))[source]¶ YelpReviewPolarity dataset
Separately returns the train/test split
- Number of lines per split:
train: 560000
test: 38000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YelpReviewFull¶
-
torchtext.datasets.
YelpReviewFull
(root='.data', split=('train', 'test'))[source]¶ YelpReviewFull dataset
Separately returns the train/test split
- Number of lines per split:
train: 650000
test: 50000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YahooAnswers¶
-
torchtext.datasets.
YahooAnswers
(root='.data', split=('train', 'test'))[source]¶ YahooAnswers dataset
Separately returns the train/test split
- Number of lines per split:
train: 1400000
test: 60000
- Number of classes
10
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
AmazonReviewPolarity¶
-
torchtext.datasets.
AmazonReviewPolarity
(root='.data', split=('train', 'test'))[source]¶ AmazonReviewPolarity dataset
Separately returns the train/test split
- Number of lines per split:
train: 3600000
test: 400000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
AmazonReviewFull¶
-
torchtext.datasets.
AmazonReviewFull
(root='.data', split=('train', 'test'))[source]¶ AmazonReviewFull dataset
Separately returns the train/test split
- Number of lines per split:
train: 3000000
test: 650000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
IMDb¶
-
torchtext.datasets.
IMDB
(root='.data', split=('train', 'test'))[source]¶ IMDB dataset
Separately returns the train/test split
- Number of lines per split:
train: 25000
test: 25000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
Language Modeling¶
WikiText-2¶
-
torchtext.datasets.
WikiText2
(root='.data', split=('train', 'valid', 'test'))[source]¶ WikiText2 dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 36718
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
WikiText103¶
-
torchtext.datasets.
WikiText103
(root='.data', split=('train', 'valid', 'test'))[source]¶ WikiText103 dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 1801350
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
PennTreebank¶
-
torchtext.datasets.
PennTreebank
(root='.data', split=('train', 'valid', 'test'))[source]¶ PennTreebank dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 42068
valid: 3370
test: 3761
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
Machine Translation¶
Multi30k¶
-
torchtext.datasets.
Multi30k
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶ Multi30k dataset
Reference: http://www.statmt.org/wmt16/multimodal-task.html#task1
- Parameters
root – Directory where the datasets are saved. Default: “.data”
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)
IWSLT2016¶
-
torchtext.datasets.
IWSLT2016
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]¶ IWSLT2016 dataset
The available datasets include following:
Language pairs:
‘en’
‘fr’
‘de’
‘cs’
‘ar’
‘en’
x
x
x
x
‘fr’
x
‘de’
x
‘cs’
x
‘ar’
x
valid/test sets: [‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’, ‘tst2014’]
For additional details refer to source website: https://wit3.fbk.eu/2016-01
- Parameters
root – Directory where the datasets are saved. Default: “.data”
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.
Examples
>>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(train_iter)
IWSLT2017¶
-
torchtext.datasets.
IWSLT2017
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶ IWSLT2017 dataset
The available datasets include following:
Language pairs:
‘en’
‘nl’
‘de’
‘it’
‘ro’
‘en’
x
x
x
x
‘nl’
x
x
x
x
‘de’
x
x
x
x
‘it’
x
x
x
x
‘ro’
x
x
x
x
For additional details refer to source website: https://wit3.fbk.eu/2017-01
- Parameters
root – Directory where the datasets are saved. Default: “.data”
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
Examples
>>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(train_iter)
Sequence Tagging¶
UDPOS¶
-
torchtext.datasets.
UDPOS
(root='.data', split=('train', 'valid', 'test'))[source]¶ UDPOS dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 12543
valid: 2002
test: 2077
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
CoNLL2000Chunking¶
-
torchtext.datasets.
CoNLL2000Chunking
(root='.data', split=('train', 'test'))[source]¶ CoNLL2000Chunking dataset
Separately returns the train/test split
- Number of lines per split:
train: 8936
test: 2012
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
Question Answer¶
SQuAD 1.0¶
-
torchtext.datasets.
SQuAD1
(root='.data', split=('train', 'dev'))[source]¶ SQuAD1 dataset
Separately returns the train/dev split
- Number of lines per split:
train: 87599
dev: 10570
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)
SQuAD 2.0¶
-
torchtext.datasets.
SQuAD2
(root='.data', split=('train', 'dev'))[source]¶ SQuAD2 dataset
Separately returns the train/dev split
- Number of lines per split:
train: 130319
dev: 11873
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)
Unsupervised Learning¶
EnWik9¶
-
torchtext.datasets.
EnWik9
(root='.data', split=('train', ))[source]¶ EnWik9 dataset
Separately returns the train split
- Number of lines per split:
train: 13147026
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’,)