torchtext.datasets
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
return line.split()
tokens = []
for label, line in train_iter:
tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification
AG_NEWS
-
torchtext.datasets.
AG_NEWS
(root='.data', split=('train', 'test'))[source] AG_NEWS dataset
Separately returns the train/test split
- Number of lines per split:
train: 120000
test: 7600
- Number of classes
4
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
SogouNews
-
torchtext.datasets.
SogouNews
(root='.data', split=('train', 'test'))[source] SogouNews dataset
Separately returns the train/test split
- Number of lines per split:
train: 450000
test: 60000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
DBpedia
-
torchtext.datasets.
DBpedia
(root='.data', split=('train', 'test'))[source] DBpedia dataset
Separately returns the train/test split
- Number of lines per split:
train: 560000
test: 70000
- Number of classes
14
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YelpReviewPolarity
-
torchtext.datasets.
YelpReviewPolarity
(root='.data', split=('train', 'test'))[source] YelpReviewPolarity dataset
Separately returns the train/test split
- Number of lines per split:
train: 560000
test: 38000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YelpReviewFull
-
torchtext.datasets.
YelpReviewFull
(root='.data', split=('train', 'test'))[source] YelpReviewFull dataset
Separately returns the train/test split
- Number of lines per split:
train: 650000
test: 50000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
YahooAnswers
-
torchtext.datasets.
YahooAnswers
(root='.data', split=('train', 'test'))[source] YahooAnswers dataset
Separately returns the train/test split
- Number of lines per split:
train: 1400000
test: 60000
- Number of classes
10
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
AmazonReviewPolarity
-
torchtext.datasets.
AmazonReviewPolarity
(root='.data', split=('train', 'test'))[source] AmazonReviewPolarity dataset
Separately returns the train/test split
- Number of lines per split:
train: 3600000
test: 400000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
AmazonReviewFull
-
torchtext.datasets.
AmazonReviewFull
(root='.data', split=('train', 'test'))[source] AmazonReviewFull dataset
Separately returns the train/test split
- Number of lines per split:
train: 3000000
test: 650000
- Number of classes
5
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
IMDb
-
torchtext.datasets.
IMDB
(root='.data', split=('train', 'test'))[source] IMDB dataset
Separately returns the train/test split
- Number of lines per split:
train: 25000
test: 25000
- Number of classes
2
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
Language Modeling
WikiText-2
-
torchtext.datasets.
WikiText2
(root='.data', split=('train', 'valid', 'test'))[source] WikiText2 dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 36718
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
WikiText103
-
torchtext.datasets.
WikiText103
(root='.data', split=('train', 'valid', 'test'))[source] WikiText103 dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 1801350
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
PennTreebank
-
torchtext.datasets.
PennTreebank
(root='.data', split=('train', 'valid', 'test'))[source] PennTreebank dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 42068
valid: 3370
test: 3761
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
Machine Translation
IWSLT2016
-
torchtext.datasets.
IWSLT2016
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source] IWSLT2016 dataset
The available datasets include following:
Language pairs:
‘en’
‘fr’
‘de’
‘cs’
‘ar’
‘en’
x
x
x
x
‘fr’
x
‘de’
x
‘cs’
x
‘ar’
x
valid/test sets: [‘dev2010’, ‘tst2010’, ‘tst2011’, ‘tst2012’, ‘tst2013’, ‘tst2014’]
For additional details refer to source website: https://wit3.fbk.eu/2016-01
- Parameters
root – Directory where the datasets are saved. Default: “.data”
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.
Examples
>>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(train_iter)
IWSLT2017
-
torchtext.datasets.
IWSLT2017
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source] IWSLT2017 dataset
The available datasets include following:
Language pairs:
‘en’
‘nl’
‘de’
‘it’
‘ro’
‘en’
x
x
x
x
‘nl’
x
x
x
x
‘de’
x
x
x
x
‘it’
x
x
x
x
‘ro’
x
x
x
x
For additional details refer to source website: https://wit3.fbk.eu/2017-01
- Parameters
root – Directory where the datasets are saved. Default: “.data”
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
Examples
>>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(train_iter)
Sequence Tagging
UDPOS
-
torchtext.datasets.
UDPOS
(root='.data', split=('train', 'valid', 'test'))[source] UDPOS dataset
Separately returns the train/valid/test split
- Number of lines per split:
train: 12543
valid: 2002
test: 2077
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
CoNLL2000Chunking
-
torchtext.datasets.
CoNLL2000Chunking
(root='.data', split=('train', 'test'))[source] CoNLL2000Chunking dataset
Separately returns the train/test split
- Number of lines per split:
train: 8936
test: 2012
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘test’)
Question Answer
SQuAD 1.0
-
torchtext.datasets.
SQuAD1
(root='.data', split=('train', 'dev'))[source] SQuAD1 dataset
Separately returns the train/dev split
- Number of lines per split:
train: 87599
dev: 10570
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)
SQuAD 2.0
-
torchtext.datasets.
SQuAD2
(root='.data', split=('train', 'dev'))[source] SQuAD2 dataset
Separately returns the train/dev split
- Number of lines per split:
train: 130319
dev: 11873
- Parameters
root – Directory where the datasets are saved. Default: .data
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘dev’)