torchtext.datasets¶
Warning
The datasets supported by torchtext are datapipes from the torchdata
project, which is still in Beta
status. This means that the API is subject to change without deprecation
cycles. In particular, we expect a lot of the current idioms to change with
the eventual release of DataLoaderV2
from torchdata
.
Here are a few recommendations regarding the use of datapipes:
For shuffling the datapipe, do that in the DataLoader:
DataLoader(dp, shuffle=True)
. You do not need to calldp.shuffle()
, becausetorchtext
has already done that for you. Note however that the datapipe won’t be shuffled unless you explicitly passshuffle=True
to the DataLoader.When using multi-processing (
num_workers=N
), use the builtinworker_init_fn
:from torch.utils.data.backward_compatibility import worker_init_fn DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True)
This will ensure that data isn’t duplicated across workers.
We also recommend using
drop_last=True
. Without this, the batch sizes at the end of an epoch may be very small in some cases (smaller than with other map-style datasets). This might affect accuracy greatly especially when batch-norm is used.drop_last=True
ensures that all batch sizes are equal.Distributed training with
DistributedDataParallel
is not yet entirely stable / supported, and we don’t recommend it at this point. It will be better supported in DataLoaderV2. If you still wish to use DDP, make sure that:All workers (DDP workers and DataLoader workers) see a different part of the data. The datasets are already wrapped inside ShardingFilter and you may need to call
dp.apply_sharding(num_shards, shard_id)
in order to shard the data across ranks (DDP workers) and DataLoader workers. One way to do this is to createworker_init_fn
that callsapply_sharding
with appropriate number of shards (DDP workers * DataLoader workers) and shard id (inferred through rank and worker ID of corresponding DataLoader withing rank). Note however, that this assumes equal number of DataLoader workers for all the ranks.All DDP workers work on the same number of batches. One way to do this is to by limit the size of the datapipe within each worker to
len(datapipe) // num_ddp_workers
, but this might not suit all use-cases.The shuffling seed is the same across all workers. You might need to call
torch.utils.data.graph_settings.apply_shuffle_seed(dp, rng)
The shuffling seed is different across epochs.
The rest of the RNG (typically used for transformations) is different across workers, for maximal entropy and optimal accuracy.
General use cases are as follows:
# import datasets
from torchtext.datasets import IMDB
train_iter = IMDB(split='train')
def tokenize(label, line):
return line.split()
tokens = []
for label, line in train_iter:
tokens += tokenize(label, line)
The following datasets are available:
Datasets
Text Classification¶
AG_NEWS¶
-
torchtext.datasets.
AG_NEWS
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ AG_NEWS Dataset
Warning
Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://paperswithcode.com/dataset/ag-news
- Number of lines per split:
train: 120000
test: 7600
AmazonReviewFull¶
-
torchtext.datasets.
AmazonReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ AmazonReviewFull Dataset
Warning
Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3000000
test: 650000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review title and text
- Return type
AmazonReviewPolarity¶
-
torchtext.datasets.
AmazonReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ AmazonReviewPolarity Dataset
Warning
Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 3600000
test: 400000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review title and text
- Return type
CoLA¶
-
torchtext.datasets.
CoLA
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev', 'test'))[source]¶ CoLA dataset
Warning
Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://nyu-mll.github.io/CoLA/
- Number of lines per split:
train: 8551
dev: 527
test: 516
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields rows from CoLA dataset (source (str), label (int), sentence (str))
- Return type
DBpedia¶
-
torchtext.datasets.
DBpedia
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ DBpedia Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://www.dbpedia.org/resources/latest-core/
- Number of lines per split:
train: 560000
test: 70000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 14) and text containing the news title and contents
- Return type
IMDb¶
-
torchtext.datasets.
IMDB
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ IMDB Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/
- Number of lines per split:
train: 25000
test: 25000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the movie review
- Return type
MNLI¶
-
torchtext.datasets.
MNLI
(root='.data', split=('train', 'dev_matched', 'dev_mismatched'))[source]¶ MNLI Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://cims.nyu.edu/~sbowman/multinli/
- Number of lines per split:
train: 392702
dev_matched: 9815
dev_mismatched: 9832
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev_matched, dev_mismatched)
- Returns
DataPipe that yields tuple of text and label (0 to 2).
- Return type
MRPC¶
-
torchtext.datasets.
MRPC
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ MRPC Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://www.microsoft.com/en-us/download/details.aspx?id=52398
- Number of lines per split:
train: 4076
test: 1725
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields data points from MRPC dataset which consist of label, sentence1, sentence2
- Return type
QNLI¶
-
torchtext.datasets.
QNLI
(root='.data', split=('train', 'dev', 'test'))[source]¶ QNLI Dataset
For additional details refer to https://arxiv.org/pdf/1804.07461.pdf (from GLUE paper)
- Number of lines per split:
train: 104743
dev: 5463
test: 5463
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and label (0 and 1).
- Return type
QQP¶
-
torchtext.datasets.
QQP
(root: str)[source]¶ QQP dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
RTE¶
-
torchtext.datasets.
RTE
(root='.data', split=('train', 'dev', 'test'))[source]¶ RTE Dataset
For additional details refer to https://aclweb.org/aclwiki/Recognizing_Textual_Entailment
- Number of lines per split:
train: 67349
dev: 872
test: 1821
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and/or label (0 and 1). The test split only returns text.
- Return type
SogouNews¶
-
torchtext.datasets.
SogouNews
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ SogouNews Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 450000
test: 60000
- Args:
root: Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’) split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- returns
DataPipe that yields tuple of label (1 to 5) and text containing the news title and contents
- rtype
(int, str)
SST2¶
-
torchtext.datasets.
SST2
(root='.data', split=('train', 'dev', 'test'))[source]¶ SST2 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://nlp.stanford.edu/sentiment/
- Number of lines per split:
train: 67349
dev: 872
test: 1821
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and/or label (1 to 4). The test split only returns text.
- Return type
- Tutorials using
SST2
:
STSB¶
-
torchtext.datasets.
STSB
(root='.data', split=('train', 'dev', 'test'))[source]¶ STSB Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark
- Number of lines per split:
train: 5749
dev: 1500
test: 1379
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of (index (int), label (float), sentence1 (str), sentence2 (str))
- Return type
WNLI¶
-
torchtext.datasets.
WNLI
(root='.data', split=('train', 'dev', 'test'))[source]¶ WNLI Dataset
For additional details refer to https://arxiv.org/pdf/1804.07461v3.pdf
- Number of lines per split:
train: 635
dev: 71
test: 146
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)
- Returns
DataPipe that yields tuple of text and/or label (0 to 1). The test split only returns text.
- Return type
YahooAnswers¶
-
torchtext.datasets.
YahooAnswers
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YahooAnswers Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 1400000
test: 60000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 10) and text containing the question title, question content, and best answer
- Return type
YelpReviewFull¶
-
torchtext.datasets.
YelpReviewFull
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YelpReviewFull Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 650000
test: 50000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 5) and text containing the review
- Return type
YelpReviewPolarity¶
-
torchtext.datasets.
YelpReviewPolarity
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ YelpReviewPolarity Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://arxiv.org/abs/1509.01626
- Number of lines per split:
train: 560000
test: 38000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields tuple of label (1 to 2) and text containing the review
- Return type
Language Modeling¶
PennTreebank¶
-
torchtext.datasets.
PennTreebank
(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ PennTreebank Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html
- Number of lines per split:
train: 42068
valid: 3370
test: 3761
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from the Treebank corpus
- Return type
WikiText-2¶
-
torchtext.datasets.
WikiText2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ WikiText2 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 36718
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
WikiText103¶
-
torchtext.datasets.
WikiText103
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ WikiText103 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
- Number of lines per split:
train: 1801350
valid: 3760
test: 4358
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields text from Wikipedia articles
- Return type
Machine Translation¶
IWSLT2016¶
-
torchtext.datasets.
IWSLT2016
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]¶ IWSLT2016 dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://wit3.fbk.eu/2016-01
The available datasets include following:
Language pairs:
“en”
“fr”
“de”
“cs”
“ar”
“en”
x
x
x
x
“fr”
x
“de”
x
“cs”
x
“ar”
x
valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
valid_set – a string to identify validation set.
test_set – a string to identify test set.
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2016 >>> train_iter, valid_iter, test_iter = IWSLT2016() >>> src_sentence, tgt_sentence = next(iter(train_iter))
IWSLT2017¶
-
torchtext.datasets.
IWSLT2017
(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]¶ IWSLT2017 dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://wit3.fbk.eu/2017-01
The available datasets include following:
Language pairs:
“en”
“nl”
“de”
“it”
“ro”
“en”
x
x
x
x
“nl”
x
x
x
x
“de”
x
x
x
x
“it”
x
x
x
x
“ro”
x
x
x
x
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Examples
>>> from torchtext.datasets import IWSLT2017 >>> train_iter, valid_iter, test_iter = IWSLT2017() >>> src_sentence, tgt_sentence = next(iter(train_iter))
Multi30k¶
-
torchtext.datasets.
Multi30k
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[source]¶ Multi30k dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1
- Number of lines per split:
train: 29000
valid: 1014
test: 1000
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)
language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)
- Returns
DataPipe that yields tuple of source and target sentences
- Return type
Sequence Tagging¶
CoNLL2000Chunking¶
-
torchtext.datasets.
CoNLL2000Chunking
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]¶ CoNLL2000Chunking Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/
- Number of lines per split:
train: 8936
test: 2012
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)
- Returns
DataPipe that yields list of words along with corresponding Parts-of-speech tag and chunk tag
- Return type
UDPOS¶
-
torchtext.datasets.
UDPOS
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]¶ UDPOS Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
- Number of lines per split:
train: 12543
valid: 2002
test: 2077
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)
- Returns
DataPipe that yields list of words along with corresponding parts-of-speech tags
- Return type
Question Answer¶
SQuAD 1.0¶
-
torchtext.datasets.
SQuAD1
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]¶ SQuAD1 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 87599
dev: 10570
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
SQuAD 2.0¶
-
torchtext.datasets.
SQuAD2
(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]¶ SQuAD2 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/
- Number of lines per split:
train: 130319
dev: 11873
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)
- Returns
DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context
- Return type
Unsupervised Learning¶
CC100¶
-
torchtext.datasets.
CC100
(root: str, language_code: str = 'en')[source]¶ CC100 Dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to https://data.statmt.org/cc-100/
EnWik9¶
-
torchtext.datasets.
EnWik9
(root: str)[source]¶ EnWik9 dataset
Warning
using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.
For additional details refer to http://mattmahoney.net/dc/textdata.html
Number of lines in dataset: 13147026
- Parameters
root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)
- Returns
DataPipe that yields raw text rows from WnWik9 dataset
- Return type