Shortcuts

torchtext.datasets

Warning

The datasets supported by torchtext are datapipes from the torchdata project, which is still in Beta status. This means that the API is subject to change without deprecation cycles. In particular, we expect a lot of the current idioms to change with the eventual release of DataLoaderV2 from torchdata.

Here are a few recommendations regarding the use of datapipes:

  • For shuffling the datapipe, do that in the DataLoader: DataLoader(dp, shuffle=True). You do not need to call dp.shuffle(), because torchtext has already done that for you. Note however that the datapipe won’t be shuffled unless you explicitly pass shuffle=True to the DataLoader.

  • When using multi-processing (num_workers=N), use the builtin worker_init_fn:

    from torch.utils.data.backward_compatibility import worker_init_fn
    DataLoader(dp, num_workers=4, worker_init_fn=worker_init_fn, drop_last=True)
    

    This will ensure that data isn’t duplicated across workers.

  • We also recommend using drop_last=True. Without this, the batch sizes at the end of an epoch may be very small in some cases (smaller than with other map-style datasets). This might affect accuracy greatly especially when batch-norm is used. drop_last=True ensures that all batch sizes are equal.

  • Distributed training with DistributedDataParallel is not yet entirely stable / supported, and we don’t recommend it at this point. It will be better supported in DataLoaderV2. If you still wish to use DDP, make sure that:

    • All workers (DDP workers and DataLoader workers) see a different part of the data. The datasets are already wrapped inside ShardingFilter and you may need to call dp.apply_sharding(num_shards, shard_id) in order to shard the data across ranks (DDP workers) and DataLoader workers. One way to do this is to create worker_init_fn that calls apply_sharding with appropriate number of shards (DDP workers * DataLoader workers) and shard id (inferred through rank and worker ID of corresponding DataLoader withing rank). Note however, that this assumes equal number of DataLoader workers for all the ranks.

    • All DDP workers work on the same number of batches. One way to do this is to by limit the size of the datapipe within each worker to len(datapipe) // num_ddp_workers, but this might not suit all use-cases.

    • The shuffling seed is the same across all workers. You might need to call torch.utils.data.graph_settings.apply_shuffle_seed(dp, rng)

    • The shuffling seed is different across epochs.

    • The rest of the RNG (typically used for transformations) is different across workers, for maximal entropy and optimal accuracy.

General use cases are as follows:

# import datasets
from torchtext.datasets import IMDB

train_iter = IMDB(split='train')

def tokenize(label, line):
    return line.split()

tokens = []
for label, line in train_iter:
    tokens += tokenize(label, line)

The following datasets are currently available. If you would like to contribute new datasets to the repo or work with your own custom datasets, please refer to CONTRIBUTING_DATASETS.md guide.

Text Classification

AG_NEWS

torchtext.datasets.AG_NEWS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AG_NEWS Dataset

Warning

Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://paperswithcode.com/dataset/ag-news

Number of lines per split:
  • train: 120000

  • test: 7600

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 4) and text

Return type:

(int, str)

AmazonReviewFull

torchtext.datasets.AmazonReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AmazonReviewFull Dataset

Warning

Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 3000000

  • test: 650000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 5) and text containing the review title and text

Return type:

(int, str)

AmazonReviewPolarity

torchtext.datasets.AmazonReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

AmazonReviewPolarity Dataset

Warning

Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 3600000

  • test: 400000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 2) and text containing the review title and text

Return type:

(int, str)

CoLA

torchtext.datasets.CoLA(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev', 'test'))[source]

CoLA dataset

Warning

Using datapipes is still currently subject to a few caveats. If you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://nyu-mll.github.io/CoLA/

Number of lines per split:
  • train: 8551

  • dev: 527

  • test: 516

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields rows from CoLA dataset (source (str), label (int), sentence (str))

Return type:

(str, int, str)

DBpedia

torchtext.datasets.DBpedia(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

DBpedia Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://www.dbpedia.org/resources/latest-core/

Number of lines per split:
  • train: 560000

  • test: 70000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 14) and text containing the news title and contents

Return type:

(int, str)

IMDb

torchtext.datasets.IMDB(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

IMDB Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to http://ai.stanford.edu/~amaas/data/sentiment/

Number of lines per split:
  • train: 25000

  • test: 25000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 2) and text containing the movie review

Return type:

(int, str)

Tutorials using IMDB:
T5-Base Model for Summarization, Sentiment Classification, and Translation

T5-Base Model for Summarization, Sentiment Classification, and Translation

T5-Base Model for Summarization, Sentiment Classification, and Translation

MNLI

torchtext.datasets.MNLI(root='.data', split=('train', 'dev_matched', 'dev_mismatched'))[source]

MNLI Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://cims.nyu.edu/~sbowman/multinli/

Number of lines per split:
  • train: 392702

  • dev_matched: 9815

  • dev_mismatched: 9832

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev_matched, dev_mismatched)

Returns:

DataPipe that yields tuple of text and label (0 to 2).

Return type:

Tuple[int, str, str]

MRPC

torchtext.datasets.MRPC(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

MRPC Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://www.microsoft.com/en-us/download/details.aspx?id=52398

Number of lines per split:
  • train: 4076

  • test: 1725

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields data points from MRPC dataset which consist of label, sentence1, sentence2

Return type:

(int, str, str)

QNLI

torchtext.datasets.QNLI(root='.data', split=('train', 'dev', 'test'))[source]

QNLI Dataset

For additional details refer to https://arxiv.org/pdf/1804.07461.pdf (from GLUE paper)

Number of lines per split:
  • train: 104743

  • dev: 5463

  • test: 5463

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields tuple of text and label (0 and 1).

Return type:

(int, str, str)

QQP

torchtext.datasets.QQP(root: str)[source]

QQP dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs

Parameters:

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

Returns:

DataPipe that yields rows from QQP dataset (label (int), question1 (str), question2 (str))

Return type:

(int, str, str)

RTE

torchtext.datasets.RTE(root='.data', split=('train', 'dev', 'test'))[source]

RTE Dataset

For additional details refer to https://aclweb.org/aclwiki/Recognizing_Textual_Entailment

Number of lines per split:
  • train: 2490

  • dev: 277

  • test: 3000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields tuple of text and/or label (0 and 1). The test split only returns text.

Return type:

Union[(int, str, str), (str, str)]

SogouNews

torchtext.datasets.SogouNews(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

SogouNews Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 450000

  • test: 60000

Args:

root: Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’) split: split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

returns:

DataPipe that yields tuple of label (1 to 5) and text containing the news title and contents

rtype:

(int, str)

SST2

torchtext.datasets.SST2(root='.data', split=('train', 'dev', 'test'))[source]

SST2 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://nlp.stanford.edu/sentiment/

Number of lines per split:
  • train: 67349

  • dev: 872

  • test: 1821

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields tuple of text and/or label (1 to 4). The test split only returns text.

Return type:

Union[(int, str), (str,)]

Tutorials using SST2:
SST-2 Binary text classification with XLM-RoBERTa model

SST-2 Binary text classification with XLM-RoBERTa model

SST-2 Binary text classification with XLM-RoBERTa model

STSB

torchtext.datasets.STSB(root='.data', split=('train', 'dev', 'test'))[source]

STSB Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://ixa2.si.ehu.eus/stswiki/index.php/STSbenchmark

Number of lines per split:
  • train: 5749

  • dev: 1500

  • test: 1379

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields tuple of (index (int), label (float), sentence1 (str), sentence2 (str))

Return type:

(int, float, str, str)

WNLI

torchtext.datasets.WNLI(root='.data', split=('train', 'dev', 'test'))[source]

WNLI Dataset

For additional details refer to https://arxiv.org/pdf/1804.07461v3.pdf

Number of lines per split:
  • train: 635

  • dev: 71

  • test: 146

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev, test)

Returns:

DataPipe that yields tuple of text and/or label (0 to 1). The test split only returns text.

Return type:

Union[(int, str, str), (str, str)]

YahooAnswers

torchtext.datasets.YahooAnswers(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YahooAnswers Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 1400000

  • test: 60000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 10) and text containing the question title, question content, and best answer

Return type:

(int, str)

YelpReviewFull

torchtext.datasets.YelpReviewFull(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YelpReviewFull Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 650000

  • test: 50000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 5) and text containing the review

Return type:

(int, str)

YelpReviewPolarity

torchtext.datasets.YelpReviewPolarity(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

YelpReviewPolarity Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://arxiv.org/abs/1509.01626

Number of lines per split:
  • train: 560000

  • test: 38000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields tuple of label (1 to 2) and text containing the review

Return type:

(int, str)

Language Modeling

PennTreebank

torchtext.datasets.PennTreebank(root='.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

PennTreebank Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://catalog.ldc.upenn.edu/docs/LDC95T7/cl93.html

Number of lines per split:
  • train: 42068

  • valid: 3370

  • test: 3761

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns:

DataPipe that yields text from the Treebank corpus

Return type:

str

WikiText-2

torchtext.datasets.WikiText2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

WikiText2 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Number of lines per split:
  • train: 36718

  • valid: 3760

  • test: 4358

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns:

DataPipe that yields text from Wikipedia articles

Return type:

str

WikiText103

torchtext.datasets.WikiText103(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

WikiText103 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/

Number of lines per split:
  • train: 1801350

  • valid: 3760

  • test: 4358

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns:

DataPipe that yields text from Wikipedia articles

Return type:

str

Machine Translation

IWSLT2016

torchtext.datasets.IWSLT2016(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014')[source]

IWSLT2016 dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://wit3.fbk.eu/2016-01

The available datasets include following:

Language pairs:

“en”

“fr”

“de”

“cs”

“ar”

“en”

x

x

x

x

“fr”

x

“de”

x

“cs”

x

“ar”

x

valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language

  • valid_set – a string to identify validation set.

  • test_set – a string to identify test set.

Returns:

DataPipe that yields tuple of source and target sentences

Return type:

(str, str)

Examples

>>> from torchtext.datasets import IWSLT2016
>>> train_iter, valid_iter, test_iter = IWSLT2016()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

IWSLT2017

torchtext.datasets.IWSLT2017(root='.data', split=('train', 'valid', 'test'), language_pair=('de', 'en'))[source]

IWSLT2017 dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://wit3.fbk.eu/2017-01

The available datasets include following:

Language pairs:

“en”

“nl”

“de”

“it”

“ro”

“en”

x

x

x

x

“nl”

x

x

x

x

“de”

x

x

x

x

“it”

x

x

x

x

“ro”

x

x

x

x

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language

Returns:

DataPipe that yields tuple of source and target sentences

Return type:

(str, str)

Examples

>>> from torchtext.datasets import IWSLT2017
>>> train_iter, valid_iter, test_iter = IWSLT2017()
>>> src_sentence, tgt_sentence = next(iter(train_iter))

Multi30k

torchtext.datasets.Multi30k(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'))[source]

Multi30k dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://www.statmt.org/wmt16/multimodal-task.html#task1

Number of lines per split:
  • train: 29000

  • valid: 1014

  • test: 1000

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

  • language_pair – tuple or list containing src and tgt language. Available options are (‘de’,’en’) and (‘en’, ‘de’)

Returns:

DataPipe that yields tuple of source and target sentences

Return type:

(str, str)

Tutorials using Multi30k:
T5-Base Model for Summarization, Sentiment Classification, and Translation

T5-Base Model for Summarization, Sentiment Classification, and Translation

T5-Base Model for Summarization, Sentiment Classification, and Translation

Sequence Tagging

CoNLL2000Chunking

torchtext.datasets.CoNLL2000Chunking(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'test'))[source]

CoNLL2000Chunking Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://www.clips.uantwerpen.be/conll2000/chunking/

Number of lines per split:
  • train: 8936

  • test: 2012

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, test)

Returns:

DataPipe that yields list of words along with corresponding Parts-of-speech tag and chunk tag

Return type:

[list(str), list(str), list(str)]

UDPOS

torchtext.datasets.UDPOS(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'valid', 'test'))[source]

UDPOS Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

Number of lines per split:
  • train: 12543

  • valid: 2002

  • test: 2077

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, valid, test)

Returns:

DataPipe that yields list of words along with corresponding parts-of-speech tags

Return type:

[list(str), list(str)]

Question Answer

SQuAD 1.0

torchtext.datasets.SQuAD1(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]

SQuAD1 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:
  • train: 87599

  • dev: 10570

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)

Returns:

DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context

Return type:

(str, str, list(str), list(int))

SQuAD 2.0

torchtext.datasets.SQuAD2(root: str = '.data', split: Union[Tuple[str], str] = ('train', 'dev'))[source]

SQuAD2 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://rajpurkar.github.io/SQuAD-explorer/

Number of lines per split:
  • train: 130319

  • dev: 11873

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • split – split or splits to be returned. Can be a string or tuple of strings. Default: (train, dev)

Returns:

DataPipe that yields data points from SQuaAD1 dataset which consist of context, question, list of answers and corresponding index in context

Return type:

(str, str, list(str), list(int))

Unsupervised Learning

CC100

torchtext.datasets.CC100(root: str, language_code: str = 'en')[source]

CC100 Dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to https://data.statmt.org/cc-100/

Parameters:
  • root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

  • language_code – the language of the dataset

Returns:

DataPipe that yields tuple of language code and text

Return type:

(str, str)

EnWik9

torchtext.datasets.EnWik9(root: str)[source]

EnWik9 dataset

Warning

using datapipes is still currently subject to a few caveats. if you wish to use this dataset with shuffling, multi-processing, or distributed learning, please see this note for further instructions.

For additional details refer to http://mattmahoney.net/dc/textdata.html

Number of lines in dataset: 13147026

Parameters:

root – Directory where the datasets are saved. Default: os.path.expanduser(‘~/.torchtext/cache’)

Returns:

DataPipe that yields raw text rows from WnWik9 dataset

Return type:

str

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources