torchtext.datasets

All datasets are subclasses of torchtext.data.Dataset, which inherits from torch.utils.data.Dataset i.e, they have split and iters methods implemented.

General use cases are as follows:

Approach 1, splits:

# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)

# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)

# build the vocabulary
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
LABEL.build_vocab(train)

# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
    (train, test), batch_size=3, device=0)

Approach 2, iters:

# use default configurations
train_iter, test_iter = datasets.IMDB.iters(batch_size=4)

The following datasets are available:

Datasets

Language Modeling

Language modeling datasets are subclasses of LanguageModelingDataset class.

class torchtext.datasets.LanguageModelingDataset(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]

Defines a dataset for language modeling.

__init__(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]

Create a LanguageModelingDataset given a path and a field.

Parameters

path – Path to the data file.
text_field – The field that will be used for text data.
newline_eos – Whether to add an <eos> token for every newline in the data file. Default: True.
keyword arguments (Remaining) – Passed to the constructor of data.Dataset.

WikiText-2

class torchtext.datasets.WikiText2(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]

classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the WikiText-2 dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters

batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)[source]

Create dataset objects for splits of the WikiText-2 dataset.

This is the most flexible way to use the dataset.

Parameters

text_field – The field that will be used for text data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘wiki.train.tokens’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.

WikiText103

class torchtext.datasets.WikiText103(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]

classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the WikiText-103 dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters

batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)[source]

Create dataset objects for splits of the WikiText-103 dataset.

This is the most flexible way to use the dataset.

Parameters

text_field – The field that will be used for text data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-103 subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘wiki.train.tokens’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.

PennTreebank

class torchtext.datasets.PennTreebank(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]

The Penn Treebank dataset. A relatively small dataset originally created for POS tagging.

References

Marcus, Mitchell P., Marcinkiewicz, Mary Ann & Santorini, Beatrice (1993). Building a Large Annotated Corpus of English: The Penn Treebank

classmethod iters(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the Penn Treebank dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters

batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory where the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, root='.data', train='ptb.train.txt', validation='ptb.valid.txt', test='ptb.test.txt', **kwargs)[source]

Create dataset objects for splits of the Penn Treebank dataset.

Parameters

text_field – The field that will be used for text data.
root – The root directory where the data files will be stored.
train – The filename of the train data. Default: ‘ptb.train.txt’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘ptb.valid.txt’.
test – The filename of the test data, or None to not load the test set. Default: ‘ptb.test.txt’.

Sentiment Analysis

SST

class torchtext.datasets.SST(path, text_field, label_field, subtrees=False, fine_grained=False, **kwargs)[source]

classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the SST dataset.

Parameters

batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, label_field, root='.data', train='train.txt', validation='dev.txt', test='test.txt', train_subtrees=False, **kwargs)[source]

Create dataset objects for splits of the SST dataset.

Parameters

text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘train.txt’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.txt’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.txt’.
train_subtrees – Whether to use all subtrees in the training set. Default: False.
keyword arguments (Remaining) – Passed to the splits method of Dataset.

IMDb

class torchtext.datasets.IMDB(path, text_field, label_field, **kwargs)[source]

classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the IMDB dataset.

Parameters

batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that contains the imdb dataset subdirectory
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, label_field, root='.data', train='train', test='test', **kwargs)[source]

Create dataset objects for splits of the IMDB dataset.

Parameters

text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – Root dataset storage directory. Default is ‘.data’.
train – The directory that contains the training examples
test – The directory that contains the test examples
keyword arguments (Remaining) – Passed to the splits method of Dataset.

Text Classification

TextClassificationDataset

class torchtext.datasets.TextClassificationDataset(vocab, data, labels)[source]

Defines an abstract text classification datasets. Currently, we only support the following datasets:

AG_NEWS

SogouNews

DBpedia

YelpReviewPolarity

YelpReviewFull

YahooAnswers

AmazonReviewPolarity

AmazonReviewFull

__init__(vocab, data, labels)[source]

Initiate text-classification dataset.

Parameters

vocab – Vocabulary object used for dataset.
data – a list of label/tokens tuple. tokens are a tensor after numericalizing the string tokens. label is an integer. [(label1, tokens1), (label2, tokens2), (label2, tokens3)]
label – a set of the labels. {label1, label2}

Examples

See the examples in examples/text_classification/

AG_NEWS

torchtext.datasets.AG_NEWS(*args, **kwargs)[source]

Defines AG_NEWS datasets.

The labels includes:

0 : World
1 : Sports
2 : Business
3 : Sci/Tech

Create supervised learning dataset: AG_NEWS

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.AG_NEWS(ngrams=3)

SogouNews

torchtext.datasets.SogouNews(*args, **kwargs)[source]

Defines SogouNews datasets.

The labels includes:

0 : Sports
1 : Finance
2 : Entertainment
3 : Automobile
4 : Technology

Create supervised learning dataset: SogouNews

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.SogouNews(ngrams=3)

DBpedia

torchtext.datasets.DBpedia(*args, **kwargs)[source]

Defines DBpedia datasets.

The labels includes:

0 : Company
1 : EducationalInstitution
2 : Artist
3 : Athlete
4 : OfficeHolder
5 : MeanOfTransportation
6 : Building
7 : NaturalPlace
8 : Village
9 : Animal
10 : Plant
11 : Album
12 : Film
13 : WrittenWork

Create supervised learning dataset: DBpedia

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.DBpedia(ngrams=3)

YelpReviewPolarity

torchtext.datasets.YelpReviewPolarity(*args, **kwargs)[source]

Defines YelpReviewPolarity datasets.

The labels includes:

0 : Negative polarity.
1 : Positive polarity.

Create supervised learning dataset: YelpReviewPolarity

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.YelpReviewPolarity(ngrams=3)

YelpReviewFull

torchtext.datasets.YelpReviewFull(*args, **kwargs)[source]

Defines YelpReviewFull datasets.

The labels includes:: 0 - 4 : rating classes (4 is highly recommended).

Create supervised learning dataset: YelpReviewFull

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.YelpReviewFull(ngrams=3)

YahooAnswers

torchtext.datasets.YahooAnswers(*args, **kwargs)[source]

Defines YahooAnswers datasets.

The labels includes:

0 : Society & Culture
1 : Science & Mathematics
2 : Health
3 : Education & Reference
4 : Computers & Internet
5 : Sports
6 : Business & Finance
7 : Entertainment & Music
8 : Family & Relationships
9 : Politics & Government

Create supervised learning dataset: YahooAnswers

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.YahooAnswers(ngrams=3)

AmazonReviewPolarity

torchtext.datasets.AmazonReviewPolarity(*args, **kwargs)[source]

Defines AmazonReviewPolarity datasets.

The labels includes:

0 : Negative polarity
1 : Positive polarity

Create supervised learning dataset: AmazonReviewPolarity

Separately returns the training and test dataset

Parameters

root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.AmazonReviewPolarity(ngrams=3)

AmazonReviewFull

torchtext.datasets.AmazonReviewFull(*args, **kwargs)[source]

Defines AmazonReviewFull datasets.

The labels includes:: 0 - 4 : rating classes (4 is highly recommended)

Create supervised learning dataset: AmazonReviewFull

Separately returns the training and test dataset

Parameters

root – Directory where the dataset are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)

Examples

>>> train_dataset, test_dataset = torchtext.datasets.AmazonReviewFull(ngrams=3)

Question Classification

TREC

class torchtext.datasets.TREC(path, text_field, label_field, fine_grained=False, **kwargs)[source]

classmethod iters(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]

Create iterator objects for splits of the TREC dataset.

Parameters

batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that contains the trec dataset subdirectory
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, label_field, root='.data', train='train_5500.label', test='TREC_10.label', **kwargs)[source]

Create dataset objects for splits of the TREC dataset.

Parameters

text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – Root dataset storage directory. Default is ‘.data’.
train – The filename of the train data. Default: ‘train_5500.label’.
test – The filename of the test data, or None to not load the test set. Default: ‘TREC_10.label’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.

Entailment

SNLI

class torchtext.datasets.SNLI(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]

classmethod iters(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)

Create iterator objects for splits of the SNLI dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters

batch_size – Batch size.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
trees – Whether to include shift-reduce parser transitions. Default: False.
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, label_field, parse_field=None, root='.data', train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl', test='snli_1.0_test.jsonl')[source]

Create dataset objects for splits of the SNLI dataset.

This is the most flexible way to use the dataset.

Parameters

text_field – The field that will be used for premise and hypothesis data.
label_field – The field that will be used for label data.
parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
extra_fields – A dict[json_key: Tuple(field_name, Field)]
root – The root directory that the dataset’s zip archive will be expanded into.
train – The filename of the train data. Default: ‘train.jsonl’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.

MultiNLI

class torchtext.datasets.MultiNLI(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]

classmethod iters(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)

Create iterator objects for splits of the SNLI dataset.

This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.

Parameters

batch_size – Batch size.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
trees – Whether to include shift-reduce parser transitions. Default: False.
keyword arguments (Remaining) – Passed to the splits method.

classmethod splits(text_field, label_field, parse_field=None, genre_field=None, root='.data', train='multinli_1.0_train.jsonl', validation='multinli_1.0_dev_matched.jsonl', test='multinli_1.0_dev_mismatched.jsonl')[source]

Create dataset objects for splits of the SNLI dataset.

This is the most flexible way to use the dataset.

Parameters

text_field – The field that will be used for premise and hypothesis data.
label_field – The field that will be used for label data.
parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
extra_fields – A dict[json_key: Tuple(field_name, Field)]
root – The root directory that the dataset’s zip archive will be expanded into.
train – The filename of the train data. Default: ‘train.jsonl’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.

Machine Translation

Machine translation datasets are subclasses of TranslationDataset class.

class torchtext.datasets.TranslationDataset(path, exts, fields, **kwargs)[source]

Defines a dataset for machine translation.

__init__(path, exts, fields, **kwargs)[source]

Create a TranslationDataset given paths and fields.

Parameters

path – Common prefix of paths to the data files for both languages.
exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
keyword arguments (Remaining) – Passed to the constructor of data.Dataset.

Multi30k

class torchtext.datasets.Multi30k(path, exts, fields, **kwargs)[source]

The small-dataset WMT 2016 multimodal task, also known as Flickr30k

classmethod splits(exts, fields, root='.data', train='train', validation='val', test='test2016', **kwargs)[source]

Create dataset objects for splits of the Multi30k dataset.

Parameters

exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train’.
validation – The prefix of the validation data. Default: ‘val’.
test – The prefix of the test data. Default: ‘test’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.

IWSLT

class torchtext.datasets.IWSLT(path, exts, fields, **kwargs)[source]

The IWSLT 2016 TED talk translation task

classmethod splits(exts, fields, root='.data', train='train', validation='IWSLT16.TED.tst2013', test='IWSLT16.TED.tst2014', **kwargs)[source]

Create dataset objects for splits of the IWSLT dataset.

Parameters

exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train’.
validation – The prefix of the validation data. Default: ‘val’.
test – The prefix of the test data. Default: ‘test’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.

WMT14

class torchtext.datasets.WMT14(path, exts, fields, **kwargs)[source]

The WMT 2014 English-German dataset, as preprocessed by Google Brain.

Though this download contains test sets from 2015 and 2016, the train set differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.

classmethod splits(exts, fields, root='.data', train='train.tok.clean.bpe.32000', validation='newstest2013.tok.bpe.32000', test='newstest2014.tok.bpe.32000', **kwargs)[source]

Create dataset objects for splits of the WMT 2014 dataset.

Parameters

exts – A tuple containing the extensions for each language. Must be either (‘.en’, ‘.de’) or the reverse.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train.tok.clean.bpe.32000’.
validation – The prefix of the validation data. Default: ‘newstest2013.tok.bpe.32000’.
test – The prefix of the test data. Default: ‘newstest2014.tok.bpe.32000’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.

Sequence Tagging

Sequence tagging datasets are subclasses of SequenceTaggingDataset class.

class torchtext.datasets.SequenceTaggingDataset(path, fields, encoding='utf-8', separator='t', **kwargs)[source]

Defines a dataset for sequence tagging. Examples in this dataset contain paired lists – paired list of words and tags.

For example, in the case of part-of-speech tagging, an example is of the form [I, love, PyTorch, .] paired with [PRON, VERB, PROPN, PUNCT]

See torchtext/test/sequence_tagging.py on how to use this class.

__init__(path, fields, encoding='utf-8', separator='\t', **kwargs)[source]

Create a dataset from a list of Examples and Fields.

Parameters

examples – List of Examples.
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.

UDPOS

class torchtext.datasets.UDPOS(path, fields, encoding='utf-8', separator='t', **kwargs)[source]

classmethod splits(fields, root='.data', train='en-ud-tag.v2.train.txt', validation='en-ud-tag.v2.dev.txt', test='en-ud-tag.v2.test.txt', **kwargs)[source]: Downloads and loads the Universal Dependencies Version 2 POS Tagged data.

CoNLL2000Chunking

class torchtext.datasets.CoNLL2000Chunking(path, fields, encoding='utf-8', separator='t', **kwargs)[source]

classmethod splits(fields, root='.data', train='train.txt', test='test.txt', validation_frac=0.1, **kwargs)[source]: Downloads and loads the CoNLL 2000 Chunking dataset. NOTE: There is only a train and test dataset so we use 10% of the train set as validation

Question Answering

BABI20

class torchtext.datasets.BABI20(path, text_field, only_supporting=False, **kwargs)[source]

__init__(path, text_field, only_supporting=False, **kwargs)[source]

Create a dataset from a list of Examples and Fields.

Parameters

examples – List of Examples.
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.

classmethod splits(text_field, path=None, root='.data', task=1, joint=False, tenK=False, only_supporting=False, train=None, validation=None, test=None, **kwargs)[source]

Create Dataset objects for multiple splits of a dataset.

Parameters

path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
root (str) – Root dataset storage directory. Default is ‘.data’.
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.

Returns

Datasets for train, validation, and test splits in that order, if provided.

Return type

Tuple[Dataset]

Unsupervised Learning

EnWik9

class torchtext.datasets.EnWik9(begin_line=0, num_lines=6348957, root='.data')[source]

Compressed size of first 10^9 bytes of enwiki-20060303-pages-articles.xml. It’s part of Large Text Compression Benchmark project

__init__(begin_line=0, num_lines=6348957, root='.data')[source]

Initiate EnWik9 dataset.

Parameters

begin_line – the number of beginning line. Default: 0
num_lines – the number of lines to be loaded. Default: 6348957
root – Directory where the datasets are saved. Default: “.data”
data – a list of label/tokens tuple. tokens are a tensor after

Examples

>>> from torchtext.datasets import EnWik9
>>> enwik9 = EnWik9(num_lines=20000)
>>> vocab = enwik9.get_vocab()

torchtext.datasets

Language Modeling

WikiText-2

WikiText103

PennTreebank

Sentiment Analysis

SST

IMDb

Text Classification

TextClassificationDataset

AG_NEWS

SogouNews

DBpedia

YelpReviewPolarity

YelpReviewFull

YahooAnswers

AmazonReviewPolarity

AmazonReviewFull

Question Classification

TREC

Entailment

SNLI

MultiNLI

Machine Translation

Multi30k

IWSLT

WMT14

Sequence Tagging

UDPOS

CoNLL2000Chunking

Question Answering

BABI20

Unsupervised Learning

EnWik9

Docs

Tutorials

Resources