torchtext.datasets¶
All datasets are subclasses of torchtext.data.Dataset
, which
inherits from torch.utils.data.Dataset
i.e, they have split
and
iters
methods implemented.
General use cases are as follows:
Approach 1, splits
:
# set up fields
TEXT = data.Field(lower=True, include_lengths=True, batch_first=True)
LABEL = data.Field(sequential=False)
# make splits for data
train, test = datasets.IMDB.splits(TEXT, LABEL)
# build the vocabulary
TEXT.build_vocab(train, vectors=GloVe(name='6B', dim=300))
LABEL.build_vocab(train)
# make iterator for splits
train_iter, test_iter = data.BucketIterator.splits(
(train, test), batch_size=3, device=0)
Approach 2, iters
:
# use default configurations
train_iter, test_iter = datasets.IMDB.iters(batch_size=4)
The following datasets are available:
Datasets
Language Modeling¶
Language modeling datasets are subclasses of LanguageModelingDataset
class.
-
class
torchtext.datasets.
LanguageModelingDataset
(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]¶ Defines a dataset for language modeling.
-
__init__
(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]¶ Create a LanguageModelingDataset given a path and a field.
- Parameters
path – Path to the data file.
text_field – The field that will be used for text data.
newline_eos – Whether to add an <eos> token for every newline in the data file. Default: True.
keyword arguments (Remaining) – Passed to the constructor of data.Dataset.
-
WikiText-2¶
-
class
torchtext.datasets.
WikiText2
(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the WikiText-2 dataset.
This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.
- Parameters
batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)[source]¶ Create dataset objects for splits of the WikiText-2 dataset.
This is the most flexible way to use the dataset.
- Parameters
text_field – The field that will be used for text data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘wiki.train.tokens’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.
-
classmethod
WikiText103¶
-
class
torchtext.datasets.
WikiText103
(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the WikiText-103 dataset.
This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.
- Parameters
batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, root='.data', train='wiki.train.tokens', validation='wiki.valid.tokens', test='wiki.test.tokens', **kwargs)[source]¶ Create dataset objects for splits of the WikiText-103 dataset.
This is the most flexible way to use the dataset.
- Parameters
text_field – The field that will be used for text data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-103 subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘wiki.train.tokens’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘wiki.valid.tokens’.
test – The filename of the test data, or None to not load the test set. Default: ‘wiki.test.tokens’.
-
classmethod
PennTreebank¶
-
class
torchtext.datasets.
PennTreebank
(path, text_field, newline_eos=True, encoding='utf-8', **kwargs)[source]¶ The Penn Treebank dataset. A relatively small dataset originally created for POS tagging.
References
Marcus, Mitchell P., Marcinkiewicz, Mary Ann & Santorini, Beatrice (1993). Building a Large Annotated Corpus of English: The Penn Treebank
-
classmethod
iters
(batch_size=32, bptt_len=35, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the Penn Treebank dataset.
This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.
- Parameters
batch_size – Batch size.
bptt_len – Length of sequences for backpropagation through time.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory where the data files will be stored.
wv_type, wv_dim (wv_dir,) – Passed to the Vocab constructor for the text field. The word vectors are accessible as train.dataset.fields[‘text’].vocab.vectors.
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, root='.data', train='ptb.train.txt', validation='ptb.valid.txt', test='ptb.test.txt', **kwargs)[source]¶ Create dataset objects for splits of the Penn Treebank dataset.
- Parameters
text_field – The field that will be used for text data.
root – The root directory where the data files will be stored.
train – The filename of the train data. Default: ‘ptb.train.txt’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘ptb.valid.txt’.
test – The filename of the test data, or None to not load the test set. Default: ‘ptb.test.txt’.
-
classmethod
Sentiment Analysis¶
SST¶
-
class
torchtext.datasets.
SST
(path, text_field, label_field, subtrees=False, fine_grained=False, **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the SST dataset.
- Parameters
batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, label_field, root='.data', train='train.txt', validation='dev.txt', test='test.txt', train_subtrees=False, **kwargs)[source]¶ Create dataset objects for splits of the SST dataset.
- Parameters
text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose trees subdirectory the data files will be stored.
train – The filename of the train data. Default: ‘train.txt’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.txt’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.txt’.
train_subtrees – Whether to use all subtrees in the training set. Default: False.
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
IMDb¶
-
class
torchtext.datasets.
IMDB
(path, text_field, label_field, **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the IMDB dataset.
- Parameters
batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that contains the imdb dataset subdirectory
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, label_field, root='.data', train='train', test='test', **kwargs)[source]¶ Create dataset objects for splits of the IMDB dataset.
- Parameters
text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – Root dataset storage directory. Default is ‘.data’.
train – The directory that contains the training examples
test – The directory that contains the test examples
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
Text Classification¶
TextClassificationDataset¶
-
class
torchtext.datasets.
TextClassificationDataset
(vocab, data, labels)[source]¶ Defines an abstract text classification datasets. Currently, we only support the following datasets:
AG_NEWS
SogouNews
DBpedia
YelpReviewPolarity
YelpReviewFull
YahooAnswers
AmazonReviewPolarity
AmazonReviewFull
-
__init__
(vocab, data, labels)[source]¶ Initiate text-classification dataset.
- Parameters
vocab – Vocabulary object used for dataset.
data – a list of label/tokens tuple. tokens are a tensor after numericalizing the string tokens. label is an integer. [(label1, tokens1), (label2, tokens2), (label2, tokens3)]
label – a set of the labels. {label1, label2}
Examples
See the examples in examples/text_classification/
AG_NEWS¶
-
torchtext.datasets.
AG_NEWS
(*args, **kwargs)[source]¶ - Defines AG_NEWS datasets.
- The labels includes:
0 : World
1 : Sports
2 : Business
3 : Sci/Tech
Create supervised learning dataset: AG_NEWS
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.AG_NEWS(ngrams=3)
SogouNews¶
-
torchtext.datasets.
SogouNews
(*args, **kwargs)[source]¶ - Defines SogouNews datasets.
- The labels includes:
0 : Sports
1 : Finance
2 : Entertainment
3 : Automobile
4 : Technology
Create supervised learning dataset: SogouNews
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.SogouNews(ngrams=3)
DBpedia¶
-
torchtext.datasets.
DBpedia
(*args, **kwargs)[source]¶ - Defines DBpedia datasets.
- The labels includes:
0 : Company
1 : EducationalInstitution
2 : Artist
3 : Athlete
4 : OfficeHolder
5 : MeanOfTransportation
6 : Building
7 : NaturalPlace
8 : Village
9 : Animal
10 : Plant
11 : Album
12 : Film
13 : WrittenWork
Create supervised learning dataset: DBpedia
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.DBpedia(ngrams=3)
YelpReviewPolarity¶
-
torchtext.datasets.
YelpReviewPolarity
(*args, **kwargs)[source]¶ - Defines YelpReviewPolarity datasets.
- The labels includes:
0 : Negative polarity.
1 : Positive polarity.
Create supervised learning dataset: YelpReviewPolarity
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.YelpReviewPolarity(ngrams=3)
YelpReviewFull¶
-
torchtext.datasets.
YelpReviewFull
(*args, **kwargs)[source]¶ - Defines YelpReviewFull datasets.
- The labels includes:
0 - 4 : rating classes (4 is highly recommended).
Create supervised learning dataset: YelpReviewFull
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.YelpReviewFull(ngrams=3)
YahooAnswers¶
-
torchtext.datasets.
YahooAnswers
(*args, **kwargs)[source]¶ - Defines YahooAnswers datasets.
- The labels includes:
0 : Society & Culture
1 : Science & Mathematics
2 : Health
3 : Education & Reference
4 : Computers & Internet
5 : Sports
6 : Business & Finance
7 : Entertainment & Music
8 : Family & Relationships
9 : Politics & Government
Create supervised learning dataset: YahooAnswers
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.YahooAnswers(ngrams=3)
AmazonReviewPolarity¶
-
torchtext.datasets.
AmazonReviewPolarity
(*args, **kwargs)[source]¶ - Defines AmazonReviewPolarity datasets.
- The labels includes:
0 : Negative polarity
1 : Positive polarity
Create supervised learning dataset: AmazonReviewPolarity
Separately returns the training and test dataset
- Parameters
root – Directory where the datasets are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.AmazonReviewPolarity(ngrams=3)
AmazonReviewFull¶
-
torchtext.datasets.
AmazonReviewFull
(*args, **kwargs)[source]¶ - Defines AmazonReviewFull datasets.
- The labels includes:
0 - 4 : rating classes (4 is highly recommended)
Create supervised learning dataset: AmazonReviewFull
Separately returns the training and test dataset
- Parameters
root – Directory where the dataset are saved. Default: “.data”
ngrams – a contiguous sequence of n items from s string text. Default: 1
vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.
include_unk – include unknown token in the data (Default: False)
Examples
>>> train_dataset, test_dataset = torchtext.datasets.AmazonReviewFull(ngrams=3)
Question Classification¶
TREC¶
-
class
torchtext.datasets.
TREC
(path, text_field, label_field, fine_grained=False, **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, device=0, root='.data', vectors=None, **kwargs)[source]¶ Create iterator objects for splits of the TREC dataset.
- Parameters
batch_size – Batch_size
device – Device to create batches on. Use - 1 for CPU and None for the currently active GPU device.
root – The root directory that contains the trec dataset subdirectory
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, label_field, root='.data', train='train_5500.label', test='TREC_10.label', **kwargs)[source]¶ Create dataset objects for splits of the TREC dataset.
- Parameters
text_field – The field that will be used for the sentence.
label_field – The field that will be used for label data.
root – Root dataset storage directory. Default is ‘.data’.
train – The filename of the train data. Default: ‘train_5500.label’.
test – The filename of the test data, or None to not load the test set. Default: ‘TREC_10.label’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
Entailment¶
SNLI¶
-
class
torchtext.datasets.
SNLI
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)¶ Create iterator objects for splits of the SNLI dataset.
This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.
- Parameters
batch_size – Batch size.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
trees – Whether to include shift-reduce parser transitions. Default: False.
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, label_field, parse_field=None, root='.data', train='snli_1.0_train.jsonl', validation='snli_1.0_dev.jsonl', test='snli_1.0_test.jsonl')[source]¶ Create dataset objects for splits of the SNLI dataset.
This is the most flexible way to use the dataset.
- Parameters
text_field – The field that will be used for premise and hypothesis data.
label_field – The field that will be used for label data.
parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
extra_fields – A dict[json_key: Tuple(field_name, Field)]
root – The root directory that the dataset’s zip archive will be expanded into.
train – The filename of the train data. Default: ‘train.jsonl’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.
-
classmethod
MultiNLI¶
-
class
torchtext.datasets.
MultiNLI
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]¶ -
classmethod
iters
(batch_size=32, device=0, root='.data', vectors=None, trees=False, **kwargs)¶ Create iterator objects for splits of the SNLI dataset.
This is the simplest way to use the dataset, and assumes common defaults for field, vocabulary, and iterator parameters.
- Parameters
batch_size – Batch size.
device – Device to create batches on. Use -1 for CPU and None for the currently active GPU device.
root – The root directory that the dataset’s zip archive will be expanded into; therefore the directory in whose wikitext-2 subdirectory the data files will be stored.
vectors – one of the available pretrained vectors or a list with each element one of the available pretrained vectors (see Vocab.load_vectors)
trees – Whether to include shift-reduce parser transitions. Default: False.
keyword arguments (Remaining) – Passed to the splits method.
-
classmethod
splits
(text_field, label_field, parse_field=None, genre_field=None, root='.data', train='multinli_1.0_train.jsonl', validation='multinli_1.0_dev_matched.jsonl', test='multinli_1.0_dev_mismatched.jsonl')[source]¶ Create dataset objects for splits of the SNLI dataset.
This is the most flexible way to use the dataset.
- Parameters
text_field – The field that will be used for premise and hypothesis data.
label_field – The field that will be used for label data.
parse_field – The field that will be used for shift-reduce parser transitions, or None to not include them.
extra_fields – A dict[json_key: Tuple(field_name, Field)]
root – The root directory that the dataset’s zip archive will be expanded into.
train – The filename of the train data. Default: ‘train.jsonl’.
validation – The filename of the validation data, or None to not load the validation set. Default: ‘dev.jsonl’.
test – The filename of the test data, or None to not load the test set. Default: ‘test.jsonl’.
-
classmethod
Machine Translation¶
Machine translation datasets are subclasses of TranslationDataset
class.
-
class
torchtext.datasets.
TranslationDataset
(path, exts, fields, **kwargs)[source]¶ Defines a dataset for machine translation.
-
__init__
(path, exts, fields, **kwargs)[source]¶ Create a TranslationDataset given paths and fields.
- Parameters
path – Common prefix of paths to the data files for both languages.
exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
keyword arguments (Remaining) – Passed to the constructor of data.Dataset.
-
Multi30k¶
-
class
torchtext.datasets.
Multi30k
(path, exts, fields, **kwargs)[source]¶ The small-dataset WMT 2016 multimodal task, also known as Flickr30k
-
classmethod
splits
(exts, fields, root='.data', train='train', validation='val', test='test2016', **kwargs)[source]¶ Create dataset objects for splits of the Multi30k dataset.
- Parameters
exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train’.
validation – The prefix of the validation data. Default: ‘val’.
test – The prefix of the test data. Default: ‘test’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
IWSLT¶
-
class
torchtext.datasets.
IWSLT
(path, exts, fields, **kwargs)[source]¶ The IWSLT 2016 TED talk translation task
-
classmethod
splits
(exts, fields, root='.data', train='train', validation='IWSLT16.TED.tst2013', test='IWSLT16.TED.tst2014', **kwargs)[source]¶ Create dataset objects for splits of the IWSLT dataset.
- Parameters
exts – A tuple containing the extension to path for each language.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train’.
validation – The prefix of the validation data. Default: ‘val’.
test – The prefix of the test data. Default: ‘test’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
WMT14¶
-
class
torchtext.datasets.
WMT14
(path, exts, fields, **kwargs)[source]¶ The WMT 2014 English-German dataset, as preprocessed by Google Brain.
Though this download contains test sets from 2015 and 2016, the train set differs slightly from WMT 2015 and 2016 and significantly from WMT 2017.
-
classmethod
splits
(exts, fields, root='.data', train='train.tok.clean.bpe.32000', validation='newstest2013.tok.bpe.32000', test='newstest2014.tok.bpe.32000', **kwargs)[source]¶ Create dataset objects for splits of the WMT 2014 dataset.
- Parameters
exts – A tuple containing the extensions for each language. Must be either (‘.en’, ‘.de’) or the reverse.
fields – A tuple containing the fields that will be used for data in each language.
root – Root dataset storage directory. Default is ‘.data’.
train – The prefix of the train data. Default: ‘train.tok.clean.bpe.32000’.
validation – The prefix of the validation data. Default: ‘newstest2013.tok.bpe.32000’.
test – The prefix of the test data. Default: ‘newstest2014.tok.bpe.32000’.
keyword arguments (Remaining) – Passed to the splits method of Dataset.
-
classmethod
Sequence Tagging¶
Sequence tagging datasets are subclasses of SequenceTaggingDataset
class.
-
class
torchtext.datasets.
SequenceTaggingDataset
(path, fields, encoding='utf-8', separator='t', **kwargs)[source]¶ Defines a dataset for sequence tagging. Examples in this dataset contain paired lists – paired list of words and tags.
For example, in the case of part-of-speech tagging, an example is of the form [I, love, PyTorch, .] paired with [PRON, VERB, PROPN, PUNCT]
See torchtext/test/sequence_tagging.py on how to use this class.
-
__init__
(path, fields, encoding='utf-8', separator='\t', **kwargs)[source]¶ Create a dataset from a list of Examples and Fields.
- Parameters
examples – List of Examples.
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.
-
UDPOS¶
CoNLL2000Chunking¶
Question Answering¶
BABI20¶
-
class
torchtext.datasets.
BABI20
(path, text_field, only_supporting=False, **kwargs)[source]¶ -
__init__
(path, text_field, only_supporting=False, **kwargs)[source]¶ Create a dataset from a list of Examples and Fields.
- Parameters
examples – List of Examples.
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.
-
classmethod
splits
(text_field, path=None, root='.data', task=1, joint=False, tenK=False, only_supporting=False, train=None, validation=None, test=None, **kwargs)[source]¶ Create Dataset objects for multiple splits of a dataset.
- Parameters
path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
root (str) – Root dataset storage directory. Default is ‘.data’.
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.
- Returns
Datasets for train, validation, and test splits in that order, if provided.
- Return type
Tuple[Dataset]
-
Unsupervised Learning¶
EnWik9¶
-
class
torchtext.datasets.
EnWik9
(begin_line=0, num_lines=6348957, root='.data')[source]¶ Compressed size of first 10^9 bytes of enwiki-20060303-pages-articles.xml. It’s part of Large Text Compression Benchmark project
-
__init__
(begin_line=0, num_lines=6348957, root='.data')[source]¶ Initiate EnWik9 dataset.
- Parameters
begin_line – the number of beginning line. Default: 0
num_lines – the number of lines to be loaded. Default: 6348957
root – Directory where the datasets are saved. Default: “.data”
data – a list of label/tokens tuple. tokens are a tensor after
Examples
>>> from torchtext.datasets import EnWik9 >>> enwik9 = EnWik9(num_lines=20000) >>> vocab = enwik9.get_vocab()
-