• Docs >
  • torchtext.experimental.datasets
Shortcuts

torchtext.experimental.datasets

The following datasets have been rewritten and more compatible with torch.utils.data. General use cases are as follows:

# import datasets
from torchtext.experimental.datasets import IMDB

# set up tokenizer (the default on is basic_english tokenizer)
from torchtext.data.utils import get_tokenizer
tokenizer = get_tokenizer("spacy")

# obtain data and vocab with a custom tokenizer
train_dataset, test_dataset = IMDB(tokenizer=tokenizer)
vocab = train_dataset.get_vocab()

# use the default tokenizer
train_dataset, test_dataset = IMDB()
vocab = train_dataset.get_vocab()

The following datasets are available:

Sentiment Analysis

IMDb

class torchtext.experimental.datasets.IMDB[source]
Defines IMDB datasets.
The labels includes:
  • 0 : Negative

  • 1 : Positive

Create sentiment analysis dataset: IMDB

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • removed_tokens – removed tokens from output dataset (Default: [])

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import IMDB
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = IMDB(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = IMDB(tokenizer=tokenizer)
>>> train, = IMDB(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

Text Classification

AG_NEWS

AG_NEWS dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.AG_NEWS[source]
Defines AG_NEWS datasets.
The labels includes:
  • 1 : World

  • 2 : Sports

  • 3 : Business

  • 4 : Sci/Tech

Create text classification dataset: AG_NEWS

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import AG_NEWS
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = AG_NEWS(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = AG_NEWS(tokenizer=tokenizer)
>>> train, = AG_NEWS(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

SogouNews

SogouNews dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.SogouNews[source]
Defines SogouNews datasets.
The labels includes:
  • 1 : Sports

  • 2 : Finance

  • 3 : Entertainment

  • 4 : Automobile

  • 5 : Technology

Create text classification dataset: SogouNews

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import SogouNews
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = SogouNews(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = SogouNews(tokenizer=tokenizer)
>>> train, = SogouNews(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

DBpedia

DBpedia dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.DBpedia[source]
Defines DBpedia datasets.
The labels includes:
  • 1 : Company

  • 2 : EducationalInstitution

  • 3 : Artist

  • 4 : Athlete

  • 5 : OfficeHolder

  • 6 : MeanOfTransportation

  • 7 : Building

  • 8 : NaturalPlace

  • 9 : Village

  • 10 : Animal

  • 11 : Plant

  • 12 : Album

  • 13 : Film

  • 14 : WrittenWork

Create text classification dataset: DBpedia

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import DBpedia
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = DBpedia(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = DBpedia(tokenizer=tokenizer)
>>> train, = DBpedia(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

YelpReviewPolarity

YelpReviewPolarity dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.YelpReviewPolarity[source]
Defines YelpReviewPolarity datasets.
The labels includes:
  • 1 : Negative polarity.

  • 2 : Positive polarity.

Create text classification dataset: YelpReviewPolarity

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import YelpReviewPolarity
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = YelpReviewPolarity(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = YelpReviewPolarity(tokenizer=tokenizer)
>>> train, = YelpReviewPolarity(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

YelpReviewFull

YelpReviewFull dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.YelpReviewFull[source]
Defines YelpReviewFull datasets.
The labels includes:

1 - 5 : rating classes (5 is highly recommended).

Create text classification dataset: YelpReviewFull

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import YelpReviewFull
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = YelpReviewFull(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = YelpReviewFull(tokenizer=tokenizer)
>>> train, = YelpReviewFull(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

YahooAnswers

YahooAnswers dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.YahooAnswers[source]
Defines YahooAnswers datasets.
The labels includes:
  • 1 : Society & Culture

  • 2 : Science & Mathematics

  • 3 : Health

  • 4 : Education & Reference

  • 5 : Computers & Internet

  • 6 : Sports

  • 7 : Business & Finance

  • 8 : Entertainment & Music

  • 9 : Family & Relationships

  • 10 : Politics & Government

Create text classification dataset: YahooAnswers

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import YahooAnswers
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = YahooAnswers(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = YahooAnswers(tokenizer=tokenizer)
>>> train, = YahooAnswers(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

AmazonReviewPolarity

AmazonReviewPolarity dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.AmazonReviewPolarity[source]
Defines AmazonReviewPolarity datasets.
The labels includes:
  • 1 : Negative polarity

  • 2 : Positive polarity

Create text classification dataset: AmazonReviewPolarity

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import AmazonReviewPolarity
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = AmazonReviewPolarity(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = AmazonReviewPolarity(tokenizer=tokenizer)
>>> train, = AmazonReviewPolarity(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

AmazonReviewFull

AmazonReviewFull dataset is subclass of TextClassificationDataset class.

class torchtext.experimental.datasets.AmazonReviewFull[source]
Defines AmazonReviewFull datasets.
The labels includes:

1 - 5 : rating classes (5 is highly recommended)

Create text classification dataset: AmazonReviewFull

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • ngrams – a contiguous sequence of n items from s string text. Default: 1

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import AmazonReviewFull
>>> from torchtext.data.utils import get_tokenizer
>>> train, test = AmazonReviewFull(ngrams=3)
>>> tokenizer = get_tokenizer("spacy")
>>> train, test = AmazonReviewFull(tokenizer=tokenizer)
>>> train, = AmazonReviewFull(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

Language Modeling

Language modeling datasets are subclasses of LanguageModelingDataset class.

class torchtext.experimental.datasets.LanguageModelingDataset(data, vocab, transforms, single_line)[source]

Defines a dataset for language modeling. Currently, we only support the following datasets:

  • WikiText2

  • WikiText103

  • PennTreebank

  • WMTNewsCrawl

__init__(data, vocab, transforms, single_line)[source]

Initiate language modeling dataset.

Parameters
  • data – a tensor of tokens. tokens are ids after numericalizing the string tokens. torch.tensor([token_id_1, token_id_2, token_id_3, token_id1]).long()

  • vocab – Vocabulary object used for dataset.

  • transforms – Text string transforms.

WikiText-2

class torchtext.experimental.datasets.WikiText2[source]

Defines WikiText2 datasets.

Create language modeling dataset: WikiText2 Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • single_line – whether to return all tokens in a single line. (Default: True) By default, all lines in raw text file are concatenated into a single line. Use single_line = False if one wants to get data line by line.

Examples

>>> from torchtext.experimental.datasets import WikiText2
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = WikiText2(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = WikiText2(tokenizer=tokenizer, vocab=vocab,
                               data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

WikiText103

class torchtext.experimental.datasets.WikiText103[source]

Defines WikiText103 datasets.

Create language modeling dataset: WikiText103 Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • single_line – whether to return all tokens in a single line. (Default: True) By default, all lines in raw text file are concatenated into a single line. Use single_line = False if one wants to get data line by line.

Examples

>>> from torchtext.experimental.datasets import WikiText103
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = WikiText103(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = WikiText103(tokenizer=tokenizer, vocab=vocab,
                                 data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

PennTreebank

class torchtext.experimental.datasets.PennTreebank[source]

Defines PennTreebank datasets.

Create language modeling dataset: PennTreebank Separately returns the train/test/valid set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • data_select – a string or tupel for the returned datasets (Default: (‘train’, ‘test’,’valid’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • single_line – whether to return all tokens in a single line. (Default: True) By default, all lines in raw text file are concatenated into a single line. Use single_line = False if one wants to get data line by line.

Examples

>>> from torchtext.experimental.datasets import PennTreebank
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, test_dataset, valid_dataset = PennTreebank(tokenizer=tokenizer)
>>> vocab = train_dataset.get_vocab()
>>> valid_dataset, = PennTreebank(tokenizer=tokenizer, vocab=vocab,
                                  data_select='valid')
__init__ = <method-wrapper '__init__' of function object>

WMTNewsCrawl

class torchtext.experimental.datasets.WMTNewsCrawl[source]

Defines WMTNewsCrawl datasets.

Create language modeling dataset: WMTNewsCrawl returns the train set

Parameters
  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well (see example below). A custom tokenizer is callable function with input of a string and output of a token list.

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • data_select – a string or tupel for the returned datasets (Default: (‘train’,))

  • single_line – whether to return all tokens in a single line. (Default: True) By default, all lines in raw text file are concatenated into a single line. Use single_line = False if one wants to get data line by line.

Examples

>>> from torchtext.experimental.datasets import WMTNewsCrawl
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = get_tokenizer("spacy")
>>> train_dataset, = WMTNewsCrawl(tokenizer=tokenizer, data_select='train')
__init__ = <method-wrapper '__init__' of function object>

Machine Translation

Language modeling datasets are subclasses of TranslationDataset class.

Multi30k

class torchtext.experimental.datasets.Multi30k[source]
Define translation datasets: Multi30k

Separately returns train/valid/test datasets as a tuple

Parameters
  • train_filenames – the source and target filenames for training. Default: (‘train.de’, ‘train.en’)

  • valid_filenames – the source and target filenames for valid. Default: (‘val.de’, ‘val.en’)

  • test_filenames – the source and target filenames for test. Default: (‘test2016.de’, ‘test2016.en’)

  • tokenizer

    the tokenizer used to preprocess source and target raw text data. It has to be in a form of tuple. Default: (get_tokenizer(“spacy”, language=’de_core_news_sm’),

    get_tokenizer(“spacy”, language=’en_core_web_sm’))

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Source and target Vocabulary objects used for dataset. If None, it will generate a new vocabulary based on the train data set. It has to be in a form of tuple. Default: (None, None)

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘valid’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • removed_tokens – removed tokens from output dataset (Default: ‘<unk>’)

  • available dataset include (The) – test_2016_flickr.cs test_2016_flickr.de test_2016_flickr.en test_2016_flickr.fr test_2017_flickr.de test_2017_flickr.en test_2017_flickr.fr test_2017_mscoco.de test_2017_mscoco.en test_2017_mscoco.fr test_2018_flickr.en train.cs train.de train.en train.fr val.cs val.de val.en val.fr test_2016.1.de test_2016.1.en test_2016.2.de test_2016.2.en test_2016.3.de test_2016.3.en test_2016.4.de test_2016.4.en test_2016.5.de test_2016.5.en train.1.de train.1.en train.2.de train.2.en train.3.de train.3.en train.4.de train.4.en train.5.de train.5.en val.1.de val.1.en val.2.de val.2.en val.3.de val.3.en val.4.de val.4.en val.5.de val.5.en

Examples

>>> from torchtext.datasets import Multi30k
>>> from torchtext.data.utils import get_tokenizer
>>> tokenizer = (get_tokenizer("spacy", language='de'),
                 get_tokenizer("basic_english"))
>>> train_dataset, valid_dataset, test_dataset = Multi30k(tokenizer=tokenizer)
>>> src_vocab, tgt_vocab = train_dataset.get_vocab()
>>> src_data, tgt_data = train_dataset[10]
__init__ = <method-wrapper '__init__' of function object>

IWSLT

class torchtext.experimental.datasets.IWSLT[source]
Define translation datasets: IWSLT

Separately returns train/valid/test datasets The available datasets include:

Parameters
  • train_filenames – the source and target filenames for training. Default: (‘train.de-en.de’, ‘train.de-en.en’)

  • valid_filenames – the source and target filenames for valid. Default: (‘IWSLT16.TED.tst2013.de-en.de’, ‘IWSLT16.TED.tst2013.de-en.en’)

  • test_filenames – the source and target filenames for test. Default: (‘IWSLT16.TED.tst2014.de-en.de’, ‘IWSLT16.TED.tst2014.de-en.en’)

  • tokenizer

    the tokenizer used to preprocess source and target raw text data. It has to be in a form of tuple. Default: (get_tokenizer(“spacy”, language=’de_core_news_sm’),

    get_tokenizer(“spacy”, language=’en_core_web_sm’))

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Source and target Vocabulary objects used for dataset. If None, it will generate a new vocabulary based on the train data set. It has to be in a form of tuple. Default: (None, None)

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘valid’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • removed_tokens – removed tokens from output dataset (Default: ‘<unk>’)

  • available datasets include (The) – IWSLT16.TED.dev2010.ar-en.ar IWSLT16.TED.dev2010.ar-en.en IWSLT16.TED.dev2010.cs-en.cs IWSLT16.TED.dev2010.cs-en.en IWSLT16.TED.dev2010.de-en.de IWSLT16.TED.dev2010.de-en.en IWSLT16.TED.dev2010.en-ar.ar IWSLT16.TED.dev2010.en-ar.en IWSLT16.TED.dev2010.en-cs.cs IWSLT16.TED.dev2010.en-cs.en IWSLT16.TED.dev2010.en-de.de IWSLT16.TED.dev2010.en-de.en IWSLT16.TED.dev2010.en-fr.en IWSLT16.TED.dev2010.en-fr.fr IWSLT16.TED.dev2010.fr-en.en IWSLT16.TED.dev2010.fr-en.fr IWSLT16.TED.tst2010.ar-en.ar IWSLT16.TED.tst2010.ar-en.en IWSLT16.TED.tst2010.cs-en.cs IWSLT16.TED.tst2010.cs-en.en IWSLT16.TED.tst2010.de-en.de IWSLT16.TED.tst2010.de-en.en IWSLT16.TED.tst2010.en-ar.ar IWSLT16.TED.tst2010.en-ar.en IWSLT16.TED.tst2010.en-cs.cs IWSLT16.TED.tst2010.en-cs.en IWSLT16.TED.tst2010.en-de.de IWSLT16.TED.tst2010.en-de.en IWSLT16.TED.tst2010.en-fr.en IWSLT16.TED.tst2010.en-fr.fr IWSLT16.TED.tst2010.fr-en.en IWSLT16.TED.tst2010.fr-en.fr IWSLT16.TED.tst2011.ar-en.ar IWSLT16.TED.tst2011.ar-en.en IWSLT16.TED.tst2011.cs-en.cs IWSLT16.TED.tst2011.cs-en.en IWSLT16.TED.tst2011.de-en.de IWSLT16.TED.tst2011.de-en.en IWSLT16.TED.tst2011.en-ar.ar IWSLT16.TED.tst2011.en-ar.en IWSLT16.TED.tst2011.en-cs.cs IWSLT16.TED.tst2011.en-cs.en IWSLT16.TED.tst2011.en-de.de IWSLT16.TED.tst2011.en-de.en IWSLT16.TED.tst2011.en-fr.en IWSLT16.TED.tst2011.en-fr.fr IWSLT16.TED.tst2011.fr-en.en IWSLT16.TED.tst2011.fr-en.fr IWSLT16.TED.tst2012.ar-en.ar IWSLT16.TED.tst2012.ar-en.en IWSLT16.TED.tst2012.cs-en.cs IWSLT16.TED.tst2012.cs-en.en IWSLT16.TED.tst2012.de-en.de IWSLT16.TED.tst2012.de-en.en IWSLT16.TED.tst2012.en-ar.ar IWSLT16.TED.tst2012.en-ar.en IWSLT16.TED.tst2012.en-cs.cs IWSLT16.TED.tst2012.en-cs.en IWSLT16.TED.tst2012.en-de.de IWSLT16.TED.tst2012.en-de.en IWSLT16.TED.tst2012.en-fr.en IWSLT16.TED.tst2012.en-fr.fr IWSLT16.TED.tst2012.fr-en.en IWSLT16.TED.tst2012.fr-en.fr IWSLT16.TED.tst2013.ar-en.ar IWSLT16.TED.tst2013.ar-en.en IWSLT16.TED.tst2013.cs-en.cs IWSLT16.TED.tst2013.cs-en.en IWSLT16.TED.tst2013.de-en.de IWSLT16.TED.tst2013.de-en.en IWSLT16.TED.tst2013.en-ar.ar IWSLT16.TED.tst2013.en-ar.en IWSLT16.TED.tst2013.en-cs.cs IWSLT16.TED.tst2013.en-cs.en IWSLT16.TED.tst2013.en-de.de IWSLT16.TED.tst2013.en-de.en IWSLT16.TED.tst2013.en-fr.en IWSLT16.TED.tst2013.en-fr.fr IWSLT16.TED.tst2013.fr-en.en IWSLT16.TED.tst2013.fr-en.fr IWSLT16.TED.tst2014.ar-en.ar IWSLT16.TED.tst2014.ar-en.en IWSLT16.TED.tst2014.de-en.de IWSLT16.TED.tst2014.de-en.en IWSLT16.TED.tst2014.en-ar.ar IWSLT16.TED.tst2014.en-ar.en IWSLT16.TED.tst2014.en-de.de IWSLT16.TED.tst2014.en-de.en IWSLT16.TED.tst2014.en-fr.en IWSLT16.TED.tst2014.en-fr.fr IWSLT16.TED.tst2014.fr-en.en IWSLT16.TED.tst2014.fr-en.fr IWSLT16.TEDX.dev2012.de-en.de IWSLT16.TEDX.dev2012.de-en.en IWSLT16.TEDX.tst2013.de-en.de IWSLT16.TEDX.tst2013.de-en.en IWSLT16.TEDX.tst2014.de-en.de IWSLT16.TEDX.tst2014.de-en.en train.ar train.ar-en.ar train.ar-en.en train.cs train.cs-en.cs train.cs-en.en train.de train.de-en.de train.de-en.en train.en train.en-ar.ar train.en-ar.en train.en-cs.cs train.en-cs.en train.en-de.de train.en-de.en train.en-fr.en train.en-fr.fr train.fr train.fr-en.en train.fr-en.fr train.tags.ar-en.ar train.tags.ar-en.en train.tags.cs-en.cs train.tags.cs-en.en train.tags.de-en.de train.tags.de-en.en train.tags.en-ar.ar train.tags.en-ar.en train.tags.en-cs.cs train.tags.en-cs.en train.tags.en-de.de train.tags.en-de.en train.tags.en-fr.en train.tags.en-fr.fr train.tags.fr-en.en train.tags.fr-en.fr

Examples

>>> from torchtext.datasets import IWSLT
>>> from torchtext.data.utils import get_tokenizer
>>> src_tokenizer = get_tokenizer("spacy", language='de')
>>> tgt_tokenizer = get_tokenizer("basic_english")
>>> train_dataset, valid_dataset, test_dataset = IWSLT(tokenizer=(src_tokenizer,
                                                                  tgt_tokenizer))
>>> src_vocab, tgt_vocab = train_dataset.get_vocab()
>>> src_data, tgt_data = train_dataset[10]
__init__ = <method-wrapper '__init__' of function object>

WMT14

class torchtext.experimental.datasets.WMT14[source]
Define translation datasets: WMT14

Separately returns train/valid/test datasets The available datasets include:

newstest2016.en newstest2016.de newstest2015.en newstest2015.de newstest2014.en newstest2014.de newstest2013.en newstest2013.de newstest2012.en newstest2012.de newstest2011.tok.de newstest2011.en newstest2011.de newstest2010.tok.de newstest2010.en newstest2010.de newstest2009.tok.de newstest2009.en newstest2009.de newstest2016.tok.de newstest2015.tok.de newstest2014.tok.de newstest2013.tok.de newstest2012.tok.de newstest2010.tok.en newstest2009.tok.en newstest2015.tok.en newstest2014.tok.en newstest2013.tok.en newstest2012.tok.en newstest2011.tok.en newstest2016.tok.en newstest2009.tok.bpe.32000.en newstest2011.tok.bpe.32000.en newstest2010.tok.bpe.32000.en newstest2013.tok.bpe.32000.en newstest2012.tok.bpe.32000.en newstest2015.tok.bpe.32000.en newstest2014.tok.bpe.32000.en newstest2016.tok.bpe.32000.en train.tok.clean.bpe.32000.en newstest2009.tok.bpe.32000.de newstest2010.tok.bpe.32000.de newstest2011.tok.bpe.32000.de newstest2013.tok.bpe.32000.de newstest2012.tok.bpe.32000.de newstest2014.tok.bpe.32000.de newstest2016.tok.bpe.32000.de newstest2015.tok.bpe.32000.de train.tok.clean.bpe.32000.de

Parameters
  • train_filenames – the source and target filenames for training. Default: (‘train.tok.clean.bpe.32000.de’, ‘train.tok.clean.bpe.32000.en’)

  • valid_filenames – the source and target filenames for valid. Default: (‘newstest2013.tok.bpe.32000.de’, ‘newstest2013.tok.bpe.32000.en’)

  • test_filenames – the source and target filenames for test. Default: (‘newstest2014.tok.bpe.32000.de’, ‘newstest2014.tok.bpe.32000.en’)

  • tokenizer

    the tokenizer used to preprocess source and target raw text data. It has to be in a form of tuple. Default: (get_tokenizer(“spacy”, language=’de_core_news_sm’),

    get_tokenizer(“spacy”, language=’en_core_web_sm’))

  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Source and target Vocabulary objects used for dataset. If None, it will generate a new vocabulary based on the train data set. It has to be in a form of tuple. Default: (None, None)

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘valid’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

  • removed_tokens – removed tokens from output dataset (Default: ‘<unk>’)

Examples

>>> from torchtext.datasets import WMT14
>>> from torchtext.data.utils import get_tokenizer
>>> src_tokenizer = get_tokenizer("spacy", language='de')
>>> tgt_tokenizer = get_tokenizer("basic_english")
>>> train_dataset, valid_dataset, test_dataset = WMT14(tokenizer=(src_tokenizer,
                                                                  tgt_tokenizer))
>>> src_vocab, tgt_vocab = train_dataset.get_vocab()
>>> src_data, tgt_data = train_dataset[10]
__init__ = <method-wrapper '__init__' of function object>

Sequence Tagging

Language modeling datasets are subclasses of SequenceTaggingDataset class.

UDPOS

class torchtext.experimental.datasets.UDPOS[source]

Universal Dependencies English Web Treebank

Separately returns the training, validation, and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • vocabs – A list of voabularies for each columns in the dataset. Must be in an instance of List Default: None

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘valid’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.datasets.raw import UDPOS
>>> train_dataset, valid_dataset, test_dataset = UDPOS()
__init__ = <method-wrapper '__init__' of function object>

CoNLL2000Chunking

class torchtext.experimental.datasets.CoNLL2000Chunking[source]

CoNLL 2000 Chunking Dataset

Separately returns the training and test dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • vocabs – A list of voabularies for each columns in the dataset. Must be in an instance of List Default: None

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘valid’, ‘test’)) By default, all the three datasets (train, test, valid) are generated. Users could also choose any one or two of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.datasets.raw import CoNLL2000Chunking
>>> train_dataset, valid_dataset, test_dataset = CoNLL2000Chunking()
__init__ = <method-wrapper '__init__' of function object>

Question Answer

Question answer datasets are subclasses of QuestionAnswerDataset class.

SQuAD 1.0

class torchtext.experimental.datasets.SQuAD1[source]

Defines SQuAD1 datasets.

Create question answer dataset: SQuAD1

Separately returns the train and dev dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘dev’)) By default, all the two datasets (train, dev) are generated. Users could also choose any one of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import SQuAD1
>>> from torchtext.data.utils import get_tokenizer
>>> train, dev = SQuAD1()
>>> tokenizer = get_tokenizer("spacy")
>>> train, dev = SQuAD1(tokenizer=tokenizer)
__init__ = <method-wrapper '__init__' of function object>

SQuAD 2.0

class torchtext.experimental.datasets.SQuAD2[source]

Defines SQuAD2 datasets.

Create question answer dataset: SQuAD2

Separately returns the train and dev dataset

Parameters
  • root – Directory where the datasets are saved. Default: “.data”

  • vocab – Vocabulary used for dataset. If None, it will generate a new vocabulary based on the train data set.

  • tokenizer – the tokenizer used to preprocess raw text data. The default one is basic_english tokenizer in fastText. spacy tokenizer is supported as well. A custom tokenizer is callable function with input of a string and output of a token list.

  • data_select – a string or tuple for the returned datasets (Default: (‘train’, ‘dev’)) By default, all the two datasets (train, dev) are generated. Users could also choose any one of them, for example (‘train’, ‘test’) or just a string ‘train’. If ‘train’ is not in the tuple or string, a vocab object should be provided which will be used to process valid and/or test data.

Examples

>>> from torchtext.experimental.datasets import SQuAD2
>>> from torchtext.data.utils import get_tokenizer
>>> train, dev = SQuAD2()
>>> tokenizer = get_tokenizer("spacy")
>>> train, dev = SQuAD2(tokenizer=tokenizer)
__init__ = <method-wrapper '__init__' of function object>

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources