torchtext.data¶
The data module provides the following:
Ability to define a preprocessing pipeline
Batching, padding, and numericalizing (including building a vocabulary object)
Wrapper for dataset splits (train, validation, test)
Loader for a custom NLP dataset
Dataset, Batch, and Example¶
Dataset¶
-
class
torchtext.data.
Dataset
(examples, fields, filter_pred=None)[source]¶ Defines a dataset composed of Examples along with its Fields.
- Variables
~Dataset.sort_key (callable) – A key to use for sorting dataset examples for batching together examples with similar lengths to minimize padding.
~Dataset.examples (list(Example)) – The examples in this dataset.
~Dataset.fields (dict[str, Field]) – Contains the name of each column or field, together with the corresponding Field object. Two fields with the same Field object will have a shared vocabulary.
-
__init__
(examples, fields, filter_pred=None)[source]¶ Create a dataset from a list of Examples and Fields.
- Parameters
examples – List of Examples.
fields (List(tuple(str, Field))) – The Fields to use in this tuple. The string is a field name, and the Field is the associated field.
filter_pred (callable or None) – Use only examples for which filter_pred(example) is True, or use all examples if None. Default is None.
-
classmethod
download
(root, check=None)[source]¶ Download and unzip an online archive (.zip, .gz, or .tgz).
-
filter_examples
(field_names)[source]¶ Remove unknown words from dataset examples with respect to given field.
-
split
(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)[source]¶ Create train-test(-valid?) splits from the instance’s examples.
- Parameters
split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for test), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
stratified (bool) – whether the sampling should be stratified. Default is False.
strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
- Returns
Datasets for train, validation, and test splits in that order, if the splits are provided.
- Return type
Tuple[Dataset]
-
classmethod
splits
(path=None, root='.data', train=None, validation=None, test=None, **kwargs)[source]¶ Create Dataset objects for multiple splits of a dataset.
- Parameters
path (str) – Common prefix of the splits’ file paths, or None to use the result of cls.download(root).
root (str) – Root dataset storage directory. Default is ‘.data’.
train (str) – Suffix to add to path for the train set, or None for no train set. Default is None.
validation (str) – Suffix to add to path for the validation set, or None for no validation set. Default is None.
test (str) – Suffix to add to path for the test set, or None for no test set. Default is None.
keyword arguments (Remaining) – Passed to the constructor of the Dataset (sub)class being used.
- Returns
Datasets for train, validation, and test splits in that order, if provided.
- Return type
Tuple[Dataset]
TabularDataset¶
-
class
torchtext.data.
TabularDataset
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]¶ Defines a Dataset of columns stored in CSV, TSV, or JSON format.
-
__init__
(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)[source]¶ Create a TabularDataset given a path, file format, and field list.
- Parameters
path (str) – Path to the data file.
format (str) – The format of the data file. One of “CSV”, “TSV”, or “JSON” (case-insensitive).
fields (list(tuple(str, Field)) –
tuple(str, Field)]: If using a list, the format must be CSV or TSV, and the values of the list should be tuples of (name, field). The fields should be in the same order as the columns in the CSV or TSV file, while tuples of (name, None) represent columns that will be ignored.
If using a dict, the keys should be a subset of the JSON keys or CSV/TSV columns, and the values should be tuples of (name, field). Keys not present in the input dictionary are ignored. This allows the user to rename columns from their JSON/CSV/TSV key names and also enables selecting a subset of columns to load.
skip_header (bool) – Whether to skip the first line of the input file.
csv_reader_params (dict) – Parameters to pass to the csv reader. Only relevant when format is csv or tsv. See https://docs.python.org/3/library/csv.html#csv.reader for more details.
-
Batch¶
-
class
torchtext.data.
Batch
(data=None, dataset=None, device=None)[source]¶ Defines a batch of examples along with its Fields.
- Variables
~Batch.batch_size – Number of examples in the batch.
~Batch.dataset – A reference to the dataset object the examples come from (which itself contains the dataset’s Field objects).
~Batch.train – Deprecated: this attribute is left for backwards compatibility, however it is UNUSED as of the merger with pytorch 0.4.
~Batch.input_fields – The names of the fields that are used as input for the model
~Batch.target_fields – The names of the fields that are used as targets during model training
Also stores the Variable for each column in the batch as an attribute.
Fields¶
RawField¶
-
class
torchtext.data.
RawField
(preprocessing=None, postprocessing=None, is_target=False)[source]¶ Defines a general datatype.
Every dataset consists of one or more types of data. For instance, a text classification dataset contains sentences and their classes, while a machine translation dataset contains paired examples of text in two languages. Each of these types of data is represented by a RawField object. A RawField object does not assume any property of the data type and it holds parameters relating to how a datatype should be processed.
- Variables
~RawField.preprocessing – The Pipeline that will be applied to examples using this field before creating an example. Default: None.
~RawField.postprocessing – A Pipeline that will be applied to a list of examples using this field before assigning to a batch. Function signature: (batch(list)) -> object Default: None.
~RawField.is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
-
__init__
(preprocessing=None, postprocessing=None, is_target=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Field¶
-
class
torchtext.data.
Field
(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)[source]¶ Defines a datatype together with instructions for converting to Tensor.
Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
- Variables
~Field.sequential – Whether the datatype represents sequential data. If False, no tokenization is applied. Default: True.
~Field.use_vocab – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default: True.
~Field.init_token – A token that will be prepended to every example using this field, or None for no initial token. Default: None.
~Field.eos_token – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default: None.
~Field.fix_length – A fixed length that all examples using this field will be padded to, or None for flexible sequence lengths. Default: None.
~Field.dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default: torch.long.
~Field.preprocessing – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default: None.
~Field.postprocessing – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default: None.
~Field.lower – Whether to lowercase the text in this field. Default: False.
~Field.tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
~Field.tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
~Field.include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
~Field.batch_first – Whether to produce tensors with the batch dimension first. Default: False.
~Field.pad_token – The string token used as padding. Default: “<pad>”.
~Field.unk_token – The string token used to represent OOV words. Default: “<unk>”.
~Field.pad_first – Do the padding of the sequence at the beginning. Default: False.
~Field.truncate_first – Do the truncating of the sequence at the beginning. Default: False
~Field.stop_words – Tokens to discard during the preprocessing step. Default: None
~Field.is_target – Whether this field is a target variable. Affects iteration over batches. Default: False
-
__init__
(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='<pad>', unk_token='<unk>', pad_first=False, truncate_first=False, stop_words=None, is_target=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for this field from one or more datasets.
- Parameters
arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for this field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
numericalize
(arr, device=None)[source]¶ Turn a batch of examples that use this field into a Variable.
If the field has include_lengths=True, a tensor of lengths will be included in the return value.
- Parameters
arr (List[List[str]], or tuple of (List[List[str]], List[int])) – List of tokenized and padded examples, or tuple of List of tokenized and padded examples and List of lengths of each example if self.include_lengths is True.
device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
pad
(minibatch)[source]¶ Pad a batch of examples using this field.
Pads to self.fix_length if provided, otherwise pads to the length of the longest example in the batch. Prepends self.init_token and appends self.eos_token if those attributes are not None. Returns a tuple of the padded list and a list containing lengths of each example if self.include_lengths is True and self.sequential is True, else just returns the padded list. If self.sequential is False, no padding is applied.
-
preprocess
(x)[source]¶ Load a single example using this field, tokenizing if necessary.
If sequential=True, the input will be tokenized. Then the input will be optionally lowercased and passed to the user-provided preprocessing Pipeline.
-
process
(batch, device=None)[source]¶ Process a list of examples to create a torch.Tensor.
Pad, numericalize, and postprocess a batch and create a tensor.
-
vocab_cls
¶ alias of
torchtext.vocab.Vocab
ReversibleField¶
SubwordField¶
-
class
torchtext.data.
SubwordField
(**kwargs)[source]¶ -
-
segment
(*args)[source]¶ Segment one or more datasets with this subword field.
- Parameters
arguments (Positional) – Dataset objects or other indexable mutable sequences to segment. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
-
vocab_cls
¶ alias of
torchtext.vocab.SubwordVocab
-
NestedField¶
-
class
torchtext.data.
NestedField
(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)[source]¶ A nested field.
A nested field holds another field (called nesting field), accepts an untokenized string or a list string tokens and groups and treats them as one field as described by the nesting field. Every token will be preprocessed, padded, etc. in the manner specified by the nesting field. Note that this means a nested field always has
sequential=True
. The two fields’ vocabularies will be shared. Their numericalization results will be stacked into a single tensor. And NestedField will share the same include_lengths with nesting_field, so one shouldn’t specify the include_lengths in the nesting_field. This field is primarily used to implement character embeddings. Seetests/data/test_field.py
for examples on how to use this field.- Parameters
nesting_field (Field) – A field contained in this nested field.
use_vocab (bool) – Whether to use a Vocab object. If False, the data in this field should already be numerical. Default:
True
.init_token (str) – A token that will be prepended to every example using this field, or None for no initial token. Default:
None
.eos_token (str) – A token that will be appended to every example using this field, or None for no end-of-sentence token. Default:
None
.fix_length (int) – A fixed length that all examples using this field will be padded to, or
None
for flexible sequence lengths. Default:None
.dtype – The torch.dtype class that represents a batch of examples of this kind of data. Default:
torch.long
.preprocessing (Pipeline) – The Pipeline that will be applied to examples using this field after tokenizing but before numericalizing. Many Datasets replace this attribute with a custom preprocessor. Default:
None
.postprocessing (Pipeline) – A Pipeline that will be applied to examples using this field after numericalizing but before the numbers are turned into a Tensor. The pipeline function takes the batch as a list, and the field’s Vocab. Default:
None
.include_lengths – Whether to return a tuple of a padded minibatch and a list containing the lengths of each examples, or just a padded minibatch. Default: False.
tokenize – The function used to tokenize strings using this field into sequential examples. If “spacy”, the SpaCy tokenizer is used. If a non-serializable function is passed as an argument, the field will not be able to be serialized. Default: string.split.
tokenizer_language – The language of the tokenizer to be constructed. Various languages currently supported only in SpaCy.
pad_token (str) – The string token used as padding. If
nesting_field
is sequential, this will be set to itspad_token
. Default:"<pad>"
.pad_first (bool) – Do the padding of the sequence at the beginning. Default:
False
.
-
__init__
(nesting_field, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, tokenize=None, tokenizer_language='en', include_lengths=False, pad_token='<pad>', pad_first=False, truncate_first=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
build_vocab
(*args, **kwargs)[source]¶ Construct the Vocab object for nesting field and combine it with this field’s vocab.
- Parameters
arguments (Positional) – Dataset objects or other iterable data sources from which to construct the Vocab object that represents the set of possible values for the nesting field. If a Dataset object is provided, all columns corresponding to this field are used; individual columns can also be provided directly.
keyword arguments (Remaining) – Passed to the constructor of Vocab.
-
numericalize
(arrs, device=None)[source]¶ Convert a padded minibatch into a variable tensor.
Each item in the minibatch will be numericalized independently and the resulting tensors will be stacked at the first dimension.
-
pad
(minibatch)[source]¶ Pad a batch of examples using this field.
If
self.nesting_field.sequential
isFalse
, each example in the batch must be a list of string tokens, and pads them as if by aField
withsequential=True
. Otherwise, each example must be a list of list of tokens. Usingself.nesting_field
, pads the list of tokens toself.nesting_field.fix_length
if provided, or otherwise to the length of the longest list of tokens in the batch. Next, using this field, pads the result by filling short examples withself.nesting_field.pad_token
.Example
>>> import pprint >>> pp = pprint.PrettyPrinter(indent=4) >>> >>> nesting_field = Field(pad_token='<c>', init_token='<w>', eos_token='</w>') >>> field = NestedField(nesting_field, init_token='<s>', eos_token='</s>') >>> minibatch = [ ... [list('john'), list('loves'), list('mary')], ... [list('mary'), list('cries')], ... ] >>> padded = field.pad(minibatch) >>> pp.pprint(padded) [ [ ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'], ['<w>', 'j', 'o', 'h', 'n', '</w>', '<c>'], ['<w>', 'l', 'o', 'v', 'e', 's', '</w>'], ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'], ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>']], [ ['<w>', '<s>', '</w>', '<c>', '<c>', '<c>', '<c>'], ['<w>', 'm', 'a', 'r', 'y', '</w>', '<c>'], ['<w>', 'c', 'r', 'i', 'e', 's', '</w>'], ['<w>', '</s>', '</w>', '<c>', '<c>', '<c>', '<c>'], ['<c>', '<c>', '<c>', '<c>', '<c>', '<c>', '<c>']]]
-
preprocess
(xs)[source]¶ Preprocess a single example.
Firstly, tokenization and the supplied preprocessing pipeline is applied. Since this field is always sequential, the result is a list. Then, each element of the list is preprocessed using
self.nesting_field.preprocess
and the resulting list is returned.
Iterators¶
Iterator¶
-
class
torchtext.data.
Iterator
(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)[source]¶ Defines an iterator that loads batches of data from a Dataset.
- Variables
~Iterator.dataset – The Dataset object to load Examples from.
~Iterator.batch_size – Batch size.
~Iterator.batch_size_fn – Function of three arguments (new example to add, current count of examples in the batch, and current effective batch size) that returns the new effective batch size resulting from adding that example to a batch. This is useful for dynamic batching, where this function would add to the current effective batch size the number of tokens in the new example.
~Iterator.sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
~Iterator.train – Whether the iterator represents a train set.
~Iterator.repeat – Whether to repeat the iterator for multiple epochs. Default: False.
~Iterator.shuffle – Whether to shuffle examples between epochs.
~Iterator.sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
~Iterator.sort_within_batch – Whether to sort (in descending order according to self.sort_key) within each batch. If None, defaults to self.sort. If self.sort is True and this is False, the batch is left in the original (ascending) sorted order.
~Iterator.device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
-
__init__
(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
splits
(datasets, batch_sizes=None, **kwargs)[source]¶ Create Iterator objects for multiple splits of a dataset.
- Parameters
datasets – Tuple of Dataset objects corresponding to the splits. The first such object should be the train set.
batch_sizes – Tuple of batch sizes to use for the different splits, or None to use the same batch_size for all splits.
keyword arguments (Remaining) – Passed to the constructor of the iterator class being used.
BucketIterator¶
-
class
torchtext.data.
BucketIterator
(dataset, batch_size, sort_key=None, device=None, batch_size_fn=None, train=True, repeat=False, shuffle=None, sort=None, sort_within_batch=None)[source]¶ Defines an iterator that batches examples of similar lengths together.
Minimizes amount of padding needed while producing freshly shuffled batches for each new epoch. See pool for the bucketing procedure used.
BPTTIterator¶
-
class
torchtext.data.
BPTTIterator
(dataset, batch_size, bptt_len, **kwargs)[source]¶ Defines an iterator for language modeling tasks that use BPTT.
Provides contiguous streams of examples together with targets that are one timestep further forward, for language modeling training with backpropagation through time (BPTT). Expects a Dataset with a single example and a single field called ‘text’ and produces Batches with text and target attributes.
- Variables
~BPTTIterator.dataset – The Dataset object to load Examples from.
~BPTTIterator.batch_size – Batch size.
~BPTTIterator.bptt_len – Length of sequences for backpropagation through time.
~BPTTIterator.sort_key – A key to use for sorting examples in order to batch together examples with similar lengths and minimize padding. The sort_key provided to the Iterator constructor overrides the sort_key attribute of the Dataset, or defers to it if None.
~BPTTIterator.train – Whether the iterator represents a train set.
~BPTTIterator.repeat – Whether to repeat the iterator for multiple epochs. Default: False.
~BPTTIterator.shuffle – Whether to shuffle examples between epochs.
~BPTTIterator.sort – Whether to sort examples according to self.sort_key. Note that shuffle and sort default to train and (not train).
~BPTTIterator.device (str or torch.device) – A string or instance of torch.device specifying which device the Variables are going to be created on. If left as default, the tensors will be created on cpu. Default: None.
Pipeline¶
Pipeline¶
-
class
torchtext.data.
Pipeline
(convert_token=None)[source]¶ Defines a pipeline for transforming sequence data.
The input is assumed to be utf-8 encoded str.
- Variables
~Pipeline.convert_token – The function to apply to input sequence data.
~Pipeline.pipes – The Pipelines that will be applied to input sequence data in order.
-
__init__
(convert_token=None)[source]¶ Create a pipeline.
- Parameters
convert_token – The function to apply to input sequence data. If None, the identity function is used. Default: None
-
add_after
(pipeline)[source]¶ Add a Pipeline to be applied after this processing pipeline.
- Parameters
pipeline – The Pipeline or callable to apply after this Pipeline.
-
add_before
(pipeline)[source]¶ Add a Pipeline to be applied before this processing pipeline.
- Parameters
pipeline – The Pipeline or callable to apply before this Pipeline.
-
call
(x, *args)[source]¶ Apply _only_ the convert_token function of the current pipeline to the input. If the input is a list, a list with the results of applying the convert_token function to all input elements is returned.
- Parameters
x – The input to apply the convert_token function to.
arguments (Positional) – Forwarded to the convert_token function of the current Pipeline.
Functions¶
batch¶
pool¶
-
torchtext.data.
pool
(data, batch_size, key, batch_size_fn=<function <lambda>>, random_shuffler=None, shuffle=False, sort_within_batch=False)[source]¶ Sort within buckets, then batch, then shuffle batches.
Partitions data into chunks of size 100*batch_size, sorts examples within each chunk using sort_key, then batch these examples and shuffle the batches.
get_tokenizer¶
-
torchtext.data.
get_tokenizer
(tokenizer, language='en')[source]¶ Generate tokenizer function for a string sentence.
- Parameters
tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
language – Default en
Examples
>>> import torchtext >>> from torchtext.data import get_tokenizer >>> tokenizer = get_tokenizer("basic_english") >>> tokens = tokenizer("You can now install TorchText using pip!") >>> tokens >>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
interleave_keys¶
-
torchtext.data.
interleave_keys
(a, b)[source]¶ Interleave bits from two sort keys to form a joint sort key.
Examples that are similar in both of the provided keys will have similar values for the key defined by this function. Useful for tasks with two text fields like machine translation or natural language inference.