torchtext.vocab

Vocab

class torchtext.vocab.Vocab(counter, max_size=None, min_freq=1, specials=('<unk>', '<pad>'), vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]

Defines a vocabulary object that will be used to numericalize a field.

Variables

~Vocab.freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.
~Vocab.stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.
~Vocab.itos – A list of token strings indexed by their numerical identifiers.

__init__(counter, max_size=None, min_freq=1, specials=('<unk>', '<pad>'), vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]

Create a Vocab object from a collections.Counter.

Parameters

counter – collections.Counter object holding the frequencies of each value found in the data.
max_size – The maximum size of the vocabulary, or None for no maximum. Default: None.
min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary. Default: [‘<unk’>, ‘<pad>’]
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros’
vectors_cache – directory for cached vectors. Default: ‘.vector_cache’
specials_first – Whether to add special tokens into the vocabulary at first. If it is False, they are added into the vocabulary at last. Default: True.

load_vectors(vectors, **kwargs)[source]

Parameters

vectors –
one of or a list containing instantiations of the GloVe, CharNGram, or Vectors classes. Alternatively, one of or a list of available pretrained vectors:

charngram.100d fasttext.en.300d fasttext.simple.300d glove.42B.300d glove.840B.300d glove.twitter.27B.25d glove.twitter.27B.50d glove.twitter.27B.100d glove.twitter.27B.200d glove.6B.50d glove.6B.100d glove.6B.200d glove.6B.300d
keyword arguments (Remaining) – Passed to the constructor of Vectors classes.

set_vectors(stoi, vectors, dim, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]

Set the vectors for the Vocab instance from a collection of Tensors.

Parameters

stoi – A dictionary of string to the index of the associated vector in the vectors input argument.
vectors – An indexed iterable (or other structure supporting __getitem__) that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.
dim – The dimensionality of the vectors.
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros’

SubwordVocab

class torchtext.vocab.SubwordVocab(counter, max_size=None, specials='<pad>', vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]

__init__(counter, max_size=None, specials='<pad>', vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]

Create a revtok subword vocabulary from a collections.Counter.

Parameters

counter – collections.Counter object holding the frequencies of each word found in the data.
max_size – The maximum size of the subword vocabulary, or None for no maximum. Default: None.
specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token.
vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros

Vectors

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]

__init__(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]

Parameters

name – name of the file that contains the vectors
cache – directory for cached vectors
url – url for download if vectors not found in cache
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

get_vecs_by_tokens(tokens, lower_case_backup=False)[source]

Look up embedding vectors of tokens.

Parameters

tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).
lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.

Examples

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(tokens, lower_case_backup=True)

Pretrained Word Embeddings

GloVe

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)[source]

FastText

class torchtext.vocab.FastText(language='en', **kwargs)[source]

CharNGram

class torchtext.vocab.CharNGram(**kwargs)[source]

Misc.

build_vocab_from_iterator

torchtext.vocab.build_vocab_from_iterator(iterator, num_lines=None)[source]

Build a Vocab from an iterator.

Parameters

iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.
num_lines – The expected number of elements returned by the iterator. (Default: None) Optionally, if known, the expected number of elements can be passed to this factory function for improved progress reporting.

torchtext.vocab

Vocab

SubwordVocab

Vectors

Pretrained Word Embeddings

GloVe

FastText

CharNGram

Misc.

build_vocab_from_iterator

Docs

Tutorials

Resources