torchtext.vocab¶

Vocab¶

class torchtext.vocab.Vocab(counter, max_size=None, min_freq=1, specials=('<unk>', '<pad>'), vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]

Defines a vocabulary object that will be used to numericalize a field.

Variables
• ~Vocab.freqs – A collections.Counter object holding the frequencies of tokens in the data used to build the Vocab.

• ~Vocab.stoi – A collections.defaultdict instance mapping token strings to numerical identifiers.

• ~Vocab.itos – A list of token strings indexed by their numerical identifiers.

__init__(counter, max_size=None, min_freq=1, specials=('<unk>', '<pad>'), vectors=None, unk_init=None, vectors_cache=None, specials_first=True)[source]

Create a Vocab object from a collections.Counter.

Parameters
• counter – collections.Counter object holding the frequencies of each value found in the data.

• max_size – The maximum size of the vocabulary, or None for no maximum. Default: None.

• min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.

• specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary. Default: [‘<unk’>, ‘<pad>’]

• vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors

• unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros’

• vectors_cache – directory for cached vectors. Default: ‘.vector_cache’

• specials_first – Whether to add special tokens into the vocabulary at first. If it is False, they are added into the vocabulary at last. Default: True.

load_vectors(vectors, **kwargs)[source]
Parameters
• vectors

one of or a list containing instantiations of the GloVe, CharNGram, or Vectors classes. Alternatively, one of or a list of available pretrained vectors:

• keyword arguments (Remaining) – Passed to the constructor of Vectors classes.

set_vectors(stoi, vectors, dim, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]

Set the vectors for the Vocab instance from a collection of Tensors.

Parameters
• stoi – A dictionary of string to the index of the associated vector in the vectors input argument.

• vectors – An indexed iterable (or other structure supporting __getitem__) that given an input index, returns a FloatTensor representing the vector for the token associated with the index. For example, vector[stoi[“string”]] should return the vector for “string”.

• dim – The dimensionality of the vectors.

• unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros’

SubwordVocab¶

class torchtext.vocab.SubwordVocab(counter, max_size=None, specials='<pad>', vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]
__init__(counter, max_size=None, specials='<pad>', vectors=None, unk_init=<method 'zero_' of 'torch._C._TensorBase' objects>)[source]

Create a revtok subword vocabulary from a collections.Counter.

Parameters
• counter – collections.Counter object holding the frequencies of each word found in the data.

• max_size – The maximum size of the subword vocabulary, or None for no maximum. Default: None.

• specials – The list of special tokens (e.g., padding or eos) that will be prepended to the vocabulary in addition to an <unk> token.

• vectors – One of either the available pretrained vectors or custom pretrained vectors (see Vocab.load_vectors); or a list of aforementioned vectors

• unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size. Default: ‘torch.zeros

Vectors¶

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]
__init__(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]
Parameters
• name – name of the file that contains the vectors

• cache – directory for cached vectors

• unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size

• max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

get_vecs_by_tokens(tokens, lower_case_backup=False)[source]

Look up embedding vectors of tokens.

Parameters
• tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).

• lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.

Examples

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(tokens, lower_case_backup=True)


GloVe¶

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)[source]

FastText¶

class torchtext.vocab.FastText(language='en', **kwargs)[source]

CharNGram¶

class torchtext.vocab.CharNGram(**kwargs)[source]

build_vocab_from_iterator¶

torchtext.vocab.build_vocab_from_iterator(iterator, num_lines=None)[source]

Build a Vocab from an iterator.

Parameters
• iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.

• num_lines – The expected number of elements returned by the iterator. (Default: None) Optionally, if known, the expected number of elements can be passed to this factory function for improved progress reporting.