torchtext.vocab

Vocab

class torchtext.vocab.Vocab(vocab)[source]

__contains__(token: str) → bool[source]

Parameters:: token – The token for which to check the membership.
Returns:: Whether the token is member of vocab or not.

__getitem__(token: str) → int[source]

Parameters:: token – The token used to lookup the corresponding index.
Returns:: The index corresponding to the associated token.

__init__(vocab) → None[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

__jit_unused_properties__ = ['is_jitable']

Creates a vocab object which maps tokens to indices.

Parameters:: vocab (torch.classes.torchtext.Vocab or torchtext._torchtext.Vocab) – a cpp vocab object.

__len__() → int[source]

Returns:: The length of the vocab.

__prepare_scriptable__()[source]: Return a JITable Vocab.

append_token(token: str) → None[source]

Parameters:: token – The token used to lookup the corresponding index.
Raises:: RuntimeError – If token already exists in the vocab

forward(tokens: List[str]) → List[int][source]

Calls the lookup_indices method

Parameters:: tokens – a list of tokens used to lookup their corresponding indices.
Returns:: The indices associated with a list of tokens.

get_default_index() → Optional[int][source]

Returns:: Value of default index if it is set.

get_itos() → List[str][source]

Returns:: List mapping indices to tokens.

get_stoi() → Dict[str, int][source]

Returns:: Dictionary mapping tokens to indices.

insert_token(token: str, index: int) → None[source]

Parameters:

token – The token used to lookup the corresponding index.
index – The index corresponding to the associated token.

Raises:

RuntimeError – If index is not in range [0, Vocab.size()] or if token already exists in the vocab.

lookup_indices(tokens: List[str]) → List[int][source]

Parameters:: tokens – the tokens used to lookup their corresponding indices.
Returns:: The ‘indices` associated with tokens.

lookup_token(index: int) → str[source]

Parameters:: index – The index corresponding to the associated token.
Returns:: The token used to lookup the corresponding index.
Return type:: token
Raises:: RuntimeError – If index not in range [0, itos.size()).

lookup_tokens(indices: List[int]) → List[str][source]

Parameters:: indices – The indices used to lookup their corresponding`tokens`.
Returns:: The tokens associated with indices.
Raises:: RuntimeError – If an index within indices is not int range [0, itos.size()).

set_default_index(index: Optional[int]) → None[source]

Parameters:: index – Value of default index. This index will be returned when OOV token is queried.

vocab

torchtext.vocab.vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) → Vocab[source]

Factory method for creating a vocab object which maps tokens to indices.

Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this.

Parameters:

ordered_dict – Ordered Dictionary mapping tokens to their corresponding occurance frequencies.
min_freq – The minimum frequency needed to include a token in the vocabulary.
specials – Special symbols to add. The order of supplied tokens will be preserved.
special_first – Indicates whether to insert symbols at the beginning or at the end.

Returns:

A Vocab object

Return type:

torchtext.vocab.Vocab

Examples

>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True

build_vocab_from_iterator

torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → Vocab[source]

Build a Vocab from an iterator.

Parameters:

iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.
min_freq – The minimum frequency needed to include a token in the vocabulary.
specials – Special symbols to add. The order of supplied tokens will be preserved.
special_first – Indicates whether to insert symbols at the beginning or at the end.
max_tokens – If provided, creates the vocab from the max_tokens - len(specials) most frequent tokens.

Returns:

A Vocab object

Return type:

torchtext.vocab.Vocab

Examples

>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens(file_path):
>>>     with io.open(file_path, encoding = 'utf-8') as f:
>>>         for line in f:
>>>             yield line.strip().split()
>>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

Vectors

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]

__init__(name, cache=None, url=None, unk_init=None, max_vectors=None) → None[source]

Parameters:

name – name of the file that contains the vectors
cache – directory for cached vectors
url – url for download if vectors not found in cache
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

get_vecs_by_tokens(tokens, lower_case_backup=False)[source]

Look up embedding vectors of tokens.

Parameters:

tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).
lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.

Examples

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)

Pretrained Word Embeddings

GloVe

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)[source]

FastText

class torchtext.vocab.FastText(language='en', **kwargs)[source]

CharNGram

class torchtext.vocab.CharNGram(**kwargs)[source]

torchtext.vocab

Vocab

vocab

build_vocab_from_iterator

Vectors

Pretrained Word Embeddings

GloVe

FastText

CharNGram

Docs

Tutorials

Resources