Shortcuts

torchtext.vocab

Vocab

class torchtext.vocab.Vocab(vocab)[source]
__contains__(token: str)bool[source]
Parameters

token – The token for which to check the membership.

Returns

Whether the token is member of vocab or not.

__getitem__(token: str)int[source]
Parameters

token – The token used to lookup the corresponding index.

Returns

The index corresponding to the associated token.

__init__(vocab)[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

__len__()int[source]
Returns

The length of the vocab.

__prepare_scriptable__()[source]

Return a JITable Vocab.

append_token(token: str)None[source]
Parameters

token – The token used to lookup the corresponding index.

Raises

RuntimeError – If token already exists in the vocab

forward(tokens: List[str])List[int][source]

Calls the lookup_indices method

Parameters

tokens – a list of tokens used to lookup their corresponding indices.

Returns

The indices associated with a list of tokens.

get_default_index()Optional[int][source]
Returns

Value of default index if it is set.

get_itos()List[str][source]
Returns

List mapping indices to tokens.

get_stoi()Dict[str, int][source]
Returns

Dictionary mapping tokens to indices.

insert_token(token: str, index: int)None[source]
Parameters
  • token – The token used to lookup the corresponding index.

  • index – The index corresponding to the associated token.

Raises

RuntimeError – If index is not in range [0, Vocab.size()] or if token already exists in the vocab.

lookup_indices(tokens: List[str])List[int][source]
Parameters

tokens – the tokens used to lookup their corresponding indices.

Returns

The ‘indices` associated with tokens.

lookup_token(index: int)str[source]
Parameters

index – The index corresponding to the associated token.

Returns

The token used to lookup the corresponding index.

Return type

token

Raises

RuntimeError – If index not in range [0, itos.size()).

lookup_tokens(indices: List[int])List[str][source]
Parameters

indices – The indices used to lookup their corresponding`tokens`.

Returns

The tokens associated with indices.

Raises

RuntimeError – If an index within indices is not int range [0, itos.size()).

set_default_index(index: Optional[int])None[source]
Parameters

index – Value of default index. This index will be returned when OOV token is queried.

vocab

torchtext.vocab.vocab(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True)torchtext.vocab.vocab.Vocab[source]

Factory method for creating a vocab object which maps tokens to indices.

Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this.

Parameters
  • ordered_dict – Ordered Dictionary mapping tokens to their corresponding occurance frequencies.

  • min_freq – The minimum frequency needed to include a token in the vocabulary.

  • specials – Special symbols to add. The order of supplied tokens will be preserved.

  • special_first – Indicates whether to insert symbols at the beginning or at the end.

Returns

A Vocab object

Return type

torchtext.vocab.Vocab

Examples

>>> from torchtext.vocab import vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = vocab(ordered_dict)
>>> print(v1['a']) #prints 1
>>> print(v1['out of vocab']) #raise RuntimeError since default index is not set
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> #adding <unk> token and default index
>>> unk_token = '<unk>'
>>> default_index = -1
>>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token])
>>> v2.set_default_index(default_index)
>>> print(v2['<unk>']) #prints 0
>>> print(v2['out of vocab']) #prints -1
>>> #make default index same as index of unk_token
>>> v2.set_default_index(v2[unk_token])
>>> v2['out of vocab'] is v2[unk_token] #prints True

build_vocab_from_iterator

torchtext.vocab.build_vocab_from_iterator(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None)torchtext.vocab.vocab.Vocab[source]

Build a Vocab from an iterator.

Parameters
  • iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.

  • min_freq – The minimum frequency needed to include a token in the vocabulary.

  • specials – Special symbols to add. The order of supplied tokens will be preserved.

  • special_first – Indicates whether to insert symbols at the beginning or at the end.

  • max_tokens – If provided, creates the vocab from the max_tokens - len(specials) most frequent tokens.

Returns

A Vocab object

Return type

torchtext.vocab.Vocab

Examples

>>> #generating vocab from text file
>>> import io
>>> from torchtext.vocab import build_vocab_from_iterator
>>> def yield_tokens(file_path):
>>>     with io.open(file_path, encoding = 'utf-8') as f:
>>>         for line in f:
>>>             yield line.strip().split()
>>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])

Vectors

class torchtext.vocab.Vectors(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]
__init__(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]
Parameters
  • name – name of the file that contains the vectors

  • cache – directory for cached vectors

  • url – url for download if vectors not found in cache

  • unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size

  • max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.

get_vecs_by_tokens(tokens, lower_case_backup=False)[source]

Look up embedding vectors of tokens.

Parameters
  • tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).

  • lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.

Examples

>>> examples = ['chip', 'baby', 'Beautiful']
>>> vec = text.vocab.GloVe(name='6B', dim=50)
>>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)

Pretrained Word Embeddings

GloVe

class torchtext.vocab.GloVe(name='840B', dim=300, **kwargs)[source]

FastText

class torchtext.vocab.FastText(language='en', **kwargs)[source]

CharNGram

class torchtext.vocab.CharNGram(**kwargs)[source]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources