• Docs >
  • torchtext.experimental.vocab
Shortcuts

torchtext.experimental.vocab

Vocab

class torchtext.experimental.vocab.Vocab(ordered_dict, min_freq=1, unk_token='<unk>')[source]

Creates a vocab object which maps tokens to indices.

Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this. Additionally, the if the unk_token isn’t found inside of the ordered_dict, it will be added to the end of the vocab.

Parameters
  • ordered_dict (collections.OrderedDict) – object holding the frequencies of each token found in the data.

  • min_freq – The minimum frequency needed to include a token in the vocabulary. Values less than 1 will be set to 1. Default: 1.

  • unk_token – The default unknown token to use. Default: ‘<unk>’.

Raises

ValueError – if a default unk_token isn’t provided.

Examples

>>> from torchtext.experimental.vocab import Vocab
>>> from collections import Counter, OrderedDict
>>> counter = Counter(["a", "a", "b", "b", "b"])
>>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True)
>>> ordered_dict = OrderedDict(sorted_by_freq_tuples)
>>> v1 = Vocab(ordered_dict)
>>> tokens = ['e', 'd', 'c', 'b', 'a']
>>> v2 = Vocab(OrderedDict([(token, 1) for token in tokens]))
__getitem__(token: str) → int[source]
Parameters

token (str) – the token used to lookup the corresponding index.

Returns

the index corresponding to the associated token.

Return type

index (int)

__init__(ordered_dict, min_freq=1, unk_token='<unk>')[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

__len__() → int[source]

Returns: length (int): the length of the vocab

append_token(token: str) → None[source]
Parameters

token (str) – the token used to lookup the corresponding index.

get_itos() → List[str][source]
Returns

dictionary mapping indices to tokens.

Return type

itos (dict)

get_stoi() → Dict[str, int][source]
Returns

dictionary mapping tokens to indices.

Return type

stoi (dict)

insert_token(token: str, index: int) → None[source]
Parameters
  • token (str) – the token used to lookup the corresponding index.

  • index (int) – the index corresponding to the associated token.

Raises

RuntimeError – if index not between [0, Vocab.size()] or if token already exists in the vocab.

lookup_indices(tokens: List[str]) → List[int][source]
Parameters

tokens (List[str]) – the tokens used to lookup their corresponding indices.

Returns

the ‘indices` associated with tokens.

Return type

indices (List[int])

lookup_token(index: int) → str[source]
Parameters

index (int) – the index corresponding to the associated token.

Returns

the token used to lookup the corresponding index.

Return type

token (str)

Raises

RuntimeError – if index not between [0, itos.size()].

lookup_tokens(indices: List[int]) → List[str][source]
Parameters

indices (List[int]) – the indices used to lookup their corresponding`tokens`.

Returns

the tokens associated with indices.

Return type

tokens (List[str])

Raises

RuntimeError – if an index within indices is not between [0, itos.size()].

vocab_from_file_object

torchtext.experimental.vocab.vocab_from_file_object(file_like_object, **kwargs)[source]

Create a Vocab object from a file like object.

The file_like_object should contain tokens seperated by new lines. Note that the vocab will be created in the order that the tokens first appear in the file (and not by the frequency of tokens).

Format for txt file:

token1 token2 … token_n

Parameters
  • file_like_object (FileObject) – a file like object to read data from.

  • keyword arguments (Remaining) – Passed to the constructor of Vocab class.

Returns

a Vocab object.

Return type

Vocab

Examples

>>> from torchtext.experimental.vocab import vocab_from_file_object
>>> f = open('vocab.txt', 'r')
>>> v = vocab_from_file_object(f, specials=('<unk>', '<pad>', '<eos>'), specials_first=False)

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources