torchtext.vocab¶
Vocab¶
-
class
torchtext.vocab.
Vocab
(vocab)[source]¶ -
__contains__
(token: str) → bool[source]¶ - Parameters
token – The token for which to check the membership.
- Returns
Whether the token is member of vocab or not.
-
__getitem__
(token: str) → int[source]¶ - Parameters
token – The token used to lookup the corresponding index.
- Returns
The index corresponding to the associated token.
-
__init__
(vocab)[source]¶ Initializes internal Module state, shared by both nn.Module and ScriptModule.
-
append_token
(token: str) → None[source]¶ - Parameters
token – The token used to lookup the corresponding index.
- Raises
RuntimeError – If token already exists in the vocab
-
forward
(tokens: List[str]) → List[int][source]¶ Calls the lookup_indices method
- Parameters
tokens – a list of tokens used to lookup their corresponding indices.
- Returns
The indices associated with a list of tokens.
-
insert_token
(token: str, index: int) → None[source]¶ - Parameters
token – The token used to lookup the corresponding index.
index – The index corresponding to the associated token.
- Raises
RuntimeError – If index is not in range [0, Vocab.size()] or if token already exists in the vocab.
-
lookup_indices
(tokens: List[str]) → List[int][source]¶ - Parameters
tokens – the tokens used to lookup their corresponding indices.
- Returns
The ‘indices` associated with tokens.
-
lookup_token
(index: int) → str[source]¶ - Parameters
index – The index corresponding to the associated token.
- Returns
The token used to lookup the corresponding index.
- Return type
token
- Raises
RuntimeError – If index not in range [0, itos.size()).
-
lookup_tokens
(indices: List[int]) → List[str][source]¶ - Parameters
indices – The indices used to lookup their corresponding`tokens`.
- Returns
The tokens associated with indices.
- Raises
RuntimeError – If an index within indices is not int range [0, itos.size()).
-
vocab¶
-
torchtext.vocab.
vocab
(ordered_dict: Dict, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True) → torchtext.vocab.vocab.Vocab[source]¶ Factory method for creating a vocab object which maps tokens to indices.
Note that the ordering in which key value pairs were inserted in the ordered_dict will be respected when building the vocab. Therefore if sorting by token frequency is important to the user, the ordered_dict should be created in a way to reflect this.
- Parameters
ordered_dict – Ordered Dictionary mapping tokens to their corresponding occurance frequencies.
min_freq – The minimum frequency needed to include a token in the vocabulary.
specials – Special symbols to add. The order of supplied tokens will be preserved.
special_first – Indicates whether to insert symbols at the beginning or at the end.
- Returns
A Vocab object
- Return type
Examples
>>> from torchtext.vocab import vocab >>> from collections import Counter, OrderedDict >>> counter = Counter(["a", "a", "b", "b", "b"]) >>> sorted_by_freq_tuples = sorted(counter.items(), key=lambda x: x[1], reverse=True) >>> ordered_dict = OrderedDict(sorted_by_freq_tuples) >>> v1 = vocab(ordered_dict) >>> print(v1['a']) #prints 1 >>> print(v1['out of vocab']) #raise RuntimeError since default index is not set >>> tokens = ['e', 'd', 'c', 'b', 'a'] >>> #adding <unk> token and default index >>> unk_token = '<unk>' >>> default_index = -1 >>> v2 = vocab(OrderedDict([(token, 1) for token in tokens]), specials=[unk_token]) >>> v2.set_default_index(default_index) >>> print(v2['<unk>']) #prints 0 >>> print(v2['out of vocab']) #prints -1 >>> #make default index same as index of unk_token >>> v2.set_default_index(v2[unk_token]) >>> v2['out of vocab'] is v2[unk_token] #prints True
build_vocab_from_iterator¶
-
torchtext.vocab.
build_vocab_from_iterator
(iterator: Iterable, min_freq: int = 1, specials: Optional[List[str]] = None, special_first: bool = True, max_tokens: Optional[int] = None) → torchtext.vocab.vocab.Vocab[source]¶ Build a Vocab from an iterator.
- Parameters
iterator – Iterator used to build Vocab. Must yield list or iterator of tokens.
min_freq – The minimum frequency needed to include a token in the vocabulary.
specials – Special symbols to add. The order of supplied tokens will be preserved.
special_first – Indicates whether to insert symbols at the beginning or at the end.
max_tokens – If provided, creates the vocab from the max_tokens - len(specials) most frequent tokens.
- Returns
A Vocab object
- Return type
Examples
>>> #generating vocab from text file >>> import io >>> from torchtext.vocab import build_vocab_from_iterator >>> def yield_tokens(file_path): >>> with io.open(file_path, encoding = 'utf-8') as f: >>> for line in f: >>> yield line.strip().split() >>> vocab = build_vocab_from_iterator(yield_tokens(file_path), specials=["<unk>"])
Vectors¶
-
class
torchtext.vocab.
Vectors
(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]¶ -
__init__
(name, cache=None, url=None, unk_init=None, max_vectors=None)[source]¶ - Parameters
name – name of the file that contains the vectors
cache – directory for cached vectors
url – url for download if vectors not found in cache
unk_init (callback) – by default, initialize out-of-vocabulary word vectors to zero vectors; can be any function that takes in a Tensor and returns a Tensor of the same size
max_vectors (int) – this can be used to limit the number of pre-trained vectors loaded. Most pre-trained vector sets are sorted in the descending order of word frequency. Thus, in situations where the entire set doesn’t fit in memory, or is not needed for another reason, passing max_vectors can limit the size of the loaded set.
-
get_vecs_by_tokens
(tokens, lower_case_backup=False)[source]¶ Look up embedding vectors of tokens.
- Parameters
tokens – a token or a list of tokens. if tokens is a string, returns a 1-D tensor of shape self.dim; if tokens is a list of strings, returns a 2-D tensor of shape=(len(tokens), self.dim).
lower_case_backup – Whether to look up the token in the lower case. If False, each token in the original case will be looked up; if True, each token in the original case will be looked up first, if not found in the keys of the property stoi, the token in the lower case will be looked up. Default: False.
Examples
>>> examples = ['chip', 'baby', 'Beautiful'] >>> vec = text.vocab.GloVe(name='6B', dim=50) >>> ret = vec.get_vecs_by_tokens(examples, lower_case_backup=True)
-