torchtext.data.utils¶
get_tokenizer¶
-
torchtext.data.utils.
get_tokenizer
(tokenizer, language='en')[source]¶ Generate tokenizer function for a string sentence.
- Parameters
tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
language – Default en
Examples
>>> import torchtext >>> from torchtext.data import get_tokenizer >>> tokenizer = get_tokenizer("basic_english") >>> tokens = tokenizer("You can now install TorchText using pip!") >>> tokens >>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
ngrams_iterator¶
-
torchtext.data.utils.
ngrams_iterator
(token_list, ngrams)[source]¶ Return an iterator that yields the given tokens and their ngrams.
- Parameters
token_list – A list of tokens
ngrams – the number of ngrams.
Examples
>>> token_list = ['here', 'we', 'are'] >>> list(ngrams_iterator(token_list, 2)) >>> ['here', 'here we', 'we', 'we are', 'are']