torchtext.data.utils¶
get_tokenizer¶
- torchtext.data.utils.get_tokenizer(tokenizer, language='en')[source]¶
Generate tokenizer function for a string sentence.
- Parameters:
tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
language – Default en
Examples
>>> import torchtext >>> from torchtext.data import get_tokenizer >>> tokenizer = get_tokenizer("basic_english") >>> tokens = tokenizer("You can now install TorchText using pip!") >>> tokens >>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
ngrams_iterator¶
- torchtext.data.utils.ngrams_iterator(token_list, ngrams)[source]¶
Return an iterator that yields the given tokens and their ngrams.
- Parameters:
token_list – A list of tokens
ngrams – the number of ngrams.
Examples
>>> token_list = ['here', 'we', 'are'] >>> list(ngrams_iterator(token_list, 2)) >>> ['here', 'here we', 'we', 'we are', 'are']