Shortcuts

torchtext.data.utils

get_tokenizer

torchtext.data.utils.get_tokenizer(tokenizer, language='en')[source]

Generate tokenizer function for a string sentence.

Parameters
  • tokenizer – the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.

  • language – Default en

Examples

>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

ngrams_iterator

torchtext.data.utils.ngrams_iterator(token_list, ngrams)[source]

Return an iterator that yields the given tokens and their ngrams.

Parameters
  • token_list – A list of tokens

  • ngrams – the number of ngrams.

Examples

>>> token_list = ['here', 'we', 'are']
>>> list(ngrams_iterator(token_list, 2))
>>> ['here', 'here we', 'we', 'we are', 'are']

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources