• Docs >
  • torchtext.experimental.transforms
Shortcuts

torchtext.experimental.transforms

BasicEnglishNormalize

class torchtext.experimental.transforms.BasicEnglishNormalize[source]

Basic normalization for a string sentence.

Normalization includes
  • lowercasing

  • complete some basic text normalization for English words as follows:
    • add spaces before and after ‘’’

    • remove ‘”’,

    • add spaces before and after ‘.’

    • replace ‘<br />’with single space

    • add spaces before and after ‘,’

    • add spaces before and after ‘(‘

    • add spaces before and after ‘)’

    • add spaces before and after ‘!’

    • add spaces before and after ‘?’

    • replace ‘;’ with single space

    • replace ‘:’ with single space

    • replace multiple spaces with single space

Examples

>>> import torch
>>> from torchtext.experimental.transforms import BasicEnglishNormalize
>>> test_sample = 'Basic English Normalization for a Line of Text'
>>> basic_english_normalize = BasicEnglishNormalize()
>>> jit_basic_english_normalize = torch.jit.script(basic_english_normalize)
>>> tokens = jit_basic_english_normalize(test_sample)
__init__()[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(line: str) → List[str][source]
Parameters

line (str) – a line of text to tokenize.

Returns

a list of tokens after normalizing and splitting on whitespace.

Return type

List[str]

RegexTokenizer

class torchtext.experimental.transforms.RegexTokenizer(patterns_list: List[Tuple[str, str]])[source]

Regex tokenizer for a string sentence that applies all regex replacements defined in patterns_list.

Parameters
  • patterns_list (List[Tuple[str, str]]) – a list of tuples (ordered pairs) which contain the regex pattern string

  • the first element and the replacement string as the second element. (as) –

Examples

>>> import torch
>>> from torchtext.experimental.transforms import RegexTokenizer
>>> test_sample = 'Basic Regex Tokenization for a Line of Text'
>>> patterns_list = [
    (r'\'', ' \'  '),
    (r'\"', '')]
>>> regex_tokenizer = RegexTokenizer(patterns_list)
>>> jit_regex_tokenizer = torch.jit.script(regex_tokenizer)
>>> tokens = jit_regex_tokenizer(test_sample)
__init__(patterns_list: List[Tuple[str, str]])[source]

Initializes internal Module state, shared by both nn.Module and ScriptModule.

forward(line: str) → List[str][source]
Parameters

line (str) – a line of text to tokenize.

Returns

a list of tokens after normalizing and splitting on whitespace.

Return type

List[str]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources