Shortcuts

torchtext.transforms

Transforms are common text transforms. They can be chained together using torch.nn.Sequential or using torchtext.transforms.Sequential to support torch-scriptability.

SentencePieceTokenizer

class torchtext.transforms.SentencePieceTokenizer(sp_model_path: str)[source]

Transform for Sentence Piece tokenizer from pre-trained sentencepiece model

Additiona details: https://github.com/google/sentencepiece

Parameters

sp_model_path (str) – Path to pre-trained sentencepiece model

Example
>>> from torchtext.transforms import SentencePieceTokenizer
>>> transform = SentencePieceTokenizer("spm_model")
>>> transform(["hello world", "attention is all you need!"])
Tutorials using SentencePieceTokenizer:
forward(input: Any)Any[source]
Parameters

input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.

Returns

tokenized text

Return type

Union[List[str], List[List(str)]]

GPT2BPETokenizer

class torchtext.transforms.GPT2BPETokenizer(encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False)[source]
forward(input: Any)Any[source]
Parameters

input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.

Returns

tokenized text

Return type

Union[List[str], List[List(str)]]

CLIPTokenizer

class torchtext.transforms.CLIPTokenizer(merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None, return_tokens: bool = False)[source]
forward(input: Any)Any[source]
Parameters

input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.

Returns

tokenized text

Return type

Union[List[str], List[List(str)]]

VocabTransform

class torchtext.transforms.VocabTransform(vocab: torchtext.vocab.vocab.Vocab)[source]

Vocab transform to convert input batch of tokens into corresponding token ids

Parameters

vocab – an instance of torchtext.vocab.Vocab class.

Example

>>> import torch
>>> from torchtext.vocab import vocab
>>> from torchtext.transforms import VocabTransform
>>> from collections import OrderedDict
>>> vocab_obj = vocab(OrderedDict([('a', 1), ('b', 1), ('c', 1)]))
>>> vocab_transform = VocabTransform(vocab_obj)
>>> output = vocab_transform([['a','b'],['a','b','c']])
>>> jit_vocab_transform = torch.jit.script(vocab_transform)
Tutorials using VocabTransform:
forward(input: Any)Any[source]
Parameters

input (Union[List[str], List[List[str]]]) – Input batch of token to convert to correspnding token ids

Returns

Converted input into corresponding token ids

Return type

Union[List[int], List[List[int]]]

ToTensor

class torchtext.transforms.ToTensor(padding_value: Optional[int] = None, dtype: torch.dtype = torch.int64)[source]

Convert input to torch tensor

Parameters
  • padding_value (Optional[int]) – Pad value to make each input in the batch of length equal to the longest sequence in the batch.

  • dtype (torch.dtype) – torch.dtype of output tensor

forward(input: Any)torch.Tensor[source]
Parameters

input (Union[List[int], List[List[int]]]) – Sequence or batch of token ids

Return type

Tensor

LabelToIndex

class torchtext.transforms.LabelToIndex(label_names: Optional[List[str]] = None, label_path: Optional[str] = None, sort_names=False)[source]

Transform labels from string names to ids.

Parameters
  • label_names (Optional[List[str]]) – a list of unique label names

  • label_path (Optional[str]) – a path to file containing unique label names containing 1 label per line. Note that either label_names or label_path should be supplied but not both.

forward(input: Any)Any[source]
Parameters

input (Union[str, List[str]]) – Input labels to convert to corresponding ids

Return type

Union[int, List[int]]

Truncate

class torchtext.transforms.Truncate(max_seq_len: int)[source]

Truncate input sequence

Parameters

max_seq_len (int) – The maximum allowable length for input sequence

Tutorials using Truncate:
forward(input: Any)Any[source]
Parameters

input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch of sequence to be truncated

Returns

Truncated sequence

Return type

Union[List[Union[str, int]], List[List[Union[str, int]]]]

AddToken

class torchtext.transforms.AddToken(token: Union[int, str], begin: bool = True)[source]

Add token to beginning or end of sequence

Parameters
  • token (Union[int, str]) – The token to be added

  • begin (bool, optional) – Whether to insert token at start or end or sequence, defaults to True

Tutorials using AddToken:
forward(input: Any)Any[source]
Parameters

input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch

Sequential

class torchtext.transforms.Sequential(*args: torch.nn.modules.module.Module)[source]
class torchtext.transforms.Sequential(arg: OrderedDict[str, Module])

A container to host a sequence of text transforms.

Tutorials using Sequential:
forward(input: Any)Any[source]
Parameters

input (Any) – Input sequence or batch. The input type must be supported by the first transform in the sequence.

PadTransform

class torchtext.transforms.PadTransform(max_length: int, pad_value: int)[source]

Pad tensor to a fixed length with given padding value.

Parameters
  • max_length (int) – Maximum length to pad to

  • pad_value (bool) – Value to pad the tensor with

forward(x: torch.Tensor)torch.Tensor[source]
Parameters

x (Tensor) – The tensor to pad

Returns

Tensor padded up to max_length with pad_value

Return type

Tensor

StrToIntTransform

class torchtext.transforms.StrToIntTransform[source]

Convert string tokens to integers (either single sequence or batch).

forward(input: Any)Any[source]
Parameters

input (Union[List[str], List[List[str]]]) – sequence or batch of string tokens to convert

Returns

sequence or batch converted into corresponding token ids

Return type

Union[List[int], List[List[int]]]

Docs

Access comprehensive developer documentation for PyTorch

View Docs

Tutorials

Get in-depth tutorials for beginners and advanced developers

View Tutorials

Resources

Find development resources and get your questions answered

View Resources