torchtext.transforms

Transforms are common text transforms. They can be chained together using torch.nn.Sequential or using torchtext.transforms.Sequential to support torch-scriptability.

SentencePieceTokenizer

class torchtext.transforms.SentencePieceTokenizer(sp_model_path: str)[source]

Transform for Sentence Piece tokenizer from pre-trained sentencepiece model

Additiona details: https://github.com/google/sentencepiece

Parameters: sp_model_path (str) – Path to pre-trained sentencepiece model

Example

>>> from torchtext.transforms import SpmTokenizerTransform
>>> transform = SentencePieceTokenizer("spm_model")
>>> transform(["hello world", "attention is all you need!"])

Tutorials using SentencePieceTokenizer:

SST-2 Binary text classification with XLM-RoBERTa model

forward(input: Any) → Any[source]

Parameters: input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
Returns: tokenized text
Return type: Union[List[str], List[List(str)]]

GPT2BPETokenizer

class torchtext.transforms.GPT2BPETokenizer(encoder_json_path: str, vocab_bpe_path: str)[source]

forward(input: Any) → Any[source]

Parameters: input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
Returns: tokenized text
Return type: Union[List[str], List[List(str)]]

CLIPTokenizer

class torchtext.transforms.CLIPTokenizer(merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None)[source]

forward(input: Any) → Any[source]

Parameters: input (Union[str, List[str]]) – Input sentence or list of sentences on which to apply tokenizer.
Returns: tokenized text
Return type: Union[List[str], List[List(str)]]

VocabTransform

class torchtext.transforms.VocabTransform(vocab: torchtext.vocab.vocab.Vocab)[source]

Vocab transform to convert input batch of tokens into corresponding token ids

Parameters: vocab – an instance of torchtext.vocab.Vocab class.

Example

>>> import torch
>>> from torchtext.vocab import vocab
>>> from torchtext.transforms import VocabTransform
>>> from collections import OrderedDict
>>> vocab_obj = vocab(OrderedDict([('a', 1), ('b', 1), ('c', 1)]))
>>> vocab_transform = VocabTransform(vocab_obj)
>>> output = vocab_transform([['a','b'],['a','b','c']])
>>> jit_vocab_transform = torch.jit.script(vocab_transform)

Tutorials using VocabTransform:: SST-2 Binary text classification with XLM-RoBERTa model

forward(input: Any) → Any[source]

Parameters: input (Union[List[str], List[List[str]]]) – Input batch of token to convert to correspnding token ids
Returns: Converted input into corresponding token ids
Return type: Union[List[int], List[List[int]]]

ToTensor

class torchtext.transforms.ToTensor(padding_value: Optional[int] = None, dtype: torch.dtype = torch.int64)[source]

Convert input to torch tensor

Parameters

padding_value (Optional[int]) – Pad value to make each input in the batch of length equal to the longest sequence in the batch.
dtype (torch.dtype) – torch.dtype of output tensor

forward(input: Any) → torch.Tensor [source]

Parameters: input (Union[List[int], List[List[int]]]) – Sequence or batch of token ids
Return type: Tensor

LabelToIndex

class torchtext.transforms.LabelToIndex(label_names: Optional[List[str]] = None, label_path: Optional[str] = None, sort_names=False)[source]

Transform labels from string names to ids.

Parameters

label_names (Optional[List[str]]) – a list of unique label names
label_path (Optional[str]) – a path to file containing unique label names containing 1 label per line. Note that either label_names or label_path should be supplied but not both.

forward(input: Any) → Any[source]

Parameters: input (Union[str, List[str]]) – Input labels to convert to corresponding ids
Return type: Union[int, List[int]]

Truncate

class torchtext.transforms.Truncate(max_seq_len: int)[source]

Truncate input sequence

Parameters: max_seq_len (int) – The maximum allowable length for input sequence

Tutorials using Truncate:: SST-2 Binary text classification with XLM-RoBERTa model

forward(input: Any) → Any[source]

Parameters: input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch of sequence to be truncated
Returns: Truncated sequence
Return type: Union[List[Union[str, int]], List[List[Union[str, int]]]]

AddToken

class torchtext.transforms.AddToken(token: Union[int, str], begin: bool = True)[source]

Add token to beginning or end of sequence

Parameters

token (Union[int, str]) – The token to be added
begin (bool, optional) – Whether to insert token at start or end or sequence, defaults to True

Tutorials using AddToken:: SST-2 Binary text classification with XLM-RoBERTa model

forward(input: Any) → Any[source]

Parameters: input (Union[List[Union[str, int]], List[List[Union[str, int]]]]) – Input sequence or batch

Sequential

class torchtext.transforms.Sequential(*args: torch.nn.modules.module.Module)[source]

class torchtext.transforms.Sequential(arg: OrderedDict[str, Module])

A container to host a sequence of text transforms.

Tutorials using Sequential:: SST-2 Binary text classification with XLM-RoBERTa model

forward(input: Any) → Any[source]

Parameters: input (Any) – Input sequence or batch. The input type must be supported by the first transform in the sequence.

torchtext.transforms

SentencePieceTokenizer

GPT2BPETokenizer

CLIPTokenizer

VocabTransform

ToTensor

LabelToIndex

Truncate

AddToken

Sequential

Docs

Tutorials

Resources