torchtext.transforms¶
Transforms are common text transforms. They can be chained together using torch.nn.Sequential
or using torchtext.transforms.Sequential
to support torch-scriptability.
SentencePieceTokenizer¶
-
class
torchtext.transforms.
SentencePieceTokenizer
(sp_model_path: str)[source]¶ Transform for Sentence Piece tokenizer from pre-trained sentencepiece model
Additiona details: https://github.com/google/sentencepiece
- Parameters
sp_model_path (str) – Path to pre-trained sentencepiece model
- Example
>>> from torchtext.transforms import SpmTokenizerTransform >>> transform = SentencePieceTokenizer("spm_model") >>> transform(["hello world", "attention is all you need!"])
- Tutorials using
SentencePieceTokenizer
:
GPT2BPETokenizer¶
CLIPTokenizer¶
VocabTransform¶
-
class
torchtext.transforms.
VocabTransform
(vocab: torchtext.vocab.vocab.Vocab)[source]¶ Vocab transform to convert input batch of tokens into corresponding token ids
- Parameters
vocab – an instance of
torchtext.vocab.Vocab
class.
Example
>>> import torch >>> from torchtext.vocab import vocab >>> from torchtext.transforms import VocabTransform >>> from collections import OrderedDict >>> vocab_obj = vocab(OrderedDict([('a', 1), ('b', 1), ('c', 1)])) >>> vocab_transform = VocabTransform(vocab_obj) >>> output = vocab_transform([['a','b'],['a','b','c']]) >>> jit_vocab_transform = torch.jit.script(vocab_transform)
- Tutorials using
VocabTransform
:
ToTensor¶
-
class
torchtext.transforms.
ToTensor
(padding_value: Optional[int] = None, dtype: torch.dtype = torch.int64)[source]¶ Convert input to torch tensor
- Parameters
padding_value (Optional[int]) – Pad value to make each input in the batch of length equal to the longest sequence in the batch.
dtype (
torch.dtype
) –torch.dtype
of output tensor
-
forward
(input: Any) → torch.Tensor[source]¶