torchtext.transforms¶
Transforms are common text transforms. They can be chained together using torch.nn.Sequential
or using torchtext.transforms.Sequential
to support torch-scriptability.
SentencePieceTokenizer¶
-
class
torchtext.transforms.
SentencePieceTokenizer
(sp_model_path: str)[source]¶ Transform for Sentence Piece tokenizer from pre-trained sentencepiece model
Additiona details: https://github.com/google/sentencepiece
- Parameters
sp_model_path (str) – Path to pre-trained sentencepiece model
- Example
>>> from torchtext.transforms import SentencePieceTokenizer >>> transform = SentencePieceTokenizer("spm_model") >>> transform(["hello world", "attention is all you need!"])
- Tutorials using
SentencePieceTokenizer
:
GPT2BPETokenizer¶
-
class
torchtext.transforms.
GPT2BPETokenizer
(encoder_json_path: str, vocab_bpe_path: str, return_tokens: bool = False)[source]¶ Transform for GPT-2 BPE Tokenizer.
Reimplements openai GPT-2 BPE in TorchScript. Original openai implementation https://github.com/openai/gpt-2/blob/master/src/encoder.py
- Parameters
CLIPTokenizer¶
-
class
torchtext.transforms.
CLIPTokenizer
(merges_path: str, encoder_json_path: Optional[str] = None, num_merges: Optional[int] = None, return_tokens: bool = False)[source]¶ Transform for CLIP Tokenizer. Based on Byte-Level BPE.
Reimplements CLIP Tokenizer in TorchScript. Original implementation: https://github.com/mlfoundations/open_clip/blob/main/src/clip/tokenizer.py
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not.
The below code snippet shows how to use the CLIP tokenizer with encoder and merges file taken from the original paper implementation.
- Example
>>> from torchtext.transforms import CLIPTokenizer >>> MERGES_FILE = "http://download.pytorch.org/models/text/clip_merges.bpe" >>> ENCODER_FILE = "http://download.pytorch.org/models/text/clip_encoder.json" >>> tokenizer = CLIPTokenizer(merges_path=MERGES_FILE, encoder_json_path=ENCODER_FILE) >>> tokenizer("the quick brown fox jumped over the lazy dog")
- Parameters
merges_path (str) – Path to bpe merges file.
encoder_json_path (str) – Optional, path to BPE encoder json file. When specified, this is used to infer num_merges.
num_merges (int) – Optional, number of merges to read from the bpe merges file.
return_tokens – Indicate whether to return split tokens. If False, it will return encoded token IDs as strings (default: False)
BERTTokenizer¶
-
class
torchtext.transforms.
BERTTokenizer
(vocab_path: str, do_lower_case: bool = True, strip_accents: Optional[bool] = None, return_tokens=False)[source]¶ Transform for BERT Tokenizer.
Based on WordPiece algorithm introduced in paper: https://static.googleusercontent.com/media/research.google.com/ja//pubs/archive/37842.pdf
The backend kernel implementation is taken and modified from https://github.com/LieluoboAi/radish.
See PR https://github.com/pytorch/text/pull/1707 summary for more details.
The below code snippet shows how to use the BERT tokenizer using the pre-trained vocab files.
- Example
>>> from torchtext.transforms import BERTTokenizer >>> VOCAB_FILE = "https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt" >>> tokenizer = BERTTokenizer(vocab_path=VOCAB_FILE, do_lower_case=True, return_tokens=True) >>> tokenizer("Hello World, How are you!") # single sentence input >>> tokenizer(["Hello World","How are you!"]) # batch input
- Parameters
vocab_path (str) – Path to pre-trained vocabulary file. The path can be either local or URL.
do_lower_case (Optional[bool]) – Indicate whether to do lower case. (default: True)
strip_accents (Optional[bool]) – Indicate whether to strip accents. (default: None)
return_tokens (bool) – Indicate whether to return tokens. If false, returns corresponding token IDs as strings (default: False)
VocabTransform¶
-
class
torchtext.transforms.
VocabTransform
(vocab: torchtext.vocab.vocab.Vocab)[source]¶ Vocab transform to convert input batch of tokens into corresponding token ids
- Parameters
vocab – an instance of
torchtext.vocab.Vocab
class.
Example
>>> import torch >>> from torchtext.vocab import vocab >>> from torchtext.transforms import VocabTransform >>> from collections import OrderedDict >>> vocab_obj = vocab(OrderedDict([('a', 1), ('b', 1), ('c', 1)])) >>> vocab_transform = VocabTransform(vocab_obj) >>> output = vocab_transform([['a','b'],['a','b','c']]) >>> jit_vocab_transform = torch.jit.script(vocab_transform)
- Tutorials using
VocabTransform
:
ToTensor¶
-
class
torchtext.transforms.
ToTensor
(padding_value: Optional[int] = None, dtype: torch.dtype = torch.int64)[source]¶ Convert input to torch tensor
- Parameters
padding_value (Optional[int]) – Pad value to make each input in the batch of length equal to the longest sequence in the batch.
dtype (
torch.dtype
) –torch.dtype
of output tensor
-
forward
(input: Any) → torch.Tensor[source]¶
LabelToIndex¶
Truncate¶
AddToken¶
Sequential¶
PadTransform¶
-
class
torchtext.transforms.
PadTransform
(max_length: int, pad_value: int)[source]¶ Pad tensor to a fixed length with given padding value.
- Parameters
-
forward
(x: torch.Tensor) → torch.Tensor[source]¶ - Parameters
x (Tensor) – The tensor to pad
- Returns
Tensor padded up to max_length with pad_value
- Return type
Tensor