torchtext.data.functional¶
generate_sp_model¶
-
torchtext.data.functional.
generate_sp_model
(filename, vocab_size=20000, model_type='unigram', model_prefix='m_user')[source]¶ Train a SentencePiece tokenizer.
- Parameters
filename – the data file for training SentencePiece model.
vocab_size – the size of vocabulary (Default: 20,000).
model_type – the type of SentencePiece model, including unigram, bpe, char, word.
model_prefix – the prefix of the files saving model and vocab.
- Outputs:
- The model and vocab are saved in two separate files with
model_prefix.
Examples
>>> from torchtext.data.functional import generate_sp_model >>> generate_sp_model('test.csv', vocab_size=23456, model_prefix='spm_user')
load_sp_model¶
-
torchtext.data.functional.
load_sp_model
(spm)[source]¶ Load a sentencepiece model for file.
- Parameters
spm – the file path or a file object saving the sentencepiece model.
- Outputs:
output: a SentencePiece model.
Examples
>>> from torchtext.data.functional import load_sp_model >>> sp_model = load_sp_model("m_user.model") >>> sp_model = load_sp_model(open("m_user.model", 'rb'))
sentencepiece_numericalizer¶
-
torchtext.data.functional.
sentencepiece_numericalizer
(sp_model)[source]¶ - A sentencepiece model to numericalize a text sentence into
a generator over the ids.
- Parameters
sp_model – a SentencePiece model.
- Outputs:
- output: a generator with the input of text sentence and the output of the
corresponding ids based on SentencePiece model.
Examples
>>> from torchtext.data.functional import sentencepiece_numericalizer >>> sp_id_generator = sentencepiece_numericalizer(sp_model) >>> list_a = ["sentencepiece encode as pieces", "examples to try!"] >>> list(sp_id_generator(list_a)) [[9858, 9249, 1629, 1305, 1809, 53, 842], [2347, 13, 9, 150, 37]]
sentencepiece_tokenizer¶
-
torchtext.data.functional.
sentencepiece_tokenizer
(sp_model)[source]¶ - A sentencepiece model to tokenize a text sentence into
a generator over the tokens.
- Parameters
sp_model – a SentencePiece model.
- Outputs:
- output: a generator with the input of text sentence and the output of the
corresponding tokens based on SentencePiece model.
Examples
>>> from torchtext.data.functional import sentencepiece_tokenizer >>> sp_tokens_generator = sentencepiece_tokenizer(sp_model) >>> list_a = ["sentencepiece encode as pieces", "examples to try!"] >>> list(sp_tokens_generator(list_a)) [['_sentence', 'piece', '_en', 'co', 'de', '_as', '_pieces'], ['_example', 's', '_to', '_try', '!']]
custom_replace¶
-
torchtext.data.functional.
custom_replace
(replace_pattern)[source]¶ A transform to convert text string.
Examples
>>> from torchtext.data.functional import custom_replace >>> custom_replace_transform = custom_replace([(r'S', 's'), (r'\s+', ' ')]) >>> list_a = ["Sentencepiece encode aS pieces", "exampleS to try!"] >>> list(custom_replace_transform(list_a)) ['sentencepiece encode as pieces', 'examples to try!']
simple_space_split¶
-
torchtext.data.functional.
simple_space_split
(iterator)[source]¶ A transform to split text string by spaces.
Examples
>>> from torchtext.data.functional import simple_space_split >>> list_a = ["Sentencepiece encode as pieces", "example to try!"] >>> list(simple_space_split(list_a)) [['Sentencepiece', 'encode', 'as', 'pieces'], ['example', 'to', 'try!']]
numericalize_tokens_from_iterator¶
-
torchtext.data.functional.
numericalize_tokens_from_iterator
(vocab, iterator, removed_tokens=None)[source]¶ Yield a list of ids from an token iterator with a vocab.
- Parameters
vocab – the vocabulary convert token into id.
iterator – the iterator yield a list of tokens.
removed_tokens – removed tokens from output dataset (Default: None)
Examples
>>> from torchtext.data.functional import simple_space_split >>> from torchtext.data.functional import numericalize_tokens_from_iterator >>> vocab = {'Sentencepiece' : 0, 'encode' : 1, 'as' : 2, 'pieces' : 3} >>> ids_iter = numericalize_tokens_from_iterator(vocab, >>> simple_space_split(["Sentencepiece as pieces", >>> "as pieces"])) >>> for ids in ids_iter: >>> print([num for num in ids]) >>> [0, 2, 3] >>> [2, 3]