I am trying to run the HuggingFace GPT2 model on a smart phone. To tokenize the text data, I was supposed to use HuggingFace’s GPT2 tokenizer. Unfortunately, HuggingFace only offers the Python version and declined to offer a C/C++ version in Mar 2020.
HuggingFace does offer a Rust version. But I'm worried that putting Rust into mobile development will create more problems.
According to their document, NVIDIA NeMo project uses the Google Sentencepiece tokenizer library, which is in C++. However, reading the source of HuggingFace GPT2’s tokenizer, I noticed that it is of a different kind — a byte-pair encoding tokenizer. According to HuggingFace’s nice guide to tokenizers,
Each model has its own tokenizer type. A pretrained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.
This guide also explains the three main types of tokenizers used with Transformer models: byte-pair encoding (BPE), WordPiece, and SentencePiece.
Therefore, I decided to translate HuggingFace’s GPT2 tokenizer from Python to C++.
In order to figure out which part of the HuggingFace’s Python tokenizer project is what I need to port, I worte a simple Python script that calls the tokenizer, used pdb
to step into the functions calls. Then I noticed that this eight-line Python method is all what I need to translate if I am gonna run the GPT2 model.
def _tokenize(self, text):
"""Tokenize a string."""
bpe_tokens = []
for token in re.findall(self.pat, text):
token = "".join(
self.byte_encoder[b] for b in token.encode("utf-8")
) # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
return bpe_tokens
From the above code snippet, we can see it does three things:
- Roughly segment the input text using regular expression matching by calling
re.findall
. - Maps control bytes, b, in each candidate token into 255+b.
- Maps each candidate token into one or more BPE tokens.
In some successive notes, I will explain how I port this three steps. In the rest of this article, I will describe how I identified that the above function is all what I need to port.
I git-cloned the HuggingFace's Transformers repository.
git clone https://github.com/huggingface/transformers
I followed the section Editable Install, I ran the following command
cd transformers
pip install -e .
This process might complain about non-exisitng dependencies. Just install them. After this process, we should be able to import transformers
.
python -c 'import transformers'
I wrote this script t.py
to tokenize a string as the following.
import transformers
import builtins
from os import path
def load_gpt2_tokenizer() -> transformers.GPT2Tokenizer:
builtins.open, tmp_open = open, builtins.open
gpt2_dir = "/Users/y/w/gpt2cpp/assets"
tokenizer = transformers.GPT2Tokenizer(
vocab_file=path.join(gpt2_dir, 'vocab.json'),
merges_file=path.join(gpt2_dir, 'merges.txt'))
builtins.open = tmp_open
return tokenizer
tknzr = load_gpt2_tokenizer()
print(tknzr("zero one two three four"))
My purpose is to use pdb to trace into the function call to tknzr("zero one two three four")
.
Before showing the tracing result, it would be helpful to understand the class hierarchy.
Running the above driver script using the following command
python -m pdb t.py
reveals the following calling chain:
PreTrainedTokenizerBase.__call__
PreTrainedTokenizerBase._switch_to_input_mode
is an empty implementationPreTrainedTokenizerBase._call_one
PreTrainedTokenizerBase.encode_plus
PreTrainedTokenizerBase._get_padding_truncation_strategies
PreTrainedTokenizer._encode_plus
- nested function
get_input_ids
PreTrainedTokenizer.tokenize
all_special_tokens_extended = {'<|endoftext|>': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}
text, kwargs = self.prepare_for_tokenization(text, **kwargs)
does nothing.no_split_token = set(self.unique_no_split_tokens)
returns{}
tokens = self.tokens_trie.split(text)
returns['zero one two three four']
tokenized_text = []
tokenized_text.extend(self._tokenize(token))
- nested function
Here we reached GPT2Tokenizer._tokenize
. And before this invocation, all above stack frames do basically nothing.