LLM Tokenziers in C++ (0): Port from Python

Why Tokenizer in C++

I am trying to run the HuggingFace GPT2 model on a smart phone. To tokenize the text data, I was supposed to use HuggingFace’s GPT2 tokenizer. Unfortunately, HuggingFace only offers the Python version and declined to offer a C/C++ version in Mar 2020.

HuggingFace does offer a Rust version. But I'm worried that putting Rust into mobile development will create more problems.

According to their document, NVIDIA NeMo project uses the Google Sentencepiece tokenizer library, which is in C++. However, reading the source of HuggingFace GPT2’s tokenizer, I noticed that it is of a different kind — a byte-pair encoding tokenizer. According to HuggingFace’s nice guide to tokenizers,

Each model has its own tokenizer type. A pretrained model only performs properly if you feed it an input that was tokenized with the same rules that were used to tokenize its training data.

This guide also explains the three main types of tokenizers used with Transformer models: byte-pair encoding (BPE), WordPiece, and SentencePiece.

Therefore, I decided to translate HuggingFace’s GPT2 tokenizer from Python to C++.

What to Port

In order to figure out which part of the HuggingFace’s Python tokenizer project is what I need to port, I worte a simple Python script that calls the tokenizer, used pdb to step into the functions calls. Then I noticed that this eight-line Python method is all what I need to translate if I am gonna run the GPT2 model.

    def _tokenize(self, text):
        """Tokenize a string."""
        bpe_tokens = []
        for token in re.findall(self.pat, text):
            token = "".join(
                self.byte_encoder[b] for b in token.encode("utf-8")
            )  # Maps all our bytes to unicode strings, avoiding control tokens of the BPE (spaces in our case)
            bpe_tokens.extend(bpe_token for bpe_token in self.bpe(token).split(" "))
        return bpe_tokens

From the above code snippet, we can see it does three things:

Roughly segment the input text using regular expression matching by calling re.findall.
Maps control bytes, b, in each candidate token into 255+b.
Maps each candidate token into one or more BPE tokens.

In some successive notes, I will explain how I port this three steps. In the rest of this article, I will describe how I identified that the above function is all what I need to port.

Install HuggingFace Transformers from Source Code

I git-cloned the HuggingFace's Transformers repository.

git clone https://github.com/huggingface/transformers

I followed the section Editable Install, I ran the following command

cd transformers
pip install -e .

This process might complain about non-exisitng dependencies. Just install them. After this process, we should be able to import transformers.

python -c 'import transformers'

Drafting a Driving Script

I wrote this script t.py to tokenize a string as the following.

import transformers
import builtins
from os import path

def load_gpt2_tokenizer() -> transformers.GPT2Tokenizer:
  builtins.open, tmp_open = open, builtins.open
  gpt2_dir = "/Users/y/w/gpt2cpp/assets"
  tokenizer = transformers.GPT2Tokenizer(
      vocab_file=path.join(gpt2_dir, 'vocab.json'),
      merges_file=path.join(gpt2_dir, 'merges.txt'))
  builtins.open = tmp_open
  return tokenizer

tknzr = load_gpt2_tokenizer()
print(tknzr("zero one two three four"))

My purpose is to use pdb to trace into the function call to tknzr("zero one two three four").

Class Hierarchy

Before showing the tracing result, it would be helpful to understand the class hierarchy.

class GPT2Tokenizer(PreTrainedTokenizer):
class PreTrainedTokenizer(PreTrainedTokenizerBase):
class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):

Trace

Running the above driver script using the following command

python -m pdb t.py

reveals the following calling chain:

PreTrainedTokenizerBase.__call__
- PreTrainedTokenizerBase._switch_to_input_mode is an empty implementation
- PreTrainedTokenizerBase._call_one
  - PreTrainedTokenizerBase.encode_plus
    - PreTrainedTokenizerBase._get_padding_truncation_strategies
    - PreTrainedTokenizer._encode_plus
      - nested function get_input_ids
        
        PreTrainedTokenizer.tokenize
        
        all_special_tokens_extended = {'<|endoftext|>': AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True)}
        
        text, kwargs = self.prepare_for_tokenization(text, **kwargs) does nothing.
        
        no_split_token = set(self.unique_no_split_tokens) returns {}
        
        tokens = self.tokens_trie.split(text) returns ['zero one two three four']
        
        tokenized_text = []
        
        tokenized_text.extend(self._tokenize(token))
        
        GPT2Tokenizer._tokenize

Here we reached GPT2Tokenizer._tokenize. And before this invocation, all above stack frames do basically nothing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.md

0.md

LLM Tokenziers in C++ (0): Port from Python

Why Tokenizer in C++

What to Port

Install HuggingFace Transformers from Source Code

Drafting a Driving Script

Class Hierarchy

Trace

Files

0.md

Latest commit

History

0.md

File metadata and controls

LLM Tokenziers in C++ (0): Port from Python

Why Tokenizer in C++

What to Port

Install HuggingFace Transformers from Source Code

Drafting a Driving Script

Class Hierarchy

Trace