-
Notifications
You must be signed in to change notification settings - Fork 828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ByteLevelBPETokenizer output seems weird #203
Comments
I also have a similar issue with Persian texts. |
TLDR; This is how the byte-level BPE works. Main advantages are:
This is totally expected behavior. The byte-level BPE converts all the Unicode code points into multiple byte-level characters:
The purpose is, by doing so, you end up with an initial alphabet of 256 tokens. These 256 tokens can then be merged together to represent any other token in the vocabulary. This results in smaller vocabularies, that won't ever need an "unknown" token. |
I use the
ByteLevelBPETokenizer
to train a custom tokenizer for Amharic language (less-resource language).The merge.txt and vocab.json files I obtained are now not human readable.
Also the encoding results in the same unreadable output
Is this the expected behavior? I will later use this to train a RoberTa model using the
run_language_modeling.py
script.Thanks
The text was updated successfully, but these errors were encountered: