Direct tokenization #64

kl-thamm · 2023-06-28T14:22:24Z

I had an issue with the t2v-transformers today:

I create embeddings using a sentence-transformers model. One time using the sentence-transformers python library and one time using the t2v-transformers container.
The cosine distance of the vectors was up to 0.16.

@antas-marcin greatly and quickly helped me by suggesting setting "T2V_TRANSFORMERS_DIRECT_TOKENIZE=true". This reduced the cosine distance to almost 0.

When looking into what it does i noticed two things:

It's a bit difficult to understand it in the code because "tokenize" has actually two meanings
T2V_TRANSFORMERS_DIRECT_TOKENIZE is not very well documented but could theoretically be very important

Regarding 1:
Tokenize in the context of this program means splitting the input into sentences and using the transformers tokenizer.
I suggest changing direct_tokenize to shall_split_in_sentences or something similar. Actually shall_embed_sentence_per_sentence might even be more precise but that is a bit verbose. Other suggestions very welcome but its just the general idea.
Therefore the environment variable becomes T2V_SHALL_SPLIT_IN_SENTENCES.
(see the commit)

Regarding 2:
For me this setting seems to be important and should be documented somewhere.
I don't know how to suggest edits for the documentation so I am writing down what I think what would be helpful here:

Environment Settings
T2V_SHALL_SPLIT_IN_SENTENCES: If not set, will use true. If set to false, use raw input.

By default all t2v-transformers split the input into sentences using nltk with english interpunctuation and calculates the mean over the sentence embeddings. This allows to embed inputs of arbitrary length. But it will produce unexpected results if your text does not have the expected interpunctuation.
Embedding on a per sentence level could at least theoretically degrade the embedding model's performance in case it produces better results with longer inputs.

(Also could this be significantly slower? Doing it sentence by sentence than doing a larger input at once?).

weaviate-git-bot · 2023-06-28T23:57:39Z

To avoid any confusion in the future about your contribution to Weaviate, we work with a Contributor License Agreement. If you agree, you can simply add a comment to this PR that you agree with the CLA so that we can merge.

beep boop - the Weaviate bot 👋🤖

PS:
Are you already a member of the Weaviate Slack channel?

config.py

antas-marcin · 2023-06-29T07:37:12Z

@kl-thamm if you want us to be able to merge your PR you need to agree to CLA.
Simply replying here in a comment "I agree to CLA" will let us to merge to your PR.

kl-thamm · 2023-06-29T07:41:04Z

@antas-marcin Thanks! I agree to CLA.
The problem is that the smoke test runs fine for me locally with the model that I use and some tests passed here as well. If now after the additional commits I made, the tests fail, I would be unsure as how to proceed :)

kl-thamm added 2 commits June 28, 2023 16:02

improve readability

334d3af

introduce config object & support T2V_DIRECT_TOKENIZE

6eea071

fix typo

7b17ae4

antas-marcin requested review from StefanBogdan and trengrj June 29, 2023 07:03

kl-thamm added 2 commits June 29, 2023 09:07

correctly warn instead of raise

2289304

add default value true

59bda1f

antas-marcin reviewed Jun 29, 2023

View reviewed changes

config.py Outdated Show resolved Hide resolved

correct T2V_TRANSFORMERS_DIRECT_TOKENIZE

01a7748

StefanBogdan approved these changes Jun 29, 2023

View reviewed changes

revert to T2V_TRANSFORMERS_DIRECT_TOKENIZE

e35e70d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Direct tokenization #64

Direct tokenization #64

kl-thamm commented Jun 28, 2023

weaviate-git-bot commented Jun 28, 2023

antas-marcin commented Jun 29, 2023

kl-thamm commented Jun 29, 2023 •

edited

Loading

Direct tokenization #64

Are you sure you want to change the base?

Direct tokenization #64

Conversation

kl-thamm commented Jun 28, 2023

weaviate-git-bot commented Jun 28, 2023

antas-marcin commented Jun 29, 2023

kl-thamm commented Jun 29, 2023 • edited Loading

kl-thamm commented Jun 29, 2023 •

edited

Loading