[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

Udayk02 · 2025-01-04T06:10:44Z

added every model support for Cohere embeddings.
added the same to the registry to support AutoEmbeddings.
added tests for the same.

key things to notice:

some models do not have tokenizers available. used the default "embed-english-light-v3.0" for those. (please review and conclude)
there is a limit of 96 on the number of documents in a single API call.
the model context length mentioned in the Cohere documentation is 512. but, they did not mention anywhere that it will be truncated in the case of larger texts, rather mentioned that it is not optimal. reflected the same in the code.
used tokenizers for the complete tokenizing via Cohere official urls as I faced an issue while loading the same through hf via autotiktoken.
there isn't a way to check whether the client is initialized with api_key or not. check_api_key is a deprecated function. along with that, there isn't a retry functionality provided by Cohere. handled it in chonkie end.

bhavnicksm · 2025-01-04T18:27:48Z

Thanks for opening a PR 🚀

I am curious about point 4, you mentioned,

used tokenizers for the complete tokenizing via Cohere official urls as I faced an issue while loading the same through hf via autotiktoken.

Could you describe the issue you faced with autotiktokenizer?

Thanks ☺️

Udayk02 · 2025-01-05T01:02:41Z

As I delved through autotiktokenizer, I found the bug. It is a very minor bug. vocab is always assumed to be a dict. But, in the cohere tokenizer.json, it is presented as a list of lists. I raised an issue there. Please check!

- cohere embeddings support - added to the registry for autoembeddings - added tests

feature chonkie-ai#118

90e5ee4

- cohere embeddings support - added to the registry for autoembeddings - added tests

Udayk02 force-pushed the feat-cohere branch from 2c7aef1 to 90e5ee4 Compare January 5, 2025 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

Udayk02 commented Jan 4, 2025 •

edited

Loading

bhavnicksm commented Jan 4, 2025

Udayk02 commented Jan 5, 2025

[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

Are you sure you want to change the base?

[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

Conversation

Udayk02 commented Jan 4, 2025 • edited Loading

bhavnicksm commented Jan 4, 2025

Udayk02 commented Jan 5, 2025

Udayk02 commented Jan 4, 2025 •

edited

Loading