Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Support Cohere Embeddings for SemanticChunker and SDPMChunker #118 #130

Open
wants to merge 1 commit into
base: development
Choose a base branch
from

Conversation

Udayk02
Copy link
Contributor

@Udayk02 Udayk02 commented Jan 4, 2025

  • added every model support for Cohere embeddings.
  • added the same to the registry to support AutoEmbeddings.
  • added tests for the same.

key things to notice:

  1. some models do not have tokenizers available. used the default "embed-english-light-v3.0" for those. (please review and conclude)
  2. there is a limit of 96 on the number of documents in a single API call.
  3. the model context length mentioned in the Cohere documentation is 512. but, they did not mention anywhere that it will be truncated in the case of larger texts, rather mentioned that it is not optimal. reflected the same in the code.
  4. used tokenizers for the complete tokenizing via Cohere official urls as I faced an issue while loading the same through hf via autotiktoken.
  5. there isn't a way to check whether the client is initialized with api_key or not. check_api_key is a deprecated function. along with that, there isn't a retry functionality provided by Cohere. handled it in chonkie end.

@bhavnicksm
Copy link
Collaborator

Hey @Udayk02!

Thanks for opening a PR 🚀

I am curious about point 4, you mentioned,

used tokenizers for the complete tokenizing via Cohere official urls as I faced an issue while loading the same through hf via autotiktoken.

Could you describe the issue you faced with autotiktokenizer?

Thanks ☺️

@Udayk02
Copy link
Contributor Author

Udayk02 commented Jan 5, 2025

As I delved through autotiktokenizer, I found the bug. It is a very minor bug. vocab is always assumed to be a dict. But, in the cohere tokenizer.json, it is presented as a list of lists. I raised an issue there. Please check!

- cohere embeddings support
- added to the registry for autoembeddings
- added tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants