Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BaseEmbeddings.embed should be treated as callable #92

Closed
ascendant512 opened this issue Dec 13, 2024 · 4 comments
Closed

BaseEmbeddings.embed should be treated as callable #92

ascendant512 opened this issue Dec 13, 2024 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@ascendant512
Copy link

I tried to write a BaseEmbeddings implementation and it crashed due to this line:

elif inspect.isfunction(tokenizer_or_token_counter):

It should be elif callable(tokenizer_or_token_counter):
The test fails because BaseEmbeddings.embed is a method, which inspect.isfunction does not consider to be a function.

The symptom of the crash was in the line Tokenizer backend {str(type(self.tokenizer))} not supported where it referred to my class and its method embed.

FYI, here is my class, I don't really know yet if it's correct, but I know that my change allowed it to run:

@dataclass
class OllamaEmbeddings(BaseEmbeddings):
	client: PatchedClient
	model: str

	def count_tokens(self, text: str) -> int:
		tokenized = self.client.tokenize(model=self.model, prompt=text)
		return len(tokenized.tokens)

	def embed(self, text: str) -> ndarray:
		embedded = self.client.embed(self.model, text)
		return array(embedded.embeddings)

	@property
	def dimension(self) -> int:
		return 2

	def embed_batch(self, texts: List[str]) -> List[ndarray]:
		return super().embed_batch(texts)

	def count_tokens_batch(self, texts: List[str]) -> List[int]:
		return super().count_tokens_batch(texts)

	def similarity(self, u: ndarray, v: ndarray) -> float:
		numerator = matmul(u, v.T) #i do not understand why it needs .T
		denominator = norm(u) * norm(v)
		return numerator / denominator
@bhavnicksm bhavnicksm self-assigned this Dec 14, 2024
@bhavnicksm
Copy link
Collaborator

Hey @ascendant512!

Thanks for opening an issue! 😊

Chonkie had that line as a callable before but the reason why it was changed to the current method is because all tokenizers are technically callable since they implement the __call__ fn in their classes, which causes that conditional to error out.

If you can do one check, that would be great! Could you see by adding the get_tokenizer_or_token_counter() funtion into your class as:

def get_tokenizer_or_token_counter(self): 
    return self.count_tokens

Since this returns a function that counts tokens, it should by-pass the issue.

Please let me know the results!

Thanks! 😊

@ascendant512
Copy link
Author

Hi, thanks for the quick reply!

I haven't use the other tokenizers yet, so I didn't know how they worked. However, I can confirm that your suggestion also didn't work. I get a different error:
ValueError: Tokenizer backend <class 'method'> not supported
I think the underlying issue is the same, relating to inspect.isfunction only returning true for a function that's not bound to a class. This does work as a substitute for callable, with or without with your suggested edit:
inspect.ismethod(tokenizer_or_token_counter)

@shreyashnigam shreyashnigam added the bug Something isn't working label Dec 16, 2024
@bhavnicksm
Copy link
Collaborator

Hey @ascendant512!

Sorry for the delay and thanks for trying it out; ismethod also sounds like a good option. I'll try evaluating the options we have and get a fix pushed ASAP.

Thanks~ 😊

bhavnicksm added a commit that referenced this issue Jan 2, 2025
[FIX]  Support class methods as `token_counter` objects for `CustomEmbeddings` (#92)
bhavnicksm added a commit that referenced this issue Jan 2, 2025
[Fix] Add fix for #92: Support `class.method` as a Tokenizer for `CustomEmbedding` +. minor changes
@bhavnicksm
Copy link
Collaborator

Hey @ascendant512!

The patch has been merged into main and can be tested with an install from source. It would also be available from the next release onwards in Chonkie. Hopefully, it would resolve the errors you were seeing!

Closing issue for now; please re-open if you are still seeing the issue~

Thanks! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants