Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Am I doing something wrong here? Possible issue with start and end index? #116

Closed
KamarulAdha opened this issue Dec 31, 2024 · 15 comments
Assignees
Labels
bug Something isn't working good first issue Good for newcomers in progress Actively looking into the issue

Comments

@KamarulAdha
Copy link

Describe the bug
Possible problem with TokenChunker and WordChunker. The Actual text looks fine. But the start and end indexes look fishy. First chunk starts the index at 0. Second chunk starts the index at -1?

To Reproduce
https://github.com/KamarulAdha/chonkie-trial-0/blob/main/first.ipynb
For TokenChunker and WordChunker, when I print the results, some of the chunks would have -1 starting index.

Expected behavior
Shouldn't the starting index be increasing and not going into -1? Also, the end index of the previous chunk, should be in the range of the start_index and the end_index of the next chunk.

Screenshots
WhatsApp Image 2024-12-31 at 13 21 13_2f29d497

Thanks~

@KamarulAdha KamarulAdha added the bug Something isn't working label Dec 31, 2024
@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Dec 31, 2024

Hey @KamarulAdha!

Thanks for opening an issue 😄

Yes, that's definitely odd behaviour. It should not act like this; thanks for pointing it out—I will try to add a patch for this soon.

Also, the notebook is really helpful! (^-^)

Thanks 😊

@bhavnicksm bhavnicksm added the good first issue Good for newcomers label Dec 31, 2024
@bhavnicksm
Copy link
Collaborator

Opening up as a good first issue, as it seems like a simple issue~ If someone wishes to make a PR, I'd be happy to guide them!

@KamarulAdha
Copy link
Author

Hola @bhavnicksm

Thanks for checking this issue out. I'm a bit more curious now to be honest. Do you mind pointing me in the right direction to solve this problem? Many thanks~

@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Dec 31, 2024

Hey @KamarulAdha,

This function is probably where the issue exists.

We use decoded_text to denote the entire text having gone through encoding and decoding (i.e. tokenizer.decode(tokenizer.encode(text)), which would sometimes not match the original text or the concatenation of the strings of the chunks —though for the ideal tokenizer it should. Because of this difference, we have to use the decoded_text, which probably doesn't match the chunk text.

We use Python's string.find() to get the start_index which would only give -1 if it couldn't find the text.

I'd suggest recreating the function in ipynb notebook with the same text to see the difference between the decoded_text and the concatenation of the chunk texts and see if the chunk texts exist in the decoded_text. That would confirm this hypothesis.

If it's not this, then I would need to think about other potential areas the problem could be, but I am about 90% certain it's this.

Please let me know if this makes sense or if you have any questions!

Thanks 😊

@KamarulAdha
Copy link
Author

Hola @bhavnicksm

I've tried to figure out the problem and came up with a temporary solution for now.

For the chunk function, I added the token_groups into the argument at the end:
chunks = self._create_chunks(chunk_texts, token_counts, decoded_text, self.chunk_overlap, token_groups)

Then, in the _create_chunks function, I added these:
for chunk_text, token_count, token_group in zip(chunk_texts, token_counts, token_groups):
and
end_index = start_index + len(token_group)

Although this is not ideal, but this should prevent the starting index from getting -1. In the previous implementation, the current_index would be too far ahead for the next token groups. Hence why it could not find the string. Thus, returning -1 back. The easiest and the least optimal solution was to re-index back to 0. But this will do the searching from the very beginning.

My temp fix is to set the current index just before the starting index of the next token group. The ideal scenario is to always get the exact starting index. But I think that would require rewriting chunk and _create_chunks function.

For the ideal implementation, I was thinking of immediately capturing the indexes within chunk function.

Do correct me where I'm wrong. Thanks!

@Udayk02
Copy link
Contributor

Udayk02 commented Jan 2, 2025

@bhavnicksm I think the major bug here resides in the _create_chunks function. You are assigning end_index to the current_index variable which is incorrect as there is chunk_overlap as well. While finding the next chunk_text in the decoded_text, you are using the start as the current_index. This obviously leads to a problem. The current_index has to be formed such that it also takes overlapped tokens into account.

@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Jan 2, 2025

Hey @KamarulAdha and @Udayk02,

You are correct! The issue lies in the improper handling of the overlap, along with indexing + missing the starting point. Eventually most chunkers will have their overlap removed, and we'll default to a refinery to handle it (see OverlapRefinery).

But for now, the suggestion by @KamarulAdha makes sense on estimating where to search from—possibly by going current_index -= chunk_overlap * MAX_CHARS_PER_TOKEN. MAX_CHARS_PER_TOKEN is an estimate from the tokenizer vocabulary, which defaults to 6 or 6.5 usually (based on GPT2/LLaMa3 vocabs)

Feel free to open a PR for this~

Thanks!

@bhavnicksm bhavnicksm added the in progress Actively looking into the issue label Jan 2, 2025
@Udayk02
Copy link
Contributor

Udayk02 commented Jan 2, 2025

But that will still be giving a -1 at certain times, correct? when the estimation is not matching with the actual? I actually modified the function removing the find. It almost as efficient as the previous method. I am raising a PR now. Kindly check it. @KamarulAdha @bhavnicksm

Udayk02 added a commit to Udayk02/chonkie that referenced this issue Jan 2, 2025
@KamarulAdha
Copy link
Author

Current implementation is using the end index as the reference (current index) for the next iteration. This causes a problem since sometimes it won't be able to find the correct index, hence giving us -1.

An alternate solution is to set the current index by using the start index with the length of the token group. I've tested this, and it won't get you to the exact token location, but it's close enough to be in range for searching. (Haven't tried with other text. Played around with the chunking size and overlap)

Another solution is to always set the current index to zero, thus searching from the very beginning. Unsure how this will affect the computation, though.

And @Udayk02 is suggesting to remove the find entirely. But we will still be able to get the start and end indexes?

Sorry, trying to understand how different approaches would work for this. 😅

@Udayk02
Copy link
Contributor

Udayk02 commented Jan 2, 2025

Hi @KamarulAdha no problem, I directly decoded the overlapping tokens each time, and subtracted the occupied space from the end_index. That will directly give you the start_index for the next chunk. Please feel free to correct me if I am wrong. I tested it, it is working fine. Also in terms of efficiency, it is on par with the fixed older method (I fixed the older method with what you suggested and tested it as well).

Kindly check here for the commit.

On top of this, what @bhavnicksm suggested is to use estimated space to remove from the end_index each time instead of the number of tokens in the token_group (which is not much useful here as we are dealing with characters instead of tokens).
Generally, in terms of GPT-3 or llama, estimated number of characters per token is between 6 - 7. So, we will use end_index - chunk_overlap * 6 as the next current_index. But, this will fail in some cases where estimation is not equal to the actual token space.

Udayk02 added a commit to Udayk02/chonkie that referenced this issue Jan 3, 2025
- removed the unnecessary `join` as there is only one token_group.
- replaced `_decode_batch` with `_decode`
Udayk02 added a commit to Udayk02/chonkie that referenced this issue Jan 3, 2025
- `start_index` remains 0 when `chunk_overlap` is 0, fixed it.
Udayk02 added a commit to Udayk02/chonkie that referenced this issue Jan 4, 2025
- applies only when chunk_overlap > 0
- batch decoding for overlap texts
bhavnicksm added a commit that referenced this issue Jan 4, 2025
[FIX] #116: Incorrect`start_index` when `chunk_overlap` is not 0
bhavnicksm added a commit that referenced this issue Jan 4, 2025
[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116)
@bhavnicksm
Copy link
Collaborator

Hey @KamarulAdha!

The patch has been merged and can be used with the source install. This patch would also be available from the next release onwards.

Closing issue for now, please re-open the issue if you are still facing the problem~

Thanks! 😊

@Udayk02
Copy link
Contributor

Udayk02 commented Jan 5, 2025

Hey @bhavnicksm

I see that you have changed the logic like below to calculate the overlap_lengths:

overlap_texts = self._decode_batch([token_group[-self.chunk_overlap:] 
                                                    if (len(token_group) > self.chunk_overlap)
                                                    else token_group
                                                    for token_group in token_groups])
overlap_lengths = [len(overlap_text) for overlap_text in overlap_texts]

But, this is wrong. In my code, I have taken overlap_length as:

 overlap_length = self.chunk_overlap - (self.chunk_size - len(token_group))

This is required because,
For example:
chunk_size and chunk_overlap are 128 and 64 respectively. Let us say the number of tokens in the last but one chunk are 116. It came out as such because there are not enough tokens remaining that will give you a chunk of size 128.

Now, the remaining tokens for the last chunk in this scenario are only 52. Because, the way we are creating chunks is using this logic:

for start_index in range(0, len(text_tokens), self.chunk_size - self.chunk_overlap)]

This implies that we will always step up 128 - 64 = 64 positions. That would give you only 52 tokens remaining for the last chunk. Further implies that we need to step behind 52 positions in our last but one chunk to get our start_index for the last chunk.

Now, if we consider end_index - overlap_length as start_index for the last chunk where overlap_length will come as 64 based on your logic, that would leave the current_index 12 tokens behind the actual starting point of the last chunk.

This will give wrong start_index and end_index for the last chunk.

So, we need to subtract the self.chunk_size - len(token_group) from chunk_overlap to get the exact starting position for the last chunk.

@bhavnicksm
Copy link
Collaborator

Hey @Udayk02!

You're right about the index for the last chunk being off; thanks for bringing this up!

I just noticed that when I do this,

chunker = TokenChunker(chunk_size=128, chunk_overlap=0)
chunks = chunker(story)
[c.token_count for c in chunks]

# output:
# [128, 128, 128, 128, 116]

But when I put in the numbers you suggested, we get:

chunker = TokenChunker(chunk_size=128, chunk_overlap=64)
chunks = chunker(story)
[c.token_count for c in chunks]

# Output:
# [128, 128, 128, 128, 128, 128, 128, 128, 116, 52]

Ideally, we don't want the last chunk that has 52 tokens, since that is created entirely from overlap. Overlap should stop at the chunk containing 116 tokens.

In fact, if we run the following piece of code, we see that the last chunk is entirely contained in the second last one.

print(chunks[-1].text in chunks[-2].text)

# Output
# True

This is an anti-pattern for the TokenChunker. I believe we need to make sure that despite the chunk_overlap being set pretty high, the last chunk is not entirely redundantly created from overlap. While other chunks have a unique combination of information, the last chunk is entirely contained in the second last chunk.

Upon running a few experiments, it doesn't happen with chunk_overlap=57 but does with chunk_overlap=56, because the last chunk is entirely redundant because of overlap.

I would add a patch fix to make sure the last chunk created due to redundancy doesn't occur.

Thanks again for pointing this out! 🚀😊

@Udayk02
Copy link
Contributor

Udayk02 commented Jan 5, 2025

Yes, you are absolutely correct. That makes total sense.
The last chunk doesn't carry any significance here. I pointed out this only because of the way we are creating token_groups currently. If we have fixed that, then there won't be any issue in the overlapping end.

@bhavnicksm
Copy link
Collaborator

Hey @Udayk02! Added a patch to the main; have a look~
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers in progress Actively looking into the issue
Projects
None yet
Development

No branches or pull requests

3 participants