-
Notifications
You must be signed in to change notification settings - Fork 103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Am I doing something wrong here? Possible issue with start and end index? #116
Comments
Hey @KamarulAdha! Thanks for opening an issue 😄 Yes, that's definitely odd behaviour. It should not act like this; thanks for pointing it out—I will try to add a patch for this soon. Also, the notebook is really helpful! (^-^) Thanks 😊 |
Opening up as a good first issue, as it seems like a simple issue~ If someone wishes to make a PR, I'd be happy to guide them! |
Hola @bhavnicksm Thanks for checking this issue out. I'm a bit more curious now to be honest. Do you mind pointing me in the right direction to solve this problem? Many thanks~ |
Hey @KamarulAdha, This function is probably where the issue exists. We use We use Python's I'd suggest recreating the function in ipynb notebook with the same text to see the difference between the If it's not this, then I would need to think about other potential areas the problem could be, but I am about 90% certain it's this. Please let me know if this makes sense or if you have any questions! Thanks 😊 |
Hola @bhavnicksm I've tried to figure out the problem and came up with a temporary solution for now. For the Then, in the Although this is not ideal, but this should prevent the starting index from getting -1. In the previous implementation, the current_index would be too far ahead for the next token groups. Hence why it could not find the string. Thus, returning -1 back. The easiest and the least optimal solution was to re-index back to 0. But this will do the searching from the very beginning. My temp fix is to set the current index just before the starting index of the next token group. The ideal scenario is to always get the exact starting index. But I think that would require rewriting For the ideal implementation, I was thinking of immediately capturing the indexes within Do correct me where I'm wrong. Thanks! |
@bhavnicksm I think the major bug here resides in the _create_chunks function. You are assigning end_index to the current_index variable which is incorrect as there is chunk_overlap as well. While finding the next chunk_text in the decoded_text, you are using the start as the current_index. This obviously leads to a problem. The current_index has to be formed such that it also takes overlapped tokens into account. |
Hey @KamarulAdha and @Udayk02, You are correct! The issue lies in the improper handling of the overlap, along with indexing + missing the starting point. Eventually most chunkers will have their overlap removed, and we'll default to a refinery to handle it (see But for now, the suggestion by @KamarulAdha makes sense on estimating where to search from—possibly by going Feel free to open a PR for this~ Thanks! |
But that will still be giving a -1 at certain times, correct? when the estimation is not matching with the actual? I actually modified the function removing the find. It almost as efficient as the previous method. I am raising a PR now. Kindly check it. @KamarulAdha @bhavnicksm |
Current implementation is using the end index as the reference (current index) for the next iteration. This causes a problem since sometimes it won't be able to find the correct index, hence giving us -1. An alternate solution is to set the current index by using the start index with the length of the token group. I've tested this, and it won't get you to the exact token location, but it's close enough to be in range for searching. (Haven't tried with other text. Played around with the chunking size and overlap) Another solution is to always set the current index to zero, thus searching from the very beginning. Unsure how this will affect the computation, though. And @Udayk02 is suggesting to remove the find entirely. But we will still be able to get the start and end indexes? Sorry, trying to understand how different approaches would work for this. 😅 |
Hi @KamarulAdha no problem, I directly decoded the overlapping tokens each time, and subtracted the occupied space from the Kindly check here for the commit. On top of this, what @bhavnicksm suggested is to use |
- removed the unnecessary `join` as there is only one token_group. - replaced `_decode_batch` with `_decode`
- `start_index` remains 0 when `chunk_overlap` is 0, fixed it.
- applies only when chunk_overlap > 0 - batch decoding for overlap texts
[FIX] #116: Incorrect`start_index` when `chunk_overlap` is not 0
[FIX] `start_index` incorrect when `chunk_overlap` is not 0 (#116)
Hey @KamarulAdha! The patch has been merged and can be used with the source install. This patch would also be available from the next release onwards. Closing issue for now, please re-open the issue if you are still facing the problem~ Thanks! 😊 |
Hey @bhavnicksm I see that you have changed the logic like below to calculate the overlap_texts = self._decode_batch([token_group[-self.chunk_overlap:]
if (len(token_group) > self.chunk_overlap)
else token_group
for token_group in token_groups])
overlap_lengths = [len(overlap_text) for overlap_text in overlap_texts] But, this is wrong. In my code, I have taken overlap_length = self.chunk_overlap - (self.chunk_size - len(token_group)) This is required because, Now, the remaining tokens for the last chunk in this scenario are only for start_index in range(0, len(text_tokens), self.chunk_size - self.chunk_overlap)] This implies that we will always step up Now, if we consider This will give wrong So, we need to subtract the |
Hey @Udayk02! You're right about the index for the last chunk being off; thanks for bringing this up! I just noticed that when I do this,
But when I put in the numbers you suggested, we get:
Ideally, we don't want the last chunk that has In fact, if we run the following piece of code, we see that the last chunk is entirely contained in the second last one.
This is an anti-pattern for the Upon running a few experiments, it doesn't happen with I would add a patch fix to make sure the last chunk created due to redundancy doesn't occur. Thanks again for pointing this out! 🚀😊 |
Yes, you are absolutely correct. That makes total sense. |
Hey @Udayk02! Added a patch to the main; have a look~ |
Describe the bug
Possible problem with
TokenChunker
andWordChunker
. The Actual text looks fine. But the start and end indexes look fishy. First chunk starts the index at 0. Second chunk starts the index at -1?To Reproduce
https://github.com/KamarulAdha/chonkie-trial-0/blob/main/first.ipynb
For
TokenChunker
andWordChunker
, when I print the results, some of the chunks would have -1 starting index.Expected behavior
Shouldn't the starting index be increasing and not going into -1? Also, the end index of the previous chunk, should be in the range of the start_index and the end_index of the next chunk.
Screenshots
Thanks~
The text was updated successfully, but these errors were encountered: