Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a more flexible overlapping iterator API #102

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aneubeck
Copy link

Please let me know if you would be willing to accept such a change.

The existing overlapping iterator implementation could be simplified with this API as well.

@vbkaisetsu
Copy link
Member

@aneubeck
Thank you for sending a pull request.

We have considered it, but find it unnecessary to implement this feature.
This pull request provides a function that allows a PMA to consume specified characters one by one, but this can be done by using find_overlapping_iter_from_iter() and implementing an iterator that allows flexible control over the characters given to the PMA, as follows:

impl Iterator for MyIterator {
    fn next(&mut self) -> Option<u8> {
        let c = ...;
        return c;
    }
}

let it = MyIterator { ... };

for m in pma.find_overlapping_iter_from_iter(it) {

}

@aneubeck
Copy link
Author

aneubeck commented Oct 8, 2024

Thanks for taking a look!
It turns out that for another use case, we also needed access to the actual state id...

Since we have another crate (https://github.com/github/rust-gems/tree/main/crates/bpe) which requires this functionality, we are publishing a fork to crates.io so that we can depend on it as a work-around: https://github.com/aneubeck/daachorse/pull/1/files.

Let us know if you have any concerns or if you have a proposal how we could get our use case implemented on top of main.

@vbkaisetsu
Copy link
Member

vbkaisetsu commented Oct 12, 2024

@aneubeck How about the following implementation?

use std::cell::RefCell;

use daachorse::DoubleArrayAhoCorasick;

struct RefCellIter<'a>(&'a RefCell<Option<u8>>);

impl Iterator for RefCellIter<'_> {
    type Item = u8;
    fn next(&mut self) -> Option<Self::Item> {
        self.0.borrow_mut().take()
    }
}

fn main() {
    let next_char = RefCell::new(None);
    let pma = DoubleArrayAhoCorasick::<u32>::new(&["a", "abcd", "ab", "bc"]).unwrap();
    let mut it = pma.find_overlapping_iter_from_iter(
        b"ab".iter().cloned().chain(RefCellIter(&next_char)),
    );
    println!("iterates 'ab'");
    for m in &mut it {
        println!("{m:?}");
    }
    println!("consumes 'c'");
    next_char.borrow_mut().replace(b'c');
    for m in &mut it {
        println!("{m:?}");
    }
    println!("consumes 'd'");
    next_char.borrow_mut().replace(b'd');
    for m in &mut it {
        println!("{m:?}");
    }
}
iterates 'ab'
Match { length: 1, end: 1, value: 0 }
Match { length: 2, end: 2, value: 2 }
consumes 'c'
Match { length: 2, end: 3, value: 3 }
consumes 'd'
Match { length: 4, end: 4, value: 1 }

@aneubeck
Copy link
Author

Thanks for the proposal!
Here is the actual use case: https://github.com/github/rust-gems/blob/main/crates/bpe/src/appendable_encoder.rs#L39

The challenge is that we want to be able to go back to an earlier snapshot of the processing which means we need to somehow recover also now the state of the aho-corasick iterator.
While that would be solveable if the current iterator would be cloneable, it would become problematic to get your trick working.
Also, it would consume much more space compared to the single u32 that we have to remember now.

I know it's not great to leak such an implementation detail.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants