-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalizer for russian #296
base: main
Are you sure you want to change the base?
Normalizer for russian #296
Conversation
@aignatovich thanks for your PR, let us know when your PR is ready for review 😊 I see there is current Rustfmt issue (cf CI) |
Hello @aignatovich, Let me know if this modification works. See you! |
Hi @ManyTheFish, Solution that was proposed (adding Cyrilic to the predicate in NonspacingMarkNormalizer) did not reach expected result. The solution this PR proposes also does not impact search behavior as expected. Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (cyrillic). I will update as long as I know more. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge conflict. |
bors merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors merge
Already running a review |
296: Normalizer for russian r=ManyTheFish a=aignatovich # Pull Request - Normalizer for russian ## Related issue - No related issue. ## Why this changes could be helpful? - In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик"). - Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable. - Case 1: Search Query: "Ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ёжик", Expected: Both - Case 2: Search Query: "Еж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both - Case 3: Search Query: "ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both, or at least "Ёжик". This one seems to be incorrect. If my assumptions are correct, this change may impact some of the cases above, though it has to be validated. ## What does this PR do? - Performs a grammatically permissible normalization of "ё" into "е" for russian language, given that compatibility decomposition already replaces 1-codepoint version with 2-codepoint version. ## PR checklist Please check if your PR fulfills the following requirements: - [ ❓ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [ 🟢 ] Have you read the contributing guidelines? - [ 🟢 ] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Arty I <[email protected]> Co-authored-by: Many the fish <[email protected]>
This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried. Additional information: {"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches","status":"422"} |
Hi @ManyTheFish , There is still work to be done by me in this PR to solve the issue that is being described. Would you be able to indicate what behavior is preferred in each of those scenarios?
This inconsistency between normalization of a query and the document, to the best of my understanding, is the cause of the issue. - Should normalization of Cyrillic -> Latin take place for input query? |
Hello @aignatovich, I don't understand why my suggestions in my comments don't work:
As I re-reading everything, I understand that doing normalizer that convert Cyrillic characters close to Latin into Latin characters should work if put after the Something like: static SPOOFING_VARIANTS: Lazy<HashMap<char, char>> = Lazy::new(|| {
[
('е', 'e'),
].into_iter().collect()
});
pub struct CyrillicVariantsNormalizer;
impl CharNormalizer for CyrillicVariantsNormalizer {
fn normalize_char(&self, c: char) -> Option<CharOrStr> {
match SPOOFING_VARIANTS.get(&c) {
Some(replacement) => Some(replacement.into()),
None => Some(c.into()),
}
}
fn should_normalize(&self, token: &Token) -> bool {
token.script == Script::Cyrillic && token.lemma.chars().any(|c| SPOOFING_VARIANTS.contains_key(&c))
}
}
converting into Latin should be good to me. |
Pull Request
Related issue
Why this changes could be helpful?
In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик").
Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable.
If my assumptions are correct, this change may impact some of the cases above, though it has to be validated.
What does this PR do?
PR checklist
Please check if your PR fulfills the following requirements:
Thank you so much for contributing to Meilisearch!