Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalizer for russian #296

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aignatovich
Copy link

Pull Request

  • Normalizer for russian

Related issue

  • No related issue.

Why this changes could be helpful?

  • In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик").

  • Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable.

    • Case 1: Search Query: "Ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ёжик", Expected: Both
    • Case 2: Search Query: "Еж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both
    • Case 3: Search Query: "ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both, or at least "Ёжик". This one seems to be incorrect.

If my assumptions are correct, this change may impact some of the cases above, though it has to be validated.

What does this PR do?

  • Performs a grammatically permissible normalization of "ё" into "е" for russian language, given that compatibility decomposition already replaces 1-codepoint version with 2-codepoint version.

PR checklist

Please check if your PR fulfills the following requirements:

  • [ ❓ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • [ 🟢 ] Have you read the contributing guidelines?
  • [ 🟢 ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@curquiza
Copy link
Member

curquiza commented Jul 4, 2024

@aignatovich thanks for your PR, let us know when your PR is ready for review 😊

I see there is current Rustfmt issue (cf CI)

@ManyTheFish
Copy link
Member

Hello @aignatovich,
Thank you for your PR. Your PR seems good to me.
However, I think there is already a normalizer that covers the work of the Russian normalizer: the NonspacingMarkNormalizer, the goal of this normalizer is to remove all the nonspacing marks including the diacritics. The only missing change to activate this normalizer is to add the Cyrilic to the eligible scripts of the should_normalize function.

Let me know if this modification works.

See you!

@aignatovich
Copy link
Author

Hi @ManyTheFish,

Solution that was proposed (adding Cyrilic to the predicate in NonspacingMarkNormalizer) did not reach expected result. The solution this PR proposes also does not impact search behavior as expected.

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (cyrillic).

I will update as long as I know more.

ManyTheFish
ManyTheFish previously approved these changes Aug 28, 2024
Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @aignatovich,

thank you for your contribution,

let's merge this!

bors merge

Copy link
Contributor

meili-bors bot commented Aug 28, 2024

Merge conflict.

@ManyTheFish
Copy link
Member

bors merge

Copy link
Member

@ManyTheFish ManyTheFish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors merge

Copy link
Contributor

meili-bors bot commented Aug 28, 2024

Already running a review

meili-bors bot added a commit that referenced this pull request Aug 28, 2024
296: Normalizer for russian r=ManyTheFish a=aignatovich

# Pull Request
- Normalizer for russian

## Related issue
- No related issue.

## Why this changes could be helpful?
- In written russian language it is permissible to use "е" in words containing diacritical version (ex. "ёжик" -> "ежик").

- Below is the current search behavior, using latest version of meilisearch available to date, and it is questionable.  
   - Case 1: Search Query: "Ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ёжик", Expected: Both
   - Case 2: Search Query: "Еж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both
   - Case 3: Search Query: "ёж", Indexed: ["Ежик", "Ёжик"], Result: "Ежик", Expected: Both, or at least "Ёжик". This one seems to be incorrect.

If my assumptions are correct, this change may impact some of the cases above, though it has to be validated.

## What does this PR do?
- Performs a grammatically permissible normalization of "ё" into "е" for russian language, given that compatibility decomposition already replaces 1-codepoint version with 2-codepoint version.

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ❓ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ 🟢 ] Have you read the contributing guidelines?
- [ 🟢 ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Arty I <[email protected]>
Co-authored-by: Many the fish <[email protected]>
Copy link
Contributor

meili-bors bot commented Aug 28, 2024

This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried.

Additional information:

{"message":"Changes must be made through a pull request.","documentation_url":"https://docs.github.com/articles/about-protected-branches","status":"422"}

@aignatovich
Copy link
Author

Hi @ManyTheFish ,

There is still work to be done by me in this PR to solve the issue that is being described.

Would you be able to indicate what behavior is preferred in each of those scenarios?

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (cyrillic).

This inconsistency between normalization of a query and the document, to the best of my understanding, is the cause of the issue.

- Should normalization of Cyrillic -> Latin take place for input query?
OR
- Should normalization of "Ё"(cyrillic) -> "e" (latin) not take place during document indexing?

@aignatovich aignatovich marked this pull request as ready for review August 30, 2024 16:39
@ManyTheFish
Copy link
Member

Hello @aignatovich,

I don't understand why my suggestions in my comments don't work:

Thank you for your PR. Your PR seems good to me.
However, I think there is already a normalizer that covers the work of the Russian normalizer: the NonspacingMarkNormalizer, the goal of this normalizer is to remove all the nonspacing marks including the diacritics. The only missing change to activate this normalizer is to add the Cyrilic to the eligible scripts of the should_normalize function.

As I re-reading everything, I understand that doing normalizer that convert Cyrillic characters close to Latin into Latin characters should work if put after the Lowecase normalizer.

Something like:

static SPOOFING_VARIANTS: Lazy<HashMap<char, char>> = Lazy::new(|| {
    [
        ('е', 'e'),
    ].into_iter().collect()
});

pub struct CyrillicVariantsNormalizer;

impl CharNormalizer for CyrillicVariantsNormalizer {
    fn normalize_char(&self, c: char) -> Option<CharOrStr> {
        match SPOOFING_VARIANTS.get(&c) {
            Some(replacement) => Some(replacement.into()),
            None => Some(c.into()),
        }
    }

    fn should_normalize(&self, token: &Token) -> bool {
        token.script == Script::Cyrillic && token.lemma.chars().any(|c| SPOOFING_VARIANTS.contains_key(&c))
    }
}

Conclusions made from debugging Case 2 scenario above: during indexing, normalization of "Ё" -> "e" (yes, that is a latin) takes place, but this does not happen for the input search term, instead "Е" (cyrillic) is normalized into "е" (Cyrillic).

converting into Latin should be good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants