Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plateau: searching for a part of camelCased word does not yield results #23

Open
ReesePlews opened this issue Dec 10, 2024 · 3 comments
Labels
plateau MLIT Plateau Project

Comments

@ReesePlews
Copy link

search is working much better in the dev branch than earlier, but there are still some questions about search rules; the anticipated results are not always what one expects.

input of "c" returns
image

but input of "park" or "_park" returns
image

"*park" returns
image

how to pick up "park" to get better results?

@ReesePlews ReesePlews added the plateau MLIT Plateau Project label Dec 10, 2024
@strogonoff
Copy link
Contributor

Good point. It looks like tokenizer should be improved to split camelCased words into distinct tokens (both in index and query). cc @ronaldtse this is a search problem that could probably be delegated.

@strogonoff strogonoff added the bug Something isn't working label Dec 10, 2024
@strogonoff
Copy link
Contributor

strogonoff commented Dec 10, 2024

input of "c" returns

I am not sure if that example shows a problem? @ReesePlews

Regarding the park, that is a problem. I will rename the issue to reflect that problem. The problem is that “park” is considered part a word (e.g., “DistributionBusinessPark”), and just “park” does not appear in the document as a standalone word, and so either that entire word needs to be searched or as you have noticed a wildcard must be used for a partial match.

We can address that by splitting camel-cased words into tokens, and/or add help text about wildcard support.

This is not very typical for an English document (where “park” would be used numerous times, and therefore be found), but we should support this case.

@strogonoff strogonoff changed the title plateau: [dev branch] various issues with search Plateau: searching for a part of camelCased word does not yield results Dec 10, 2024
@strogonoff
Copy link
Contributor

strogonoff commented Dec 10, 2024

@ReesePlews The reason “c” is found in clause 4.25.4.6.9 is because you will see “C” does appear as a standalone word in the table in that clause. In other circumstances it appears as part of words with numbers, and I assume the tokenizer splits a word into multiple tokens when it encounters a number in it. (E.g., in 6.3.1 it appears in “C01”, etc.)

@strogonoff strogonoff removed the bug Something isn't working label Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plateau MLIT Plateau Project
Projects
None yet
Development

No branches or pull requests

2 participants