Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: optimize unoptimizable token types #2072

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

msujew
Copy link
Collaborator

@msujew msujew commented Oct 30, 2024

Currently, when the lexer cannot apply the "first character" optimization to any token type, it will use an un-optimized algorithm for all token types. This behavior can also be seen when using the safeMode: true property on lexers. While this definitely makes sense in general, this leads to major performance regressions (possibly in the order of magnitudes) in case the token type cannot be equipped with the start_chars_hint field (like for unicode regexp).

This PR applies a major optimization in case the "first character" optimization fails: When encountering a non-optimizable token type the lexer will now attempt to match this token type on any tokenization attempt in addition to the optimized token types. More generally, the lexer will attempt all non-optimizable token types together with the optimized token types in the order in which they appear (this is important as we likely get unexpected behavior otherwise!).

If there is no optimized token type for the character at the current position, the lexer will now attempt to match all non-optimizable token types, only yielding a lexer error in case non of them match.

This new behavior mirrors the current behavior perfectly, and none of the existing tests can actually determine that something has changed in the lexer behavior. This change is a pure performance improvement for non-optimizable token types. Of course I added new tests to assert that the new change behaves as expected. Furthermore, using ensureOptimizations: true when creating a lexer still throws an error when attempting to use non-optimizable tokens, as it's not a perfect optimization (the best optimization is still to use the start_chars_hint for those tokens).

I haven't written a benchmark for Chevrotain yet to tell whether this yields the expected performance improvements for non-optimizable tokens, but I can already tell that it does not decrease performance for the normal use case (I'm not sure why the performance for JSON increases, I believe the first test that I run is always a bit slower, even when running everything multiple times):

image

Note that currently lexer.ts does not have a 100% test coverage. My new tests cover the newly added/changed lines, but do not increase the test coverage to 100%.

@msujew msujew requested a review from bd82 October 30, 2024 13:46
@bd82
Copy link
Member

bd82 commented Nov 4, 2024

Thanks @msujew I will look at this soon (hopefully on the weekend...)

@msujew
Copy link
Collaborator Author

msujew commented Jan 13, 2025

@bd82 You think you will have time to look at this soon? We're currently building a language server for PL/I which features configurable logical or and not tokens. Since this requires function based matching with unknown start chars (user supplied via config), this currently makes the lexer unbearably slow. This change would drastically improve lexing performance :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants