-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add better URL detection in ClickableHook #136
base: master
Are you sure you want to change the base?
Conversation
This would theoretically allow invalid URLs. |
You current approach isn't better. It's not like there there would be a buffer overflow if there's an invalid URL. OK, I'll try to make something better. |
Here, I've fixed it. |
Do you have example URLs that currently don't pass the validation? |
Why don't you want to merge it? FILTER_VALIDATE_URL checks URLs strictly after RFC 2396 I doubt you can archive that level of validation with a regex. It's probably also much faster. It's much easier to maintain and read than that regex you use. This will also fix #129 and #130 (with bracket support). |
I have to agree with @Hyleus here @milesj . This is a much much better approach. |
My only issue with the implementation is the massive O(N) loop at the beginning which traverses the entire content char-by-char. This has negative performance against large posts and or multiple decoda instances on the page. I think a better approach would be to search for |
But isn't |
Yes but native code > user-land code in most situations IMO. |
Running
So you see there's no big difference here, run it multiple times and it'll average the same probably. |
You're testing in a best/perfect case, which is irrelevant to Big-O. Let me put it this way. You're current loop would do a |
Yeah, but we need to search for multiple split chars, so in the end if we determine the split_chars first it would have the same performance, because we would no matter what search for every split char in the string. No, the amount of lookups doesn't change, we would still run it 5 times for the split_chars, because there are 5 split chars. |
True, what about |
I think the regex would still work the same way as my loop internally. Or do you think there would be a difference? |
My only issue is that we're looping over the the entire content twice now, once for the initial lexer, and again for this hook. That's quite a change. I agree that this is a better way of finding URLs, I'm just worried about the perf changes. Ideally this would hook into the lexer somehow but that's probably more work than necessary. |
Could this be a feature flag/option to turn on if you want it? Then people could decide on their own if it is worth sacrificing a bit of runtime in favor of more correct parsing? |
Yes, it can be a solution. Firstly there is no test cases on this PR, first step is to add a failed test then make test pass then test the performance impact #136 (comment) Secondly there is maybe a better and more elegant way to achieve the same stuff but here there is the most important thing that is missing => the way to reproduce the issue. So it need at least one day of focus on it. Both issues mentioned are already fixed. |
This will detect full URLs, currently it seems only the first part until the slash gets matched with that current regex.