-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize #id .class
selector performance
#2254
Comments
Hi there, I don't see Do you have a specific page as an example we can review? I won't just remove the function as it is a key goal of jsoup to parse to HTML5 compliance. If it is an actual hotspot that we can improve perf on, I am certainly keen to look at that. As you noted, you can avoid the entire HTML5 tree parsing rules by just using the simpler XML parser. One thought I have had previously for the Parser is to use a different data structure than a simple ArrayList to maintain the stack of open elements. One that would allow an O(1) lookup to see if a given element by name is on the stack, and if so, where. Or at least something better than the current O(n) scan. E.g. perhaps an ancillary HashMap with a name -> list index. On your second question WRT |
@jhy the HTML i used is a normal Google Serps. and here is the Zip file and test code
this Css Selector is just for test, 200 times query need about 10s in version 1.13 and 47s in version 1.18 what i am using is Jsoup 1.13 with htmlParser, i want to change to use xmlParser but there is a xmlParser bug in 1.13 so i want update version to 1.18. |
Hmm, that Measurements via JMH operations per second;
Will need to dig deeper to understand what's happening here. Off the top of my head the main changes in the Selector in that period have been:
Both the step change and the new variance is quite unexpected here. Here's a longer run on the current head, just for the
@saselovejulie some Qs:
(Edit) |
@jhy
let me test our production situlation to check if the normal query still have performence different, |
[Worklog] With the sort disabled and in forward: And in reverse sorted order: With sort() called, but in forward eval order: With sort() called, but in (previous) reverse eval order: So, need to review the cost function / execution plan for this query; hitting the ID first should be more performant than scanning for classes. Perhaps we end up running more evals because of the |
Original plan of the query: <And css="#id .class" cost="10">
<Parent css="#id " cost="4">
<Id css="#id" cost="2"></Id>
</Parent>
<Class css=".class" cost="6"></Class>
</And> I updated the cost such that the ancestor evaluator has an appropriately higher cost than the class evaluator, as the ancestor will have to scan up the tree; even if each ID check is relatively low cost, they will accumulate. I also renamed it from Parent to Ancestor, to make its behavior clearer. New: <And css="#id .class" cost="22">
<Class css=".class" cost="6"></Class>
<Ancestor css="#id " cost="16">
<Id css="#id" cost="2"></Id>
</Ancestor>
</And> Completed run:
|
#id .class
selector perf
#id .class
selector perf#id .class
selector performance
Thanks for the feedback here, fixed |
when i use JProfiler test Jsoup HtmlParser, i found .inButtonScope("p") this method cost too much time,
every tag need check if it is in < p > tag.
if ignore this method will improve over 20% performence.
is it worth for just P tag HTML5 rules?
thank you
The text was updated successfully, but these errors were encountered: