-
-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some invalid HTML documents yield Failure("require_current_element: None") #67
Comments
I haven't looked into the error in detail yet, but, indeed, exceptions should not leak to the library user, and the parser should be able to recover at least somehow from all errors. |
You may be able to follow along these rules in the HTML5 spec to get an idea of what the parser is required by the spec to do (that's what I'll do when debugging). The input is quite small (thank you!), so it should be doable. If you can spot the point at which the parser diverges, that would probably be the bug. The error makes me suspect that this is an interaction with the "adoption agency algorithm", called on by the spec in various places. It is used to reorganize the document tree in response to various misnested elements. Markup.ml creates local subtrees to "buffer" parser stream output in case those subtrees might be affected by the "adoption agency algorithm," and streams those subtrees out on its final output stream once it is sure that the subtree is safe from modification by the algorithm. |
If you do try to follow the spec, and run into the "foster parenting algorithm," which is required in misnesting inside As an aside, it is possible to implement the algorithm as a separate pass over the signal stream for people that need this level of conformance (at the price of buffering the document), but I don't think that's ever been done for Markup.ml. |
The following document (reduced as much as possible) makes
parse_html
raise a failure"require_current_element: None"
. Note that the document is indeed invalid sincexmp
elements are both deprecated and should not appear inside atd
element but still one should be able to perform error recovery. I'm guessing the failure exception should not leak to the client ?I tested it with markup 1.0.2, using the following program :
The text was updated successfully, but these errors were encountered: