Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST validation #425

Merged
merged 2 commits into from
Jul 8, 2024
Merged

AST validation #425

merged 2 commits into from
Jul 8, 2024

Conversation

yannham
Copy link
Contributor

@yannham yannham commented Jul 7, 2024

This draft PR implements a simple validation procedure for CommonMark ASTs, following the discussions in #371. The idea is to perform the additional checks that the nature of a block and the nature of its children (or the absence thereof) match, which isn't currently possible to enforce (or at least not in a agreable way) using static types.

@kivikakk The logic is pretty simple, but the tests seem to loop forever, so it sounds like there's an infinite loop somewhere. However the function is a pretty dumb tree traversal, so I wonder if I'm missing something (like children() is actually returning descendants, but given the API and the existence of descendants(), this doesn't seem to be the case).

@kivikakk
Copy link
Owner

kivikakk commented Jul 7, 2024

I think there's quite a few places in the new validate function where self should be node; the infinite loop happens where we stack.extend(self.children()) on every iteration, thus making it inexhaustible.

@kivikakk
Copy link
Owner

kivikakk commented Jul 7, 2024

After that, I think you'll find most checks raised by the tests are of the InlineNodeWithChildren kind, but it's not true that inline nodes can't have children — they certainly can (and most do). (Basic examples of each: Emph and Strong are inlines, and they can contain Text, other Emph and Strong, Link, most (all?) other inlines. Text is an inline which cannot contain any other node.) nodes::can_contain_type should enumerate all the valid nestings, and should be used by the validate function.

The commonmark_avoids_spurious_backslash test constructs an invalid AST (that LineBreak should go on the end of p1, not as a child of the document) and is correctly flagged!

@yannham
Copy link
Contributor Author

yannham commented Jul 7, 2024

I think there's quite a few places in the new validate function where self should be node; the infinite loop happens where we stack.extend(self.children()) on every iteration, thus making it inexhaustible.

Oh, right. I started with a recursive function but it blew the stack and there's a specific test case to avoid that, so I switched to an iterative version which introduced node but I must have forgotten to change self to node. Thanks

@yannham
Copy link
Contributor Author

yannham commented Jul 7, 2024

It seems that a few dozen remaining tests still don't pass. Most of them have a Document node with Text or variants (like SpoileredText) as direct children, but this isn't allowed according to can_contain_type and the documentation (and the markdown spec I believe, which agrees that documents should only have blocks as children: https://github.com/commonmark/commonmark-spec/blob/9103e341a973013013bb1a80e13567007c5cef6f/CommonMark.dtd#L8).

What do you think? Should we relax can_contain_type, or should we fix the fact that such ASTs are ever produced?

@yannham
Copy link
Contributor Author

yannham commented Jul 7, 2024

For the record, failing tests are:

failures:
    tests::commonmark::commonmark_avoids_spurious_backslash
    tests::footnotes::footnote_in_table
    tests::footnotes::footnote_with_superscript
    tests::spoiler::mismatched_spoilers
    tests::spoiler::spoiler
    tests::spoiler::spoiler_in_table
    tests::spoiler::spoiler_regressions
    tests::strikethrough::strikethrough
    tests::superscript::superscript
    tests::underline::underline

@kivikakk
Copy link
Owner

kivikakk commented Jul 7, 2024

Nice, thank you so much!

Of the remaining, some of them are legit (like FootnoteReference in a TableCell) and need can_contain_type to be adjusted, and some of them (like Text in Document) aren't. If it's all good with you, I'll make the remaining adjustments myself?

@yannham
Copy link
Contributor Author

yannham commented Jul 7, 2024

Sure!

@kivikakk
Copy link
Owner

kivikakk commented Jul 7, 2024

Thanks! The spoiler ones were particularly fun: half of the issues raised were legit, but half unearthed a weirdness in what gets added to the AST when a spoiler doesn't fully match. I've left a hack in can_contain_type for that for now since fixing that is out of scope for this PR. If both you and CI like where it's at, it should be good to go!

yannham and others added 2 commits July 8, 2024 10:16
This commit adds a `validate()` method to check that AST are
well-formed. This can be useful when generating ASTs programmatically,
where it's possible to violate the markdown specification, leading to
invalid output if rendered. Additionally, this validation is done
automatically when formatting an AST in debug mode.
@yannham yannham force-pushed the task/document-validation branch from 17a0ca6 to 7460a2f Compare July 8, 2024 08:16
@yannham yannham marked this pull request as ready for review July 8, 2024 08:16
@yannham
Copy link
Contributor Author

yannham commented Jul 8, 2024

I did a bit of clean up, and simplified further the validation function: the first test that some nodes don't have children at all was superfluous, because it's also subsumed by can_contain_node which would return false on such an unexpected child. So now, validate mostly just maps can_contain_child on each pair (child, parent) of the tree.

It's good to go on my end 🙂

@kivikakk
Copy link
Owner

kivikakk commented Jul 8, 2024

Really clean now. Thank you so much!

@kivikakk kivikakk enabled auto-merge July 8, 2024 08:53
@kivikakk kivikakk merged commit 878e601 into kivikakk:main Jul 8, 2024
20 checks passed
@kivikakk
Copy link
Owner

This just unearthed some AST mistakes I was making! Thanks again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants