Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: resolve inconsistency in banned surrogate chars in escaped form #450

Merged
merged 1 commit into from
Dec 24, 2024

Conversation

eugenesvk
Copy link
Contributor

Also add it explicitly to the spec grammar to make it clear that escaped forms are also affected

Since currently Disallowed Literal Code Points text allows surrogates in \u form, also the spec grammar doesn't specify that disallowed-literal-code-points should be excluded even in their hex code form

Also add it explicitly to the spec grammar to make it clear that escaped forms are also affected
@zkat
Copy link
Member

zkat commented Dec 23, 2024

I really didn’t want to change the spec at this stage.

I think we need to address this with a more explicit errata system that leaves the spec itself intact.

The nice thing about this particular change is that most parsers out there will almost certainly already fail if someone tries to do this (except maybe JS parsers that don’t parse docs as actual utf-8). It’s also a significant security issue so we should definitely address it.

@eugenesvk what are your thoughts on how to organize errata? Maybe we can add a section to the SPEC.md file for it? Or a separate file?

@tabatkins do you have any thoughts/experience with how other standards might handle errata without publishing new versions? My only real experience with it is Common Lisp: https://www.cliki.net/ANSI%20Clarifications%20and%20Errata

@eilvelia
Copy link
Contributor

eilvelia commented Dec 23, 2024

The nice thing about this particular change is that most parsers out there will almost certainly already fail if someone tries to do this (except maybe JS parsers that don’t parse docs as actual utf-8). It’s also a significant security issue so we should definitely address it.

It looks like this is a change in wording only, not in the actual behavior, since the \u escape already prohibits surrogates

Unicode Escape	\u{(1-6 hex chars)}	Code point described by hex characters, as long as it represents a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value)

I think the grammar change also doesn't produce any differences in behavior; the grammar only tells how escapes are analyzed, not what they are resolved into, and the process of resolving escapes is not necessarily total

(On an unrelated note, could #449 be merged? it's currently the only test that's incorrect)

edit: If an implementation allows surrogates in \u, that would also contradict the Strings in KDL represent textual UTF-8 Values part, because those are invalid UTF-8.

@eugenesvk
Copy link
Contributor Author

eugenesvk commented Dec 23, 2024

For me errata only makes sense if it has the most recent grammar in its full form as otherwise it'd be confusing trying to parse the rules. And grammar's design should get continuous improvements so that whoever decides later to use it to build some new parser won't have to rediscover the issues (or worse - not discover them).

(for example, while updating the syntax highlighter I've WIP converted the grammar to KDL and made the element hierarchy explicit to make it easier to use as a reference)

And the current Changelog could continue adding updates similar to how it tracked draft revisions

(And yes, the JS parser had this bug, not sure about others as most don't have easily runnable binaries/scripts published)

@zkat
Copy link
Member

zkat commented Dec 23, 2024

Ahhh, so we do specify that already. I’ll let someone else chime in first so it’s not just me but I’m ok with this change.

@bgotink
Copy link
Contributor

bgotink commented Dec 23, 2024

Sounds good to me, though I'd like to see a test added as well so other implementations can easily detect if they've made the same mistake as I have.

@eugenesvk
Copy link
Contributor Author

added a test separately

@zkat
Copy link
Member

zkat commented Dec 24, 2024

I’ll merge this for now, then

@zkat
Copy link
Member

zkat commented Dec 24, 2024

Me: “I’ll merge this now”
ADHD: “ok but have you considered looking at mastodon for a minute?”

@zkat zkat merged commit 76dc3e3 into kdl-org:main Dec 24, 2024
@eugenesvk eugenesvk deleted the pr-doc branch December 24, 2024 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants