doc: resolve inconsistency in banned surrogate chars in escaped form #450

eugenesvk · 2024-12-23T14:45:33Z

Also add it explicitly to the spec grammar to make it clear that escaped forms are also affected

Since currently Disallowed Literal Code Points text allows surrogates in \u form, also the spec grammar doesn't specify that disallowed-literal-code-points should be excluded even in their hex code form

Also add it explicitly to the spec grammar to make it clear that escaped forms are also affected

zkat · 2024-12-23T17:59:03Z

I really didn’t want to change the spec at this stage.

I think we need to address this with a more explicit errata system that leaves the spec itself intact.

The nice thing about this particular change is that most parsers out there will almost certainly already fail if someone tries to do this (except maybe JS parsers that don’t parse docs as actual utf-8). It’s also a significant security issue so we should definitely address it.

@eugenesvk what are your thoughts on how to organize errata? Maybe we can add a section to the SPEC.md file for it? Or a separate file?

@tabatkins do you have any thoughts/experience with how other standards might handle errata without publishing new versions? My only real experience with it is Common Lisp: https://www.cliki.net/ANSI%20Clarifications%20and%20Errata

eilvelia · 2024-12-23T18:36:33Z

The nice thing about this particular change is that most parsers out there will almost certainly already fail if someone tries to do this (except maybe JS parsers that don’t parse docs as actual utf-8). It’s also a significant security issue so we should definitely address it.

It looks like this is a change in wording only, not in the actual behavior, since the \u escape already prohibits surrogates

Unicode Escape	\u{(1-6 hex chars)}	Code point described by hex characters, as long as it represents a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value)

I think the grammar change also doesn't produce any differences in behavior; the grammar only tells how escapes are analyzed, not what they are resolved into, and the process of resolving escapes is not necessarily total

(On an unrelated note, could #449 be merged? it's currently the only test that's incorrect)

edit: If an implementation allows surrogates in \u, that would also contradict the Strings in KDL represent textual UTF-8 Values part, because those are invalid UTF-8.

eugenesvk · 2024-12-23T19:33:41Z

For me errata only makes sense if it has the most recent grammar in its full form as otherwise it'd be confusing trying to parse the rules. And grammar's design should get continuous improvements so that whoever decides later to use it to build some new parser won't have to rediscover the issues (or worse - not discover them).

(for example, while updating the syntax highlighter I've WIP converted the grammar to KDL and made the element hierarchy explicit to make it easier to use as a reference)

And the current Changelog could continue adding updates similar to how it tracked draft revisions

(And yes, the JS parser had this bug, not sure about others as most don't have easily runnable binaries/scripts published)

zkat · 2024-12-23T19:50:14Z

Ahhh, so we do specify that already. I’ll let someone else chime in first so it’s not just me but I’m ok with this change.

bgotink · 2024-12-23T23:24:19Z

Sounds good to me, though I'd like to see a test added as well so other implementations can easily detect if they've made the same mistake as I have.

eugenesvk · 2024-12-24T04:43:05Z

added a test separately

zkat · 2024-12-24T05:01:41Z

I’ll merge this for now, then

zkat · 2024-12-24T07:57:03Z

Me: “I’ll merge this now”
ADHD: “ok but have you considered looking at mastodon for a minute?”

doc: resolve inconsistency in banned chars in escaped form

52a95a6

Also add it explicitly to the spec grammar to make it clear that escaped forms are also affected

zkat approved these changes Dec 23, 2024

View reviewed changes

eugenesvk mentioned this pull request Dec 24, 2024

Add a test for escaped surrogates #451

Closed

zkat merged commit 76dc3e3 into kdl-org:main Dec 24, 2024

eugenesvk deleted the pr-doc branch December 24, 2024 08:01

eugenesvk mentioned this pull request Dec 25, 2024

Exclude hex above max Unicode Scalar Value #456

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc: resolve inconsistency in banned surrogate chars in escaped form #450

doc: resolve inconsistency in banned surrogate chars in escaped form #450

eugenesvk commented Dec 23, 2024

zkat commented Dec 23, 2024

eilvelia commented Dec 23, 2024 •

edited

Loading

eugenesvk commented Dec 23, 2024 •

edited

Loading

zkat commented Dec 23, 2024

bgotink commented Dec 23, 2024

eugenesvk commented Dec 24, 2024

zkat commented Dec 24, 2024

zkat commented Dec 24, 2024

doc: resolve inconsistency in banned surrogate chars in escaped form #450

doc: resolve inconsistency in banned surrogate chars in escaped form #450

Conversation

eugenesvk commented Dec 23, 2024

zkat commented Dec 23, 2024

eilvelia commented Dec 23, 2024 • edited Loading

eugenesvk commented Dec 23, 2024 • edited Loading

zkat commented Dec 23, 2024

bgotink commented Dec 23, 2024

eugenesvk commented Dec 24, 2024

zkat commented Dec 24, 2024

zkat commented Dec 24, 2024

eilvelia commented Dec 23, 2024 •

edited

Loading

eugenesvk commented Dec 23, 2024 •

edited

Loading