-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Insufficient definition of quoted-string
#71
Comments
On a related note, I think the specification should mention that the value of a |
It appears those definitions were copied verbatim from RFC 2616 except that WARC defines CHAR as a UTF-8 character instead of an ASCII octet. The newer RFC 7230 indeed excludes \ (0x5C) and " (0x22) from qdtext and also has a little more explanation of how a quoted-string should be interpreted:
It also defines other important details that WARC does not such as that leading and trailing whitespace in field values is removed during parsing and that one of the reasons to use quoted-string is when it's necessary to preserve it. Presumably the reason it's used in WARC-Filename is so that filenames that start or end with whitespace or that contain CTLs can be preserved. |
By the way, I realise that these definitions were taken directly from RFC 2616. The RFC has a bit of clarification but also doesn't avoid this ambiguity. Specifically, it states that 'The backslash character ("") MAY be used as a single-character quoting mechanism only within quoted-string and comment constructs.', but that only specifies where the backslash may be used as an escape character and does not reject its use as a literal backslash in RFC 7230 redefined |
Aah, crossfire. Yeah, the 7230 definition makes much more sense. It probably can't simply be taken over to WARC though due to the continuation lines, which were deprecated in 7230, and indeed the |
Actually, the line folding does have an undesired effect: if you want to represent any white space other than a single space in a However, there is also an issue with 7230's definition: control characters cannot be included in a [0] Edit: Minor correction: you can also represent N spaces with N line folds. For example, a field |
First, these are the relevant definitions from WARC/1.1:
The issue here is that the backslash is not excluded in
qdtext
. This means that anyquoted-string
value involving backslashes is ambiguous:"\\"
could mean either two backslashes (viaqdtext
) or one (viaquoted-pair
)."\a"
could mean\a
or justa
."\"
could be a single backslash or the beginning of aquoted-string
with the first character being a double quote; this could lead to all sorts of parser errors later on.I propose to redefine
qdtext
as:The text was updated successfully, but these errors were encountered: