-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WARC-Protocol field proposal #42
Comments
What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art. |
1 similar comment
What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art. |
I think that's a great idea. I've updated the proposal text to include a link to an issue template. |
h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field. Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one. |
After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way. In favour of a single field in the style of User-Agent:
In favour of repeated fields:
|
I have a question on which record types the
The most similar, already defined header I could think of to this is (Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the It feels like the |
Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol. |
I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’). Some reasons for allowing it on multiple record types:
It's likely because:
|
Excellent, thanks for the context. I ended up including them on both |
It seems like this hasn't been decided one way or another, but would very much be in favor of a single field, as that makes representing WARC headers as dictionary object much easier and more concise. Are there other WARC headers that allow repetition currently? The repeatable |
WARC-Concurrent-To is the only one in the standard:
The only other standard headers that would seem to make sense to repeat are the payload/block digest headers for different algorithms. But that's not allowed currently. Repetition of extension headers was also discussed in #95. I haven't seen any other extension headers in the wild that use repetition or comma separated lists so far. It's not WARC record headers but Heritrix uses repeated fields in application/warc-field metadata records to record extracted links. |
I'm in favor of a single field, comma-separated. Note that the clock has pretty much ticked out on this discussion... the minute that a large web player starts discriminating against crawling with http/1.1 and less so against crawling with http/2, we have to switch immediately. |
- HTTP headers: replace HTTP/2 and alike by HTTP/1.1 to ensure backward-compatibility for WARC readers, see iipc/warc-specifications#15 - store protocol versions and cipher suites in WARC headers WARC-Protocol and WARC-Cipher-Suite, see iipc/warc-specifications#42 iipc/warc-specifications#86 - allow multiple WARC headers of the same name (WARC-Protocol may occur twice to hold the HTTP and TLS version)
To clarify, are multiple headers required, eg:
or is:
also acceptable? The former seems to introduce unnecessary complexity, as we can't use a regular map to store the values, and would very much prefer to use the later. Could we clarify that the above are equivalent? We'd prefer to use the latter in Browsertrix Crawler, but if others feel strongly that this should not be supported, can do the former, reluctantly. |
The WARC-Concurrent-To header already accommodates multiple values by repeating the header, per section 5.7:
This means that storing WARC headers using a simple string-to-string map is already insufficient, since WARC-Concurrent-To header can have multiple values represented by repeating the header name. Since header order isn't important you can use a regular hash or tree map, but it has to be a string to list of strings map (e.g. If we want to enable using a string to string map, we could consider introducing a comma concatenation rule like HTTP has and update WARC-Concurrent-To to allow comma separated values. However commas can appear in URIs and thus WARC-Concurrent-To values, so the splitting would need to be more complicated to disambiguate whether the comma is part of the URI. Such as by splitting on Consequently my position is preferring repeated headers over commas for consistency with WARC-Concurrent-To. However if everyone else prefers commas I won't stand in the way of it. |
My personal thoughts are to support not just the single entry, but a strict ordering as well, and change the grammar accordingly.
If we anticipate a situation where we would record the transport layer data without any associated application layer protocol then we can make at least any one value as mandatory. If both are present (or if we can think of any other reasonable combinations of more than two of the available IDs) then we can further categorize them and propose a strict hierarchy, making it easier to parse. |
Hi, my personal preference goes to multiple headers instead of comma-separated values. It's easier to parse and Edit: to be clear, I think there MIGHT be a way to fit it in 1 header. But I think it should not be comma-separated. We can come up with a different pattern or use multiple headers. (which I prefer personally) |
To my knowledge no other field in the WARC spec has such comma-separated values so using the same header multiple times makes more sense |
Regarding @ibnesayeed's idea of separating application and transport protocol ids:
If we're going to do that, I think it would be simpler to define separate headers for them:
It is conceivable to have more than two levels of nesting though. One common example perhaps being DNS over HTTPS and a more exotic one being HTTP over TLS over SCTP. I could well be overthinking things though and maybe two is enough in practice. |
Per @ato's above message regarding the insufficiency of a string-to-string map already, I do not agree with your claim that multiple headers would "introduce unnecessary complexity." Furthermore, I'm curious about webrecorder's current support for multiple headers given that Something like @ibnesayeed proposed with a well-defined syntax could work (similar to User Agent, as brought up by @ato above). But I agree with @equals215 that a comma-seperated header introduces it's own version of a special case that requires additional consideration for reasons of parsing and security, whereas multiple records have precedence and should already be supported by most WARC tools. I therefore favor multiple headers for reasons of precedence and my interpretation of KISS principles in this situation |
I would fully endorse purpose-specific separate headers for various categories of protocols, which would be simpler to parse and would allow easier extension in the future. Comma-separated values with only a single key would be my second choice, especially, if there can be cases where multiple protocols of the same category are present. Repeated headers should be avoided as much as possible, even though historically they have been present in both WARC and HTTP worlds. I can say a lot more on this in a separate post. |
Updating my position to ranked vote: purpose-specific separate headers, repeated headers, application-comma-transport, commas. On further reflection I find the purpose-specific separate headers option compelling because:
Admittedly, I have seen over and over again with both WARC and HTTP someone make the incorrect assumption they can't be repeated and later had to retrofit them in awkwardly. For example somehow the Chrome Devtools Protocol has ended up with four different ways of representing HTTP headers as JSON (array of name/value objects, object with null separated string values, object with comma separated string values, base64 encoded binary string with null separated lines like "a: 1\0a: 2\0b: foo") I guess the fact so many people overlook repeated headers may be reason enough to avoid them. Although I think quoting/escaping issues with commas is probably just as often a problem. I guess many implementations also silently (and sometimes dangerously) ignore unexpected repeat headers whereas unexpected commas are more likely to throw an error (usually safer). |
I would also vote for purpose-specific headers being the best option. Upon reflection, I agree that comma-separation is not best option for the reasons mentioned (new semantic, error in parsing). If a single header must be used store multiple values, probably the best option is to use a well-defined format like JSON for the value so that there's no confusion about parsing. Obviously, too late to change WARC-Concurrent-To but, I would strongly support not adding any more repeated headers, as they are just prone to errors, as @ato mentioned above. (I believe HTTP has stopped adding repeated headers after Set-Cookie). I think For example, knowing that Would be in favor of these two headers, or other suggestions! |
re: distinct I also think strict ordering of headers to show "nesting" places requirements on header order that are not currently present in the WARC spec (and existing tools that don't explicitly preserve order). I haven't heard anyone share a clear use cases for accurately recording the nesting of protocols, or when that might not be determined from the protocols listed themselves, so this seems to be a lot of added complexity for little gain. But I could just be missing something, so let me know. I think the simplicity of a generic Only show the transport protocol, because the application protocol is defined the
Application and protocol together? Great!:
Application protocol by itself? Sure, it could be redundant, but be bold WARC author!
Deeper? Go for it!
or
What about a "higher" protocol on top of HTTP, as @ato mentioned above.Sure, go for it:
Let's make something simple and flexible. Personally I prefer multiple headers, since there is precedence for that with |
- add new HeadersMultiMap to support set-cookie, as well as warc-concurrent-to headers with multiple values (store internally as map, convert to array for multi value headers, override iterator) - update tests to check for multiple warc-concurrent-to - also ensure multiple Set-Cookie works with case sensitive headers - fixes #32 - ready to support warc-protocol from iipc/warc-specifications#42
@willmhowes This is now fixed as of warcio.js 2.4.0 - thanks for reminder! @acidus99 Sounds like you're arguing for supporting as many variations, including commas and multiple headers, so
could also be written as:
or
or
Having multiple ways of specifying the same thing seems like the wrong approach imo. A more concise approach with distinct headers seems to make the most sense. I think it should be clarified if the purpose of this field is entirely informational and does not imply anything else about the rest of the WARC data. As mentioned above, the HTTP protocol version does affect the semantics of the HTTP request/response data found in the WARC, while the transport protocol does not. For example, does specifying There is no current way to store binary H2 / H3 data in WARCs, so it must be converted to HTTP/1.1, and my understanding was that this header would be used to indicate this. An indicator of H2/H3 -> HTTP/1.1 conversion seems more critical from semantic perspective then information-only application / transport protocol info which is 'nice to have' additional metadata. |
Good point, we haven't considered QUIC and should. So HTTP/3 uses QUIC as a transport layer protocol and QUIC also incorporates TLS however using its own framing over UDP, not DTLS or the TLS over TCP record protocol. A strict layer ordering between QUIC and TLS does not seem obvious to me, although I guess we could just define an ordering if we want to. There is also already a QUIC 2 which explicitly can be used under the "h3" or "doq" (DNS over QUIC) protocols. While QUIC requires TLS 1.3 or newer and there isn't a TLS 1.4 yet, presumably there will be at some point. So you may wish record the version used of TLS and QUIC separately. So either we hit the same problem with Under the current repeated-header form of the proposal you could use:
I'll add
I think the |
So that'd be amending |
OK, the week deadline I initially set has passed, so I'll recap where we are. On the original binary question about repeated headers vs comma separated list there was more support amongst IIPC members for repeated headers. So by the original process and deadline I suggested the proposal remains as is. However, Sawood made a new suggestion about separating the application and transport protocol ids. He suggested using fixed order subfields within WARC-Protocol but from that a fourth option naturally followed of moving the transport protocol to a separate header such as WARC-Transport. There was some support expressed for that as a way to avoid the need for multi valued headers entirely, although only a few members have expressed an opinion either way on the new options. acidus99 pointed out though that H3/QUIC/TLS already wouldn't fit well with those options, at least as they've been defined so far. Given the QUIC problem, my personal preference has moved back to repeated headers while acknowledging they're a gotcha because they can be easily overlooked by implementers. Maybe we can alleviate that a little bit with a community annotation and/or additional explanatory text in the next version of the standard. To keep things moving while also giving the newer ideas a chance to be properly considered, as they weren't in my initial request for comments, I'll set another deadline, this time 3 days from now. Any IIPC member can call to extend that to 1 week if they need a few more days to make an important point of argument or to draft a new option (e.g. a 3 header version that accounts for QUIC). Otherwise if the deadline is reached without a call for delay or call for a vote on a new set of options we'll consider this question as decided in favor of repeated headers and implementers can go ahead on that basis. |
@ikreymer To be clear, I was/am arguing against distinct @ato, while I'm not an IIPC member, I would be in favor of "multiple
The main reasons are:
More generally, I would suggest this "multiple duplicate headers each with a single value" pattern should be the preferred approach for situations in the future with headers that can have multiple values. The only downside I can think of to this approach is that order is not preserved, since header order isn't defined as meaningful in the WARC spec, but that seems a reasonable tradeoff. |
Motivation:
For example it was proposed in WARC revision 1.1 (modification): support of HTTP 2.X protocol in WARC format. #15 and WARC Extensions for HTTP/2 proposal #41 to allow HTTP/2 messages to be represented as application/http.
WARC-Protocol field definition
The WARC-Protocol field denotes the protocol(s) of the original network message
this record holds information about.
If the protocol you wish to record is not on the list above please file an issue to
propose a protocol identifier before using it.
The WARC-Protocol field may be omitted when the protocol is unknown or can be
unambiguosly determined from some combination of the scheme portion of the
WARC-Target-URI field, the Content-Type field and the message in the record
block itself.
Multiple WARC-Protocol fields may be present to indicate protocol layering. For
example HTTP/1.1 over TLS 1.0 would be indicated by:
The WARC-Protocol field does not indicate the format of the record block and
is not a replacement for the Content-Type field. For example the use of an
extended text format that includes HTTP/2 pseudo-headers should be indicated by a
new value of the Content-Type field not the presence of
WARC-Protocol: h2
.Different protocols may reuse the same media type. There are also situations
where it may be desirable to represent the same message of a particular protocol
using different types such as semantically equivalent text and binary forms.
The WARC-Protocol field may be used in 'request', 'response',
'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo',
'conversion' and 'continuation' records.
Determining the protocol in the absence of WARC-Protocol
† Not a registered media type but has been used in the wild.
When the WARC-Protocol field is present it takes precedence over the rules in the table above.
Edit 2023-05-31: Added 'revisit' to list of allowed records.
Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85.
Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87.
Edit 2024-07-15: Added h3 (HTTP/3)
Edit 2024-11-18: Added quic/1 and quic/2. Added clarifying example about pseudo-headers.
The text was updated successfully, but these errors were encountered: