Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARC-Protocol field proposal #42

Open
ato opened this issue Jul 13, 2018 · 28 comments
Open

WARC-Protocol field proposal #42

ato opened this issue Jul 13, 2018 · 28 comments

Comments

@ato
Copy link
Member

ato commented Jul 13, 2018

Motivation:

  • To allow the recording of messages using a different representation to their wire message format as
  • To allow the presence of layered protocols like TLS to be recorded.
  • To allow readers of WARC files to be able to determine the protocol of a message without having to know how to parse the record block.
  • To disambiguate when the protocol cannot be determined from the message itself. Many protocols, including HTTP/2 and SPDY, negotiate protocol version up front and subsequent messages are not tagged with a protocol identifier.

WARC-Protocol field definition

The WARC-Protocol field denotes the protocol(s) of the original network message
this record holds information about.

WARC-Protocol = "WARC-Protocol" ":" protocol-id
protocol-id = "dns"      ; DNS [RFC 1035]
            | "ftp"      ; FTP [RFC 959]
            | "gemini"   ; Gemini
            | "gopher"   ; Gopher [RFC 1436]
            | "http/0.9" ; HTTP/0.9
            | "http/1.0" ; HTTP/1.0 [RFC 1945]
            | "http/1.1" ; HTTP/1.1 [RFC 7230]
            | "h2"       ; HTTP/2 over TLS [RFC 7540]
            | "h2c"      ; HTTP/2 over cleartext TCP [RFC 7540]
            | "h3"       ; HTTP/3 [RFC 9114]
            | "quic/1"   ; QUIC version 1 [RFC 9000]
            | "quic/2"   ; QUIC version 2 [RFC 9369]
            | "spdy/1"   ; SPDY/1
            | "spdy/2"   ; SPDY/2
            | "spdy/3"   ; SPDY/3
            | "ssl/2"    ; SSLv2 aka SSL 0.2
            | "ssl/3"    ; SSLv3 aka SSL 3.0 [RFC 6101]
            | "tls/1.0"  ; TLS 1.0 [RFC 2246]
            | "tls/1.1"  ; TLS 1.1 [RFC 4336]
            | "tls/1.2"  ; TLS 1.2 [RFC 5246]
            | "tls/1.3"  ; TLS 1.3

If the protocol you wish to record is not on the list above please file an issue to
propose a protocol identifier before using it.

The WARC-Protocol field may be omitted when the protocol is unknown or can be
unambiguosly determined from some combination of the scheme portion of the
WARC-Target-URI field, the Content-Type field and the message in the record
block itself.

Multiple WARC-Protocol fields may be present to indicate protocol layering. For
example HTTP/1.1 over TLS 1.0 would be indicated by:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

The WARC-Protocol field does not indicate the format of the record block and
is not a replacement for the Content-Type field. For example the use of an
extended text format that includes HTTP/2 pseudo-headers should be indicated by a
new value of the Content-Type field not the presence of WARC-Protocol: h2.

Different protocols may reuse the same media type. There are also situations
where it may be desirable to represent the same message of a particular protocol
using different types such as semantically equivalent text and binary forms.

The WARC-Protocol field may be used in 'request', 'response',
'resource', 'metadata' and 'revisit' records and shall not be used in 'warcinfo',
'conversion' and 'continuation' records.

Determining the protocol in the absence of WARC-Protocol

URI Scheme Content-Type Header version Protocol
dns text/dns dns ; transport unknown
ftp ftp ; over cleartext TCP
gemini application/gemini † gemini ; over TLS #85
gopher application/gopher † gopher ; over cleartext TCP
http application/http absent http/0.9 ; over cleartext TCP
http application/http "HTTP/1.0" http/1.0 ; over cleartext TCP
http application/http "HTTP/1.1" http/1.1 ; over cleartext TCP
https application/http "HTTP/1.0" http/1.0 ; over TLS
https application/http "HTTP/1.1" http/1.1 ; over TLS

† Not a registered media type but has been used in the wild.

When the WARC-Protocol field is present it takes precedence over the rules in the table above.

Edit 2023-05-31: Added 'revisit' to list of allowed records.
Edit 2023-06-01: Added Gemini protocol as proposed by @acidus99 in #85.
Edit 2023-06-02: Added Gopher protocol as proposed by @TheTechRobo in #87.
Edit 2024-07-15: Added h3 (HTTP/3)
Edit 2024-11-18: Added quic/1 and quic/2. Added clarifying example about pseudo-headers.

@nlevitt
Copy link
Member

nlevitt commented Jul 16, 2018

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

1 similar comment
@nlevitt
Copy link
Member

nlevitt commented Jul 16, 2018

What should we say about other protocols not in your list? Seems to me it is desirable to allow other values, but we also want to avoid a complete free-for-all. Maybe we could say, please file a github issue here to propose a new protocol id, before you use it. Then at least there is one place to check for prior art.

@ato
Copy link
Member Author

ato commented Jul 17, 2018

Maybe we could say, please file a github issue here to propose a new protocol id, before you use it.

I think that's a great idea. I've updated the proposal text to include a link to an issue template.

@ato
Copy link
Member Author

ato commented Jul 17, 2018

h2c and h2 are obvious odd ones out in the list as they don't follow the general name/version form and h2 vs h2c is somewhat redundant with specifying the TLS version. I did it that way for consistency with the identifiers the RFC itself says to use in the HTTP Upgrade header and the ALPN protocol identifier field.

Also I just made up the TLS protocol identifiers as I couldn't find anything semi-official. "SSLv3", "TLSv1.1" etc seems somewhat common in software though (Java, OpenSSL) so I can see an argument that might be a better choice. I don't think there's a right answer here, the slash form is better in the sense that you could consistently chop the version off. The "TLSvX" form is better in that you might not have to convert from whatever TLS library you're using says. I couldn't see one argument as particularly more compelling than the other so just picked one.

@ato
Copy link
Member Author

ato commented Mar 6, 2019

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

In favour of a single field in the style of User-Agent:

  • It makes WARC fields easier to deal with in most programming languages as you can just dump them into a hash table (with the exception of WARC-Concurrent-To).
  • I like the idea of using a consistent mini-language across all three headers (User-Agent, WARC-Software-Version, WARC-Protocol) to specifying component version numbers. It also leads to the obvious extension of allowing comments with more details for diagnostic/troubleshooting purposes.
  • It's more concise which makes records more human readable.

In favour of repeated fields:

  • It doesn't require field-specific parsing.
  • WARC does allow specific fields to be repeated so that's something readers have to account for anyway.
  • It's simpler to write a matching expression for generic filtering tools.

@acidus99
Copy link

acidus99 commented May 30, 2023

I have a question on which record types the WARC-Protocol header, as well as the WARC-TLS-Cipher-Suite header mentioned/proposed by @ato here should appear.

  • Both a request and a response can travel on top of a TLS connection, so presumably these headers could appear on both the request and response records. But should they?
  • A client cannot change the TLS version of cipher suite between a request and a response, so the header values would be identical for request/response record pairs. Including it on both seems like needless duplication, especially if the records are linked with a WARC-Concurrent-To.

The most similar, already defined header I could think of to this is WARC-IP-Address. Section 5.10 of the 1.1 spec says "the numeric Internet address contacted to retrieve any included content" and can be associated with request and response records. But all the examples in the spec only show the WARC-IP-Address header on response records, and I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

(Which is kind of weird if you think about it from an order-of-operations perspective. The IP address of the system must be known before the request is made, so it's odd that the convention is to include the WARC-IP-Address header on response instead of the request.)

It feels like the WARC-Protocol and WARC-TLS-Cipher-Suite headers should go where the WARC-IP-Address header goes, but I really am curious to the community's feedback.

@JustAnotherArchivist
Copy link

I haven't ever encountered any WARCs in the real world that use WARC-IP-Address on the request records.

Here are some tools that do: wget, wpull, qwarc, Zeno, warcio (at least when using warcio.capture_http). I'm sure there are more. Heritrix and warcprox don't. If you want some real-world example WARCs, the ArchiveTeam collection on the Internet Archive is full of them.

I think that they should be allowed on both request and response records. As for why you might want to record it on the request record: consider the case where you send a request but never receive a response. It is still worth recording this attempted request (and note the lack of a response in the log accompanying the crawl), including the relevant details like IP and protocol.

@ato
Copy link
Member Author

ato commented May 31, 2023

I just realized I missed the 'revisit' record type in the WARC-Protocol proposal, so have edited it to be included. After this edit WARC-Protocol is allowed on the same record types as WARC-IP-Address (‘response’, ‘resource’, ‘request’, ‘metadata’, and ‘revisit’).

Some reasons for allowing it on multiple record types:

  • In some cases the request and response may use different protocol versions. (e.g. http/1.0 vs http/1.1)
  • You may have information about the protocol that was used but not have the actual request or response message. This can occur for example when converting to WARC from another format or due to tool limitations (e.g. in-browser archiving).

it's odd that the convention is to include the WARC-IP-Address header on response instead of the request

It's likely because:

  1. The older ARC file format did not store the request but did store the IP address.
  2. Before the advent of browser-based crawling, request records were usually completely ignored and not indexed for replay. So if you're going to put it in just one record then choosing the response record would make it more easily accessible to replay tools.

@acidus99
Copy link

Excellent, thanks for the context. I ended up including them on both request and response records

@ikreymer
Copy link
Member

After proposing in #52 that WARC-Software-Version follow the format of HTTP User-Agent I find myself thinking WARC-Protocol as also a list of version numbers should also be consistent with it. I keep going back on forth on it as I think there's arguments either way.

It seems like this hasn't been decided one way or another, but would very much be in favor of a single field, as that makes representing WARC headers as dictionary object much easier and more concise. Are there other WARC headers that allow repetition currently?

The repeatable Set-Cookie and Link HTTP headers require special parsing, but also have custom semantics that make sense to have separate. As this is much simpler header, I think a comma-separated value list makes a lot of sense, in line with other headers like Accept*, Vary, etc...

@ato
Copy link
Member Author

ato commented Jul 13, 2024

Are there other WARC headers that allow repetition currently?

WARC-Concurrent-To is the only one in the standard:

As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

The only other standard headers that would seem to make sense to repeat are the payload/block digest headers for different algorithms. But that's not allowed currently.

Repetition of extension headers was also discussed in #95. I haven't seen any other extension headers in the wild that use repetition or comma separated lists so far.

It's not WARC record headers but Heritrix uses repeated fields in application/warc-field metadata records to record extracted links.

@wumpus
Copy link

wumpus commented Jul 16, 2024

I'm in favor of a single field, comma-separated.

Note that the clock has pretty much ticked out on this discussion... the minute that a large web player starts discriminating against crawling with http/1.1 and less so against crawling with http/2, we have to switch immediately.

sebastian-nagel added a commit to commoncrawl/nutch that referenced this issue Jul 18, 2024
- HTTP headers: replace HTTP/2 and alike by HTTP/1.1 to
  ensure backward-compatibility for WARC readers, see
   iipc/warc-specifications#15
- store protocol versions and cipher suites in WARC headers
  WARC-Protocol and WARC-Cipher-Suite, see
   iipc/warc-specifications#42
   iipc/warc-specifications#86
- allow multiple WARC headers of the same name (WARC-Protocol
  may occur twice to hold the HTTP and TLS version)
@ikreymer
Copy link
Member

ikreymer commented Nov 7, 2024

To clarify, are multiple headers required, eg:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0

or is:

WARC-Protocol: http/1.1, tls/1.0

also acceptable?

The former seems to introduce unnecessary complexity, as we can't use a regular map to store the values, and would very much prefer to use the later. Could we clarify that the above are equivalent?

We'd prefer to use the latter in Browsertrix Crawler, but if others feel strongly that this should not be supported, can do the former, reluctantly.

@ato
Copy link
Member Author

ato commented Nov 11, 2024

The WARC-Concurrent-To header already accommodates multiple values by repeating the header, per section 5.7:

As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

This means that storing WARC headers using a simple string-to-string map is already insufficient, since WARC-Concurrent-To header can have multiple values represented by repeating the header name. Since header order isn't important you can use a regular hash or tree map, but it has to be a string to list of strings map (e.g. Map<String,List<String>>) not a string to string map (Map<String,String>).

If we want to enable using a string to string map, we could consider introducing a comma concatenation rule like HTTP has and update WARC-Concurrent-To to allow comma separated values. However commas can appear in URIs and thus WARC-Concurrent-To values, so the splitting would need to be more complicated to disambiguate whether the comma is part of the URI. Such as by splitting on , outside of < and > pairs. Also I'm wary that comma concatenation has been a frequent source of security issues in HTTP implementations.

Consequently my position is preferring repeated headers over commas for consistency with WARC-Concurrent-To. However if everyone else prefers commas I won't stand in the way of it.

@ibnesayeed
Copy link
Contributor

My personal thoughts are to support not just the single entry, but a strict ordering as well, and change the grammar accordingly.

WARC-Protocol = "WARC-Protocol" ":" application-protocol-id [ "," transport-protocol-id ]

If we anticipate a situation where we would record the transport layer data without any associated application layer protocol then we can make at least any one value as mandatory. If both are present (or if we can think of any other reasonable combinations of more than two of the available IDs) then we can further categorize them and propose a strict hierarchy, making it easier to parse.

@CorentinB
Copy link

CorentinB commented Nov 12, 2024

Hi, my personal preference goes to multiple headers instead of comma-separated values. It's easier to parse and WARC-Concurrent-To already allow repetition of headers, so for me it just naturally make sense to apply it to other headers if needed.

Edit: to be clear, I think there MIGHT be a way to fit it in 1 header. But I think it should not be comma-separated. We can come up with a different pattern or use multiple headers. (which I prefer personally)

@equals215
Copy link

To my knowledge no other field in the WARC spec has such comma-separated values so using the same header multiple times makes more sense

@ato
Copy link
Member Author

ato commented Nov 12, 2024

Regarding @ibnesayeed's idea of separating application and transport protocol ids:

WARC-Protocol = "WARC-Protocol" ":" application-protocol-id [ "," transport-protocol-id ]

If we're going to do that, I think it would be simpler to define separate headers for them:

WARC-Protocol  = "WARC-Protocol" ":" application-protocol-id
WARC-Transport = "WARC-Transport" ":" transport-protocol-id

It is conceivable to have more than two levels of nesting though. One common example perhaps being DNS over HTTPS and a more exotic one being HTTP over TLS over SCTP. I could well be overthinking things though and maybe two is enough in practice.

@willmhowes
Copy link

willmhowes commented Nov 12, 2024

@ikreymer

Per @ato's above message regarding the insufficiency of a string-to-string map already, I do not agree with your claim that multiple headers would "introduce unnecessary complexity."

Furthermore, I'm curious about webrecorder's current support for multiple headers given that WARC-Concurrent-To can already have multiple records, but your warcio.js library appears to lack support, per this issue. Do you know if this shortcoming is present in other webrecorder packages? Full disclosure: a similar flaw was discovered in Zeno recently, and a fix is in the works over the next few weeks.

Something like @ibnesayeed proposed with a well-defined syntax could work (similar to User Agent, as brought up by @ato above). But I agree with @equals215 that a comma-seperated header introduces it's own version of a special case that requires additional consideration for reasons of parsing and security, whereas multiple records have precedence and should already be supported by most WARC tools.

I therefore favor multiple headers for reasons of precedence and my interpretation of KISS principles in this situation

@ibnesayeed
Copy link
Contributor

Regarding @ibnesayeed's idea of separating application and transport protocol ids:

WARC-Protocol = "WARC-Protocol" ":" application-protocol-id [ "," transport-protocol-id ]

If we're going to do that, I think it would be simpler to define separate headers for them:

WARC-Protocol  = "WARC-Protocol" ":" application-protocol-id
WARC-Transport = "WARC-Transport" ":" transport-protocol-id

It is conceivable to have more than two levels of nesting though. One common example perhaps being DNS over HTTPS and a more exotic one being HTTP over TLS over SCTP. I could well be overthinking things though and maybe two is enough in practice.

I would fully endorse purpose-specific separate headers for various categories of protocols, which would be simpler to parse and would allow easier extension in the future. Comma-separated values with only a single key would be my second choice, especially, if there can be cases where multiple protocols of the same category are present. Repeated headers should be avoided as much as possible, even though historically they have been present in both WARC and HTTP worlds. I can say a lot more on this in a separate post.

@ato
Copy link
Member Author

ato commented Nov 12, 2024

Updating my position to ranked vote: purpose-specific separate headers, repeated headers, application-comma-transport, commas.

On further reflection I find the purpose-specific separate headers option compelling because:

  • it sidesteps the multiple values disagreement
  • readers can pick the layer they're interested in
  • no additional parsing
  • no potential for order confusion
  • additional headers can be defined for stuff like SCTP or "something over HTTP" when someone actually has a use case for them

Repeated headers should be avoided as much as possible, even though historically they have been present in both WARC and HTTP worlds. I can say a lot more on this in a separate post.

Admittedly, I have seen over and over again with both WARC and HTTP someone make the incorrect assumption they can't be repeated and later had to retrofit them in awkwardly. For example somehow the Chrome Devtools Protocol has ended up with four different ways of representing HTTP headers as JSON (array of name/value objects, object with null separated string values, object with comma separated string values, base64 encoded binary string with null separated lines like "a: 1\0a: 2\0b: foo")

I guess the fact so many people overlook repeated headers may be reason enough to avoid them. Although I think quoting/escaping issues with commas is probably just as often a problem. I guess many implementations also silently (and sometimes dangerously) ignore unexpected repeat headers whereas unexpected commas are more likely to throw an error (usually safer).

@ikreymer
Copy link
Member

I would also vote for purpose-specific headers being the best option. Upon reflection, I agree that comma-separation is not best option for the reasons mentioned (new semantic, error in parsing). If a single header must be used store multiple values, probably the best option is to use a well-defined format like JSON for the value so that there's no confusion about parsing.

Obviously, too late to change WARC-Concurrent-To but, I would strongly support not adding any more repeated headers, as they are just prone to errors, as @ato mentioned above. (I believe HTTP has stopped adding repeated headers after Set-Cookie).

I think WARC-Protocol and WARC-Transport distinction definitely sense sense, because they convey different things about the rest of the WARC content.

For example, knowing that WARC-Protocol is h2 might imply other things abou the HTTP content (certain HTTP headers may be present, the headers are case-insensitive, etc...) while knowing that WARC-Transport is tls/1.2 or tls/1.3 is generally informational only, since TLS transport data not at all included in WARCs.

Would be in favor of these two headers, or other suggestions!

@acidus99
Copy link

acidus99 commented Nov 12, 2024

re: distinct WARC-Protocol and WARC-Transport headers. Protocols like QUIC/H3 are blending those concepts anyway, so not sure if the distinction is valuable. Also different people may disagree on what info to put where leading to ambiguity and having to treat them both as the same thing anyway, defeating the concept.

I also think strict ordering of headers to show "nesting" places requirements on header order that are not currently present in the WARC spec (and existing tools that don't explicitly preserve order). I haven't heard anyone share a clear use cases for accurately recording the nesting of protocols, or when that might not be determined from the protocols listed themselves, so this seems to be a lot of added complexity for little gain. But I could just be missing something, so let me know.

I think the simplicity of a generic WARC-Protocol header, which through some mechanism can have multiple values, provides a lot of value. It also lets the WARC author decide how deep or high into the protocol stack they want to go:

Only show the transport protocol, because the application protocol is defined the WARC-Target-Uri header? Fine:

WARC-Protocol: tls/1.3

Application and protocol together? Great!:

WARC-Protocol: gemini, tls/1.3

Application protocol by itself? Sure, it could be redundant, but be bold WARC author!

WARC-Protocol: gopher

Deeper? Go for it!

WARC-Protocol: http/1.1, tls/1.0, tcp

or

WARC-Protocol: h3, udp

What about a "higher" protocol on top of HTTP, as @ato mentioned above.Sure, go for it:

WARC-Protocol: WebDAV, h2, tls/1.3, tcp

Let's make something simple and flexible. Personally I prefer multiple headers, since there is precedence for that with WARC-Concurrent-To, and as @ato mentioned above, header concatenation is the source of many security problems in HTTP. But regardless of the mechanism for multiple values, I think a single defined header, without worrying about nesting order, is a good balance.

ikreymer added a commit to webrecorder/warcio.js that referenced this issue Nov 14, 2024
- add new HeadersMultiMap to support set-cookie, as well as
warc-concurrent-to headers with multiple values (store internally as
map, convert to array for multi value headers, override iterator)
- update tests to check for multiple warc-concurrent-to
- also ensure multiple Set-Cookie works with case sensitive headers
- fixes #32
- ready to support warc-protocol from
iipc/warc-specifications#42
@ikreymer
Copy link
Member

Furthermore, I'm curious about webrecorder's current support for multiple headers given that WARC-Concurrent-To can already have multiple records, but your warcio.js library appears to lack support, per this issue. Do you know if this shortcoming is present in other webrecorder packages? Full disclosure: a similar flaw was discovered in Zeno recently, and a fix is in the works over the next few weeks.

@willmhowes This is now fixed as of warcio.js 2.4.0 - thanks for reminder!
We haven't really use multiple WARC-Concurrent-To headers and no other library was affected, but of course should be supported for accuracy. Also added support for WARC-Protocol as it seems like people are using the multiple header versions already, so we'll support parsing it.

@acidus99 Sounds like you're arguing for supporting as many variations, including commas and multiple headers, so
effectively there are many permutations to the same the same thing, eg:

WARC-Protocol: http/1.1, tls/1.0, tcp

could also be written as:

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0, tcp

or

WARC-Protocol: http/1.1, tls/1.0
WARC-Protocol: tcp

or

WARC-Protocol: http/1.1
WARC-Protocol: tls/1.0
WARC-Protocol: tcp

Having multiple ways of specifying the same thing seems like the wrong approach imo. A more concise approach with distinct headers seems to make the most sense.

I think it should be clarified if the purpose of this field is entirely informational and does not imply anything else about the rest of the WARC data. As mentioned above, the HTTP protocol version does affect the semantics of the HTTP request/response data found in the WARC, while the transport protocol does not.

For example, does specifying WARC-Protocol: h2 imply that certain H2 specific properties, like
Psuedo-Headers may be found in the HTTP headers? Or does it still presume the data is entirely HTTP/1.1 compliant (current assumption). Should pseudo-headers be allowed/expected if and only if H2+ is set as protocol?

There is no current way to store binary H2 / H3 data in WARCs, so it must be converted to HTTP/1.1, and my understanding was that this header would be used to indicate this. An indicator of H2/H3 -> HTTP/1.1 conversion seems more critical from semantic perspective then information-only application / transport protocol info which is 'nice to have' additional metadata.

@ato
Copy link
Member Author

ato commented Nov 18, 2024

[acidus99] re: distinct WARC-Protocol and WARC-Transport headers. Protocols like QUIC/H3 are blending those concepts anyway, so not sure if the distinction is valuable.

Good point, we haven't considered QUIC and should.

So HTTP/3 uses QUIC as a transport layer protocol and QUIC also incorporates TLS however using its own framing over UDP, not DTLS or the TLS over TCP record protocol. A strict layer ordering between QUIC and TLS does not seem obvious to me, although I guess we could just define an ordering if we want to.

There is also already a QUIC 2 which explicitly can be used under the "h3" or "doq" (DNS over QUIC) protocols. While QUIC requires TLS 1.3 or newer and there isn't a TLS 1.4 yet, presumably there will be at some point. So you may wish record the version used of TLS and QUIC separately. So either we hit the same problem with WARC-Transport needing multiple values or we'd need separate headers for TLS and QUIC.

Under the current repeated-header form of the proposal you could use:

WARC-Protocol: h3
WARC-Protocol: quic/2
WARC-Protocol: tls/1.3

I'll add quic/1 and quic/2 to the allowed values in the proposal.

[ikreymer] I think it should be clarified if the purpose of this field is entirely informational and does not imply anything else about the rest of the WARC data.

I think the WARC-Protocol header should merely record the protocols used on the wire and be independent from the format of the content block. If we define a standard way to record a binary h2 or h3 message or an extended text format that can represent pseudo-headers, then I think that should be indicated by a new value for the Content-Type header not by the WARC-Protocol header. I'll add a sentence to the proposal clarifying that.

@ikreymer
Copy link
Member

I think the WARC-Protocol header should merely record the protocols used on the wire and be independent from the format of the content block. If we define a standard way to record a binary h2 or h3 message or an extended text format that can represent pseudo-headers, then I think that should be indicated by a new value for the Content-Type header not by the WARC-Protocol header. I'll add a sentence to the proposal clarifying that.

So that'd be amending application/http; msgtype=response to something like application/http+h2; msgtype=response? I can see that making sense, but also have mixed feelings about this - there's not a good way to specify version, and seems like potential duplication and conflict with WARC-Protocol header. Maybe this discussion should be moved to its own issue, though.

@ato
Copy link
Member Author

ato commented Nov 18, 2024

OK, the week deadline I initially set has passed, so I'll recap where we are. On the original binary question about repeated headers vs comma separated list there was more support amongst IIPC members for repeated headers. So by the original process and deadline I suggested the proposal remains as is.

However, Sawood made a new suggestion about separating the application and transport protocol ids. He suggested using fixed order subfields within WARC-Protocol but from that a fourth option naturally followed of moving the transport protocol to a separate header such as WARC-Transport. There was some support expressed for that as a way to avoid the need for multi valued headers entirely, although only a few members have expressed an opinion either way on the new options. acidus99 pointed out though that H3/QUIC/TLS already wouldn't fit well with those options, at least as they've been defined so far.

Given the QUIC problem, my personal preference has moved back to repeated headers while acknowledging they're a gotcha because they can be easily overlooked by implementers. Maybe we can alleviate that a little bit with a community annotation and/or additional explanatory text in the next version of the standard.

To keep things moving while also giving the newer ideas a chance to be properly considered, as they weren't in my initial request for comments, I'll set another deadline, this time 3 days from now. Any IIPC member can call to extend that to 1 week if they need a few more days to make an important point of argument or to draft a new option (e.g. a 3 header version that accounts for QUIC). Otherwise if the deadline is reached without a call for delay or call for a vote on a new set of options we'll consider this question as decided in favor of repeated headers and implementers can go ahead on that basis.

@acidus99
Copy link

acidus99 commented Nov 18, 2024

@ikreymer To be clear, I was/am arguing against distinct WARC-Protocol and WARC-Transport headers. How to specify multiple values is a different matter. I was using the "Single header, comma separated list of values" approach just as an example. I would be for something simple that doesn't support different permutations which could lead to errors or security issues.

@ato, while I'm not an IIPC member, I would be in favor of "multiple WARC-Protocol headers each with a single value" as the the best choice to implement multiple values in this case. As in (copying your explicit example):

WARC-Protocol: h3
WARC-Protocol: quic/2
WARC-Protocol: tls/1.3

The main reasons are:

  1. It aligns with the approach used by WARC-Concurrent-To, which is the only other multi-value header in the specification.
  2. It is simple and unambiguous.

More generally, I would suggest this "multiple duplicate headers each with a single value" pattern should be the preferred approach for situations in the future with headers that can have multiple values.

The only downside I can think of to this approach is that order is not preserved, since header order isn't defined as meaningful in the WARC spec, but that seems a reasonable tradeoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants