Add implementation of `MultipartFormData` streaming encoder #200

andreiltd · 2025-01-21T14:21:31Z

I did a quick comparison of encoded FormData between SM and Chrome using following JS code:

Simple FormData

async function readStream(stream) {
 const reader = stream.getReader();
 const chunks = [];

 while (true) {
   const { done, value } = await reader.read();
   if (done) {
     break;
   }
   chunks.push(value);
 }

 let totalLen = 0;
 for (const chunk of chunks) {
   totalLen += chunk.length;
 }

 const result = new Uint8Array(totalLen);

 let offset = 0;
 for (const chunk of chunks) {
   result.set(chunk, offset);
   offset += chunk.length;
 }

 return result;
}

async function main() {
 const form = new FormData();
 form.append('field1', 'value1');
 form.append('field2', 'value2');

 const file = new File(['Hello World!'], 'dummy.txt', { type: 'foo' });
 form.append('file1', file);

 const req = new Request('https:example.com', {
   method: 'POST',
   body: form,
 });

 const type = req.headers.get('Content-Type');
 const boundary = type.split('boundary=')[1];

 // assert(type.startsWith('multipart/form-data; boundary='), "Content-Type is multipart/form-data");
 const data = await readStream(req.body);
 const dec = new TextDecoder();
 const bodyStr = dec.decode(data);

 console.log(bodyStr);
}

main();

The results are:

Chrome:

------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field1"

value1
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="field2"

value2
------WebKitFormBoundaryhpShnP1JqrBTVTnC
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
------WebKitFormBoundaryhpShnP1JqrBTVTnC--

SM:

----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field1"

value1
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="field2"

value2
----BoundaryjXo5N4HEAXWcKrw7
Content-Disposition: form-data; name="file1"; filename="dummy.txt"
Content-Type: foo

Hello World!
----BoundaryjXo5N4HEAXWcKrw7--

I also did some manual fuzzing to test the streaming logic by varying the buffer sizes the encoder writes into. Specifically, I tested these buffer sizes in bytes: 1, 2, 4, 8, 32, 1024, and 8192 by changing the size in the implementation. Though, having some unit test framework would be nice.

Relevant RFC: https://www.rfc-editor.org/rfc/rfc2046#section-5.1.1

andreiltd · 2025-01-22T14:04:46Z

~~It looks like the CI failure is unrelated to this PR (I've observed the same failure on main branch). This is a problematic test:~~

    const response = await fetch("https://http-me.glitch.me/meow?header=cat:é");
    strictEqual(response.headers.get('cat'), "é");

tschneidereit · 2025-01-24T14:11:59Z

@andreiltd, is this ready to review, or are you still expecting to make changes to it?

andreiltd · 2025-01-24T16:38:53Z

Yes, this is ready to review :)

tschneidereit

Either I'm missing something in looking at the diff, or there is some code missing—see the inline comment on form-data-encoder.h.

One thing that I can't fully evaluate, because it'll presumably be part of the implementation of encode_stream: IIUC, field names, will always be first have their newlines normalized to CRLF, and then have those escaped. It'd be good to fold those into one operation, if possible.

Otherwise, I left a few comments and suggestions. I'm a bit concerned about allocation and general failure handling, so it'd be good to go over those aspects in some detail.

tschneidereit · 2025-01-25T15:53:52Z

builtins/web/form-data/form-data-encoder.cpp

+      if (i + 1 < src.size() && src[i + 1] == LF) {
+        len += newline_len;
+        i++;
+      } else {
+        len += newline_len;
+      }


Suggested change

if (i + 1 < src.size() && src[i + 1] == LF) {

len += newline_len;

i++;

} else {

len += newline_len;

}

len += newline_len;

if (i + 1 < src.size() && src[i + 1] == LF) {

i++;

}

tschneidereit · 2025-01-25T15:56:40Z

builtins/web/form-data/form-data-encoder.cpp

+const char CR = '\r';
+const char *CRLF = "\r\n";
+
+size_t compute_normalized_len(std::string_view src, const char *newline) {


it seems like this is only ever called with CRLF as the input for newline. Given that, does it make sense to remove the newline parameter and hardcode use of CRLF instead? Chances are LLVM emits the same code anyway, but it'd be better not to have to rely on that.

tschneidereit · 2025-01-25T16:13:55Z

builtins/web/form-data/form-data-encoder.cpp

+// `%0A`, 0x0D (CR) with `%0D` and 0x22 (") with `%22`.
+//
+// https://html.spec.whatwg.org/multipage/form-control-infrastructure.html#multipart-form-data
+std::optional<std::string> escape_newlines(std::string_view str) {


Since this isn't just for newlines, it'd be good to give it a different name. Perhaps escape_name, given that this applies escaping of a set of characters in names?

tschneidereit · 2025-01-25T16:24:03Z

builtins/web/form-data/form-data-encoder.h

+  static size_t query_length(JSContext *cx, HandleObject self);
+  static JSObject *encode_stream(JSContext *cx, HandleObject self);
+  static JSObject *create(JSContext *cx, HandleObject form_data);


The implementation of these seems to be missing? (And in the case of query_length I also can't find any uses.)

tschneidereit · 2025-01-25T16:28:17Z

builtins/web/form-data/form-data-encoder.cpp

+bool MultipartFormDataImpl::handle_entry_header(JSContext *cx, StreamContext &stream) {
+  auto entry = stream.entries->begin()[chunk_idx_];
+  auto header = fmt::memory_buffer();
+  auto name = escape_newlines(entry.name).value();


I think this needs a null check?

tschneidereit · 2025-01-25T16:30:05Z

builtins/web/form-data/form-data-encoder.cpp

+    RootedString type_str(cx, Blob::type(obj));
+    auto type = core::encode(cx, type_str);
+
+    if (!filename || !type) {


The filename check needs to happen right after the call to escape_newlines: if that operation errors, we don't want to run any more fallible code.

tschneidereit · 2025-01-25T16:32:59Z

builtins/web/form-data/form-data-encoder.cpp

+  // Hex encode bytes to string
+  auto bytes = std::move(res.unwrap());
+  auto bytes_str = std::string_view((char *)(bytes.ptr.get()), bytes.size());
+  auto base64_str = base64::forgivingBase64Encode(bytes_str, base64::base64EncodeTable);


Does this need a null check?

tschneidereit · 2025-01-25T16:34:42Z

builtins/web/form-data/form-data-encoder.cpp

+    return outbuf.size() - read;
+  }
+
+  template <typename I> size_t write(I first, I last) {


IIUC, this method is deliberately infallible? If so, could you document that fact?

tschneidereit · 2025-01-25T16:36:55Z

builtins/web/form-data/form-data-encoder.cpp

+
+  bool is_draining() { return (file_leftovers_ || remainder_.size()); };
+
+  template <typename I> size_t write_and_cache_remainder(StreamContext &stream, I first, I last);


The return value isn't ever used, it seems. Maybe remove it?

tschneidereit · 2025-01-25T16:43:11Z

builtins/web/form-data/form-data-encoder.cpp

+  auto leftover = datasz - written;
+  if (leftover > 0) {
+    MOZ_ASSERT(remainder_.empty());
+    remainder_.assign(first + written, last);


Is this an infallible operation? It seems like it should need to allocate memory, so I'm not sure I understand how that works. And ideally, if it does allocate, we should make that fallible—or at least assert that it didn't fail and abort execution.

andreiltd marked this pull request as draft January 21, 2025 14:21

andreiltd force-pushed the form-data-request branch from 591475c to 1c57fa0 Compare January 22, 2025 13:17

andreiltd marked this pull request as ready for review January 22, 2025 14:02

andreiltd mentioned this pull request Jan 22, 2025

Add just recipe for serving the js script #202

Merged

Add FormData streaming encoder

655b57d

andreiltd force-pushed the form-data-request branch from 1c57fa0 to 655b57d Compare January 23, 2025 08:04

andreiltd added 3 commits January 23, 2025 10:46

Fix setting content type and use base64 for boundary encoding

493a46c

Update wpt expectations

0243ac2

Remove unused header

58e0404

tschneidereit requested changes Jan 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add implementation of `MultipartFormData` streaming encoder #200

Add implementation of `MultipartFormData` streaming encoder #200

andreiltd commented Jan 21, 2025 •

edited

Loading

andreiltd commented Jan 22, 2025 •

edited

Loading

tschneidereit commented Jan 24, 2025

andreiltd commented Jan 24, 2025

tschneidereit left a comment

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025

tschneidereit Jan 25, 2025


		bool is_draining() { return (file_leftovers_ \|\| remainder_.size()); };

		template <typename I> size_t write_and_cache_remainder(StreamContext &stream, I first, I last);

Add implementation of MultipartFormData streaming encoder #200

Are you sure you want to change the base?

Add implementation of MultipartFormData streaming encoder #200

Conversation

andreiltd commented Jan 21, 2025 • edited Loading

andreiltd commented Jan 22, 2025 • edited Loading

tschneidereit commented Jan 24, 2025

andreiltd commented Jan 24, 2025

tschneidereit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Add implementation of `MultipartFormData` streaming encoder #200

Add implementation of `MultipartFormData` streaming encoder #200

andreiltd commented Jan 21, 2025 •

edited

Loading

andreiltd commented Jan 22, 2025 •

edited

Loading