-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use serde instead of hand-rolled serialization, take 2 #59
base: main
Are you sure you want to change the base?
Conversation
I guess the package version should be bumped too. Totally forgot about that. |
Huh, it's surprising we no longer see the performance difference from that previous change.
|
Almost certainly yes.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm mildly positive on this. Would you try bincode instead of cbor? It feels silly to have names for the fields here given we version the whole file.
File(String), | ||
|
||
#[serde(rename = "b")] | ||
Build { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I never knew you could use named fields in an enum like this! TIL
.discovered_ins() | ||
.iter() | ||
.map(|&file_id| self.ensure_id(graph, file_id)) | ||
.collect::<anyhow::Result<Vec<_>>>()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note that your new code allocs more (the intermediate Vecs here), but I think writing isn't on the critical path anyway.
Looking through the bincode docs, it's a little disappointing that even it uses big sizes like u64 for lengths, but it probably doesn't matter too much, and maybe using their varint encoding would end up a net win? |
If you build the At the point where my build is at as I type this:
My .n2_db is 3.9M, which is large enough where we care more about performance. |
95e6f5f
to
1b31c58
Compare
A comparison between "baseline" e256bc6 (manual parsing), "serde" 8d42961 (serde + cbor), and "bincode" 77e2d4a (serde + bincode):
Database sizes (more or less the same content, so this drastic size difference is explained solely by the format)
Binary sizes (stripped, obviously).
serde + cbor seems to be faster than the manual implementation on small databases, but significantly slower on the bigger ones. It is also the largest binary (but the most compact databases). serde + bincode beats the manual implementation on big DBs. Haven't tested on small ones. It is also only slightly larger than the baseline binary (unlike the cbor one), although its databases are significantly bigger. |
Co-authored-by: Pavel Grigorenko <[email protected]>
I've rebased on top of main (e73378d)
|
Based on this branch I created a PR that has just the serde parts without the refactoring: Some random thoughts:
It's pretty interesting to me how the perf is a wash, given that the serde codepath appears to doing a lot more (creating intermediate vecs and strings etc.). |
Maybe this can be combated with
My completely arbitrary (and probably wrong) guess would be that it's actually faster to store the values with all the excess zeroes without compression, because then it saves time because we no longer need to do bit shifts and maybe also because the read/writes are now more aligned? Maybe? |
8cafbda
to
d43e8ca
Compare
You may try msgpack and bitcode as other alternatives, potentially even avoiding the serde dependency if it is not needed by other part of the code (e.g. to parse ninja files). |
Note that this log file needs to be chunked and loaded in parallel to get good load times, which often precludes using all-in-one serialization solutions. |
But does it need to be all in-memory at the same time? |
I figured out how to do cool git commands and set commit's author and co-author, so here we go.
This basically supersedes #18 (I mean, it's literally a port of that PR).
So here's a comparison between e256bc6 (baseline) and 8d42961 (serde):
Byte counts (stripped binaries):
Yes, the size still goes up, sadly.
But as I've already mentioned in #18 (comment), this allows to get rid of some probably-UB unsafe code like
from here.