Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up new router implementation #279

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kyleplump
Copy link

related to #278

in the new Trie leaf implementation, Mustermann matchers were created at the time of evaluation, causing a semi-significant slowdown. we have all of the information to create the matcher when building the trie - so, this fix creates the matcher at the time of leaf creation (leaving a fallback in #match), while maintaining the benefits of having leaves as introduced in 2.2, with this PR

this brings us back to similar performance as hanami-router v2.1, with the exception of startup (this fix is temporarily called 2.2.1):

rps
log_rps
runtime_with_startup

this slowdown on boot is expected - more work (creating leaves / mustermann) is happening. the way to get around this would likely be a new structure other than a trie (maybe someday!), or maybe evaluating / building the trie async (😬)

please let me know if you'd like more test cases (especially looking for thoughts from @dcr8898) . thanks!

@dcr8898
Copy link
Contributor

dcr8898 commented Jan 1, 2025 via email

@cllns
Copy link
Member

cllns commented Jan 1, 2025

Sweet! The 10+ second startup time for 10,000 routes is still worrisome to me, is there anyway we could bring that down?

Is it possible to postpone creating leaves until we need them, or batch them up somehow? I guess that's what you mean by making it async, but I could be wrong.

@dcr8898
Copy link
Contributor

dcr8898 commented Jan 1, 2025 via email

@kyleplump
Copy link
Author

@cllns @dcr8898

howdy gentlemen

thank you for the thoughts! i think i made some headway here

context

i did some poking around to see if i could figure out where the slowdown on boot was, and surprise surprise, it's our friend Mustermann.

Damian is correct - in theory we're creating significantly less matchers since its only creating them for defined routes, rather than each individual dynamic segment. (for reference, the original design).

so in 2.1, we would have matchers for things like :id, or :uuid instead of full paths like /entity/:id or /some-route/:unique.

musty

in this section of the mustermann docs we learn two things:

  1. creating pattern matching objects is (intentionally) expensive
  2. when calling #new, and passing the same arguments, mustermann might just return an already created object instead of initializing a new one.

point #2 is the important part. when giving dynamic segments in 2.1, it was very possible we were sharing matchers between routes. a matcher would match :id, regardless of what route that dynamic segment was used in, utilizing the same underlying object. so in theory - less matchers. in practice - probably not.

i confirmed this by looking at the memory usage of 2.2.1:

memory

in Sean's original test, memory was similar between 2.1 and 2.2. this is because we were creating the matchers in the leaves at runtime. when 'precompiling' the routes, we see that initial memory usage is much higher. more evidence that we're initializing many more objects

proposal

route caching 🤔

i think there could be a world where we pre-compile the Trie on the first boot, and then save to a gitignore'd file on disk. on subsequent loads, we could check if the routes have changed. if they havent, we should have constant boot times. if they have changed, compile the changed routes into a new version of that file. i think this could be an acceptable solution, since long start up times really only starts to take effect when you have a significant number of routes, and i imagine very few people are going to add 10k routes before the first time starting the app. though, the consequence of this would be adding more complexity to the Hanami boot process. if this is a direction people are interested in, i'd love to create a separate issue for tracking that effort

lmk what you think! thank you as always

@kyleplump
Copy link
Author

kyleplump commented Jan 3, 2025

@cllns @dcr8898

follow up here: ended up squeezing some more juice out of the 2.2.1 implementation (now named 2.2.12 although idk if this is the correct usage of semver or not 🤷 )

updated graphs (on my machine at least):
rps
log_rps
runtime_with_startup

memory

as you can see memory usage is pretty much the same, this is still because of the issue discussed above.

i did some extra benchmarking and realized we were leaving a lot on the table with Trie#segments_from. turns out this function was taking most of the time in our implementation. we should be splitting based on a string instead of a regex (i didnt just know this off the top of my head, i found this and then tried it), as well as we should be memoizing segments (#put can do the split and cache, #find can just read segments from this cache)

please let me know if you see any issues with this updated approach!

@dcr8898
Copy link
Contributor

dcr8898 commented Jan 6, 2025

@kyleplump

Thank you for continuing to look into this! I've been swamped on my end through the holidays.

Optimizing Hanami Router 2.2

Your switch from RegEx to string splitting is what I meant about finding the most performant implementation for our approach. It's good to see that 2.2.1 is faster than 2.1 when implemented properly. There may be more wins to be had with similar optimizations. Jeremy Evans' book has a lot of tips for this.

Using Mustermann will always require a trade-off

In terms of boot time vs response time, I believe this will always be a trade-off as long as we are using Mustermann. I don't know if it would be possible to pre-compile the trie structure somehow. I've never seen that done. You would essentially have to serialize Ruby objects and then de-serialize them at runtime. Not sure if this would be possible or faster. Interesting thought.

At present, I consider a longer boot-time for faster run-time in 2.2.1 to be the better trade-off (compared to the opposite in 2.2).

Mustermann discussion

With regard to Mustermann, I would bet that Hanami's use of tries allows it to compare favorably with any other framework that also uses Mustermann. This doesn't apply to Roda, since Roda does not seem to use Mustermann.

However, it would be hard for Hanami Router to match Roda's speed, because Roda's routing config is essentially a trie structure implemented in code that does not use Mustermann (see the "Faster" section here). Therefore, it's faster both at boot (there is no routing compile step at all) and during run time. (You can read a comprehensive write-up of Roda here.)

Hanami uses Mustermann because of the power and flexibility it provides out of the box, and because it's battle tested. With that said, there are other options.

Options for moving forward

I recommend that we continue to use Mustermann for now, as improved in this PR, but we should consider other options that could further improve Hanami Router, such as:

  • Contribute improvements to Mustermann.
  • Consider using a different Mustermann type (other than Rails), or even creating a Hanami Mustermann type.
  • Drop Mustermann and create a new matching strategy, perhaps using Roda for inspiration.

Other thoughts

As a general rule, the more flexible we allow our routes to be, the more overhead we should expect. Part of the reason Roda is fast is because it implements a small set of routing options to start (for example, only GET and POST are available by default). More complex options are available as plugins, but users don't incur their added overhead unless they choose to use them. If we do refactor Hanami Router along a different path, we could consider a similar approach of simple first.

Other considerations include developer experience. Hanami Router's configuration of routes is very similar to that of Rails, and (I think) very readable. Roda's presentation is very different, and perhaps not as readable (could just be my lack of familiarity). The fastest possible implementation would have developers specify a routing trie data structure directly in the routes file, but the DX for this would be poor. Trade-offs of some kind will always exist.

@kyleplump do you have time to discuss your code changes before we recommend merging them?

@cllns
Copy link
Member

cllns commented Jan 6, 2025

Thanks for finding this and writing up your thoughts @kyleplump @dcr8898!

Pre-compiling the trie is an interesting idea but adds too much complexity. It'd be a cool experiment for sure, but it's too novel to add at this point.

In terms of versioning proposals, I think it'd be better to refer to them via something like 2.2.0.proposal1 and 2.2.0.proposal2 or .fixN etc. Just something that signifies that it's not an actual gem release. At some point we'll make a 2.2.1 gem release and then the chart & discussion will be inaccurately labeled, which is confusing for future readers.

Is there some way we could take a hybrid approach here? That is, building Mustermann leaves ahead of time, for the first N routes, then after that just build the matchers lazily. We could use benchmarking to figure out what N should be by default, and allow that N to be configurable.

Similarly, could we have pre-building leaves an option? i.e. N=0 is valid and implements (essentially) the previous behavior from 2.1.

I don't know if it would be safe, but I could see wanting lazy built matchers in dev and test environments, then build them ahead of time in production for increased performance (at the expense of slower startup time).

@dcr8898
Copy link
Contributor

dcr8898 commented Jan 6, 2025

@cllns These are some interesting ideas for sure.

My initial feeling is that this would be a case of premature optimization. I don't know how many apps there are out there with 10K routes. I would think a huge app would have 1-2K routes (based on my limited experience). If the graph above is logarithmic, this would imply startup times of 1-2 seconds for such apps. Do you think that would be too long for dev & test environments? My feeling is that we should wait until developers are actually feeling that pain before we introduce more complexity.

For example, I would like to speak with @kyleplump more to see if we can squeeze the current memory usage a little more, and maybe the startup time, with further optimizations.

If we do explore your suggestions, what would be the gain of compiling some matchers ahead of time and not others? As far as this benchmark is concerned, I presume it attempts to exercise all of the generated routes more or less equally. This is actually one factor that makes the r10k tool a little unrealistic.

In a real-world app, there would definitely be some distribution of popular and less-popular routes, but this distribution would not be captured by an arbitrary division of route handling based on a number. What might work in this case is the original 2.2 strategy and some kind of fast, in-memory cache of created Mustermann matchers. I don't know if this is worthwhile, but it would provide a compromise approach. (EDIT: I have no idea how this would work in a concurrent environment. Maybe it is best to compile the routes first, and then potentially freeze the router. 🤷 )

The second part of your suggestion, handling dev and test environments differently than production, could be implemented pretty easily (I think 🤔 ), but I still wouldn't recommend introducing that additional complexity until actual developer experience starts to suffer.

@kyleplump
Copy link
Author

@cllns @dcr8898

i appreciate you both thinking about this so deeply! i love hearing your perspectives - trying to keep up with you two is helping me learn a ton :)

i like this idea of pre-generating them all on production builds. is there a world where we keep it simple, at first, and just lazy load the matchers when running hanami dev, and pre-generate them on prod builds? as long as we document it, and maybe even provide the option to override this behavior, i feel like thats a big win. it allows developers the ability to see everything 'live' and un-cached, but once they're ready to ship a build we can eagerly optimize. i've worked with some tools that do similar-ish things, so as long as we're transparent, there's a precedent.

regardless - i'd love to keep poking around and seeing if we can be more 'clever' with our optimizations and make this router blazingly fast (side note: do people still say this?). be happy to bounce ideas off of anyone, but def @dcr8898 if youre volunteering

on versioning: once we iron out a game plan, i can make the modifications, close this PR and create a formal 'proposal' laying out changes made if that would be better? 2.2 fixes the bug, we're now just talking about optimization (which is the name of the game for a project like this). point being: if we bundle it all up nicely it can get merged whenever.

thanks again!

@dcr8898
Copy link
Contributor

dcr8898 commented Jan 7, 2025

@kyleplump I'm in favor of polishing this PR up and merging it to fix the problem. Then we can experiment with further changes at our leisure. Let me know when you want to meet up. I think we can make a few more optimizations before submitting.

I might still saying blazing fast, but I'm likely an anomaly. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants