-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics scalability performance issue in 6.0 #366
Comments
Agreed — I'm relieved because we have so many options here. The I also like the idea of implementing proper request-driven scraping in 6.1 instead of constant push from all nodes to the manager. |
@Neverlord would be great to hear your thoughts here — could |
I'll look into it. |
I'm not convinced that this is a good route for a distributed system with loose coupling like Broker. At the pub/sub layer, we don't know how many responses we should expect. There is no central registry of metric sources. Even if there was one, we would still have to guard against all sorts of partial errors, ultimately with some sort of timeout for the operation. A loosely coupled push model like we have now is much more robust. If the once-per-second updates introduce significant traffic, I think we can instead optimize that update. I haven't looked at the actual metrics yet, but are all workers actually using all the metrics? Maybe we could emit metrics with zero values or use some sort of "compression" / better encoding. But let's fix the obvious performance bugs first. 🙂 |
Full ACK on the strategy. 👍 PR #367 is getting rid of the The interval is already configuration via the |
* origin/topic/neverlord/gh-366: Use a set for caching metric instances Optimize metric_collector::insert_or_update lookup
Move the telemetry/cluster.zeek file over into policy/frameworks/telemetry/prometheus.zeek. Mention it in local.zeek. Relates to zeek/broker#366.
Move the telemetry/cluster.zeek file over into policy/frameworks/telemetry/prometheus.zeek. Mention it in local.zeek. Relates to zeek/broker#366.
On the Zeek level the nodes we expect metrics from are fixed. All nodes should also have consistent metrics (types, help, labels). In fact, in much larger setups it might be best to forego the centralization aspect of either push based or pull based altogether and use configuration management to set Prometheus up for individual nodes accordingly.
That seems quite okay and pragmatic. If a node fails to provide metrics (in 1 or 2 seconds) then it timed-out and that's a signal, too. I have prototyped the request-response/pull-based approach here: https://github.com/awelzel/zeek-js-metrics This triggers metric collection as Zeek events over broker and collects the results (pre-rendered Prometheus lines) before replying back to an HTTP request handled in JavaScript on the manager. With 24 workers and a large number of metrics there is zero overhead or extra cluster communication when no scraping happens and significant lower overhead on the manager when scraping happens at 1 second intervals (still high, but to comparable with broker's default). In this artificial setup, broker's centralization causes the manager to use 30% CPU by default, for the pull approach usage is only at 10%. This doesn't necessarily mean we should require JS for this, but I think it's reasonable to use it to compare approaches and expectations. |
I disagree with this. Very much. With opening up the system via WebSocket and ultimately via ALM, we are no longer limited to the rigid structure of a Zeek cluster. To me, this is not a pragmatic solution. On the contrary, it will tie down Broker and directly contradicts many design decisions and idioms when designing a loosely coupled, distributed system. It also brings more issues with it. Tying up a scraper for up to 1-2s because a single node is lagging behind is unacceptable. They probably time out at that point, ultimately causing the scraper to have no metrics at all instead of at least having the n-1 metrics. I very much appreciate your efforts in quantifying the problem. But please let's not commit to problematic workarounds that compromise the system architecture and come with their own bag of problems. Let's fix the actual problem here: poor performance in Broker. This is purely a performance bug. Currently, the metrics are heavily nested. Broker is really, really bad at efficiently handling this (unfortunately). #368 could be a big part of a solution here, to make this simply a non-issue. Pre-rendering the metrics instead of shipping "neat" The central collection of metrics was something I've put together a while ago after some internal discussion that this would be nice to have. Then it basically didn't get used until you tried it after adding hundreds of metrics to Zeek. My design was assuming maybe a couple dozen metrics per node. Let me fix that. 🙂 We have already disabled central collection by default again, right? Ist this still something we would consider urgent? |
Yes, it's disabled. |
I have been observing this discussion from the sidelines, just a few comments.
For me reading of metrics from the manager is a convenience feature for users who do not want to set up proper scraping of individual, decoupled components. Re: your comments on dynamic cluster layouts, we should have discoverability tooling for working with such clusters anyway, so there should (eventually?) be a way set up such scraping; if anything is missing here I'd focus on that. The current implementation has some issues:
I'd argue that if this a concern for users they should scrape individual nodes. |
In general: yes and full ACK on all of your other observations. The issue we are trying to tackle is that configuring a Zeek cluster with metrics should be as easy as possible to make it simple to actually get the metrics. In a Zeek cluster, we have users that run > 60 workers and (much) more. It gets very fiddly to configure a port for each of the Zeek processes. And then you also have to configure your Prometheus accordingly, which usually runs on a different host. You quickly end up with hundreds of HTTP ports that you need to make accessible over the network and configure in at least two places (Zeek's cluster config plus Prometheus). The setup just becomes much more manageable if there is a port to get metrics from a Zeek cluster. However, maybe we did think of this from a wrong perspective and are tackling the problems on the wrong layer. Instead of having the cluster collect the metrics and have the user pick ports, we could actually turn this around. When booting up the manager with metrics enabled, it could instruct all of the workers/proxies/loggers to turn on metrics too with a port chosen by the OS. Then, the manager could collect all ports and make the list of all the available endpoints/ports available to Prometheus: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#http_sd_config. The manager would only have to respond to trivial HTTP GET requests that tells Prometheus what ports to scrape. The response is basically just a JSON list. That way, Prometheus would scrape all of the individual processes with minimal setup by the user. If the actual Prometheus runs on a different node and the user doesn't want to open up so many ports to the network, I think federations could help with that. With this setup, we would have to manage a few extra steps when spinning up the cluster and need to teach Broker (or Zeek) to open up an extra HTTP port where Prometheus can collect the scrape targets. I think that should be very minimal extra overhead and no extra work is performed if no Prometheus is actually scraping. Thoughts? |
To tackle this, we need to serve HTTP replies from Zeek. @timwoj It seems like prometheus-cpp isn't available in Zeek yet (for the 3rd-party HTTP server). Is this on its way? If not, do we just pull in civetweb-cpp ourselves or do we delay this issue? |
We're not planning for any work on telemetry to land until at least 7.0, so if there's something else we can do in the meantime go for it. I'm also still planning to bring in open-telemetry since we'd like to tinker with the tracing functionality at some point too, and writing this once for prometheus-cpp and then having to rewrite it again for opentelemetry isn't worth it when I could just do it for otel first. |
#418 made this obsolete. |
This has been reported during Zeek 6.0 rc1/rc2 testing as manager memory growing steadily until crashing due to OOM.
Further, when manager's memory usage is growing, it does not serve requests to
curl http://localhost:9911/metrics
anymore. The connection is accepted but no response is produced.The following supervisor setup reproduces the issue here. It creates two counter families with 200 counter instances for each family. With 8 workers and 3 other processes, this results in enough metrics to cause manager overload.
Either increasing the number of workers or increasing
nworkers
orncounters
ornfamilies
should reproduce the issue if you have a more powerful system.Deployments with 64 or 128 workers may trigger this observation even if the number of instances per metrics is small (10 or 20) as it's multiplied by the number of endpoints if I read the code right. Let alone setups with 500+ workers (#352).
On the Zeek side we could consider removing the metrics for log streams/writers and/or event invocations for 6.0 or reduce the synchronization interval, but this seems mostly like a performance bug.
Further down the road, it may also be more efficient to query workers on-demand rather than having workers publish their metrics every second (which most of the time will be overwritten again) and the expense of a small delay.
The text was updated successfully, but these errors were encountered: