Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Inference-perf loadgen component to be based on Grafana k6 load testing tool #2

Open
SachinVarghese opened this issue Jan 20, 2025 · 3 comments

Comments

@SachinVarghese
Copy link

Inference-perf proposal doc describes many vital components for its functioning. This document recommends building some of this capability on top of an existing mature load gen tooling in k6. Given the current requirements and constraints, a k6 based wrapper design can be hugely beneficial to quickly build and provide the following capabilities from the initial proposal.

Load Generator
Load Generator is the component which generates different traffic patterns based on user input. K6 can generate fixed or custom load pattern for a defined duration as deemed necessary for the requirement.

Request Processor
Request Processor provides a way to support different model servers and their corresponding request payload with different configurable parameters. K6 supports http and grpc based request for direct and distributed testing.

Response Processor / Data Collector
Response Processor / Data Collector component allows us to process the response and measure the actual performance of the model server in terms of request latency, TPOT, TTFT and throughput. K6 scripting can be leveraged for advanced data/metrics computation.

Report Generator / Metrics Exporter
Report Generator / Metrics Exporter generates a report based on the data collected during benchmarking. It can also export the different metrics that we collected during benchmarking as metrics into Prometheus which can then be consumed by other monitoring or visualization solutions. k6 supports real-time metrics streaming to services like Prometheus, New Relic etc.

Key Benefits

Key advantages of building on top of k6

  • Existing mature OSS ecosystem
  • Support for custom load generation patterns
  • Support for HTTP and GRPC request processing
  • Built-in kubernetes based distributed testing and associated k8s operator
  • Real-time metrics collection and export to a variety of data stores
  • Many built-in memory optimizations (like ability to discard response bodies)
@SachinVarghese
Copy link
Author

Examples from the industry: Huggingface TGI uses k6 for benchmarking results

@achandrasekar
Copy link
Contributor

Like the idea of using a well-tested loadgen. But we need to make sure that the core benchmarking library is python based and can be used as such if needed. I'm not sure if we can instrument k6 loadgen via python. But I would be interested in learning more and discussing the options we have.

@SachinVarghese
Copy link
Author

Yes, with this proposal the benchmarking library can be Python-based. There are many reasons to prefer Python for this project data manipulation, tokenization, reporting, etc. and k6 can merely bring an underlying set of utilities aimed at simply load design and request processing. Such a model would help us leverage the best of both worlds.

In many load generation cases, a single node cannot process/maintain production-grade loads, especially long-context loads with LLMs, and in such cases distributed testing becomes a necessity. Further, we have seen from the initial project proposal too that distributed testing on Kubernetes would be a key differentiating factor. Many existing LLM perf tools lack in this specific area. A huge benefit of using k6 here would be the distributed testing that we get out of the box with minimal lift. There are also additional extensions to script in "python" if needed. But the key is to leverage the right set of tools.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants