Skip to content

CoLearn-Dev/deserve

Repository files navigation

DeServe

DeServe is a offline-serving framework for decentralized inference of large language models. Benefiting from following technologies, DeServe can achieve up to 12.6x throughput improvement in high-latency network compared to the vLLM with pipeline parallelism. Following features are key to the performance:

  • KV Cache Swapping: Maximizes GPU computation utilization by enlarging the KV cache size through swapping microbatch memory between CPU and GPU.
  • Microbatch Scheduling: Allocates microbatches inside the pipeline for different network latency to maximize the throughput.
Latency (ms) real-world centralized real-world east-west sim 16 sim 32 sim 64 sim 256
vLLM (tp) 253.0 failed / / / /
vLLM (pp) 89.1 37.3 68.8 55.3 36.1 /
DeServe (pp) 194.6 138.4 182.3 163.7 133.7 /
DeServe (opt) 445.2 434.1 458.5 457.3 456.8 442.9

To start the experiments, please refer to deserve_exp/readme.md.