Efficiently serving LLMs is a challenging task. I use a custom vLLM server to achive so. It's relatively fast and efficient, specially with AWQ models.
I recommend you use docker
to serve your model in a contained environment but you can always try to install vLLM with pip
or from source but results are not given.
You'll need a working Docker installation with Nvidia Container Toolkit and at least one GPU with a CUDA score of 7.0 or more and enough memory to load the model.
You can try using CPU but at the cost of inference speed.
See vLLM docker installation for more information.
docker build \
--rm \
-f Dockerfile \
-t vllm-inference-api:latest .
Expose port 5085
(or $PORT
), specify gpus, shared memory size allocation and mount your local HuggingFace cache inside the container so that you don't have to download it everytime.
Use $MODEL_ID
to set which model to load.
docker run \
-it \
-p 5085:5085 \
--gpus=all \
--privileged \
--shm-size=8g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--mount type=bind,src=`echo ~/.cache/huggingface/hub/`,dst=/root/.cache/huggingface/hub/ \
--env PORT="5085" --env MODEL_ID="wasertech/assistant-llama2-chat-awq" \
--env QUANT="awq" --env DTYPE="half" \
vllm-inference-api:latest
curl -N -X POST \
-H "Accept: text/event-stream" -H "Content-Type: application/json" \
-d '{"prompt": "<s>[INST] Greeting Assistant [/INST] ", "temperature": 0.5, "max_tokens": 2, "stop": ["</s>",]}' \
http://0.0.0.0:8000/generate