ECS GenAI Inference: vLLM on AWS Inferentia with Neuron #250
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This new example in the ECS Blueprints project demonstrates how to set up infrastructure for running GenAI inference using vLLM with AWS Neuron on Inferentia 2 instances. It creates an ECS cluster with an autoscaling group of inf2 instances, deploys a vLLM service for handling inference requests, and sets up an Application Load Balancer to expose the service endpoint. The solution uses pre-compiled Neuron-compatible models and is designed for scalable GenAI workloads. It includes steps for preparing a custom Docker image with vLLM and necessary dependencies, deploying the infrastructure using Terraform, and provides an example of how to send inference requests to the deployed service. This blueprint offers a streamlined way to leverage AWS's specialized AI hardware for efficient large language model inference within an ECS environment.
Motivation and Context
This change addresses a significant gap in the available examples for implementing inference workloads on ECS, particularly when compared to existing resources for EKS and EC2. Multiple AWS partners have requested comparable ECS-based solutions, especially for projects like vLLM. This example fills that need by providing a functional, ECS-specific implementation modeled after recent examples for other platforms. It ensures that ECS users have access to up-to-date, practical guidance for deploying GenAI inference workloads, bringing ECS documentation in line with resources available for other AWS compute services.
How Has This Been Tested?
examples/*
projectspre-commit run -a
on my pull request - NOTE: this also applied changes to files in other examples within the project. This explains the changes made outside of my additional example.