vLLM is an open source library for fast, easy-to-use LLM inference and serving. It optimizes hundreds of language models across diverse data-center hardware—NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, Intel CPUs—using innovations such as PagedAttention, chunked prefill, multi-LoRA and automatic prefix caching. It is designed to serve large scale production traffic with OpenAI compatible server and offline batch inference, scalable to multi-node inference. As a community-driven project, vLLM collaborates with foundation model labs, hardware vendors and AI infrastructure companies to develop cutting-edge features.
The University of California – Berkeley contributed vLLM to the Linux Foundation in July 2024.