vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.

vLLM is an open source library for fast, easy-to-use LLM inference and serving. It optimizes hundreds of language models across diverse data-center hardware—NVIDIA and AMD GPUs, Google TPUs, AWS Trainium, Intel CPUs—using innovations such as PagedAttention, chunked prefill, multi-LoRA and automatic prefix caching. It is designed to serve large scale production traffic with OpenAI compatible server and offline batch inference, scalable to multi-node inference. As a community-driven project, vLLM collaborates with foundation model labs, hardware vendors and AI infrastructure companies to develop cutting-edge features.

The University of California – Berkeley contributed vLLM to the Linux Foundation in July 2024.

vLLM

vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news