Skip to main content
Announcements

PyTorch Foundation Welcomes vLLM as a Hosted Project

PyTorch Foundation Welcomes vLLM

The PyTorch Foundation is excited to welcome vLLM as a PyTorch Foundation-hosted project. Contributed by the University of California – Berkeley, vLLM is a high-throughput, memory-efficient inference and serving engine designed for LLMs. vLLM has always had a strong connection with the PyTorch project. It is deeply integrated into PyTorch, leveraging it as a unified interface to support a wide array of hardware backends. These include NVIDIA GPUs, AMD GPUs, Google Cloud TPUs, Intel GPUs, Intel CPUs, Intel Gaudi HPUs, and AWS Neuron, among others. This tight coupling with PyTorch ensures seamless compatibility and performance optimization across diverse hardware platforms. 

The PyTorch Foundation recently announced its expansion to an umbrella foundation to accelerate AI innovation and is pleased to welcome vLLM as one of the first new projects. Foundation-Hosted Projects are projects that fall under the umbrella, they are officially governed and administered under the PyTorch Foundation’s neutral and transparent governance model. 

What is vLLM?

Running large language models (LLMs) is both resource-intensive and complex, especially as these models scale to hundreds of billions of parameters. That’s where vLLM comes in. Originally built around the innovative PagedAttention algorithm, vLLM has grown into a comprehensive, state-of-the-art inference engine. A thriving community is also continuously adding new features and optimizations to vLLM, including pipeline parallelism, chunked prefill, speculative decoding, and disaggregated serving.

Since its release, vLLM has garnered significant attention, achieving over 46,500 GitHub stars and over 1000 contributors—a testament to its popularity and thriving community. This milestone marks an exciting chapter for vLLM as we continue to empower developers and researchers with cutting-edge tools for efficient and scalable AI deployment. Welcome to the next era of LLM inference!

Key features of vLLM include:

  • Extensive Model Support: Powers 100+ LLM architectures with multi-modal capabilities for image and video, while supporting specialized architectures like sparse attention, Mamba, BERT, Whisper, embedding, and classification models.
  • Comprehensive Hardware Compatibility: Runs on NVIDIA GPUs through Blackwell, with official support for AMD, Google TPU, AWS Neuron, Intel CPU/XPU/HPU, and ARM. Third-party accelerators like IBM Spyre and Huawei Ascend easily integrate via our plugin system.
  • Highly Extensible: Enables custom model implementations, hardware plugins, torch.compile optimizations, and configurable scheduling policies to match your specific needs.
  • Optimized for Response Speed: Delivers minimal latency through speculative decoding, quantization, prefix caching, and CUDA graph acceleration.
  • Engineered for Maximum Throughput: Achieves peak performance with tensor/pipeline parallelism and specialized kernels.
  • Seamless RLHF Integration: Provides first-class support for reinforcement learning from human feedback and common post training frameworks.
  • Enterprise-Scale Distributed Inference: Enables cluster-wide scaling through KV cache offloading, intelligent routing, and prefill-decode disaggregation.
  • Production-Hardened: Delivers enterprise-grade security, comprehensive observability, and battle-tested operational reliability.

Accelerating Open Source AI Together

By becoming a PyTorch Foundation project, vLLM will collaborate with the PyTorch team closely on feature development. For example: 

  • vLLM will make sure the code runs on Torch nightly, and the PyTorch team will monitor to ensure all tests are passed. 
  • PyTorch team is enhancing torch.compile and FlexAttention support for vLLM.
  • Close collaboration and support with native libraries such as TorchTune, TorchAO, and FBGEMM. 

The partnership creates significant mutual advantages for both vLLM and PyTorch core. vLLM gains a committed steward in the Foundation, ensuring long-term codebase maintenance, production stability, and transparent community governance. Meanwhile, PyTorch benefits from vLLM’s ability to dramatically expand PyTorch adoption across diverse accelerator platforms while driving innovation in cutting-edge features that enhance the entire ecosystem.