Skip to main content

On September 17, 2025, PyTorch ATX partnered with the vLLM community and Red Hat to host “The Future of Inferencing” at Capital Factory’s Voltron room in downtown Austin. The gathering brought together leading experts working on vLLM—including core committers, project creators, and deployment specialists—to explore cutting-edge techniques powering modern LLM inference at scale and to strengthen Austin’s growing inference optimization community.

Over 90 attendees filled the Voltron room for technical deep-dives into high-throughput LLM serving. Topics spanned INT4/INT8 quantization, pruning strategies, PagedAttention memory management, continuous batching, speculative decoding, and multi-node deployment architectures.

Jason Meaux kicked off the evening with updates on PyTorch ATX member projects, highlighting local work on diffusion models, Nano-GPT speed runs using the muon optimizer, state space models, BERT classification, and the robotics paper club.

Steve Watt, PyTorch ambassador, gave an introduction to vLLM and walked through two hands-on demos showing how to deploy vLLM on AWS with Nvidia hardware and on AMD developer cloud.

Luka Govedič, a vLLM core committer, presented an intermediate-level session on PagedAttention, quantization approaches, speculative decoding, and continuous batching. He also previewed his recent work on torch.compile integration with vLLM.

Huamin Chen, creator of vLLM Semantic Router (boasting over 1,700 GitHub stars), explained his intent-aware “mixture-of-models” router. The system uses ModernBERT to semantically classify requests and direct them to appropriate models or reasoning paths for more cost effective and accurate inference serving.

Greg Pereira, llm-d maintainer, explored distributed inference challenges through the llm-d architecture and its schedulers. His closing demo illustrated KV cache management and pre-fill decode disaggregation in action.

All session videos can be found here. Attendees left with both conceptual frameworks and actionable strategies for building production-ready inference systems.

Looking ahead, we’re preparing our next major gathering in Austin—the Robotics & Edge Inference Conference in February 2026! We’ll cover the complete stack from microcontrollers to Jetson modules, including compilers & runtimes, ROS 2, 3D perception, navigation, and diffusion policies—featuring live demos from Austin’s leading robotics companies. Sign up here.​​​​​​​​​​​​​​​​