Measuring Intelligence Summit at PyTorch Conference

Measuring Intelligence Summit

The Measuring Intelligence Summit on October 21 in San Francisco, co-located with PyTorch Conference 2025, brings together experts in AI evaluation to discuss the critical question: how do we effectively measure intelligence in both foundation models and agentic systems?

As AI systems become more capable and more widely deployed, evaluation methods must evolve just as rapidly. This half-day summit will cover key topics such as evaluating reasoning models, superintelligence, and the evolution of AI benchmarks. Attendees will gain insight into state-of-the-art evaluation methods, explore the challenges in assessing AI capabilities, and engage in discussions that will shape the future of AI evaluation from the experts leading the work in this field.

Top 3 Reasons to Attend

Engage with leading voices in AI evaluation – Hear directly from researchers at OpenAI, Stanford, Meta, and more as they share insights into the latest methods for evaluating reasoning, intelligence, and agentic behavior in advanced AI systems.
Be part of shaping the future of benchmarks – From debates on whether benchmarks truly capture intelligence to discussions on practical, real-world evaluation, you’ll gain a front-row seat to conversations that will guide how our community measures progress in AI.
Connect with the leaders driving innovation – The summit offers a unique opportunity to meet others working at the intersection of research and application, building networks that extend beyond the conference and into the broader AI ecosystem.

Program Highlights

Keynotes

Framing the Frontier of Machine Intelligence – Joe Spisak, Meta
Chat about the SOTA in reasoning, planning, and inference time scaling and the new methods for how we measure intelligence in this new regime – Noam Brown, OpenAI in conversation with Joe Spisak, Meta

Sessions

Weaver: Shrinking the Generation-Verification Gap with Weak Verifiers – Jon Saad-Falcon, Stanford
Holistic Evaluation of Language Models (HELM) – Yifan Mai, Stanford University
Scaling Agentic Intelligence from Pre-Training to RL – Aakanksha Chowdery, Reflection AI & Stanford University
LMArena: The Reliability Standard for AI – Anastasios Angelopolous, LMArena

Panels

Are We Measuring Intelligence or Just Benchmarks?

Sara Hooker
Vivienne Zhang, NVIDIA
Baber Abbasi, Eleuther AI
Nathan Habib, HuggingFace
Carlos Jimenez, Princeton University / SWE Bench

Beyond the Leaderboard: Practical Intelligence in the Wild

Shishir Patil, Meta
Haifeng Xu, ProphetArena / U.Chicago
Tatiana Shavrina, Meta
Lisa Dunlap, UCB / LMSys
Rebecca Qian, Patronus AI

Measuring Intelligence Summit at PyTorch Conference

Top 3 Reasons to Attend

Program Highlights

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news