Accelerate Your AI: PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads Blog Accelerate Your AI: PyTorch 2.4 Now Supports Intel GPUs for Faster Workloads We have exciting news! PyTorch 2.4 now supports Intel® Data Center GPU Max Series and…the PyTorch Team at IntelAugust 29, 2024
Enabling Fast Gradient Clipping and Ghost Clipping in Opacus Blog Enabling Fast Gradient Clipping and Ghost Clipping in Opacus Introduction and Context Differentially Private Stochastic Gradient Descent (DP-SGD) is the canonical method for training machine…Enayat Ullah, Huanyu Zhang, Will Bullock, Ilya MironovAugust 20, 2024
FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention Blog FlexAttention: The Flexibility of PyTorch with the Performance of FlashAttention In theory, Attention is All You Need. In practice, however, we also need optimized attention…Team PyTorch: Driss Guessous, Yanbo Liang, Joy Dong, Horace HeAugust 7, 2024
Quantization-Aware Training for Large Language Models with PyTorch Blog Quantization-Aware Training for Large Language Models with PyTorch In this blog, we present an end-to-end Quantization-Aware Training (QAT) flow for large language models…Andrew Or, Jerry Zhang, Evan Smothers, Kartikay Khandelwal, Supriya RaoJuly 30, 2024
PyTorch 2.4 Release Blog Blog PyTorch 2.4 Release Blog We are excited to announce the release of PyTorch® 2.4 (release note)! PyTorch 2.4 adds…PyTorch FoundationJuly 24, 2024
Deep Dive on the Hopper TMA Unit for FP8 GEMMs Blog Deep Dive on the Hopper TMA Unit for FP8 GEMMs Abstract The Hopper (H100) GPU architecture, billed as the “first truly asynchronous GPU”, includes a…Adnan Hoque, Less Wright, Chih-Chieh YangJuly 22, 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Blog FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Attention, as a core layer of the ubiquitous Transformer architecture, is a bottleneck for large…Jay Shah and Ganesh Bikshandi, Colfax Research, Ying Zhang, Meta, Vijay Thakkar and Pradeep Ramani, NVIDIA, Tri Dao, TogetherAI and Princeton UniversityJuly 11, 2024
Learn how to develop Android applications with ExecuTorch and Llama models Blog Learn how to develop Android applications with ExecuTorch and Llama models This blog is courtesy of the PyTorch team at Arm. More details can be found here.…ArmJuly 10, 2024
Accelerated PyTorch inference with torch.compile on AWS Graviton processors Blog Accelerated PyTorch inference with torch.compile on AWS Graviton processors Summary Originally PyTorch, used an eager mode where each PyTorch operation that forms the model…Sunita NadampalliJuly 9, 2024
Training MoEs at Scale with PyTorch Blog Training MoEs at Scale with PyTorch Over the past year, Mixture of Experts (MoE) models have surged in popularity, fueled by…Brian Chu, Mihir Patel, Less Wright, Vitaliy Chiley, Evan Racah, Wanchao Liang, Iris Zhang, Andrew GuJune 23, 2024
Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity Blog Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity Over the past year, we’ve added support for semi-structured (2:4) sparsity into PyTorch. With just…Jesse Cai, Daniel Haziza, Supriya RaoJune 20, 2024
Reducing Model Checkpointing Times by Over 10x with PyTorch Distributed Asynchronous Checkpointing Blog Reducing Model Checkpointing Times by Over 10x with PyTorch Distributed Asynchronous Checkpointing Summary: With PyTorch distributed’s new asynchronous checkpointing feature, developed with feedback from IBM, we show how…Meta: Lucas Pasqualin, Less Wright, Iris Zhang (PyTorch), Chien-Chin Huang; IBM Research: Swaminathan Sundararaman, Saransh Gupta, Raghu GantiJune 12, 2024
INT4 Decoding GQA CUDA Optimizations for LLM Inference Blog INT4 Decoding GQA CUDA Optimizations for LLM Inference An efficient decoding Grouped-Query Attention with low-precision KV cache Introduction Generative AI has taken the…Sarunya Pumma, Jongsoo Park, Jianyu Huang, Amy Yang, Jaewon Lee, Daniel Haziza, Grigory Sizov, Jeremy Reizenstein, Jeff Johnson, Ying ZhangJune 6, 2024
Maximizing Training Throughput Using PyTorch FSDP and Torch.compile Blog Maximizing Training Throughput Using PyTorch FSDP and Torch.compile Recently, we demonstrated how FSDP and selective activation checkpointing can be used to achieve 57% MFU…Team PyTorch at IBM and Team PyTorch at MetaMay 21, 2024
Achieving Sustainability Goals with PyTorch and Intel AI Blog Achieving Sustainability Goals with PyTorch and Intel AI This post was contributed by Intel AI in partnership with the PyTorch Foundation. In 2017,…PyTorch FoundationMay 15, 2024
Speeding up ViTs using Block Sparsity Blog Speeding up ViTs using Block Sparsity TLDR: We show promising results of up to a 1.46x speedup with <2% drop in accuracy on float32…FAIR at Meta: Mostafa Elhoushi, Sensors and Systems at Meta Reality Labs Research: Syed Shakib Sarwar, Aaryan Kothapalli, Mia Kasperek, Barbara De Salvo, PyTorch at Meta: Christian Puhrsch, Jesse Cai, Joe Isaacson, Quantsight: Andrew James, Pearu Peterson, Nikita VedeneevMay 14, 2024
Introducing depyf: mastering torch.compile with ease Community Introducing depyf: mastering torch.compile with ease We are thrilled to introduce depyf, a new project to the PyTorch ecosystem designed to help…Kaichao YouMay 11, 2024
Deep Learning Energy Measurement and Optimization Community Deep Learning Energy Measurement and Optimization This post is authored by Jae-Won Chung, a PhD student at the University of Michigan and…Jae-Won ChungMay 11, 2024
A Hitchhiker’s Guide to Speculative Decoding Blog A Hitchhiker’s Guide to Speculative Decoding Speculative decoding is an optimization technique for inference that makes educated guesses about future tokens…Team PyTorch at IBMMay 2, 2024
Accelerating Llama3 FP8 Inference with Triton Kernels Blog Accelerating Llama3 FP8 Inference with Triton Kernels 1.0 Summary We present an optimized Triton FP8 GEMM (General Matrix-Matrix Multiply) kernel TK-GEMM, which…Adnan Hoque, Less Wright, Chih Chieh YangMay 1, 2024