TL;DR
Combining 2:4 sparsity with quantization offers a powerful approach to compress large language models (LLMs) for efficient deployment, balancing accuracy and hardware-accelerated performance, but enhanced tool support in GPU libraries and programming interfaces is essential to fully realize its potential.
Overview of LLM Compression Techniques
Despite their success in natural language understanding and generation, large language models (LLMs) are often prohibitively expensive to run due to their massive parameter counts. This leads to significant memory overhead and high inference costs, particularly during deployment. To address these challenges, model compression techniques, such as quantization and pruning, have emerged, aiming to reduce inference costs while preserving model accuracy as much as possible, though often with trade-offs compared to their dense counterparts.
Quantization: Although high-precision formats are essential during training, LLMs can often retain their accuracy during inference using much lower bitwidths. Quantizing LLMs to 8-bit integers or floating points is relatively straightforward, and recent methods like GPTQ and AWQ demonstrate promising accuracy even at 4-bit precision. However, pushing below 4 bits remains challenging: methods like AQLM often suffer from inference slowdowns on modern GPUs, while others like QUIP# rely on complex, custom preprocessing kernels. These limitations suggest that quantization alone may not suffice for aggressive compression, prompting the need to explore complementary techniques such as sparsity.
Unstructured Sparsity: Sparsity offers an orthogonal path for compressing LLMs when the benefits of quantization begin to plateau. Unstructured sparsity, where non-zero elements can appear anywhere in a matrix, allows models to retain high accuracy even with up to 50% of weights pruned. Recent methods like SparseGPT and Wanda enable such pruning with minimal degradation in performance. However, despite its compression benefits, unstructured sparsity is difficult to accelerate on modern GPUs due to its irregular memory access patterns. Most hardware-optimized methods, such as FlashLLM, only deliver inference speedups at extreme sparsity levels (typically 80% or more). This gap between accuracy and hardware efficiency motivates the use of semi-structured sparsity formats like 2:4, which offer a better trade-off between performance and deployability.
Semi-structured Sparsity: Semi-structured sparsity formats, such as 2:4 sparsity supported by NVIDIA and AMD GPUs, offer a promising balance between compression and speedup by aligning with the underlying hardware. While semi-structured sparsity imposes some constraints on where weights can be pruned, recent methods like MaskLLM use learnable masks to recover accuracy, achieving performance comparable to unstructured sparsity. Additionally, research demonstrates that sparse matrix multiplications, particularly with predictable patterns like zeros, can significantly reduce GPU power consumption by minimizing transistor switching, leading to improved energy efficiency during inference. This makes 2:4 sparsity a practical alternative for deploying compressed LLMs, especially when combined with other techniques like quantization.
Sparsity in Pretraining: Although this post focuses on reducing inference costs, it is worth noting that sparsity is also a powerful tool for training. Weight sparsity in pretraining has been explored in previous work, such as SLoPe and FST, with recent contribution from the PyTorch team demonstrating that 2:4 weight sparsity can accelerate training without incurring any loss to the model quality. Furthermore, recent work from Meta has shown that activation sparsity can losslessly recover the accuracy of the models, while accelerating training and inference of LLMs. This body of work underscores that sparsity is a fundamental tool for the entire model lifecycle. Having established its value in training, we now turn our focus to quantifying its impact on inference, where combining sparsity with quantization provides a powerful solution to today’s deployment challenges.
Sparsity vs. Quantization at Inference
To empirically compare the effectiveness of standalone quantization against combining quantization with sparsity, we conducted experiments on the LLaMA-2 7B model. Our goal was to evaluate both approaches at an equivalent theoretical 8x compression ratio, specifically comparing 2-bit quantization against 4-bit quantization combined with 50% sparsity (using both unstructured and 2:4 formats).
Our experiment leverages state-of-the-art methods to represent each strategy. For pure sub-4-bit quantization, we selected AQLM and QUIP#. For the hybrid approach, we used SparseGPT for unstructured sparsity and MaskLLM for the hardware-friendly 2:4 structured format, both combined with 4-bit quantization. Finally, to showcase the power of composing techniques, we also applied SLiM, a zero-shot low-rank adapter, on top of the sparse models to measure the potential for further accuracy recovery
Our experiments on LLaMA-2-7B demonstrate that combining 4-bit quantization with 50% sparsity consistently outperforms standalone 2-bit quantization in accuracy, despite both achieving equivalent theoretical 8× compression ratio. Among sparse methods, 2:4 structured sparsity, especially when enhanced with low-rank adapters like SLiM, not only preserves accuracy but also takes advantage of hardware acceleration support on modern GPUs. This makes 2:4 sparsity a particularly compelling choice, not just for model accuracy, but also for practical deployment using existing GPU hardware.
LLaMA-2-7B
Quantization | Pruning | Bitwidth | Sparsity | Compression Ratio | ArcC | ArcE | PiQA | Wino | Average |
– | Dense | 16 | – | 1.0 | 40.0 | 69.3 | 78.5 | 67.3 | 63.8 |
AQLM | – | 2 | – | 0.18 | 33.6 | 62.8 | 73.5 | 64.6 | 58.6 |
QUIP# | – | 2 | – | 0.18 | 34.6 | 64.6 | 75.1 | 64.9 | 59.8 |
GPTQ | SparseGPT* | 4 | Unstructured | 0.18 | 35.3 | 68.1 | 74.2 | 67.7 | 61.3 |
AbsMax | MaskLLM** | 4 | 2:4 | 0.18 | 33.2 | 68.4 | 74.5 | 65.0 | 60.3 |
AbsMax | MaskLLM + SLiM-LoRA (r=0.1) | 4 | 2:4 | 0.22 | 38.0 | 70.9 | 77.2 | 70.6 | 64.2 |
* State-of-the-art unstructured sparsity method.
** State of the art 2:4 sparsity method.
While low-bit quantization offers compelling compression, its effectiveness can face limitations when applied to achieving aggressive compression ratios on more recent and complex models. For instance, with LLaMA-3-8B, the 2-bit AQLM quantization method achieves only a 0.25x compression ratio while attempting to retain acceptable accuracy. (Note: QUIP# did not have an open-source checkpoint available for a direct comparison.) In contrast, by combining sparsity, 4-bit quantization, and low-rank approximations, we can achieve higher accuracy on the same LLaMA-3-8B model with the same 0.25x compression ratio, as shown in the table below. This compelling example underscores that relying solely on quantization can be a limiting factor for achieving both aggressive compression and high accuracy in contemporary LLMs. This clear advantage of combining sparsity with quantization highlights the critical need for robust hardware and software tools to effectively deploy such compression techniques, which we explore in the next section.
LLaMA-3-8B
Quantization | Pruning | Bitwidth | Sparsity | Compression Ratio | ArcC | ArcE | PiQA | Wino | Average |
– | Dense | 16 | – | 1.0 | 50.4 | 80.1 | 79.7 | 72.6 | 70.7 |
AQLM | – | 2 | – | 0.25 | 41.3 | 74.3 | 77.8 | 72.0 | 66.4 |
AbsMax | MaskLLM + SLiM-LoRA (r=0.2) | 4 | 2:4 | 0.25 | 42.9 | 75.2 | 77.8 | 71.2 | 66.8 |
Available Tools for Model Acceleration
Several GPU libraries now support 2:4 sparsity with efficient matrix multiplication kernels, easing the deployment of compressed models. Notably, high-performance support for 2:4 sparsity using standard data types is now available through both cuSPARSELt and the CUTLASS template library. The torchao team has integrated these kernels into the PyTorch framework, eliminating the need for custom CUDA or C++ extensions and streamlining adoption.
However, despite the existing support, these libraries present several limitations, particularly concerning the extension of kernel support for hybrid compression. A significant challenge is their current lack of support for fused quantization and dequantization operations, which are critical for minimizing memory bandwidth usage and reducing latency in modern model compression pipelines. Current sparse and quantized inference methods often require loading tensors with varying or incompatible data types into shared memory. These tensors then require explicit dequantization and casting to a common format before efficient matrix multiplication can be performed on tensor cores. Furthermore, the scarcity of comprehensive documentation regarding 2:4 sparsity metadata within CUTLASS and cuSPARSELt renders the development of custom CUDA kernels for integrated quantization and sparsity a highly time and labor-intensive endeavor. Consequently, 2:4 sparsity often remains unsupported in most custom quantization kernels, thereby preventing users from fully leveraging the accuracy improvements achieved on the modeling front and impeding the rapid development of novel compression techniques.
In an effort to address these aforementioned challenges, external libraries such as SGLang and vLLM have added custom quantization kernels like Sparse Marlin. These kernels uniquely add sparsity support to quantization methods and are designed to integrate smoothly within the PyTorch framework, aiming to offer users a more plug-and-play experience for sparse-quantized inference. However, these solutions are quite targeted and do not cover all use cases. They typically support only a restricted range of data types (e.g., W4A16 in the case of Sparse Marlin) and quantization schemes (e.g., 1-D group quantization). Furthermore, their implementation in highly specialized custom CUDA code renders them inherently difficult to extend or adapt to new methodologies. Compounding this, the substantial maintenance overhead associated with transferring such kernels to newer hardware generations means these libraries often lag, supporting only older architectures; for instance, Sparse Marlin’s high performance remains limited to Ampere architectures.
Furthermore, a persistent challenge across all aforementioned compression methods is the significant preprocessing overhead required to prepare matrices and their associated metadata. This substantial preprocessing cost inherently limits their utility primarily to static sparsity and quantization approaches, thereby substantially reducing their applicability in dynamic or adaptive compression scenarios crucial for future LLMs. In response to this specific challenge, the PyTorch team has actively developed custom kernels aimed at significantly reducing this overhead for both weight sparsity and activation sparsity. However, comprehensive support for quantization within these new kernels remains an ongoing development, underscoring a critical area for continued advancement.
Conclusion
In light of the comprehensive discussion above, we firmly believe that the synergistic combination of 2:4 sparsity and quantization holds immense potential for pushing the very boundaries of large language model compression. However, as demonstrated, the current ecosystem of available tools and foundational GPU programming interfaces remains a significant limiting factor in fully realizing this potential. Specifically, many common flexible GPU coding interfaces, such as Triton and ThunderKittens, presently lack robust, native support for 2:4 sparsity, and their integration with many quantization methods is still notably limited. Therefore, enhancing these tools to natively support 2:4 sparsity and diverse quantization methods is essential to unlock this potential and accelerate innovation in model compression.