Rotation-based Quantization

# Rotation-based Quantization for LLMs Rotation-based quantization is a family of techniques where you apply an orthogonal transformation (typically a randomized Hadamard transform) to weight or activation vectors before quantizing them. The goal is to smooth out outlier elements that would otherwise dominate the quantization range. If you directly quantize vectors with large outliers, most values get crushed into a narrow band while a few extreme values waste dynamic range, creating very sparse vectors. By rotating the vectors, the energy of those dominant elements gets spread across all entries, making the distribution much more quantization-friendly. The foundational work for LLMs is [QuaRot](https://arxiv.org/abs/2404.00456) (NeurIPS 2024), which showed that randomized Hadamard rotations can be fused into the weight matrices of a transformer without changing the model's output — exploiting a computational invariance property. QuaRot demonstrated end-to-end 4-bit quantization of all weights, activations, and the KV cache, retaining 99% of zero-shot accuracy on LLaMA-2 models. Related earlier work includes QuIP# (Hadamard incoherence processing) and SliceGPT (the computational invariance idea). Since QuaRot, there have been many rotation-based quantization publications. Two notable recent ones are Google's [TurboQuant](https://arxiv.org/abs/2504.19874) (ICLR 2026) and [RotorQuant](https://www.scrya.com/rotorquant/) (a community project, March 2026). TurboQuant is a two-stage online vector quantization method: the first stage, [PolarQuant](https://arxiv.org/abs/2502.02617), randomly rotates input vectors and applies MSE-optimal Lloyd-Max scalar quantizers per coordinate. The second stage applies a 1-bit Quantized Johnson-Lindenstrauss (QJL) transform to the residual, correcting the bias that MSE-optimal quantizers introduce in inner product estimation (which is critical because attention scores are dot products). An MSE-optimal quantizer is one that minimizes the mean squared error between the original values and their quantized reconstructions. TurboQuant is data-oblivious (no calibration needed) and achieves near-optimal distortion rates. RotorQuant extends this by replacing TurboQuant's dense 128×128 rotation matrix with small Clifford rotors from the geometric algebra Cl(3,0), splitting each 128-dim vector into groups of 3 and rotating each group independently. This reduces parameter count by ~44× and speeds up the rotation step 10-19× on NVIDIA GPUs, while matching TurboQuant's attention fidelity. As of April 2026, rotation-based KV cache quantization has not yet been merged into major inference engines like vLLM or SGLang, though both have active pull requests for TurboQuant support. This reflects the difficulty of integrating novel quantization schemes into the bespoke CUDA/Triton kernel stacks these engines rely on. Once adopted, methods like TurboQuant promise ~6× KV cache memory reduction at 3-bit precision with near-zero quality loss. This is significant because the KV cache is the dominant memory consumer during long-context LLM inference.