2026

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
kv-cache
quantization

Abstract

The growing context lengths and batch sizes in LLM inference are pushing memory capacity to its limits. To address this, we propose XQuant, a method that trades increased computation for reduced memory operations. Instead of caching Keys and Values directly, XQuant quantizes and caches the layer input activations X, and then rematerializes the Keys and Values on-the-fly during inference. This achieves up to ~7.7× memory savings with <0.1 perplexity degradation compared to the FP16 baseline. An extended variant, XQuant-CL, exploits cross-layer similarity for up to 10× memory savings relative to the FP16 baseline with only 0.01 perplexity degradation.

Resources

Read paper

Related Publications

Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

TMLR
2026
search
transformer
View Job

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

EACL
2026
speculative-decoding
vision-lanuage
View Job

Draft-based Approximate Inference for LLMs

ICLR
2026
speculative-decoding
kv-cache
View Job