Abstract dot-matrix graphic featuring red, green, and purple circles and squares on a black background, forming a bold digital pattern.

2026

XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization

Aditya Tomar, Coleman Hooper, Minjae Lee, Haocheng Xi, Rishabh Tiwari, Wonjun Kang, Luca Manolache, Michael W. Mahoney, Kurt Keutzer, Amir Gholami

Preprint

kv-cache

quantization

Abstract

The growing context lengths and batch sizes in LLM inference are pushing memory capacity to its limits. To address this, we propose XQuant, a method that trades increased computation for reduced memory operations. Instead of caching Keys and Values directly, XQuant quantizes and caches the layer input activations X, and then rematerializes the Keys and Values on-the-fly during inference. This achieves up to ~7.7× memory savings with <0.1 perplexity degradation compared to the FP16 baseline. An extended variant, XQuant-CL, exploits cross-layer similarity for up to 10× memory savings relative to the FP16 baseline with only 0.01 perplexity degradation.

Related Publications

Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

TMLR

2026

transformer

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

EACL

2026

speculative-decoding

vision-lanuage

Draft-based Approximate Inference for LLMs

ICLR

2026

speculative-decoding

kv-cache