XQuant: Breaking the Memory Wall for LLM Inference with KV Cache Rematerialization
Abstract
The growing context lengths and batch sizes in LLM inference are pushing memory capacity to its limits. To address this, we propose XQuant, a method that trades increased computation for reduced memory operations. Instead of caching Keys and Values directly, XQuant quantizes and caches the layer input activations X, and then rematerializes the Keys and Values on-the-fly during inference. This achieves up to ~7.7× memory savings with <0.1 perplexity degradation compared to the FP16 baseline. An extended variant, XQuant-CL, exploits cross-layer similarity for up to 10× memory savings relative to the FP16 baseline with only 0.01 perplexity degradation.