Draft-based Approximate Inference for LLMs
Abstract
Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Within this framework, we present: 1. SpecKV: The first method to use lookahead with a small draft model to enable precise KV cache dropping. 2. SpecPC: Uses the draft model's attention activations to identify and discard less important prompt tokens. 3. SpecKV-PC: A cascaded compression strategy combining both techniques for superior results. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.