Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

speculative-decoding

kv-cache

Abstract

Optimizing inference for long-context Large Language Models (LLMs) is increasingly important due to the quadratic compute and linear memory complexity of Transformers. Existing approximation methods, such as key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on rough predictions of token or KV pair importance. We propose a framework for approximate LLM inference that leverages small draft models to more accurately predict the importance of tokens and KV pairs. Within this framework, we present: 1. SpecKV: The first method to use lookahead with a small draft model to enable precise KV cache dropping. 2. SpecPC: Uses the draft model's attention activations to identify and discard less important prompt tokens. 3. SpecKV-PC: A cascaded compression strategy combining both techniques for superior results. We motivate our methods with theoretical and empirical analyses, and show a strong correlation between the attention patterns of draft and target models. Extensive experiments on long-context benchmarks show that our methods consistently achieve higher accuracy than existing baselines, while preserving the same improvements in memory usage, latency, and throughput.

Related Publications

Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

TMLR

2026

transformer

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

EACL

2026

speculative-decoding

vision-lanuage