type: concept tags: [quantization, kv-cache, on-device, llm, memory, optimization, 优化技术] related: [[on-device-inference-memory-pressure]], [[lcsb-finetuning-ondevice]], [[edgeflow-cold-start]] sources: - url: https://arxiv.org/abs/2604.04722v1 title: "Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs" date: 2026-04 created: 2026-04-14
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs¶
核心问题¶
Large Language Models (LLMs) have achieved remarkable progress across reasoning, generation, and decision-making tasks, yet deploying them on mobile, embedded, and edge devices remains particularly challenging.
方法/架构¶
基于论文摘要,该方法包含以下关键创新点:
- On-device LLM inference is heavily constrained by the memory and bandwidth overhead of the key-value (KV) cache, which grows linearly with context length and often dominates decoding cost.
- Inspired by Huffman coding's principle of variable-length allocation, we propose adaptive KV-cache quantization, a learned policy that assigns bit-width proportional to token importance, minimizing expected memory and latency without sacrificing competitive accuracy.
实验结果¶
论文报告了以下主要实验结果:
- Our framework extracts lightweight token-level features, including token frequency, quality score, attention variance, and entropy-based uncertainty, and feeds them into a compact data-driven controller that dynamically selects KV precision from {2-bit, 4-bit, 8-bit, FP16} during decoding.
- This adaptive precision policy reduces KV memory footprint and latency while improving accuracy compared to static KV quantization and rule-based baselines, and maintaining competitive accuracy close to FP16 inference across standard LLM benchmarks.
- Extensive experiments across multiple commonsense reasoning benchmarks using SmolLM-135M, SmolLM-360M, and SmolLM-1.7B demonstrate that our controller consistently improves the accuracy-latency trade-off.
为什么重要¶
该研究的重要性体现在:
- For instance, with SmolLM-360M on HellaSwag, our method reduces decoding latency (ms/token) by 17.75% relative to static KV quantization, improves accuracy by 7.60 points, and remains within only 0.30 points of FP16 inference.
关联¶
基于论文内容和研究领域,该工作与以下概念相关:
- [on-device-inference
参考资源¶
- 论文原文:https://arxiv.org/abs/2604.04722