vrypan |--o--| pfp
vrypan |--o--|

@vrypan.eth

> During autoregressive decoding, consecutive tokens produce nearly identical KV cache values. The hidden state for "The cat sat on the mat" differs from "The cat sat on the rug" by only ~1% at most dimensions. > Standard KV cache quantization (Q4_0) compresses absolute values to 4 bits. Delta-KV compresses the tiny difference between tokens to 4 bits instead. Same bits, vastly less error. https://github.com/cenconq25/delta-compress-llm
2 replies
1 recast
7 reactions