Nvidia says it can shrink LLM memory 20x without changing model weights

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the mod...

By · · 1 min read

Source: venturebeat.com

Nvidia researchers have introduced a new technique that dramatically reduces how much memory large language models need to track conversation history — by as much as 20x — without modifying the model itself. The method, called KV Cache Transform Coding (KVTC), applies ideas from media compression formats like JPEG to shrink the key-value cache behind multi-turn AI systems, lowering GPU memory demands and speeding up time-to-first-token by up to 8x.For enterprise AI applications that rely on agents and long contexts, this translates to reduced GPU memory costs, better prompt reuse, and up to an 8x reduction in latency by avoiding the need to recompute dropped KV cache values.Serving large language models at scale requires managing a massive amount of data, especially for multi-turn conversations and long coding sessions. Every time a user adds to a prompt, the system relies on stored memory to avoid recomputing the entire conversation history from scratch.However, this memory footprint