IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

By Crystal Osprey · March 27, 2026 · 1 min read

technology

Processing 200,000 tokens through a large language model is expensive and slow: the longer the context, the faster the costs spiral. Researchers at Tsinghua University and Z.ai have built a technique called IndexCache that cuts up to 75% of the redundant computation in sparse attention models, delivering up to 1.82x faster time-to-first-token and 1.48x faster generation throughput at that context length.The technique applies to models using the DeepSeek Sparse Attention architecture, including the latest DeepSeek and GLM families. It can help enterprises provide faster user experiences for production-scale, long-context models, a capability already proven in preliminary tests on the 744-billion-parameter GLM-5 model.The DSA bottleneckLarge language models rely on the self-attention mechanism, a process where the model computes the relationship between every token in its context and all the preceding ones to predict the next token.However, self-attention has a severe limitation. Its com

IndexCache, a new sparse attention optimizer, delivers 1.82x faster inference on long-context AI models

Related Posts

Trending on ShareHub

Latest on ShareHub

Browse Topics

Around the Network