MLKV: Multi-layer Key-value Heads For Memory Efficient Transformer Decoding

Zuhri Zayd Muhammad Kawakibi, Adilazuarda Muhammad Farid, Purwarianti Ayu, Aji Alham Fikri. Arxiv 2024

[Paper] [Code]
Attention Mechanism Has Code Model Architecture Pretraining Methods Transformer

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

The Large Language Model Bible

MLKV: Multi-layer Key-value Heads For Memory Efficient Transformer Decoding

Zuhri Zayd Muhammad Kawakibi, Adilazuarda Muhammad Farid, Purwarianti Ayu, Aji Alham Fikri. Arxiv 2024

Similar Work