Vtensor: Flexible Virtual Tensor Management For Efficient LLM Serving · The Large Language Model Bible Contribute to LLM-Bible

Vtensor: Flexible Virtual Tensor Management For Efficient LLM Serving

Xu Jiale, Zhang Rui, Guo Cong, Hu Weiming, Liu Zihan, Wu Feiyang, Feng Yu, Sun Shixuan, Shao Changxu, Guo Yuhong, Zhao Junping, Zhang Ke, Guo Minyi, Leng Jingwen. Arxiv 2024

[Paper]    
Attention Mechanism Model Architecture RAG Tools Transformer

Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. This surge in demand poses significant challenges in optimizing throughput and latency while keeping costs manageable. The Key-Value (KV) cache, a standard method for retaining previous computations, makes LLM inference highly bounded by memory. While batching strategies can enhance performance, they frequently lead to significant memory fragmentation. Even though cutting-edge systems like vLLM mitigate KV cache fragmentation using paged Attention mechanisms, they still suffer from inefficient memory and computational operations due to the tightly coupled page management and computation kernels. This study introduces the vTensor, an innovative tensor structure for LLM inference based on GPU virtual memory management (VMM). vTensor addresses existing limitations by decoupling computation from memory defragmentation and offering dynamic extensibility. Our framework employs a CPU-GPU heterogeneous approach, ensuring efficient, fragmentation-free memory management while accommodating various computation kernels across different LLM architectures. Experimental results indicate that vTensor achieves an average speedup of 1.86x across different models, with up to 2.42x in multi-turn chat scenarios. Additionally, vTensor provides average speedups of 2.12x and 3.15x in kernel evaluation, reaching up to 3.92x and 3.27x compared to SGLang Triton prefix-prefilling kernels and vLLM paged Attention kernel, respectively. Furthermore, it frees approximately 71.25% (57GB) of memory on the NVIDIA A100 GPU compared to vLLM, enabling more memory-intensive workloads.

Similar Work