• Author(s): Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, Song Han

The acceleration of large language model (LLM) inference is achievable through quantization. The research community is currently exploring lower precision than INT8, such as INT4. However, the state-of-the-art INT4 quantization techniques only speed up low-batch, edge LLM inference and fail to deliver performance gains in large-batch, cloud-based LLM serving. A significant runtime overhead (20-90%) is observed when dequantizing either weights or partial sums on GPUs using existing INT4 quantization methods.

To address this challenge, a new algorithm called QoQ, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache, has been introduced. The term QoQ stands for quattuor-octo-quattuor, which represents 4-8-4 in Latin. QoQ is implemented by the QServe inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by operations on low-throughput CUDA cores.

Building upon this insight, the QoQ algorithm introduces progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, a development called SmoothAttention is introduced to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, compute-aware weight reordering is performed and register-level parallelism is taken advantage of to reduce dequantization latency. Fused attention is also made memory-bound, harnessing the performance gain brought by KV4 quantization.

As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by 1.2x on A100, 1.4x on L40S; and Qwen1.5-72B by 2.4x on A100, 3.5x on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by 3x. For more details, please refer to the source.