• Author(s): Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, Furu Wei

The paper introduces a novel architecture, YOCO, designed for large language models. This architecture is unique The paper introduces a novel architecture, YOCO, designed for large language models. This architecture is unique in its approach as it only caches key-value pairs once. YOCO is composed of two main components: a self-decoder and a cross-decoder. The self-decoder is responsible for efficiently encoding global key-value caches, which are then reused by the cross-decoder through cross-attention.

Despite this unique design, the overall model operates similarly to a decoder-only Transformer. However, the advantage of YOCO is that it only requires caching once. This design significantly reduces the demand on GPU memory while maintaining the capability for global attention.
An additional feature of YOCO is its computation flow, which allows for prefilling to exit early without altering the final output. This feature greatly accelerates the prefill stage.Experimental results have shown that YOCO outperforms the Transformer in various settings, including scaling up the model size and the number of training tokens. YOCO has also been extended to a 1M context length, achieving near-perfect needle retrieval accuracy.

Profiling results indicate that YOCO significantly improves inference memory, prefill latency, and throughput across various context lengths and model sizes. This paper presents a promising advancement in the field of large language models. In its approach as it only caches key-value pairs once. YOCO is composed of two main components: a self-decoder and a cross-decoder. The self-decoder is responsible for efficiently encoding global key-value caches, which are then reused by the cross-decoder through cross-attention.

Despite this unique design, the overall model operates similarly to a decoder-only Transformer. However, the advantage of YOCO is that it only requires caching once. This design significantly reduces the demand on GPU memory while maintaining the capability for global attention.
An additional feature of YOCO is its computation flow, which allows for prefilling to exit early without altering the final output. This feature greatly accelerates the prefill stage.

Experimental results have shown that YOCO outperforms the Transformer in various settings, including scaling up the model size and the number of training tokens. YOCO has also been extended to a 1M context length, achieving near-perfect needle retrieval accuracy.Profiling results indicate that YOCO significantly improves inference memory, prefill latency, and throughput across various context lengths and model sizes. This paper presents a promising advancement in the field of large language models.