• Author(s): Ali Safaya, Deniz Yuret

“Neurocache: Efficient Vector Retrieval for Long-range Language Modeling” introduces Neurocache, a novel approach designed to extend the effective context size of large language models (LLMs). This method addresses the challenge of maintaining long-range dependencies in language models, which is crucial for tasks that require understanding and generating coherent text over extended sequences.

Neurocache leverages an external vector memory to store past states of the language model. This external memory allows the model to retrieve relevant past information efficiently, thereby extending its context window. The core innovation of Neurocache lies in its use of a k-nearest-neighbor (kNN) algorithm for vector retrieval. This algorithm enables the model to quickly and accurately retrieve the most relevant past states, which are then incorporated into the attention mechanism of the language model.

One of the key advancements of Neurocache is its ability to store compressed states. By compressing the stored vectors, Neurocache significantly reduces the size of the cache, making it more memory-efficient. Additionally, the method performs a single retrieval operation per token, which enhances the inference speed. This efficiency is particularly beneficial for real-time applications where quick response times are essential.

Neurocache also extends the retrieval window to include neighboring states, which improves both language modeling and downstream task accuracy. This extended retrieval window allows the model to capture more context, leading to better performance in tasks that require long-range dependencies.
The paper provides extensive experimental results to demonstrate the effectiveness of Neurocache. The authors evaluate their approach on models trained from scratch as well as pre-trained models such as Llama2-7B and Mistral-7B. The results show that Neurocache significantly improves the performance of these models in terms of both language modeling and downstream tasks. The method also outperforms traditional text retrieval methods in single-document question-answering and few-shot learning tasks.

“Neurocache: Efficient Vector Retrieval for Long-range Language Modeling” presents a significant advancement in the field of language modeling. By introducing an efficient vector retrieval mechanism and leveraging compressed states, the authors offer a powerful tool for extending the context size of LLMs. This research has important implications for various applications, including natural language processing, real-time text generation, and interactive AI systems, making it easier to handle long-range dependencies in language models.