Here are the top 10 machine learning and AI research papers from September 16 to September 23, 2024. These papers present fresh ideas, tools, and platforms that could change how AI is used in many areas of life. This research highlights the amazing power of artificial intelligence and machine learning, offering new solutions that make businesses run better and help technology grow.

1. Moshi

  • Author(s): Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour

Overview of Moshi

“Moshi: A Speech-Text Foundation Model for Real-Time Dialogue” introduces Moshi, a novel speech-to-speech model designed to enhance real-time spoken dialogue systems. Traditional systems often face limitations such as latency, reliance on text-based information, and turn-based interactions, which hinder natural conversation flow. Moshi addresses these challenges by integrating a text-based large language model (LLM) with a smaller audio language model, allowing it to process and generate audio directly. This approach eliminates the bottleneck of converting speech to text and back, enabling the model to understand and produce speech in real-time with reduced latency. The architecture of Moshi includes a streaming, hierarchical design that processes audio inputs and outputs simultaneously, supporting full-duplex communication where the model can listen and speak concurrently. This design is achieved through the use of a multi-stream audio language model that handles multiple audio streams in parallel. The paper details the development of Moshi’s components, including Helium, a text LLM pre-trained on extensive datasets, and Mimi, a neural audio codec that facilitates high-quality audio tokenization. The model’s performance is evaluated across various metrics such as speech intelligibility, consistency, and spoken question answering, demonstrating state-of-the-art capabilities in speech-text modeling.

2. Training Language Models to Self-Correct via Reinforcement Learning

  • Author(s): Aviral Kumar, Vincent Zhuang, Rishabh Agarwal, Yi Su, John D Co-Reyes, Avi Singh, Kate Baumli, Shariq Iqbal, Colton Bishop, Rebecca Roelofs, Lei M Zhang, Kay McKinney, Disha Shrivastava, Cosmin Paduraru, George Tucker, Doina Precup, Feryal Behbahani, and Aleksandra Faust

Training LLMs to Self-Correct via RL

“Training Language Models to Self-Correct via Reinforcement Learning” introduces a novel approach to enhancing the self-correction capabilities of large language models (LLMs). Traditional methods for training self-correction often require multiple models or rely on external supervision, which can be inefficient. To address these limitations, the authors propose SCoRe, a multi-turn online reinforcement learning strategy that leverages self-generated data to improve self-correction. The study highlights the inadequacies of supervised fine-tuning (SFT) on offline model-generated correction traces, which often result in a distribution mismatch or ineffective correction behaviors during testing. SCoRe overcomes these challenges by training under the model’s own distribution of correction traces and employing regularization techniques to guide the learning process towards effective self-correction strategies. This involves an initial phase of reinforcement learning on a base model to establish a robust policy initialization, followed by the application of a reward bonus to enhance self-correction during training. The implementation of SCoRe on Gemini 1.0 Pro and 1.5 Flash models demonstrates significant improvements in self-correction performance, with enhancements of 15.6% and 9.1% on the MATH and HumanEval benchmarks, respectively.

3. Qwen2.5-Coder Technical Report

  • Author(s): Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou and Junyang Lin

Qwen2.5-Coder

“Qwen2.5-Coder Technical Report” presents the Qwen2.5-Coder series, an advancement over the previous CodeQwen1.5 model. This series comprises two models: Qwen2.5-Coder-1.5B and Qwen2.5-Coder-7B, which are specifically designed for code-related tasks. Built on the Qwen2.5 architecture, these models are pretrained on a substantial corpus exceeding 5.5 trillion tokens. The development process involved rigorous data cleaning, scalable synthetic data generation, and balanced data mixing, which contribute to the model’s robust code generation capabilities while maintaining general versatility. The Qwen2.5-Coder models have been rigorously evaluated across various code-related tasks and have achieved state-of-the-art performance in more than ten benchmarks, including tasks such as code generation, completion, reasoning, and repair. These models consistently outperform larger models of similar sizes, showcasing their efficiency and effectiveness in code intelligence tasks. The report suggests that the release of the Qwen2.5-Coder series will advance research in code intelligence and promote wider adoption by developers in real-world applications due to its permissive licensing.

4. On the Diagram of Thought

  • Author(s): Yifan Zhang, Yang Yuan, and Andrew Chi-Chih Yao

Diagram of Thought (DoT) enhances the reasoning capabilities of LLMs through mathematical rigor.

“On the Diagram of Thought” introduces the Diagram of Thought (DoT) framework, which models iterative reasoning in large language models (LLMs) as a directed acyclic graph (DAG). This approach differs from traditional methods that represent reasoning as linear chains or trees. In DoT, propositions, critiques, refinements, and verifications are organized into a cohesive DAG structure, allowing models to explore complex reasoning pathways while maintaining logical consistency. Each node in the diagram corresponds to a proposition that has been proposed, critiqued, refined, or verified, enabling iterative improvement through natural language feedback. The framework utilizes auto-regressive next-token prediction with role-specific tokens to facilitate seamless transitions between proposing ideas and critically evaluating them, offering richer feedback than binary signals. Additionally, DoT is formalized using Topos Theory, providing a mathematical foundation that ensures logical consistency and soundness in the reasoning process. This framework enhances both training and inference within a single LLM, eliminating the need for multiple models or external control mechanisms. DoT serves as a conceptual framework for designing next-generation reasoning-specialized models, emphasizing training efficiency and robust reasoning capabilities. The code supporting this research is accessible online.

5. Agents in Software Engineering: Survey, Landscape, and Vision

  • Author(s): Yanlin Wang, Wanjun Zhong, Yanxian Huang, Ensheng Shi, Min Yang, Jiachi Chen, Hui Li, Yuchi Ma, Qianxiang Wang, and Zibin Zheng

Agents in Software Engineering

“Agents in Software Engineering: Survey, Landscape, and Vision” examines the integration of large language models (LLMs) with software engineering (SE) tasks through the use of agent technologies. Despite the growing application of LLMs in SE, there has been a lack of comprehensive surveys that explore how these models are combined with agent frameworks to enhance various SE tasks. This study provides the first in-depth survey of the intersection between LLM-based agents and SE, presenting a structured framework comprising three key components: perception, memory, and action. The paper analyzes existing research to understand how LLM-based agents optimize SE tasks and identifies current challenges in merging these fields. Additionally, it proposes future opportunities to address these challenges. By offering a detailed overview, the paper aims to clarify the role of LLM-based agents in SE and guide future research in this area. A GitHub repository is maintained to support this survey, providing access to related works and fostering further exploration in the field.

6. To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

  • Author(s): Zayne Sprague, Fangcong Yin, Juan Diego Rodriguez, Dongwei Jiang, Manya Wadhwa, Prasann Singhal, Xinyu Zhao, Xi Ye, Kyle Mahowald, and Greg Durrett

To CoT or not to CoT?

“To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning” investigates the efficacy of chain-of-thought (CoT) prompting in enhancing reasoning capabilities of large language models (LLMs). Through a comprehensive meta-analysis of over 100 papers and evaluations on 20 datasets across 14 models, the study identifies that CoT significantly benefits tasks involving mathematics and logic. However, it provides minimal improvements for other task types. Notably, on the MMLU benchmark, generating answers directly without CoT achieves similar accuracy to CoT unless the task involves symbolic operations, indicated by the presence of an equals sign. The analysis further separates planning and execution phases in CoT tasks and compares them with tool-augmented LLMs. Findings reveal that while CoT enhances symbolic execution, it is less effective compared to using a symbolic solver. The research suggests that CoT can be selectively applied to maintain performance while reducing inference costs. Additionally, the study highlights the potential need to move beyond prompt-based CoT towards new paradigms that better utilize intermediate computations across various LLM applications.

7. A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B

  • Author(s): Jemin Lee, Sihyeong Park, Jinse Kwon, Jihun Oh, and Yongin Kwon

A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs

“A Comprehensive Evaluation of Quantized Instruction-Tuned Large Language Models: An Experimental Analysis up to 405B” provides an in-depth assessment of instruction-tuned large language models (LLMs) using various quantization methods, including GPTQ, AWQ, SmoothQuant, and FP8. The study focuses on models ranging from 7 billion to 405 billion parameters, such as the Llama 3.1 model. It evaluates these models across 13 benchmarks covering six task types: commonsense question and answer, knowledge and language understanding, instruction following, hallucination detection, mathematics, and dialogue. The findings indicate that quantizing larger LLMs to a size comparable to smaller FP16 models generally results in superior performance across most benchmarks, except for hallucination detection and instruction following. The study also reveals that performance is significantly influenced by the choice of quantization method, model size, and bit width, with weight-only methods often providing better outcomes in larger models. Additionally, the research suggests that task difficulty does not substantially affect accuracy degradation due to quantization. The paper also critiques the MT-Bench evaluation method for its limited ability to differentiate among recent high-performing LLMs.

8. Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning

  • Author(s): Santosh Kumar Radha, Yasamin Nouri Jelyani, Ara Ghukasyan, and Oktay Goktas

Proposes the Iteration of Thought (IoT) framework to enhance the LLM responses and reasoning capabilities with adaptive reasoning paths.

“Iteration of Thought: Leveraging Inner Dialogue for Autonomous Large Language Model Reasoning” introduces the Iteration of Thought (IoT) framework, designed to enhance the reasoning capabilities of large language models (LLMs). Unlike static methods such as Chain of Thought (CoT) or Tree of Thoughts (ToT), IoT dynamically adapts its reasoning path based on the evolving context of a conversation. The framework comprises three key components: an inner dialogue agent (IDA) that generates context-specific prompts, an LLM agent (LLMA) that refines responses, and an iterative prompting loop facilitating interaction between these agents. Two variants are proposed: autonomous iteration of thought (AIoT), where the LLM autonomously decides when to stop iterating, and guided iteration of thought (GIoT), which enforces a fixed number of iterations. The IoT framework has been tested across various datasets, including complex reasoning tasks from the GPQA dataset, problem-solving in Game of 24, puzzle-solving in Mini Crosswords, and multi-hop question answering from the HotpotQA dataset. Results demonstrate that IoT significantly improves LLM response quality over CoT, offering a more adaptive and efficient reasoning system with minimal human intervention.

9. Schrodinger’s Memory: Large Language Models

  • Author(s): Wei Wang, Qing Li

Schrodinger's Memory: Large Language Models

“Schrodinger’s Memory: Large Language Models” explores the concept of memory within large language models (LLMs) and examines whether these models possess memory capabilities akin to humans. The study employs the Universal Approximation Theorem (UAT) to elucidate the mechanism behind LLM memory, proposing that LLMs exhibit a form of memory that becomes evident only when a specific memory is queried. This phenomenon is likened to Schrödinger’s memory, where the presence of memory is determined by the model’s output in response to a query, remaining indeterminate otherwise. The authors conduct experiments to evaluate the memory capabilities of various LLMs and introduce a novel method for assessing these abilities. Additionally, the paper compares the memory functions of human brains with those of LLMs, highlighting both similarities and differences in their operational mechanisms. This research aims to deepen the understanding of how LLMs store and retrieve information, providing insights into their potential applications and limitations in tasks requiring memory.

10. Jailbreaking Large Language Models with Symbolic Mathematics

  • Author(s): Emet Bethany, Mazal Bethany, Juan Arturo Nolazco Flores, Sumit Kumar Jha, and Peyman Najafirad

Interesting jailbreaking technique with math encoded prompts.

“Jailbreaking Large Language Models with Symbolic Mathematics” explores a novel method to bypass the safety mechanisms of large language models (LLMs) using symbolic mathematics. It highlights how LLMs, despite their advanced capabilities, can be vulnerable to specific types of attacks that exploit their mathematical reasoning abilities. The study demonstrates that symbolic mathematics can be used to generate prompts that effectively circumvent the built-in safety protocols of these models. By leveraging mathematical expressions and operations, the authors show that it is possible to induce LLMs to produce outputs that they would typically be restricted from generating. This approach not only reveals potential weaknesses in the current safety measures but also emphasizes the need for more robust defenses against such vulnerabilities. The paper provides a detailed analysis of the methods used to create these jailbreak prompts and evaluates their effectiveness across different LLM architectures. The findings underscore the importance of continuously improving the security frameworks of LLMs to prevent unauthorized manipulation and ensure safe deployment in various applications.