Artificial Intelligence (AI) and Machine Learning (ML) are changing how we live and work every day. From helping businesses run more smoothly to improving technologies we use daily, these fields are constantly evolving. In this blog, we’ve handpicked the top 10 AI and machine learning research papers from September 23 to September 29, 2024. These papers introduce new ideas, tools, and systems that show the exciting potential of AI and ML in solving real-world problems. If you’re curious about how AI can improve business, healthcare, and more, these papers offer fresh insights into what’s next for technology. This list of papers highlights the latest breakthroughs in AI and ML, making it easy for anyone—whether you’re an expert or just curious—to see how artificial intelligence is growing and how it might impact everyday life. These papers offer ideas that could shape the future of technology, making it worth your time.

1. Llama 3.2

  • Author(s): Meta

Revolutionizing edge AI and vision with open, customizable models

The “Llama 3.2” technical report introduces the latest advancements in edge AI and vision capabilities through open, customizable models. Building on the success of previous Llama models, Llama 3.2 includes both small and medium-sized vision language models (11B and 90B) and lightweight text-only models (1B and 3B) designed for edge and mobile devices. These models enhance image reasoning tasks such as document understanding, image captioning, and visual grounding. The lightweight models excel in multilingual text generation and tool usage, enabling personalized applications with strong privacy features by processing data locally on devices. This local processing not only ensures faster response times but also maintains user privacy by keeping data on the device rather than sending it to the cloud. The Llama 3.2 models have been evaluated against leading foundation models, demonstrating competitive performance in image recognition and visual understanding tasks. The development process involved structured pruning and knowledge distillation to create efficient models without compromising performance. The release of Llama 3.2 aims to foster innovation by providing developers with tools for building safe and responsible systems, supported by collaborations with major tech companies like Qualcomm and MediaTek. This open approach encourages widespread access to AI opportunities while ensuring equitable and safe technology deployment.

2. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models

  • Author(s): Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, Jiasen Lu, Taira Anderson, Erin Bransom, Kiana Ehsani, Huong Ngo, YenSung Chen, Ajay Patel, Mark Yatskar, Chris Callison-Burch, Andrew Head, Rose Hendrix, Favyen Bastani, Eli VanderBilt, Nathan Lambert, Yvonne Chou, Arnavi Chheda, Jenna Sparks, Sam Skjonsberg, Michael Schmitz, Aaron Sarnat, Byron Bischoff, Pete Walsh, Chris Newell, Piper Wolters, Tanmay Gupta, Jon Borchardt, Dirk Groeneveld, Jen Dumas, Crystal Nam, Sophie Lebrecht, Caitlin Wittlif, Carissa Schoenick, Oscar Michel, Ranjay Krishna, Luca Weihs, Noah Smith, Hannaneh Hajishirzi, Ross Girshick, Ali Farhadi, and Aniruddha Kembhavi

Meet Molmo: a family of open, state-of-the-art multimodal AI models.

“Molmo: Open Weights and Open Data for State-of-the-Art Multimodal Models” introduces the Molmo family of vision-language models (VLMs) that offer open access to model weights and training data. These models are designed to advance the development of VLMs without relying on proprietary synthetic data. A key innovation in Molmo is the use of a novel image caption dataset collected from human annotators using speech-based descriptions, which enhances the richness and detail of the data. This approach avoids the common pitfalls of using synthetic data generated by proprietary systems. The Molmo models are trained using a streamlined process that combines a pre-trained vision encoder with a language model, followed by supervised fine-tuning on diverse datasets. These datasets include unique elements like 2D pointing data, which enhances the models’ ability to perform tasks such as counting and visual grounding. The Molmo-72B model, in particular, demonstrates superior performance compared to other open-weight models and even rivals some proprietary systems like GPT-4o and Claude 3.5 on both academic benchmarks and human evaluations. The release of Molmo aims to foster innovation and accessibility in the field of multimodal AI research by providing comprehensive resources for further exploration.

3. AlphaChip

  • Author(s): Anna Goldie, Azalia Mirhoseini, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, Jiwoo Pak, Andy Tong, Kavya Srinivasa, William Hang, Emre Tuncer, Quoc V. Le, James Laudon, Richard Ho, Roger Carpenter and Jeff Dean

AlphaChip

“AlphaChip: A Graph Placement Methodology for Fast Chip Design” presents AlphaChip, a pioneering deep reinforcement learning (RL) method designed to generate chip layouts that surpass human capabilities. Initially introduced in 2020, AlphaChip has sparked significant interest and advancements in AI-driven chip design. This methodology employs pre-training to enhance speed, reliability, and the quality of chip placement, akin to techniques used in large language models. The open-source release of a software repository allows external researchers to reproduce the methods and apply pre-trained models to new chip blocks. The study highlights the scalability of AlphaChip’s performance with computational resources, using a setup of multiple GPUs and CPUs to fine-tune specific blocks. An ablation study confirmed that initial placement steps are unnecessary for maintaining performance. AlphaChip has demonstrated superior results on TPU blocks with sub-10 nm technology node sizes, outperforming human experts in wire length reduction across successive generations of Google’s Tensor Processing Unit (TPU). The paper underscores AlphaChip’s potential to transform the entire chip design process through automation, accelerating design cycles and enhancing performance. This work invites collaboration from the community to develop further AI methods that integrate hardware, software, and machine learning models for optimized chip design.

4. LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench

  • Author(s): Karthik Valmeekam, Kaya Stechly, Subbarao Kambhampati

LLMs Still Can’t Plan

“LLMs Still Can’t Plan; Can LRMs? A Preliminary Evaluation of OpenAI’s o1 on PlanBench” examines the planning capabilities of large language models (LLMs) and introduces a new model type, the Large Reasoning Model (LRM), specifically OpenAI’s o1 (Strawberry). Planning, a fundamental aspect of intelligent agents, has been a core area of artificial intelligence research. Despite the rapid development of LLMs since the release of GPT-3, progress in planning abilities, as evaluated by the PlanBench benchmark, has been limited. PlanBench was developed in 2022 to assess the planning capabilities of LLMs. The paper highlights that while OpenAI’s o1 model shows significant improvement on PlanBench, surpassing previous models, it still falls short of fully meeting the benchmark’s requirements. This advancement raises important considerations regarding accuracy, efficiency, and reliability before such systems can be widely deployed. The study underscores the need for continued evaluation and enhancement of these models to achieve more effective planning capabilities in AI systems.

5. Scaled-up Instructable Model Become Less Reliable

  • Author(s): Lexin Zhou, Wout Schellaert, Fernando Martínez-Plumed, Yael Moros-Daval, Cèsar Ferri and José Hernández-Orallo

Scaled-up Instructable Model Become Less Reliable

“Larger and More Instructable Language Models Become Less Reliable” examines the reliability of scaled-up language models, which have been enhanced through increased size, data volume, and computational resources. Despite these improvements, the study finds that larger models, while more instructable, exhibit decreased reliability. The research investigates the relationship between difficulty concordance, task avoidance, and prompting stability in various language model families. It reveals that although scaled-up models handle easy tasks well, they often fail to secure low-difficulty areas where errors are minimal or easily spotted by human supervision. Furthermore, these models tend to provide seemingly correct but inaccurate answers more frequently than earlier versions, especially on complex questions that human supervisors might overlook. While scaling and shaping interventions improve stability against variations in question phrasing, inconsistencies persist across different difficulty levels. These findings suggest a need for a fundamental shift in designing general-purpose AI systems, particularly in high-stakes areas where predictable error distribution is crucial. The paper emphasizes the importance of developing new strategies to enhance the reliability of AI systems as they become more integrated into everyday applications across various fields.

6. Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models

  • Author(s): Tongxuan Liu, Wenjiang Xu, Weizhe Huang, Xingyu Wang, Jiaxing Wang, Hailong Yang and Jing Li

Logic-of-Thought

“Logic-of-Thought: Injecting Logic into Contexts for Full Reasoning in Large Language Models” addresses the limitations of large language models (LLMs) in handling complex logical reasoning tasks. Current methods like Chain-of-Thought (CoT) improve reasoning but often result in unaligned conclusions with the reasoning chain. The study introduces Logic-of-Thought (LoT) prompting, which uses propositional logic to expand logical information from input contexts, enhancing logical reasoning capabilities. This approach integrates seamlessly with existing prompting methods, providing additional logical information to input prompts without relying solely on symbolic solvers, thus avoiding information loss. Extensive experiments demonstrate that LoT significantly boosts the performance of various prompting methods across five logical reasoning tasks. Notably, LoT enhances CoT’s performance on the ReClor dataset by 4.35%, improves CoT with Self-Consistency on LogiQA by 5%, and elevates Tree-of-Thoughts on the ProofWriter dataset by 8%. The findings suggest that LoT can be effectively combined with existing methods to improve logical reasoning in LLMs, offering a robust solution to enhance their capabilities in complex reasoning tasks. The paper highlights LoT’s potential to advance the development of more reliable and accurate AI systems in logical reasoning applications.

7. Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

  • Author(s): Siyun Zhao, Yuqing Yang, Zilong Wang, Zhiyuan He, Luna K. Qiu, Lili Qiu

Main Focus of Four Level Queries

“Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make Your LLMs Use External Data More Wisely” explores the integration of external data into large language models (LLMs) to enhance their performance in real-world tasks. Techniques such as Retrieval-Augmented Generation (RAG) and fine-tuning are increasingly popular, yet deploying data-augmented LLMs across specialized fields remains challenging. These challenges include retrieving relevant data, interpreting user intent accurately, and fully utilizing the reasoning capabilities of LLMs for complex tasks. The paper argues against a one-size-fits-all approach for these applications, noting that underperformance often results from misidentifying the task’s core focus or the need for multiple capabilities that must be disentangled. To address these issues, the authors propose a RAG task categorization method, classifying user queries into four levels based on the type of external data required: explicit fact queries, implicit fact queries, interpretable rationale queries, and hidden rationale queries. The paper provides relevant datasets and summarizes key challenges and effective techniques for overcoming them. Additionally, it discusses three main methods of integrating external data into LLMs—context, small model, and fine-tuning—highlighting their strengths and limitations. This work aims to guide the systematic development of LLM applications by addressing data requirements and key bottlenecks.

8. A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

  • Author(s): Yunfei Xie, Juncheng Wu, Haoqin Tu, Siwei Yang, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor

The paper “A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?” investigates the capabilities of OpenAI’s latest large language model, o1, in the medical field. This model is notable for its internalized chain-of-thought technique using reinforcement learning strategies. While o1 has shown strong performance in general language tasks, its effectiveness in specialized areas like medicine has not been fully explored. The study evaluates o1 across various medical scenarios, focusing on understanding, reasoning, and multilingual capabilities. This evaluation involves six tasks using data from 37 medical datasets, including two newly developed question-answering tasks based on quizzes from the New England Journal of Medicine and The Lancet. These datasets provide more clinically relevant challenges compared to standard benchmarks like MedQA. The findings indicate that o1’s enhanced reasoning ability significantly improves its understanding of medical instructions and complex clinical scenarios, outperforming GPT-4 by an average of 6.2% and 6.6% in accuracy across 19 datasets and the new QA tasks. However, the study also identifies weaknesses such as hallucination, inconsistent multilingual performance, and evaluation metric discrepancies. The authors have released the raw data and model outputs to support future research in this area.

9. Small Language Models: Survey, Measurements, and Insights

  • Author(s): Zhenyan Lu, Xiang Li, Dongqi Cai, Rongjie Yi, Fangming Liu, Xiwen Zhang, Nicholas D. Lane, Mengwei Xu

Small Language Models: Survey, Measurements, and Insights

“Small Language Models: Survey, Measurements, and Insights” explores the role and development of small language models (SLMs), which are increasingly used in smart devices but have not received as much academic focus as large language models (LLMs). While LLMs are typically deployed in data centers to pursue artificial general intelligence, SLMs aim to make machine intelligence more accessible and efficient for everyday tasks. This study focuses on transformer-based, decoder-only language models ranging from 100 million to 5 billion parameters. It surveys 59 state-of-the-art open-source SLMs, examining their innovations in architecture, training datasets, and algorithms. The paper evaluates these models’ capabilities in areas such as commonsense reasoning, in-context learning, mathematics, and coding. Additionally, it benchmarks their on-device runtime costs by analyzing inference latency and memory usage. Through this comprehensive evaluation, the study provides valuable insights to advance the development and application of SLMs, highlighting their potential to offer efficient solutions for a wide range of tasks while maintaining affordability and accessibility. The findings aim to guide future research efforts in optimizing SLM performance and expanding their practical applications.

10. Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts

  • Author(s): Ming Wang, Yuanzhong Liu, Xiaoyu Liang, Yijie Huang, Daling Wang, Xiaocui Yang, Sijia Shen, Shi Feng, Xiaoming Zhang, Chaofeng Guan, Yifei Zhang

:The overall framework of Minstrel, a structural prompt generation framework with multi-agents collaboration

“Minstrel: Structural Prompt Generation with Multi-Agents Coordination for Non-AI Experts” addresses the challenge of creating effective prompts for large language models (LLMs), particularly for individuals without expertise in artificial intelligence. While LLMs have shown strong performance across various tasks, designing high-quality prompts remains difficult for non-experts due to scattered optimization principles and empirically dependent prompt optimizers. These existing methods often lack a structured design, leading to high learning costs and difficulties in updating prompts iteratively. Inspired by structured programming languages, the authors propose LangGPT, a framework for structural prompt design. Additionally, they introduce Minstrel, a multi-generative agent system that automates the creation of structural prompts through reflection. The study demonstrates through experiments and case studies that structural prompts generated by Minstrel or crafted manually significantly improve LLM performance. The paper also includes an analysis of the ease of use of these structural prompts based on feedback from a user survey conducted within an online community. This research aims to make prompt generation more accessible and efficient for non-AI experts, enhancing the usability and effectiveness of LLMs in various applications.

As AI and machine learning continue to grow, these research papers from September 23 to 29, 2024, show what’s coming next. Whether you’re a business owner looking for smarter ways to work, a tech enthusiast, or just curious about how AI is changing the world, these papers offer something for everyone. Keeping up with these ideas helps you stay informed and ready for the future as AI becomes an even bigger part of our daily lives.