• Author(s): Sachit Menon, Richard Zemel, Carl Vondrick

“Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities” introduces a novel approach to multimodal reasoning called Whiteboard-of-Thought (WoT). This method aims to enhance the interpretability and transparency of AI systems by enabling them to perform step-by-step reasoning across different modalities, such as text, images, and sketches. WoT draws inspiration from the way humans use whiteboards to break down complex problems into smaller, more manageable steps. The proposed framework allows AI models to mimic this process by generating a sequence of intermediate reasoning steps, each represented by a combination of text, images, and sketches. These steps are dynamically generated based on the input query and the available knowledge sources.

One of the key advantages of WoT is its ability to provide a clear and intuitive explanation of the reasoning process. By presenting the intermediate steps in a visually rich format, users can easily understand how the AI system arrived at its final conclusion. This transparency is crucial for building trust in AI systems and facilitating human-AI collaboration. The WoT framework is designed to be modular and extensible, allowing it to incorporate various knowledge sources and reasoning techniques. It can handle a wide range of tasks, including question-answering, problem-solving, and decision-making. The authors demonstrate the effectiveness of WoT through extensive experiments on multiple datasets, showcasing its ability to generate accurate and interpretable reasoning steps.

“Whiteboard of Thought: Thinking Step-by-Step Across Modalities” presents a significant advancement in multimodal reasoning. By enabling AI systems to perform step-by-step reasoning across different modalities, WoT enhances the interpretability and transparency of AI decision-making processes. This research has important implications for developing more trustworthy and collaborative AI systems that can effectively communicate their reasoning to human users.