• Author(s): Jing Lin, Yao Feng, Weiyang Liu, Michael J. Black

The paper titled “ChatHuman: A Language-Driven Human Understanding System” discusses the development of a unique system that integrates various methods to detect, estimate, and analyze properties of people in images. These properties include the estimation of 3D pose, shape, contact, human-object interaction, emotion, and more. However, these methods often work in isolation rather than synergistically.

To address this issue, the paper introduces ChatHuman, a language-driven human understanding system. This system combines and integrates the skills of many different methods by fine-tuning a Large Language Model (LLM) to select and use a wide variety of existing tools in response to user inputs. As a result, ChatHuman can combine information from multiple tools to solve problems more accurately than the individual tools themselves and leverage tool output to improve its ability to reason about humans.

ChatHuman’s novel features include leveraging academic publications to guide the application of 3D human-related tools, employing a retrieval-augmented generation model to generate in-context-learning examples for handling new tools, and discriminating and integrating tool results to enhance 3D human understanding.

The paper presents experiments showing that ChatHuman outperforms existing models in both tool selection accuracy and performance across multiple 3D human-related tasks. Thus, ChatHuman represents a significant step towards consolidating diverse methods for human analysis into a single, powerful system for 3D human reasoning.