• Author(s):Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, Tong Zhang

“Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts” introduces a novel approach to learning and representing human preferences in reinforcement learning (RL) systems. The primary goal of this research is to develop a more interpretable and expressive method for capturing complex and potentially conflicting preferences, which is essential for aligning AI systems with human values.

The proposed method combines two key components: multi-objective reward modeling and a mixture-of-experts architecture. Multi-objective reward modeling allows for the representation of multiple, possibly competing objectives that reflect different aspects of human preferences. By explicitly modeling these objectives, the system can capture a more nuanced and comprehensive understanding of what humans value in a given context. The mixture-of-experts architecture is employed to learn a diverse set of reward functions, each representing a specific preference or objective. This approach enables the system to model a wide range of preferences and handle potential conflicts between them. The mixture-of-experts model also provides a level of interpretability, as each expert corresponds to a distinct preference that can be analyzed and understood independently.

To train the multi-objective reward model and mixture-of-experts architecture, the authors introduce a novel learning algorithm that leverages human feedback and demonstrations. The algorithm iteratively refines the reward functions based on the collected data, allowing the system to gradually improve its understanding of human preferences. The training process is designed to be efficient and scalable, enabling the system to learn from a limited amount of human input.

The paper presents extensive experimental results to demonstrate the effectiveness of the proposed method. The authors evaluate their approach on a range of RL tasks and compare it with existing preference learning techniques. The results show that the multi-objective reward modeling and mixture-of-experts architecture consistently outperform baseline methods in terms of both performance and interpretability. The learned reward functions provide meaningful insights into the underlying human preferences, facilitating better understanding and trust in the system’s decision-making process.

“Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts” presents a significant advancement in the field of preference learning for RL systems. By combining multi-objective reward modeling and a mixture-of-experts architecture, the proposed method enables the learning of more expressive and interpretable representations of human preferences. This research has important implications for developing AI systems that align with human values and can be trusted to make decisions in complex real-world environments.