• Author(s): Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan, Zhaoran Wang

Preference optimization techniques, such as Reinforcement Learning from Human Feedback (RLHF), have demonstrated remarkable success in aligning large language models (LLMs) with human intentions. In contrast to offline alignment methods that rely on fixed datasets, online feedback collection from humans or AI on model generations has proven to be more effective in creating accurate reward models and better-aligned LLMs through an iterative process. However, developing a globally accurate reward model necessitates systematic exploration to generate diverse responses that cover the extensive space of natural language, a requirement that cannot be met by random sampling from standard reward-maximizing LLMs alone.

To tackle this challenge, this paper introduces a bilevel objective that is optimistically biased towards potentially high-reward responses, actively exploring out-of-distribution regions. By solving the inner-level problem using the reparameterized reward function, the resulting algorithm, called Self-Exploring Language Models (SELM), eliminates the need for a separate reward model and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces the indiscriminate favor of unseen extrapolations and improves exploration efficiency.

The paper presents a comprehensive evaluation of the SELM approach, demonstrating its effectiveness in generating diverse and high-quality responses while maintaining alignment with human preferences. The experiments are conducted on a range of datasets and benchmarks, showcasing the robustness and generalizability of the proposed method. The results highlight the potential of SELM to enhance the performance of LLMs in various applications, such as dialogue systems, content generation, and task-oriented interactions.

Furthermore, the paper discusses the implications of the SELM approach for the development of more reliable and trustworthy AI systems. By actively exploring and refining the reward model, SELM contributes to the ongoing efforts to ensure the safe and responsible deployment of LLMs in real-world scenarios. This paper introduces self-exploring language models as a novel approach to active preference elicitation for online alignment of LLMs. The proposed method addresses the limitations of existing techniques and paves the way for more effective and efficient alignment of language models with human intentions.