TY - GEN
T1 - FLIP
T2 - 2025 International Joint Conference on Neural Networks, IJCNN 2025
AU - Liu, Ziang
AU - Wu, Xingjiao
AU - Chen, Hongxin
AU - Xiao, Luwei
AU - Yang, Jing
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Preference-based Reinforcement Learning (PBRL) relies on the efficient collection and use of preference data to train accurate reward functions, enabling agents to learn directly from human preferences. This process allows agents to better understand human intentions while effectively reducing biases inherent in AI systems. The pairwise comparison method gathers diverse preference data, and Seqrank expands preference datasets through transitivity, both fail to establish preference relationships across different rounds of labeling. This limitation can result in fragmented signals and slow convergence toward the optimal policy. To address this, we propose the Global Tree (GTree), a method built on the Seqrank framework that integrates trajectory preferences across multiple rounds, providing a unified representation of global preferences. Moreover, we posit that different trajectory comparison methods offer distinct advantages depending on the task and the stage of training. To fully exploit these strengths, we introduce FLIP. This adaptive strategy dynamically selects either the pairwise method or GTree based on historical performance, optimizing method use for each task and training stage. Our evaluations demonstrate that integrating cross-round preferences accelerates the convergence of the reward function, while the FLIP strategy further enhances learning efficiency and overall performance, thereby enabling agents to better understand human intentions.
AB - Preference-based Reinforcement Learning (PBRL) relies on the efficient collection and use of preference data to train accurate reward functions, enabling agents to learn directly from human preferences. This process allows agents to better understand human intentions while effectively reducing biases inherent in AI systems. The pairwise comparison method gathers diverse preference data, and Seqrank expands preference datasets through transitivity, both fail to establish preference relationships across different rounds of labeling. This limitation can result in fragmented signals and slow convergence toward the optimal policy. To address this, we propose the Global Tree (GTree), a method built on the Seqrank framework that integrates trajectory preferences across multiple rounds, providing a unified representation of global preferences. Moreover, we posit that different trajectory comparison methods offer distinct advantages depending on the task and the stage of training. To fully exploit these strengths, we introduce FLIP. This adaptive strategy dynamically selects either the pairwise method or GTree based on historical performance, optimizing method use for each task and training stage. Our evaluations demonstrate that integrating cross-round preferences accelerates the convergence of the reward function, while the FLIP strategy further enhances learning efficiency and overall performance, thereby enabling agents to better understand human intentions.
KW - Human-AI Interaction
KW - Preference-based Reinforcement Learning
KW - Reinforcement Learning
UR - https://www.scopus.com/pages/publications/105023985802
U2 - 10.1109/IJCNN64981.2025.11228808
DO - 10.1109/IJCNN64981.2025.11228808
M3 - 会议稿件
AN - SCOPUS:105023985802
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - International Joint Conference on Neural Networks, IJCNN 2025 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 30 June 2025 through 5 July 2025
ER -