跳到主要导航 跳到搜索 跳到主要内容

De-Pessimism Offline Reinforcement Learning via Value Compensation

  • Zhenbo Huang
  • , Jing Zhao
  • , Shiliang Sun*
  • *此作品的通讯作者
  • East China Normal University
  • Shanghai Jiao Tong University

科研成果: 期刊稿件文章同行评审

摘要

Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q-value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q-value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.

源语言英语
页(从-至)12655-12667
页数13
期刊IEEE Transactions on Neural Networks and Learning Systems
36
7
DOI
出版状态已出版 - 2025

指纹

探究 'De-Pessimism Offline Reinforcement Learning via Value Compensation' 的科研主题。它们共同构成独一无二的指纹。

引用此