TY - JOUR
T1 - De-Pessimism Offline Reinforcement Learning via Value Compensation
AU - Huang, Zhenbo
AU - Zhao, Jing
AU - Sun, Shiliang
N1 - Publisher Copyright:
© 2012 IEEE.
PY - 2025
Y1 - 2025
N2 - Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q-value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q-value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.
AB - Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q-value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q-value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.
KW - De-pessimism (DEP)
KW - offline reinforcement learning (RL)
KW - out-of-distribution (OOD) action
KW - value compensation
UR - https://www.scopus.com/pages/publications/85218978467
U2 - 10.1109/TNNLS.2024.3443082
DO - 10.1109/TNNLS.2024.3443082
M3 - 文章
C2 - 39178073
AN - SCOPUS:85218978467
SN - 2162-237X
VL - 36
SP - 12655
EP - 12667
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 7
ER -