De-Pessimism Offline Reinforcement Learning via Value Compensation

  • Zhenbo Huang
  • , Jing Zhao
  • , Shiliang Sun*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Offline reinforcement learning (RL) has been widely used in practice due to its efficient data utilization, but it still faces the challenge of training vulnerability caused by policy deviation. Existing offline RL methods that add policy constraints or perform conservative Q-value estimation are pessimistic, making the learned policy suboptimal. In this article, we address the pessimism problem by focusing on accurate Q-value estimation. We propose the de-pessimism (DEP) operator to estimate Q values using the optimal Bellman operator or the compensation operator according to whether the actions are in the behavior support set. The compensation operator qualitatively determines the positive or negative nature of out-of-distribution (OOD) actions based on their performance compared with the behavior actions. It leverages differences in state values to compensate for the Q value of positive OOD actions, thereby alleviating pessimism. We theoretically demonstrate the convergence of DEP and its effectiveness in policy improvement. To further advance the practical application, we integrate DEP into the soft actor-critic (SAC) algorithm, yielding the value-compensated de-pessimism offline RL (DoRL-VC). Experimentally, DoRL-VC achieves state-of-the-art (SOTA) performance across mujoco locomotion, Maze 2-D, and challenging Adroit tasks, illustrating the efficacy of DEP in mitigating pessimism.

Original languageEnglish
Pages (from-to)12655-12667
Number of pages13
JournalIEEE Transactions on Neural Networks and Learning Systems
Volume36
Issue number7
DOIs
StatePublished - 2025

Keywords

  • De-pessimism (DEP)
  • offline reinforcement learning (RL)
  • out-of-distribution (OOD) action
  • value compensation

Fingerprint

Dive into the research topics of 'De-Pessimism Offline Reinforcement Learning via Value Compensation'. Together they form a unique fingerprint.

Cite this