Abstract
Offline Reinforcement Learning (RL) was proposed to learn from pre-recorded decision data without online interactions. In this setting, evaluating out-of-distribution (OOD) actions accurately is challenging, often leading to over-optimistic estimations. Certain efforts mitigate this issue by entirely avoiding out-of-sample actions during training, i.e., in-sample learning. While these methods safely exploit the dataset behavior, their generalization capacity is inevitably compromised. To bridge the gap, we identify a key insight that the value function derived from a well-designed in-sample learning method can effectively constrain OOD action-values. Building on this, we develop a concise and effective In-sample Expectile Value Regularization (IEVR) method, which simply restricts OOD actions using the in-sample expectile value while preserving standard Bellman updates for in-sample actions. We provide a theoretical analysis of IEVR regarding its convergence and substantiate the effectiveness of in-sample expectile values as a form of regularization through error bounds and experiments. Finally, extensive experimental results demonstrate that IEVR achieves significant performance improvements over existing methods across a diverse array of tasks in the D4RL benchmark.
| Original language | English |
|---|---|
| Article number | 108763 |
| Journal | Neural Networks |
| Volume | 200 |
| DOIs | |
| State | Accepted/In press - 2026 |
Keywords
- Expectile value regularization
- In-sample learning
- Offline reinforcement learning
- Reinforcement learning
Fingerprint
Dive into the research topics of 'Mitigating OOD overoptimism via in-sample value function in offline reinforcement learning'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver