TY - JOUR
T1 - Off-policy evaluation for tabular reinforcement learning with synthetic trajectories
AU - Wang, Weiwei
AU - Li, Yuqiang
AU - Wu, Xianyi
N1 - Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2024/2
Y1 - 2024/2
N2 - This paper addresses the problem of offline evaluation in tabular reinforcement learning (RL). We propose a novel method that leverages synthetic trajectories constructed from the available data using a “sampling with replacement” basis, combining the advantages of model-based and Monte Carlo policy evaluation. The method is accompanied by theoretically derived finite sample upper error bounds, offering performance guarantees and allowing for a trade-off between statistical efficiency and computational cost. The results from computational experiments demonstrate that our method consistently achieves lower upper error bounds and relative mean square errors compared to Importance Sampling, Doubly Robust methods, and other existing approaches. Furthermore, this method achieves these superior results in significantly shorter running times compared to traditional model-based approaches. These findings highlight the effectiveness and efficiency of this synthetic trajectory method for accurate offline policy evaluation in RL.
AB - This paper addresses the problem of offline evaluation in tabular reinforcement learning (RL). We propose a novel method that leverages synthetic trajectories constructed from the available data using a “sampling with replacement” basis, combining the advantages of model-based and Monte Carlo policy evaluation. The method is accompanied by theoretically derived finite sample upper error bounds, offering performance guarantees and allowing for a trade-off between statistical efficiency and computational cost. The results from computational experiments demonstrate that our method consistently achieves lower upper error bounds and relative mean square errors compared to Importance Sampling, Doubly Robust methods, and other existing approaches. Furthermore, this method achieves these superior results in significantly shorter running times compared to traditional model-based approaches. These findings highlight the effectiveness and efficiency of this synthetic trajectory method for accurate offline policy evaluation in RL.
KW - Importance sampling
KW - Markov decision process
KW - Off-policy evaluation
KW - Reinforcement learning
KW - Synthetic trajectories
UR - https://www.scopus.com/pages/publications/85176926160
U2 - 10.1007/s11222-023-10351-y
DO - 10.1007/s11222-023-10351-y
M3 - 文章
AN - SCOPUS:85176926160
SN - 0960-3174
VL - 34
JO - Statistics and Computing
JF - Statistics and Computing
IS - 1
M1 - 41
ER -