TY - JOUR
T1 - Robust sequential design for piecewise-stationary multi-armed bandit problem in the presence of outliers
AU - Wang, Yaping
AU - Peng, Zhicheng
AU - Zhang, Riquan
AU - Xiao, Qian
N1 - Publisher Copyright:
© 2021 East China Normal University.
PY - 2021
Y1 - 2021
N2 - The multi-armed bandit (MAB) problem studies the sequential decision making in the presence of uncertainty and partial feedback on rewards. Its name comes from imagining a gambler at a row of slot machines who needs to decide the best strategy on the number of times as well as the orders to play each machine. It is a classic reinforcement learning problem which is fundamental to many online learning problems. In many practical applications of the MAB, the reward distributions may change at unknown time steps and the outliers (extreme rewards) often exist. Current sequential design strategies may struggle in such cases, as they tend to infer additional change points to fit the outliers. In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers in piecewise-stationary MAB settings. We show that the proposed RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of (Formula presented.), where T is the number of time steps, K is the number of arms and S is the number of stationary segments. We demonstrate its superior performance compared to some state-of-the-art algorithms in both simulation experiments and real data analysis. (See https://github.com/woaishufenke/MAB_STRF.git for the codes used in this paper.).
AB - The multi-armed bandit (MAB) problem studies the sequential decision making in the presence of uncertainty and partial feedback on rewards. Its name comes from imagining a gambler at a row of slot machines who needs to decide the best strategy on the number of times as well as the orders to play each machine. It is a classic reinforcement learning problem which is fundamental to many online learning problems. In many practical applications of the MAB, the reward distributions may change at unknown time steps and the outliers (extreme rewards) often exist. Current sequential design strategies may struggle in such cases, as they tend to infer additional change points to fit the outliers. In this paper, we propose a robust change-detection upper confidence bound (RCD-UCB) algorithm which can distinguish the real change points from the outliers in piecewise-stationary MAB settings. We show that the proposed RCD-UCB algorithm can achieve a nearly optimal regret bound on the order of (Formula presented.), where T is the number of time steps, K is the number of arms and S is the number of stationary segments. We demonstrate its superior performance compared to some state-of-the-art algorithms in both simulation experiments and real data analysis. (See https://github.com/woaishufenke/MAB_STRF.git for the codes used in this paper.).
KW - Change-point detection
KW - noisy data
KW - online learning
KW - truncated loss
UR - https://www.scopus.com/pages/publications/85104303521
U2 - 10.1080/24754269.2021.1902687
DO - 10.1080/24754269.2021.1902687
M3 - 文章
AN - SCOPUS:85104303521
SN - 2475-4269
VL - 5
SP - 122
EP - 133
JO - Statistical Theory and Related Fields
JF - Statistical Theory and Related Fields
IS - 2
ER -