TY - JOUR
T1 - FIRE
T2 - A Failure-Adaptive RL Framework for Edge Computing Migrations
AU - Siew, Marie
AU - Sharma, Shikhar
AU - Li, Zekai
AU - Guo, Kun
AU - Xu, Chao
AU - Lorido-Botran, Tania
AU - Quek, Tony Q.S.
AU - Joe-Wong, Carlee
N1 - Publisher Copyright:
© 2008-2012 IEEE.
PY - 2025
Y1 - 2025
N2 - In edge computing, users’ service profiles are migrated between edge servers due to user mobility. Reinforcement Learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like AR/VR and real- time obstacle detection. These rare failures, being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. We introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose FIRE-ImRE, an importance samplingbased Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove FIRE-ImRE’s boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (FIRE-ImDQL) and actor critic (FIRE-ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances of rare failure events. Through trace-driven experiments, we show that FIRE reduces edge computing costs compared to vanilla RL and the greedy baseline in the event of failures.
AB - In edge computing, users’ service profiles are migrated between edge servers due to user mobility. Reinforcement Learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like AR/VR and real- time obstacle detection. These rare failures, being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. We introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose FIRE-ImRE, an importance samplingbased Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove FIRE-ImRE’s boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (FIRE-ImDQL) and actor critic (FIRE-ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances of rare failure events. Through trace-driven experiments, we show that FIRE reduces edge computing costs compared to vanilla RL and the greedy baseline in the event of failures.
KW - Edge computing
KW - reinforcement learning
KW - resilient resource allocation
KW - service migration
UR - https://www.scopus.com/pages/publications/105020755446
U2 - 10.1109/TSC.2025.3626791
DO - 10.1109/TSC.2025.3626791
M3 - 文章
AN - SCOPUS:105020755446
SN - 1939-1374
JO - IEEE Transactions on Services Computing
JF - IEEE Transactions on Services Computing
ER -