TY - JOUR
T1 - VME-Transformer
T2 - Enhancing Visual Memory Encoding for Navigation in Interactive Environments
AU - Shen, Jiwei
AU - Lou, Pengjie
AU - Yuan, Liang
AU - Lyu, Shujing
AU - Lu, Yue
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2024/1/1
Y1 - 2024/1/1
N2 - The efficiency of a robotic system is primarily determined by its ability to navigate complex and interactive environments. In real-world scenarios, cluttered surroundings are common, requiring a robot to navigate diverse spaces and displace objects to pave a path towards its objective. Consequently, 'Visual Interactive Navigation' presents several challenges, including how to retain historical exploration information from partially observable visual signals, and how to utilize sparse rewards in reinforcement learning to simultaneously learn a latent representation and a control policy. Addressing these challenges, we introduce a Transformer-based Visual Memory Encoder (VME-Transformer), capable of embedding both recent and long-term exploration information into memory. Additionally, we explicitly estimate the robot's next pose, conditioned on the impending action, to bootstrap the learning process of the high-capacity VME-Transformer. We further regularize the value function by introducing input perturbations, thereby enhancing its generalization capabilities in previously unseen environments. In the Visual Interactive Navigation tasks within the iGibson environment, the VME-Transformer demonstrates superior performance compared to state-of-the-art methods, underlining its effectiveness.
AB - The efficiency of a robotic system is primarily determined by its ability to navigate complex and interactive environments. In real-world scenarios, cluttered surroundings are common, requiring a robot to navigate diverse spaces and displace objects to pave a path towards its objective. Consequently, 'Visual Interactive Navigation' presents several challenges, including how to retain historical exploration information from partially observable visual signals, and how to utilize sparse rewards in reinforcement learning to simultaneously learn a latent representation and a control policy. Addressing these challenges, we introduce a Transformer-based Visual Memory Encoder (VME-Transformer), capable of embedding both recent and long-term exploration information into memory. Additionally, we explicitly estimate the robot's next pose, conditioned on the impending action, to bootstrap the learning process of the high-capacity VME-Transformer. We further regularize the value function by introducing input perturbations, thereby enhancing its generalization capabilities in previously unseen environments. In the Visual Interactive Navigation tasks within the iGibson environment, the VME-Transformer demonstrates superior performance compared to state-of-the-art methods, underlining its effectiveness.
KW - Visual interactive navigation
KW - long-term memory encoding
KW - reinforcement learning
KW - transformer
UR - https://www.scopus.com/pages/publications/85177065650
U2 - 10.1109/LRA.2023.3333238
DO - 10.1109/LRA.2023.3333238
M3 - 文章
AN - SCOPUS:85177065650
SN - 2377-3766
VL - 9
SP - 643
EP - 650
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 1
ER -