TY - JOUR
T1 - Transformer Memory for Interactive Visual Navigation in Cluttered Environments
AU - Li, Weiyuan
AU - Hong, Ruoxin
AU - Shen, Jiwei
AU - Yuan, Liang
AU - Lu, Yue
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2023/3/1
Y1 - 2023/3/1
N2 - Substantial progress has been achieved in embodied visual navigation based on reinforcement learning (RL). These studies presume that the environment is stationary where all the obstacles are static. However, in real cluttered scenes, interactable objects (e.g. shoes and boxes) blocking the way of robots makes the environment non-stationary. Accordingly, the ego-centric visual agent will easily get stuck in the dilemma of finding the next waypoint as it struggles to decide whether to push the obstacles ahead. To handle the predicament, we formulate this interactive visual navigation as Partial Observed Markov Decision Process (POMDP). As the transformer encoder has demonstrated its superior ability to capture the spatial-temporal dependencies in natural language processing. We propose a transformer-based memory to empower the agents utilizing the historical interactive information. However, directly leveraging the transformer architecture in the RL settings is highly unstable. We further propose a surrogate objective to predict the next waypoint as the auxiliary task, which facilitates the representation learning and bootstraps the RL. We demonstrate our method in the iGibson environment and experimental results show a significant improvement over the interactive Gibson benchmark and the related recurrent RL policy both in the validation seen scenes and the test unseen scenes.
AB - Substantial progress has been achieved in embodied visual navigation based on reinforcement learning (RL). These studies presume that the environment is stationary where all the obstacles are static. However, in real cluttered scenes, interactable objects (e.g. shoes and boxes) blocking the way of robots makes the environment non-stationary. Accordingly, the ego-centric visual agent will easily get stuck in the dilemma of finding the next waypoint as it struggles to decide whether to push the obstacles ahead. To handle the predicament, we formulate this interactive visual navigation as Partial Observed Markov Decision Process (POMDP). As the transformer encoder has demonstrated its superior ability to capture the spatial-temporal dependencies in natural language processing. We propose a transformer-based memory to empower the agents utilizing the historical interactive information. However, directly leveraging the transformer architecture in the RL settings is highly unstable. We further propose a surrogate objective to predict the next waypoint as the auxiliary task, which facilitates the representation learning and bootstraps the RL. We demonstrate our method in the iGibson environment and experimental results show a significant improvement over the interactive Gibson benchmark and the related recurrent RL policy both in the validation seen scenes and the test unseen scenes.
KW - Vision-based navigation
KW - reinforcement learning
KW - representation learning
UR - https://www.scopus.com/pages/publications/85148432269
U2 - 10.1109/LRA.2023.3241803
DO - 10.1109/LRA.2023.3241803
M3 - 文章
AN - SCOPUS:85148432269
SN - 2377-3766
VL - 8
SP - 1731
EP - 1738
JO - IEEE Robotics and Automation Letters
JF - IEEE Robotics and Automation Letters
IS - 3
ER -