Transformer Memory for Interactive Visual Navigation in Cluttered Environments

  • Weiyuan Li
  • , Ruoxin Hong
  • , Jiwei Shen
  • , Liang Yuan
  • , Yue Lu*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

21 Scopus citations

Abstract

Substantial progress has been achieved in embodied visual navigation based on reinforcement learning (RL). These studies presume that the environment is stationary where all the obstacles are static. However, in real cluttered scenes, interactable objects (e.g. shoes and boxes) blocking the way of robots makes the environment non-stationary. Accordingly, the ego-centric visual agent will easily get stuck in the dilemma of finding the next waypoint as it struggles to decide whether to push the obstacles ahead. To handle the predicament, we formulate this interactive visual navigation as Partial Observed Markov Decision Process (POMDP). As the transformer encoder has demonstrated its superior ability to capture the spatial-temporal dependencies in natural language processing. We propose a transformer-based memory to empower the agents utilizing the historical interactive information. However, directly leveraging the transformer architecture in the RL settings is highly unstable. We further propose a surrogate objective to predict the next waypoint as the auxiliary task, which facilitates the representation learning and bootstraps the RL. We demonstrate our method in the iGibson environment and experimental results show a significant improvement over the interactive Gibson benchmark and the related recurrent RL policy both in the validation seen scenes and the test unseen scenes.

Original languageEnglish
Pages (from-to)1731-1738
Number of pages8
JournalIEEE Robotics and Automation Letters
Volume8
Issue number3
DOIs
StatePublished - 1 Mar 2023

Keywords

  • Vision-based navigation
  • reinforcement learning
  • representation learning

Fingerprint

Dive into the research topics of 'Transformer Memory for Interactive Visual Navigation in Cluttered Environments'. Together they form a unique fingerprint.

Cite this