VME-Transformer: Enhancing Visual Memory Encoding for Navigation in Interactive Environments

  • Jiwei Shen
  • , Pengjie Lou
  • , Liang Yuan
  • , Shujing Lyu*
  • , Yue Lu
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

16 Scopus citations

Abstract

The efficiency of a robotic system is primarily determined by its ability to navigate complex and interactive environments. In real-world scenarios, cluttered surroundings are common, requiring a robot to navigate diverse spaces and displace objects to pave a path towards its objective. Consequently, 'Visual Interactive Navigation' presents several challenges, including how to retain historical exploration information from partially observable visual signals, and how to utilize sparse rewards in reinforcement learning to simultaneously learn a latent representation and a control policy. Addressing these challenges, we introduce a Transformer-based Visual Memory Encoder (VME-Transformer), capable of embedding both recent and long-term exploration information into memory. Additionally, we explicitly estimate the robot's next pose, conditioned on the impending action, to bootstrap the learning process of the high-capacity VME-Transformer. We further regularize the value function by introducing input perturbations, thereby enhancing its generalization capabilities in previously unseen environments. In the Visual Interactive Navigation tasks within the iGibson environment, the VME-Transformer demonstrates superior performance compared to state-of-the-art methods, underlining its effectiveness.

Original languageEnglish
Pages (from-to)643-650
Number of pages8
JournalIEEE Robotics and Automation Letters
Volume9
Issue number1
DOIs
StatePublished - 1 Jan 2024

Keywords

  • Visual interactive navigation
  • long-term memory encoding
  • reinforcement learning
  • transformer

Fingerprint

Dive into the research topics of 'VME-Transformer: Enhancing Visual Memory Encoding for Navigation in Interactive Environments'. Together they form a unique fingerprint.

Cite this