Tina: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation

  • Dingbang Li
  • , Wenzhou Chen
  • , Xin Lin*
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Scopus citations

Abstract

Zero-shot navigation is a critical challenge in Vision-Language Navigation (VLN) tasks, where the ability to adapt to unfamiliar instructions and to act in unknown environments is essential. Existing supervised learning-based models, trained using annotated data through reinforcement learning, exhibit limitations in generalization capabilities. Large Language Models (LLMs), with their extensive knowledge and emergent reasoning abilities, present a potential path-way for achieving zero-shot navigation. This paper presents a VLN agent based on LLMs, exploring approaches to the zero-shot navigation problem. To compensate for the shortcomings of LLMs in environmental perception, we propose the Thinking, Interacting, and Action (TINA) framework. TINA enables the agent to scrutinize perceptual information and autonomously query key clues within the environment through an introduced question-answering module, thereby aligning instructions with specific perceptual data. The navigation agent's perceptual abilities are enhanced through the TINA framework, while the explicit thought and query processes also improve the navigational procedure's explainability and transparency. We evaluate the performance of our method on the Room-to-Room dataset. The experiment results indicate that our approach improves the navigation performance of LLM-based agents. Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.

Original languageEnglish
Title of host publication2024 IEEE International Conference on Multimedia and Expo, ICME 2024
PublisherIEEE Computer Society
ISBN (Electronic)9798350390155
DOIs
StatePublished - 2024
Event2024 IEEE International Conference on Multimedia and Expo, ICME 2024 - Niagra Falls, Canada
Duration: 15 Jul 202419 Jul 2024

Publication series

NameProceedings - IEEE International Conference on Multimedia and Expo
ISSN (Print)1945-7871
ISSN (Electronic)1945-788X

Conference

Conference2024 IEEE International Conference on Multimedia and Expo, ICME 2024
Country/TerritoryCanada
CityNiagra Falls
Period15/07/2419/07/24

Keywords

  • Agent
  • Large language model
  • Navigation
  • Vision and language
  • Zero-shot

Fingerprint

Dive into the research topics of 'Tina: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation'. Together they form a unique fingerprint.

Cite this