Journals & Magazines >IEEE Transactions on Circuits... >Volume: 34 Issue: 10

Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

A challenging task in embodied artificial intelligence is enabling the robot to carry out a navigational task following natural language instruction. In the task, the nav...Show More

Metadata

Abstract:

A challenging task in embodied artificial intelligence is enabling the robot to carry out a navigational task following natural language instruction. In the task, the navigator needs to understand objects, directions, as well as room types, which serve as landmarks for navigation. Although it is easy to encode objects and directions with an external encoder like an object detector, current navigators struggle to encode room type information properly due to the low accuracy offered by existing classifiers. This inadequacy poses confusion that navigators find difficult to overcome. Even humans may sometimes fail to determine the exact type of a room since multiple room types may exist in one panorama. To mitigate this problem, we propose to encode room type information in a prompt manner. Specifically, we first establish multi-modal, learnable prompt pools containing knowledge of room types. By querying the prompt pools, the navigator can obtain room-type prompts of the current view, and incorporate them into the navigator using a prompt-based learning method. Experimental results on the REVERIE, R2R and SOON datasets demonstrate the effectiveness of our approach.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 34, Issue: 10, October 2024)

Page(s): 9745 - 9756

Date of Publication: 15 May 2024

ISSN Information:

DOI: 10.1109/TCSVT.2024.3401451

Funding Agency:

Citations are not available for this document.

Contents

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All

Heqian Qiu, Lanxiao Wang, Taijin Zhao, Fanman Meng, Qingbo Wu, Hongliang Li, "MCCE-REC: MLLM-Driven Cross-Modal Contrastive Entropy Model for Zero-Shot Referring Expression Comprehension", IEEE Transactions on Circuits and Systems for Video Technology, vol.35, no.1, pp.754-768, 2025.

Show Article

Google Scholar

References is not available for this document.

Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Cites in Papers - |

Cites in Papers - IEEE (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Enhancing Vision and Language Navigation With Prompt-Based Scene Knowledge

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Cites in Papers - IEEE (1) | Other Publishers (0)

Cites in Papers - IEEE (1)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Cites in Papers - |