Journals & Magazines >IEEE Transactions on Circuits... >Volume: 33 Issue: 7

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

It is a rather practical but difficult task to find a specified target object via autonomous exploration based on natural language descriptions in an unstructured environ...Show More

Metadata

Abstract:

It is a rather practical but difficult task to find a specified target object via autonomous exploration based on natural language descriptions in an unstructured environment. Since the human-annotated data is expensive to gather for the goal-oriented vision-language navigation (GVLN) task, the size of the standard dataset is inadequate, which has significantly limited the accuracy of previous techniques. In this work, we aim to improve the robustness and generalization of the navigator by dynamically providing high-quality pseudo-instructions using a proposed RES-StS paradigm. Specifically, we establish a referring expression speaker (RES) to predict descriptive instructions for the given path to the goal object. Based on an environment-and-object fusion (EOF) module, RES derives spatial representations from the input trajectories, which are subsequently encoded by a number of transformer layers. Additionally, given that the quality of the pseudo labels is important for data augmentation while the limited dataset may also hinder RES learning, we propose to equip RES with a more effective generation ability by using the self-training approach. A trajectory-instruction matching scorer (TIMS) network based on contrastive learning is proposed to selectively use rehearsal of prior knowledge. Finally, all network modules in the system are integrated by suggesting a multi-stage training strategy, allowing them to assist one another and thus enhance performance on the GVLN task. Experimental results demonstrate the effectiveness of our approach. Compared with the SOTA methods, our method improves SR, SPL, and RGS by 4.72%, 2.55%, and 3.45% respectively, on the REVERIE dataset, and 4.58%, 3.75% and 3.14% respectively, on the SOON dataset.

Published in: IEEE Transactions on Circuits and Systems for Video Technology ( Volume: 33, Issue: 7, July 2023)

Page(s): 3441 - 3454

Date of Publication: 02 January 2023

ISSN Information:

DOI: 10.1109/TCSVT.2022.3233554

Funding Agency:

Contents

I. Introduction

As the instruction based on the referring expression like “Please go to my office and bring me the document on the table” are usually used in social conversations in our daily life, it is fundamental for embodied AI agents to have such a capability to accurately pursue the guidance and discern the goal object in perceptually-rich environments. The Goal-oriented Vision-Language Navigation (GVLN) setups in real environments [1], [2] take such a step toward this goal. Compared with the Vision-and-Language Navigation (VLN) task [3] that focuses on building a model capable of following fine-grained instructions, the GVLN task is more practical and constructive since fine-grained instructions are actually difficult to access in real life; thus the GVLN task has driven increased interest from various research fields. Although numerous methods have been explored to use visual and language clues to assist in goal navigation [1], [2], [4], [5], [6], [7], the severe overfitting problem caused by the small dataset with monotonous environments still remains challenging. This problem will become prominent when the network scale significantly increases, resulting in weak generalization.

References is not available for this document.

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

RES-StS: Referring Expression Speaker via Self-Training With Scorer for Goal-Oriented Vision-Language Navigation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?