Conferences >2022 International Conference...

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding r...Show More

Metadata

Abstract:

Referring expressions are commonly used when referring to a specific target in people's daily dialogue. In this paper, we develop a novel task of audio-visual grounding referring expression for robotic manipulation. The robot leverages both the audio and visual information to understand the referring expression in the given manipulation instruction and the corresponding manipulations are implemented. To solve the proposed task, an audio-visual framework is proposed for visual localization and sound recognition. We have also established a dataset which contains visual data, auditory data and manipulation instructions for evaluation. Finally, extensive experiments are conducted both offline and online to verify the effectiveness of the proposed audio-visual framework. And it is demonstrated that the robot performs better with the audio-visual data than with only the visual data.

Published in: 2022 International Conference on Robotics and Automation (ICRA)

Date of Conference: 23-27 May 2022

Date Added to IEEE Xplore: 12 July 2022

ISBN Information:

DOI: 10.1109/ICRA46639.2022.9811895

Conference Location: Philadelphia, PA, USA

Funding Agency:

Contents

I. Introduction

Referring expressions are commonly used when people talking with each other specifying some particular objects in the scene. For example, “the cup next to the computer”, “the brown bag on the chair”, etc. By understanding the referring expression, the target object can be localized in the scene given natural language description. Different from traditional visual perception tasks which have predefined object labels, the referring expression task is faced with more complex language and visual semantics making it a more challenging task. Currently, the referring expression task has aroused the attention from both the computer vision and natural language communities. And various methods and datasets for referring expression tasks are proposed [1]–[4].

References is not available for this document.

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Abstract:

Metadata

Abstract:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

I. Introduction

References