Conferences >ICASSP 2023 - 2023 IEEE Inter...

Target Sound Extraction with Variable Cross-Modality Clues

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a...Show More

Metadata

Abstract:

Automatic target sound extraction (TSE) is a machine learning approach to mimic the human auditory perception capability of attending to a sound source of interest from a mixture of sources. It often uses a model conditioned on a fixed form of target sound clues, such as a sound class label, which limits the ways in which users can interact with the model to specify the target sounds. To leverage variable number of clues cross modalities available in the inference phase, including a video, a sound event class, and a text caption, we propose a unified transformer-based TSE model architecture, where a multi-clue attention module integrates all the clues across the modalities. Since there is no off-the-shelf benchmark to evaluate our proposed approach, we build a dataset¹ based on public corpora, Audioset and AudioCaps. Experimental results for seen and unseen target-sound evaluation sets show that our proposed TSE model can effectively deal with a varying number of clues which improves the TSE performance and robustness against partially compromised clues.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10095266

Conference Location: Rhodes Island, Greece

Contents

1. INTRODUCTION

People can focus their auditory attention on the sound of their interest in a complex acoustic environment [1]. Researchers have attempted to endow machines with a similar capability by audio source separation, a process of separating all audio sources out of their mixture. Audio source separation includes speech separation [2]–[4], music separation [5]–[7], and universal sound separation [8]–[10].

References is not available for this document.

MIT Libraries

MIT Libraries

Target Sound Extraction with Variable Cross-Modality Clues

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

Target Sound Extraction with Variable Cross-Modality Clues

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References