Conferences >ICASSP 2023 - 2023 IEEE Inter...

High-Resolution Embedding Extractor for Speaker Diarisation

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from...Show More

Metadata

Abstract:

Speaker embedding extractors significantly influence the performance of clustering-based speaker diarisation systems. Conventionally, only one embedding is extracted from each speech segment. However, because of the sliding window approach, a segment easily includes two or more speakers owing to speaker change points. This study proposes a novel embedding extractor architecture, referred to as a high-resolution embedding extractor (HEE), which extracts multiple high-resolution embeddings from each speech segment. Hee consists of a feature-map extractor and an enhancer, where the enhancer with the self-attention mechanism is the key to success. The enhancer of HEE replaces the aggregation process; instead of a global pooling layer, the enhancer combines relative information to each frame via attention leveraging the global context. Extracted dense frame-level embeddings can each represent a speaker. Thus, multiple speakers can be represented by different frame-level features in each segment. We also propose an artificially generating mixture data training framework to train the proposed HEE. Through experiments on five evaluation sets, including four public datasets, the proposed HEE demonstrates at least 10% improvement on each evaluation set, except for one dataset, which we analyse that rapid speaker changes less exist.

Published in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 04-10 June 2023

Date Added to IEEE Xplore: 05 May 2023

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49357.2023.10097190

Conference Location: Rhodes Island, Greece

Contents

1. INTRODUCTION

Speaker diarisation is the task of finding ‘who spoke when’, which is an essential task for a range of applications, including speech dictation systems [1], [2]. The task is usually used as a pre-process of a speech recognition system to divide long speech recordings into short speaker-homogeneous segments [3],[4]. The field is experiencing rapid breakthroughs accelerated by advances in deep learning, where diarisation error rate (DER) is widely adopted as the primary metric.

References is not available for this document.

MIT Libraries

MIT Libraries

High-Resolution Embedding Extractor for Speaker Diarisation

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

High-Resolution Embedding Extractor for Speaker Diarisation

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. INTRODUCTION

References