Loading [MathJax]/extensions/MathZoom.js
Augmented Transformer for Speech Detection in Adverse Acoustical Conditions | IEEE Conference Publication | IEEE Xplore

Augmented Transformer for Speech Detection in Adverse Acoustical Conditions


Abstract:

In this work, we have presented a study of speech signal detection in adverse acoustic conditions. We have prepared a dedicated dataset, where fragments of monologues and...Show More

Abstract:

In this work, we have presented a study of speech signal detection in adverse acoustic conditions. We have prepared a dedicated dataset, where fragments of monologues and dialogues were mixed with various background noises at five SNR levels. Then we used a vision transformer adapted to audio signals to determine the speech regions in an audio signal. To adapt to the adverse acoustic conditions, we have added an augmentation module as an extra head in the transformer with low-pass and band-pass filters. As conducted experiments show, the proposed AugViT architecture improves speech detection compared with the accuracy of the baseline ViT transformer.
Date of Conference: 20-22 September 2023
Date Added to IEEE Xplore: 10 October 2023
Print on Demand(PoD) ISBN:979-8-3503-0498-5

ISSN Information:

Conference Location: Poznan, Poland
No metrics found for this document.

I. Introduction

Voice activation detection (VAD) is essential to speech enhancement, coding and recognition tasks. Its functionality directly influences the effectiveness due to selecting parts of the input audio signal containing speech. This task has been applied in many speech-based systems for years, especially in speaker diarisation, speech transmission, voice interaction systems, and automatic speech recognition (ASR). Many VAD systems have been developed because speech detection significantly impacts the quality and efficiency of voice-based tasks. In the initial phase, research was conducted on finding the relevant attributes of the speech signal, which unambiguously indicate the presence of such a signal in the acoustic stream. Then, some systems [1] - [3] has mechanisms for reconstructing the speech signal by the introduction of a noise reduction stage and then began to conduct research on the search for such attributes of the speech signal that would be more resistant to the changing conditions of signal acquisition, including the type and intensity of other sources of sound. The existing VADs use various methods to determine speech regions in the audio signal. They may use statistical techniques or machine learning with a supervised or unsupervised approach. Currently implemented VAD systems exploit the paradigm of deep neural networks.

Usage
Select a Year
2025

View as

Total usage sinceOct 2023:84
01234JanFebMarAprMayJunJulAugSepOctNovDec230000000000
Year Total:5
Data is updated monthly. Usage includes PDF downloads and HTML views.

Contact IEEE to Subscribe

References

References is not available for this document.