I. Introduction
Voice activation detection (VAD) is essential to speech enhancement, coding and recognition tasks. Its functionality directly influences the effectiveness due to selecting parts of the input audio signal containing speech. This task has been applied in many speech-based systems for years, especially in speaker diarisation, speech transmission, voice interaction systems, and automatic speech recognition (ASR). Many VAD systems have been developed because speech detection significantly impacts the quality and efficiency of voice-based tasks. In the initial phase, research was conducted on finding the relevant attributes of the speech signal, which unambiguously indicate the presence of such a signal in the acoustic stream. Then, some systems [1] - [3] has mechanisms for reconstructing the speech signal by the introduction of a noise reduction stage and then began to conduct research on the search for such attributes of the speech signal that would be more resistant to the changing conditions of signal acquisition, including the type and intensity of other sources of sound. The existing VADs use various methods to determine speech regions in the audio signal. They may use statistical techniques or machine learning with a supervised or unsupervised approach. Currently implemented VAD systems exploit the paradigm of deep neural networks.