I. Introduction
The maintenance of a safe living environment within society has increasingly become a priority, in which the early detection of suspicious or dangerous situations is essential for efficient rescue operations. Shouted speech detection is the fundamental task involved in many audio surveillance systems, and several machine learning approaches have been presented based on labeled datasets [1]–[9]. Most conventional studies treated a shouted speech detection problem as a binary classification of "normal" and "shouted." Pohjalainen et al. [1] and Nandwana et al. [2] used Gaussian mixture models, in which normal and shouted speeches were separately modeled using acoustic features. Similarly, Huang et al. [3] and Sharma et al. [8] employed support vector machines to find a decision boundary of shouted and normal speeches. More recent studies often employ deep learning-based approaches to improve classification performance [4]–[6]. For example, Baghel et al. [6] trained convolutional neural networks (CNNs) to capture local features that were effective for classification. Saeed et al. [4] introduced long short-term memory to a classifier to capture temporal structures. Our previous study [5] presented an architecture combining the CNNs with gated recurrent units (GRUs), showing state-of-the-art classification performance.