1. Introduction
Classification of audio signals is an important issue in many different applications, such as speech enhancement, audio searching, event detection and audio compression, etc. In the field of joined audio and speech compression, in order to use some advantages of the different compression techniques, the original signal is always preprocessed and is classified into different types or components, such as speech and music, the two major types with quite different properties and compression models. Following the application of the wideband wireless communication system, jointed speech and music coding is becoming a new trend [1], since many new applications strongly demand high quality in a diverse range of audio types. This makes a real-time audio classification module with high accuracy to be necessary. For example, the new MPEG Unified Speech and Audio Coding (USAC) scheme [2] has been proposed actively for a multimedia communication network with diverse signal types and variable bandwidth. Its core reference model is essentially based on an adaptive selection and switching of AMR-WB+ [3] and HE-AAC [4]. The signal classifier in USAC controls two switches: The first one happens between the two core coders - the frequency domain (FD) coder for music and the linear prediction domain (LPD) coder for speech. And the second switch is triggered only in the LPD coder, to separate speech segments from non-speech segments in a loop search, which is similar to the mode selection in AMR-WB+. Since the sound quality is highly related to the coding mode and might be greatly sensitive to misclassification, some factors seem to be crucial to the classifier in the practical scenario.