1. Introduction
Handset effects and the mismatch between training and testing due to AWGN are two of the most important challenges in speaker identification. Another problem is reverberation; to achieve Robust Speaker Identification (RSI), Zhao et al. [1] suggested using binary masking with a deep neural network. Alternatively, denoising algorithms can be used to reduce the system complexity. According to [2] RSI is accomplished for noisy environments with Cochlear Filter Cepstral Coefficients (CFCCs). Unlike previous work higher system enhancement can be produced by exploiting fusion techniques between various types of features. Several researchers have focused only on mismatched noise conditions [3], [4] in order to improve the speaker recognition in both verification and identification tasks. Togneri and Pullella have presented an overview paper [5] based on two major issues which are the accuracy and robustness of speaker identification tasks. However, this work used only limited populations and modern strategies like fusion are missing. In general, several researchers in the speaker identification field concentrated on front-end (feature extraction) in the presence of AWGN such as [6], [8] to improve the system performance, On the other hand, [9] focused on the back-end (classifier) under different types of noise to improve RSI. Basically, the main drawback for these papers is limitation in number of samples used. Although the authors in [10] tried to employ fusion strategies to improve robustness for the speaker identification system, noise and channel effects were not investigated extensively. This study provides new investigations by using combinations of two main features, PNCC features which provide noise robustness in the speaker identification system [11], [13], and MFCC features that typically have better performance in clean speech. In addition, application of two feature compensation methods Feature Warping (FW) and Cepstral Mean and Variance Normalization (CMVN) discussed in [5], [14] are used to reduce noise and the handset channel effects, and alleviate linear/non-linear channel effects. Furthermore, exploitation of fusion techniques depending on feature dimension is considered such as early feature fusion (32 feature dimension), late score fusion (16 feature dimension) and finally combination of feature and score based early and late fusion (32 feature dimension). The major contribution in this work is to perform a thorough evaluation of the scheme first proposed in our previous work [15] by conducting more sophisticated fusion schemes in the presence of handset and AWGN.