I. Introduction
Sound Source Localization (SSL) is finding the specific position or direction from which a sound originates. Analysing a sound source's location and angle with reference to a microphone or sensor network is necessary. Applications that improve user experiences include audio scene analysis, voice recognition, and acoustic tracking. SSL is essential to these applications. Using a pair of microphones and the direction of arrival (DOA) of sound waves, the location of the sound source is ascertained. Based on variances in arrival time, amplitude, andfrequency content of the recorded sound signals by the sensors, several techniques, such as time delay estimations, phase analysis, intensity disparities, and spectral cues, are used to pinpoint the source's position.[1] SSL for indoors using Deep Learning(SSLIDE) based on CNN and DNN with an encoder and two decoders is proposed. Where prediction of location and mitigation of multipath artefacts are simultaneously performed. The result shows that SSLIDE outperforms CNN based on MAE(Mean Absolute Error). In[2] GCC-PHAT(Generalized Cross-Correlation) with RIR is used for feature extraction for 3D CNN training. TIMIT speech data is used, and a microphone array of 6 Omni-directional microphones forming a hexagon with a side length of 0.15m is considered. The model is tested with different reverberation times and SNR and exhibits 95.8% accuracy, making CNN a considerable algorithm for SSL. In [3] a technique that is different from BSS but accurately separates sources is introduced. Delay-and-sum beam forming is used for source identification. Side lobes are disregarded to provide a more precise signal. Each sound source's signal can be acquired using signal reconstruction. [4] This research tested a microphone array with 4 microphones and a 2cm radius and 12 microphones with a 11.9 cm radius for the circular harmonic domain. It was seen that the directivity of12 microphones with a 11.9cm radius is better than the other one. But the author contributed towards the arrangement of 4 mics with 2cm radius array. The sensitivity to reverberation and noise leads to degraded DOA. Although the SSL with deep learning approach gives remarkable accuracy of 86.36% and stability with the CH-E-MMP-CNN method. In[5] the DSCNN was used for first time for SSL. The Speech dataset of the TIMIT Acoustic Phonetic Continuous Speech Corpus was used with a tetrahedral microphone array of 4 cardioid microphones. 3D and 2D with RNN are employed where it was observed that 3D CNN achieves the highest accuracy with DOA errors of 9.39 and 10.29 degrees but results in very high model complexity, whereas DSCNN maintains a balance between accuracy and complexity. The Author in [6] used convolutional neural networks (CNN)to localise responses with the lowest variance and least distortion. By boosting data fusion and reducing the negative effects of noise and reverberation artefacts on the localization technique, this research examines the application of CNNs to improve the direction of arrival estimate using a ULA in noisy and reverberant environments.[7] As in [6], this method also introduced 3D SSL using a tetrahedral microphone array using CNN STFT phase input features and is experimented with a semi-synthetic audio data set. The experiment provides at least 31% lower MAE(Mean Absolute Error) for SSL. For active speech, azimuthal MAE is 18.97% and elevation MAE is 48.49 degrees, which is very low. Furthermore, the method can be developed for multiple active sound sources. [8] The paper, which has a heavy influence on the comparative analysis, uses an 8-microphone Uniform Linear Array (ULA), and the research suggests a CNN-based classification method for Broadband DOA Estimation of a single continuous sound source under noisy and reverberation situations. In contrast to our suggested algorithm, the system is trained using synthetic noise signals rather than real-time sound sources as inputs and performs really well in terms of accuracy. [9] The DOA Estimation for a Single Sound Source employing the designated phase difference between the signals of each direction and for each microphone was previously employed by the same authors in [7], something that was used in [7], [8]. [10] CNNs for MVDR-based (Minimum Variation Distortion Less Response) localization methods were employed, concentrating in particular on the SRP-WMVDR (Steered Response Powered Weighted MVDR) beam-former to improve accuracy in circumstances with a single source and no interferences. By properly allocating component weights, CNNs successfully boosted coherent frequency fusion of the narrowband response power, improving localization performance and reducing noise and reverberation artefacts. [11] Using convolutional recurrent neural networks (CRNNs) and Time Difference of Arrival (TDOA) estimation on a 4-microphone array, the study offers a system for sound source localization and identification. The suggested method outperforms the DCASE 2019 baseline system in recognising and localising sound events from multichannel recordings by integrating these two strategies.