Loading [MathJax]/extensions/TeX/boldsymbol.js
Convolutional Neural Networks for Direction of Arrival Estimation Compared to Classical Estimators and Bounds | IEEE Journals & Magazine | IEEE Xplore

Convolutional Neural Networks for Direction of Arrival Estimation Compared to Classical Estimators and Bounds


The use of convolution neural networks (CNN) for direction of arrival estimation is best suited to cases of imperfect information where classical methods such as multiple...

Abstract:

Recently, there has been a proliferation of applied machine learning (ML) research, including the use of convolutional neural networks (CNNs) for direction of arrival (Do...Show More

Abstract:

Recently, there has been a proliferation of applied machine learning (ML) research, including the use of convolutional neural networks (CNNs) for direction of arrival (DoA) estimation. With the increasing amount of research in this area, it is important to balance the performance and computational costs of CNNs with classical methods of DoA estimation such as Multiple Signal Classification (MUSIC). We outline the performance of both methods of DoA estimation for single-source and two-source cases for multiple array conditions. The results are also compared to the Cramer-Rao lower bound (CRLB) and conventional beamforming. For each source case, CNNs were trained for a perfect uniform line array (ULA) and tested against data from a perfect ULA, perturbed ULAs, ULAs with missing sensors, and ULAs with muffled sensors. We show that for the single-source case, the CNNs do not offer any performance improvement relative to MUSIC at low signal-to-noise ratio (SNR). For the two-source cases, the CNNs perform better than MUSIC but only at low SNR. For the remaining array cases, the CNNs outperform MUSIC. These results indicate that the performance improvements from CNNs are highest for situations where there is signal model to data mismatch (imperfect information). This work also illustrates that the CNN estimators developed in this work exceed the CRLB and are biased estimators caused by the lack of unbiased constraint in the loss function during training of the CNNs.
The use of convolution neural networks (CNN) for direction of arrival estimation is best suited to cases of imperfect information where classical methods such as multiple...
Published in: IEEE Access ( Volume: 13)
Page(s): 25533 - 25545
Date of Publication: 04 February 2025
Electronic ISSN: 2169-3536

Funding Agency:

Figures are not available for this document.

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Direction of arrival (DoA) estimation is the process of determining the location of signals arriving at a group of sensors. DoA estimation is performed in many fields, such as radar, telecommunications, speech, and sonar [1] and [2]. This problem has been studied for many decades with the initial development of conventional beamforming (CBF) by Bartlett in 1948 [3], followed by methods including Capon’s beamformer in 1969 [4], the minimum norm method [5], [6] in 1979, multiple signal classification (MUSIC) [7] in 1986, estimation of signal parameters via rotational invariance techniques (ESPRIT) [8] in 1989, and more recently compressive sensing [9]. Starting in the 1990s, machine learning (ML) based approaches began to be applied to source localization [10]; however, the use of ML methods (also known as model-free, or data driven methods) has significantly increased in the last decade due to both the advancement and use of graphical processing units (GPUs) for ML and the more readily available off-the-shelf ML packages such as TensorFlow, PyTorch, or MATLAB’s own toolboxes.

Many different types of ML approaches have been studied for DoA estimation such as multi-layer perceptrons (MLPs), support vector machines (SVMs), convolutional neural networks (CNNs), autoencoders, residual networks (ResNets), and vision transformers (ViTs) along with hybrid approaches combining multiple network types.

MLPs were one of the first ML-methods used for DoA estimation in [11] which used an MLP with non-linear activation functions to estimate DoA for signals in random noise. MLPs continued to be used today such as in [12] which developed a deep neural network (DNN), which is functionally similar to an MLP, to perform DoA estimation for up to four sources in which the network also estimates the number of sources present. Additionally, MLPs has been used to perform DoA estimation and number of source estimation in non-gaussian noise as illustrated in [13] and they have been extended to subarray sampling use cases [14].

SVMs have also been applied to the DoA estimation problem including [15] which developed a multi-class SVM algorithm for DoA estimation and [16] which used support vector regression in wireless communication which was then validated with experimental data in [17]. More recently a sensitivity study of SVM error performance to hyperparameters has been performed [18].

The use of CNNs include [19] which developed a CNN focused on DoA estimation in low SNR that performed well for low numbers of snapshots and in cases of SNR mismatches while also estimating the number of sources present. Reference [20] studied the improvements in CNN classification accuracy when imposing the natural shift-invariant structure present in ULAs (and other arrays) on the input data to the CNN. Reference [21] then extended this work to CNNs but posed the ML problem as a regression problem. Reference [22] has also extended CNNs for three-dimensional DoA estimation while also developing a output layer formulation allowing for probabilities at each output neuron to be calculated providing the ability to apply confidence intervals to predictions.

Autoencoders, more specifically denoising autoencoders, have been leveraged in DoA estimation in order to improve the estimate of the sample covariance matrix [23]. The denoised sample covariance matrix is then used in classical covariance based DoA estimation schemes. An autoencoder was also used in [24] to perform DoA estimation but included array imperfections seeking to account for real world scenarios. ResNets have also been applied to DoA estimation in [25]. ResNets were originally developed to overcome the vanishing gradient problem [26] by including skip connections across single or multiple layers of a neural network (NN). More recently, ViTs have been used for multi-class classification of DoA in [27]. This work is an outgrowth of the success of transformers in natural language processing [28], and their extension to image classification with ViTs which have been shown to outperform CNNs in image classification [29] and [30] leading to a natural extension of applying ViTs to DoA estimation due to the highly successful results from CNNs.

In addition to these different machine learning architectures, there are different approaches in the use of each architecture. For example, NNs can be developed to directly calculate the DoA via regression, or output probabilities of each possible DoA within a discretized domain, known as classification. Work such as [23] leverages NNs to denoise the estimate of the covariance matrix, which can be fed to classical covariance-based DoA estimation methods. Other works leverage the structure of classical methods and seek to improve on them such as [31] and [32] which both utilize the structure of MUSIC in the development of their ML-based approach. More recently, there has been work specifically designing NNs for DoA estimation accounting for imperfect data whether due to array perturbations, sensor mutual coupling, and gain or phase errors [24], [32].

For a more comprehensive review of the current status of ML-based DoA estimation methods the reader is referred to the following review papers [33] and [34].

The rapid expansion in the use of ML for DoA estimation has resulted in a significant amount of work to identify reductions in DoA error compared to classical methods without discussing the trade-offs (other than computational cost and training time). Specifically, the missing unbiased constraint in the development of ML-based DoA estimators is a significant trade-off that needs to be studied. This is especially true in relatively simple problems with perfect information where efficient estimators are well known, for example, the CBF for the single source in additive white Gaussian noise (AWGN) problem [1]. The application of ML-based methods to cases with efficient estimators may be applicable but a discussion of the bias-variance trade-off should be included and is often missing.

It is the authors’ belief that a more thorough discussion of the trade-offs of ML-based methods is required. For example, the use of ML-based methods in imperfect information cases (such as perturbed arrays, arrays with missing sensors, muffled arrays, or mutual sensor coupling) is a promising use case for ML-based methods because efficient estimators do not exist due to the unknown state of the array. Additionally, the use of ML-based methods such as denoising autoencoders to improve the estimate of the sample covariance matrix prior to the use of traditional covariance based DoA estimation schemes [23], as discussed above, is another interesting application.

The advantage of the ML-based methods is that no assumptions about the signal model need to be made a priori. ML algorithms utilize the training data to learn a model that represents the physics of the training data. The key point of training is to develop a model that is able to generalize to cover all data. This is especially important if the training data does not span the entire physics of the system. This is where ML methods have the greatest advantage over classical methods, as no signal structure is assumed in training and developing models of signal structure can be difficult when considering real-world effects. CBF, Capon’s beamformer, and MUSIC all make assumptions about the state of the sensors when estimating DoA. This is problematic when there is a mismatch between the assumed physical model and the actual physical model. This mismatch can be caused by failed sensors, sensor perturbations, sensor calibration drift, non-isotropic sensors, near-field sources, and others.

Furthermore, in conventional estimation problems, when developing an estimator it is common to first attempt to derive an unbiased estimator [35]. Unbiased estimators are typically desired because the expected value of the estimator, E\left ({{\hat {\theta }}}\right) , will be the true value, \theta , while a biased estimator will only converge to \theta + b , where b is the bias of the estimator. If an unbiased estimator cannot be derived, the unbiased constraint can then be removed from the derivation of the estimator. Although an unbiased constraint can be easily applied in classical estimation theory, ML-based methods utilize gradient descent methods, such as Adam [36], to iteratively converge to a solution utilizing training data and a loss function, typically root mean square error (RMSE) in regression problems. This can result in the ML estimator being biased for two reasons. First, if the training data for the ML-based method does not span the entire statistics of the test data, predictions of the test data will be biased because the statistics of the training data do not match the statistics of the test data. Second, the use of RMSE and mean square error (MSE) as the loss functions during training result in no unbiased constraint being applied to the weight updates during back propagation. For example using MSE, defined in (1), the gradient-descent algorithm will seek to minimize MSE which does not impose an explicit constraint on the bias term. This results in the ML-based method finding an estimator that achieves the lowest error, but is not guaranteed to be efficient and therefore unbiased.\begin{equation*} MSE = bias^{2} + variance \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

This can result in the ML methods achieving significantly lower mean square error (MSE) compared to classical methods, but this MSE improvement is achieved by trading off bias for variance. This becomes especially important when coupled with the overuse of ML-based methods where classical approaches are well suited.

The specific contributions of this research are:

  1. To illustrate that it is unnecessary to utilize ML-based methods for DoA estimation problems with perfect information a priori when unbiased estimators are desired.

  2. To demonstrate the ability of ML-based methods (in this case a CNN) to generalize to imperfect information cases better than classical methods such as MUSIC and CBF.

  3. To show that CNNs can learn a more general function for DoA estimation compared to classical methods when accounting for imperfect information.

  4. To illustrate that the CNNs can converge to a biased estimator.

  5. To outline that MUSIC in general has lower errors in DoA estimation for muffled sensors compared to missing sensors.

This paper is organized as follows: Section II outlines the signal model used to generate the training, validation, and test data for this work including details of the different imperfect information cases; Section III reviews important background topics; Section IV provides an overview of the CNN architecture and the training, validation, and prediction process; Section V presents the results and discussion; and Section VI provides the conclusions.

Conventions: Boldface lowercase math symbols denote vectors and boldface uppercase math symbols denote matrices. H denotes Hermitian. \odot denotes the Hadamard product. \mathbf {a}\sim \mathcal {CN}(\boldsymbol {\mu },\boldsymbol {C}) indicates that a is a complex random vector with normal distribution, mean \boldsymbol {\mu } , and covariance C. \mathop {\mathrm {diag}}\nolimits ([a_{0},a_{1},\ldots ,a_{L-1}]) is an L\times L diagonal matrix containing the elements a_{0},~a_{1} , \ldots , a_{L-1} along its diagonal. I denotes an identity matrix. K denotes the number of snapshots. \lambda indicates wavelength. ^{\perp } denotes the orthogonal operation.

SECTION II.

Signal Model

This paper utilizes a uniform linear array (ULA) with L sensors and inter-sensor spacing d=\lambda /2 . The sensors are placed on the z-axis with coordinates from 0,\lambda /2,\ldots ,(L-1)\lambda /2 . The plane wave assumption is made, which results in a specific wave creating an incident angle with the array of \theta _{s} , with a direction cosine given by, u_{s}=\cos (\theta _{s}) , where s denotes the specific plane wave [1]. The frequency domain signal snapshot model is given by (2),\begin{equation*} \mathbf {X}=\sum _{m=1}^{M} \mathbf {v}_{m} \mathbf {s}_{m} + \mathbf {N} \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features.

where the array manifold vector is denoted by v with size L \times 1 with each row entry corresponding to a specific direction cosine u_{s} . The i^{th} row of v is \exp (j\pi u_{s} (i-1)) [1]. The vector \mathbf {s}_{m} is the complex amplitude of the signal with distribution \mathcal {CN}(0,\sigma _{s}^{2}) and size 1 \times K . N is AWGN with size L \times K . Lastly, m denotes the specific source.

When multiple independent snapshots are combined, the result is a L\times K matrix given by (3).\begin{equation*} \mathbf {X}=\begin{bmatrix} \mathbf {x}_{1}& \quad \mathbf {x}_{2}& \quad \ldots & \quad \mathbf {x}_{K} \end{bmatrix} \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features.

This signal model is then used as the basis to create four different array conditions (shown in Figure 1) to evaluate MUSIC, CBF, and the CNNs under both single-source and two-source present cases. The following subsections outline the four array cases, the reason for studying each case, and the modifications to the signal model to model each of these cases.

FIGURE 1. - Array cases.
FIGURE 1.

Array cases.

A. Perfect Array Data

The perfect array data case is used to train, validate, and test the ML models. No modifications to the signal model outlined above are required.

For the single-source case, the u-space is discretized into 100 bins and 1,000 data samples are created for each bin representing a source at that DoA. This data is then used for training and validation of the ML models. This process is repeated for all integer signal-to-noise ratio (SNR) values between −10 and 10 decibels (dB) SNR. A test data set is then created using the same SNR ranges and number of data samples per u-space discretization parameters with a different seed value for the random parameters.

For the two-source case, the same process is followed except for the locations of the sources. For the training and validation data, each data sample randomly places two uncorrelated sources of equal strength in the u-space domain by sampling uniformly from the u-space discretization. For the test data, the equal strength sources are located at the half-power beam width (HPBW) of the array for each data sample.

B. Perturbed Array Data

The perturbed array data case is used only as a test case for the ML models and classical methods. The purpose of introducing this test case is to demonstrate the ability of ML methods to generalize to cases of imperfect information. The perturbed case models the real-world scenario where over time sensor positions will drift, resulting in decreased array performance. The number of test data samples and u-space discretization for the perturbed case is the same as for the perfect array test cases. The perturbed array data is generated by adding a perturbation value to each sensor location in the array manifold vector. The array perturbation values are calculated by sampling a Gaussian random variable, with a mean of zero and variance of 0.4, independently for each sensor. This sampling operation is conducted independently for each test data sample.

C. Missing Array Data

The missing array data case is used only as a test case for the ML models. The purpose of introducing this test case is to demonstrate the ability of ML methods to generalize to cases of imperfect information where random sensors in the array may fail over time. The missing array data is generated by setting the rows of the data matrix that correspond to missing sensors to 0 utilizing the same u-space discretization and number of data samples as the previous cases.

The missing sensor cases analyzed in this work are all combinations of 4 failed sensors in a 10 sensor array resulting in 210 array combinations. The majority of the figures throughout this paper will be presented for a specific SNR and all array combinations. Each combination has been assigned a configuration number to condense the plots. The mapping between array combinations and configuration number is provided in Figure 2.

FIGURE 2. - Sensor configuration key.
FIGURE 2.

Sensor configuration key.

D. Muffled Array Data

The muffled array data case is used as a test case for the ML models. The purpose of introducing this test case is to demonstrate the ability of ML methods to generalize to the scenario of sensor performance decreasing due to increased noise on the sensors. The muffled array data is generated by multiplying the muffled sensor data by 0.1 after the frequency domain snapshot data matrix is calculated. This represents the effects of increased noise on the desired sensors. The muffled array data is generated using the same u-space discretization and number of data samples as the previous cases. The same sensor combinations as those used for the missing sensor array are used for the muffled sensor array except each configuration represents muffled sensors.

SECTION III.

Review of Important Topics

A. Conventional Beamforming

CBF is a classical approach used to estimate DoA. In the frequency domain, it relies on the assumed constant phase shift between each successive sensor in the array [37]. Thus, CBF works by scanning across the entire u-space domain and for each u-space discretization: (a) applies the appropriate phase shift to each sensor, (b) coherently sums the phase-shifted data, and (c) divides the sum by the number of snapshots. The resulting level is then set as the spectral amplitude of that DoA. For the single-source case, the maximum amplitude in the resulting spectrum is located at the predicted DoA. CBF is conducted for all test cases, as it is expected to have poor performance in the imperfect information cases since it relies on the constant phase-shift assumption, which fails when the array is not perfect.

B. Overview of Multiple Signal Classification

MUSIC is a subspace-based (or super-resolution) method for DoA estimation [1] and is used as a benchmark comparison for the CNNs developed in this work. MUSIC relies on the orthogonality of the signal + noise and noise subspaces in the frequency domain snapshot model. MUSIC works by calculating and sorting from largest to smallest the eigenvalues and eigenvectors of the sample covariance matrix given by (4). The first n eigenvectors are the signal eigenvectors and the remaining are the noise eigenvectors, denoted by \mathbf {e_{n}} . The DoA is then estimated by looping over all possible DoAs and multiplying the noise eigenvectors by the array manifold vector as shown in (5). This operation will result in the pseudo-spectrum, f, tending toward infinity at the estimated DoA. The MUSIC algorithm is applied for all test cases to use as a benchmark for super-resolution algorithms.\begin{align*} \mathbf {C_{s}} & = \mathbf {X} \mathbf {X}^{H}/K \tag {4}\\ \mathbf {f} & = \frac {1}{\mathbf {v}^{H} \mathbf {e_{n}} \mathbf {v}} \tag {5}\end{align*} View SourceRight-click on figure for MathML and additional features.

C. Overview of Cramer-Rao Lower Bound

The Cramer-Rao Lower Bound (CRLB) represents the lower bound of the variance for a minimum variance unbiased estimator (MVUE) [35]. The CRLB is given by (6), assuming that the regularity condition in (7) is satisfied, where p is the probability density function (PDF) of the random variable x with unknown parameter \theta . In (6), the denominator is the negative expected value of the second derivative of the PDF of x, with respect to the unknown parameter, \theta .\begin{align*} & var\left ({{ \hat {\theta } }}\right ) \geq \frac {1}{-\mathbf {E} \left [{{ \frac {\partial ^{2} \ln p\left ({{\mathbf {x};\theta }}\right )}{\partial ^{2} \theta }}}\right ]} \tag {6}\\ & \mathbf {E} \left [{{ \frac {\partial \ln p\left ({{\mathbf {x};\theta }}\right )}{\partial \theta } }}\right ] = 0 \tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features.

The CRLB for DoA estimation with a line array for plane waves in AWGN is given by (8) [1] where \sigma _{w} is the noise variance, \mathbf {S_{f}} is the signal covariance matrix, v is the array manifold vector, \mathbf {S_{x}} is the input spectral matrix defined in (9), \mathbf {P_{V}} is the noise subspace projection matrix defined in (10), H is defined in (11), and D is the derivative of the array manifold vector.\begin{align*} \mathbf {C}_{CR} \left ({{ u }}\right ) & = \frac {\sigma ^{2}_{w}}{2K} \{ \Re \left [{{ \left ({{ \mathbf {S_{f}} \mathbf {V}^{H} \mathbf {S_{x}}^{-1} \mathbf {V} \mathbf {S_{f}} }}\right ) \odot \mathbf {H^{T}} }}\right ] \} ^{-1} \tag {8}\\ \mathbf {S_{x}} & = \mathbf {V} \mathbf {S_{f}} \mathbf {V}^{H} + \sigma ^{2}_{w} \mathbf {I} \tag {9}\\ \mathbf {P_{V}} & = \left [{{ \mathbf {I} - \mathbf {V} \left ({{ \mathbf {V}^{H} \mathbf {V} }}\right )^{-1} \mathbf {V}^{H} }}\right ] \tag {10}\\ \mathbf {H} & \triangleq \mathbf {D}^{H} \mathbf {P_{V}}^{\perp } \mathbf {D} \tag {11}\end{align*} View SourceRight-click on figure for MathML and additional features.

The CRLB for each array case is calculated by modifying the array manifold vector as outlined in II. Note that the standard CRLB does not apply to the muffled array case. Due to this, the CRLB for the missing sensor case is utilized for the muffled sensor CRLB throughout this paper to provide a reference CRLB for the muffled case. Additionally, no CRLB is provided for the perturbed case because each array manifold vector perturbation value is a random variable, so if Q perturbed data samples are created this would result in Q CRLBs. This is impractical to plot, and due to this constraint, no CRLB is presented for the perturbed cases.

SECTION IV.

Convolutional Neural Network Details

A CNN is a ML network architecture that utilizes filters to extract information about input data [38] to learn a generalized model of the data. The layers in which these filters are applied are called convolutional layers. These convolutional layers increase the dimensionality of the data as multiple filters of a defined size are applied to the input data. These convolutional layers are then typically followed by normalization, activation, pooling, and dropout layers. The batch normalization layers are used to reduce the impacts of the changing statistical distribution of each layer during training, which can lead to the vanishing gradient problem [26]. The change in statistical properties during training is commonly called the internal covariate shift [39]. Next, activation layers are used to allow the network to learn non-linear trends [40], and pooling layers can be used to reduce the dimensionality of the data [41]. These layers are then typically followed by a dropout layer to aid in the generalization capabilities of the neural network [42]. These operations can be grouped together and performed multiple times to allow each group of operations to learn different features of the data.

In supervised learning, training and validation data, with already known outputs, are used as input to the network such that the network makes predictions based on the input training data. The error between the predicted output of the neural network and the known output is calculated. The weights and biases are then updated based on the magnitude of the error and the learning rate through a process called back propagation [43]. A block diagram of this process is provided in Figure 3. The process is repeated by looping over all the data (called epochs) until the optimal performance of the given network architecture and initialization conditions are met. The optimal performance of the network is found by leveraging the validation data which is not used to update the weights and biases. Instead, it is used to periodically check the error of the network during training. Using separate validation data to confirm the accuracy of a network is needed because the neural network would eventually memorize the input data, called overfitting [40]. The optimal network is typically the one that produces the minimum error on the validation data, not on the training data, as illustrated in Figure 4.

FIGURE 3. - CNN training block diagram.
FIGURE 3.

CNN training block diagram.

FIGURE 4. - CNN training and validation.
FIGURE 4.

CNN training and validation.

A. Overview of Network Structure & Training Parameters

An overview of the NN architecture used in this work is provided in Figure 5. The network consists of an input layer followed by a two-dimensional convolutional layer, batch normalization layer, activation layer, dropout layer, and max pooling layer. The preceding 5 layers are then repeated 3 times creating three blocks of layers, with the only change being a doubling in the total number of filters for each successive convolutional layer. This is then followed by another dropout layer, a flattening layer, and the final fully connected layer with the number of neurons equal to the number of sources.

FIGURE 5. - The designed CNN where 2D conv indicates 2D convolutional layers with the number of filters and size of the kernel specified, m indicates the slope for the negative portion of ReLU, p denotes the percentage of neurons dropped in a dropout layer, and max pooling indicates a maximum pooling layer with the window and stride size indicated.
FIGURE 5.

The designed CNN where 2D conv indicates 2D convolutional layers with the number of filters and size of the kernel specified, m indicates the slope for the negative portion of ReLU, p denotes the percentage of neurons dropped in a dropout layer, and max pooling indicates a maximum pooling layer with the window and stride size indicated.

The filter size used for all convolutional layers in this network is 20\times 20 and the number of filters in the first convolutional layer is 8. The activation function utilized is the leaky rectified linear unit (ReLU) [44]. Leaky ReLU is a modification to ReLU where the inputs less than zero are assigned a value based on a line intersecting the origin of the Cartesian plane parameterized by a slope. The slope chosen for this work was 0.4. The dropout layer is then parameterized by the percentage of neurons randomly set to 0, which was 20% for this work. The max pooling layer parameters of pool size and stride were set to 2. This sets the pooling window to a 2\times 2 matrix, and the selected stride results in the dimensionality of the data being halved by the pooling layers. This structure of the layers, specifically the doubling of the number of filters and the pooling operations halving the data, results in a pyramidal structure to the data. The length and width of the data matrix are shrunk as the data passes through the network but the depth is increased.

The specific selection of the network size and hyperparameters of the network were determined by training a few trial CNNs and relying on past work by the authors. The trial CNNs used one to three of the previously described blocks of layers and the validation error of the networks was monitored to select the final number of blocks. Additional trial CNNs were developed to select the dropout percentage and slope parameter for leaky ReLU. The size of the convolution filters was selected by leveraging previous research indicating that larger filter sizes directly lead to lower error in DoA estimation [21]. A filter size of 20 was selected because the size of the input data in this work was 20\times 20 limiting the maximum size of the filters. The pooling parameters were then selected in order to reduce the dimensionality of the data, and previous work in [21] showed that a kernel size of 2\times 2 and a stride of 2 produced reasonable results. The authors wish to note that an extensive hyperparameter tuning effort was not pursued as the error results produced by the CNNs described in this section achieved error results to support our claims.

B. Training and Validation Results

For this work, SNR-specific and number of source specific CNNs were trained using the perfect array data generated by the signal model outlined in Section II. Each data sample generated by the signal model is an L\times K complex data matrix. The CNNs requires real valued data; thus, the real and imaginary values of each data sample are concatenated together resulting in a 2L\times K real valued matrix. This resulting matrix is then normalized to contain only values between −1 and 1 using (12) where x is the input data and i is a specific sample. The resulting normalized matrix is used as the input to the CNN. The CNN is then trained using root mean square error (RMSE) as the loss function and Adam as the optimization routine [36]. The network is trained for 10 epochs with a mini-batch size of 32. Additionally, 5 cross validation folds are performed to verify the network is training to a consistent minimum. This process is repeated for each SNR, resulting in five trained networks at the 21 SNR values generated. The statistics for the cross-validation study are provided in Figure 6 and indicate that the cross-validation networks for each specific SNR converged to similar mean and variances indicating that the training process is consistent.\begin{equation*} \mathbf {X}(i) = \frac {2*(\mathbf {X}(i) - min(\mathbf {X}))}{(max(\mathbf {X}) - min(\mathbf {X})) - 1}; \tag {12}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 6. - Cross-validation statistics for all SNR and both source cases.
FIGURE 6.

Cross-validation statistics for all SNR and both source cases.

The training and validation loss curves, plotted on a semi-log plot, for a CNN trained on the single-source 0 dB SNR case for 100 epochs is shown in Figure 7. This plot shows two training plateaus in the data, one between 5 and 20 epochs and a second above 20 epochs. This indicates that additional training time beyond the 10 epochs conducted for all the networks may have lead to improved results on the test data. However, training beyond 10 epochs was not pursued as the performance of the CNNs after 10 epochs of training was able to demonstrate the desired results without the prohibitive increase in training time for all 210 CNNs. It should be noted that this loss plot does not show the typical parabola shaped validation loss due to the training and validation data being generated from the same signal model thus the training data spans the validation data statistics. If a different set of validation data was used, for example the perturbed array data, the validation loss curve would have the parabolic shape indicating over-fitting.

FIGURE 7. - Training and validation loss curves for the single-source 0 dB SNR CNN trained for 100 epochs.
FIGURE 7.

Training and validation loss curves for the single-source 0 dB SNR CNN trained for 100 epochs.

C. Predictions

After training, the CNNs can be used to make predictions utilizing the test data. Predictions are generated for the perfect, perturbed, missing, and muffled array cases for both the single-source and two-source cases. Before making predictions, as outlined above, the complex valued data needs to be converted to real values only and is also normalized. It is important to note in this step that the data after normalization is not bounded on the range −1 to 1 as the maximum and minimum values used in (12) are the maximum and minimum from the training data. This is done because the testing data may have data outside the bounds of what was included in the training data, and thus compressing it to the same range of the training data is not statistically valid. The MSE in dB between the neural network prediction and the true value is then calculated by averaging the error across all test samples accounting for the periodicity of direction cosines as shown in (13).\begin{align*} MSE = 10*log_{10} \left ({{\frac {1}{N} \sum _{i=1}^{N} min \left ({{ \begin{array}{lllllllllll} {\left ({{ \hat {\theta } - \theta }}\right )^{2},} \\ {\left ({{ \hat {\theta } - \left ({{ \theta +2 }}\right ) }}\right )^{2},} \\ {\left ({{ \hat {\theta } - \left ({{ \theta -2 }}\right ) }}\right )^{2}} \end{array} }}\right ) }}\right ) \tag {13}\end{align*} View SourceRight-click on figure for MathML and additional features.

SECTION V.

Results and Discussion

The following section outlines the results when comparing the MSE in dB for the CNNs, MUSIC, CBF, and the CRLB.

A. Perfect Array Results

Figure 8 plots the MSE in dB against SNR for the single-source (top figure) and two-source (bottom figure) cases. For the single-source case, it can be readily seen that at low SNR the CNNs and MUSIC are nearly equivalent. At higher SNR, the CNNs outperform MUSIC and exceed the CRLB. For the two-source case, the CNNs outperform MUSIC at lower SNR only; however, the low SNR case is typically the most relevant case, as locating high SNR signals is less challenging. Additionally, it is expected that MUSIC will perform well in the perfect data case, as there is no model mismatch (perfect information) between the assumptions in MUSIC and the actual data. For both the single and two-source cases, both MUSIC and the CNNs outperform the CBF as expected.

FIGURE 8. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for a perfect array with 1 source (above) and 2 sources.
FIGURE 8.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for a perfect array with 1 source (above) and 2 sources.

For both signal cases, the CNNs are below the CRLB for at least a portion of the SNR range. This indicates that the CNNs are biased estimators. This becomes apparent when looking at the definition of MSE in terms of bias and variance as previously defined in (1). Since the CRLB is the lower bound of the variance for an unbiased estimator, the only way to reduce the MSE below the CRLB is to introduce bias into the estimator [45]. Additionally, at lower SNR where the CNNs are above the CRLB the CNN-based estimator may still be biased because the training data is not guaranteed to have the same statistical properties of the test data which would result in a biased estimator for the test data.

B. Perturbed Array Results

Figure 9 plots the results of estimating DoA for a randomly perturbed array using CNNs trained on perfect array data, MUSIC, and CBF. Unlike the perfect array case, there is a model mismatch (imperfect information) between the data and assumptions in all estimation methods. This results in significantly degraded performance of MUSIC and CBF for both signal cases. The CNNs also suffer from degraded performance relative to the perfect array case but outperform MUSIC and CBF. This indicates that the CNNs have learned a more generalizable function for DoA estimation compared to MUSIC. However, this is likely at the cost of bias, as discussed in the perfect array results section. This case highlights the trade-offs of utilizing CNNs. The size of the CNNs can be made arbitrarily large to drive the MSE lower while also being more generalizable than classical methods, but this comes with the trade-off of bias in the estimator which cannot be removed with additional sampling [35]. The CNNs will converge to their biased estimate, while an unbiased estimator will converge to the true answer as the number of snapshots increases.

FIGURE 9. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for a perturbed array with 1 source (above) and 2 sources.
FIGURE 9.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for a perturbed array with 1 source (above) and 2 sources.

C. Missing Sensor Array Results

Figures 10 & 11 are the results at -10, 0, and 10 dB SNR for all 210 missing sensor combinations, as outlined in Section II. The single-source case results show that at −10 dB SNR the CNNs outperform MUSIC and CBF for all possible missing sensor combinations. The performance advantage of the CNNs relative to MUSIC then decreases as SNR increases. For the two-source case, the CNNs outperform MUSIC for almost all missing sensor combinations up to 0 dB SNR. At higher SNRs the performance of MUSIC increases faster than that of the CNNs; however, as stated previously the low SNR cases are typically the ones of more interest. It should be additionally noted that for the two-source case the CNNs exceed the CRLB indicating they are guaranteed to be biased estimators.

FIGURE 10. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for 1 source at -10 (top), 0 (middle), and 10 (bottom) dB for all missing sensor combinations.
FIGURE 10.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for 1 source at -10 (top), 0 (middle), and 10 (bottom) dB for all missing sensor combinations.

FIGURE 11. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for 2 sources at HPBW at -10 (top), 0 (middle), and 10 (bottom) dB for all missing sensor combinations.
FIGURE 11.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for 2 sources at HPBW at -10 (top), 0 (middle), and 10 (bottom) dB for all missing sensor combinations.

The performance differences between MUSIC and the CNNs specifically are better visualized in the colormaps provided in Figure 12. Each green portion of the plot indicates an SNR and array configuration where a CNN has a lower MSE than MUSIC. These results further indicate that at lower SNR values the CNNs are able learn a more generalizable function for DoA estimation compared to MUSIC.

FIGURE 12. - Heatmaps indicating in green SNR values and missing sensor configurations where the CNNs outperform music for 1 source (top) and 2 sources at HPBW (bottom).
FIGURE 12.

Heatmaps indicating in green SNR values and missing sensor configurations where the CNNs outperform music for 1 source (top) and 2 sources at HPBW (bottom).

The colormaps also highlight two trends from the data, the first being that for both the single-source and two-source cases, there are threshold SNR values that are tipping points where large numbers of array configurations switch from being better modeled by MUSIC to being better modeled by the CNNs. Secondly, the vertical striping indicates that some missing sensor combinations are very well represented by the CNNs and are almost always better than MUSIC. Further research is required to understand these two trends.

D. Muffled Sensor Array Results

Figures 13 & 14 are the results at -10, 0 , and 10 dB SNR for all 210 muffled sensor combinations, as outlined in Section II. The results of the single-source muffled sensor case follow the same trends as those of the single-source missing sensor case. The CNNs outperform MUSIC at low SNR for all muffled sensor combinations, but as SNR increases MUSIC’s error decreases more rapidly than the CNNs. For the two-source case, the CNNs outperform MUSIC for almost all missing sensor combinations at −10 dB SNR. The performance difference between MUSIC and the CNNs is then configuration dependent as SNR increases until MUSIC performs better than all CNNs at high SNR. This is more clearly shown in Figure 15. These results further indicate that, at lower SNR values, CNNs are able to learn a more generalizable function for DoA estimation compared to MUSIC. These results also indicate that MUSIC is more robust to muffled sensors compared to missing sensors as the MSE for MUSIC in the muffled sensor cases is generally lower.

FIGURE 13. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for 1 source at -10 (top), 0 (middle), and 10 (bottom) dB for all muffled sensor combinations.
FIGURE 13.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for 1 source at -10 (top), 0 (middle), and 10 (bottom) dB for all muffled sensor combinations.

FIGURE 14. - Comparison of MSE between music, CBF, and the CNNs with the CRLB for 2 sources at HPBW at -10 (top), 0 (middle), and 10 (bottom) dB for all muffled sensor combinations.
FIGURE 14.

Comparison of MSE between music, CBF, and the CNNs with the CRLB for 2 sources at HPBW at -10 (top), 0 (middle), and 10 (bottom) dB for all muffled sensor combinations.

FIGURE 15. - Heatmaps indicating in green SNR values and muffled sensor configurations where the CNNs outperform music for 1 source (top) and 2 sources at HPBW (bottom).
FIGURE 15.

Heatmaps indicating in green SNR values and muffled sensor configurations where the CNNs outperform music for 1 source (top) and 2 sources at HPBW (bottom).

The colormaps also illustrate the same threshold SNR trend seen in the missing sensor case. The vertical striping is also present for the muffled case, but only for the single-source case unlike in the missing sensor case. Further research is required to understand these two trends.

SECTION VI.

Conclusion

In this study, the trade-offs between ML-based and classical methods for DoA estimation were examined. We trained SNR and number of source specific CNNs utilizing perfect array assumptions for the ML-based approach. We then tested the CNNs against a test data set that included perfect, perturbed, missing, and muffled array cases. MUSIC and CBF algorithms were also evaluated against these same four array cases. The results demonstrate two findings. First, we illustrated that it is unnecessary to utilize ML-based methods for DoA estimation problems with perfect information a priori when unbiased estimators are desired. For the perfect array case, MUSIC achieved similar results as the CNNs developed for this work, because there was no model to data mismatch.

Second, we found that for the other array cases (perturbed, missing, and muffled), there is a model to data mismatch resulting in an imperfect information scenario. In these cases, we have shown that CNNs generally performed better than MUSIC and CBF at low SNR. This indicates that the model-free method (the CNN) learned a more general model of the ULA compared to the assumptions in MUSIC. These results illustrate that CNNs can offer improved robustness to real-world array cases with imperfect information because the CNNs were trained only using the perfect array case. Although the use of CNNs for DoA estimation in imperfect information cases offers improved accuracy compared to classical methods, this comes at the cost of the CNNs being biased estimators, as illustrated by the CNN results at times exceeding the CRLB. Thus, the estimator will not converge to the true answer as the number of snapshots increases. Additionally it should be noted, that the CNN results could be improved further (via a larger network and additional hyperparameter tuning) which would further highlight this trend. Lastly, this paper demonstrated that MUSIC is more robust to muffled sensors compared to missing sensors.

This paper has shown there is minimal benefit to utilizing CNNs for perfect array geometry. The utility of CNNs is their ability to learn generalized functions, allowing something trained on a specific signal model (in this case the perfect array) to be applied to a different signal model (perturbed, missing, and muffled signal models) and achieve satisfactory results.

References

References is not available for this document.