Introduction
Estimation of the direction of arrival (DoA) of an acoustic source is prevalent in signal processing; it is an important step in many tasks, such as source localization, beamforming, source separation, spectrum sensing, and speech enhancement [1], to name but a few. Despite the large research attention it has drawn in the past decades, acoustic DoA estimation is still considered a challenging open problem. Especially in noisy and reverberant environments and in the presence of interference sources, it continues to be an active research field.
Acoustic source localization, and particularly DoA estimation, are often addressed using beamforming [2]. Many beamformers have been proposed over the years for these tasks. One class of beamformers is based on the steered response power (SRP) of a beamformer output. For example, considering the maximum likelihood criterion for a single source, the output power of the beamformer from all the directions is computed, and the DoA is identified as the direction with the maximal power [3], [4], [5], [6]. Another example is the Minimum Variance Distortionless Response (MVDR) beamformer [7], [8], [9], which was first introduced by Capon [10]. The MVDR beamformer extracts the DoA of each of the existing sources, maintaining a unit gain at their direction while minimizing the response from other directions. An important generalization of the MVDR beamformer is the Linearly Constrained Minimum Variance (LCMV) beamformer [11], obtained by minimizing the output power under multiple linear constraints, and can be used for DoA estimation as well [12]. Another line of beamformers is derived based on a subspace approach, i.e., by identifying the subspace of the desired sources, which is assumed to contain only a small portion of the noise and the interference sources. A prominent subspace method, which is also used for DoA estimation, is MUltiple Signal Classification (MUSIC) [13], [14], [15], [16].
A notable algorithm for acoustic source localization is the Steered-Response Power Phase Transform (SRP-PHAT) algorithm proposed in [17]. In SRP-PHAT, the phase transform is used to normalize the different frequencies, such that only their phase information is considered. This allows for the fusion of the different frequencies when considering a broadband signal. A popular time-domain implementation of the SRP-PHAT is the generalized cross-correlation with phase transform (GCC-PHAT) proposed in [18], which normalizes each cross-correlation using the phase transform. In recent years, sparse signal recovery methods have been proposed for DoA estimation [19], [20], [21]. In particular, one approach for solving the sparse recovery problem is the sparse Bayesian learning approach proposed in [22], which was adopted for the DoA estimation problem as well, for example in [23], [24].
In this paper, we consider DoA estimation in a reverberant enclosure consisting of desired sources along with interference sources. We assume that the desired sources are constantly active, whereas the interference sources are only intermittently active. The number of sources, their locations, and their times of activity are all unknown. Consequently, their identification as desired or interference is unknown as well. The power of the different sources is also unknown, and the interference sources could, in fact, be stronger than the desired sources with overlapping activity periods. Our goal is to estimate the DoA of the desired sources in the presence of possibly simultaneously active, multiple interference sources.
This setting poses a major challenge to the common practice in existing methods that rely on maximal power because estimating the DoA of the strongest sources might result in distinct beams in the direction of the interference sources rather than the desired sources. Furthermore, these beams could mask the beams pointing at the directions of desired sources.
We address this challenge from a geometric standpoint. Our approach relies on the observation that the frequently-used beamformers implicitly consider Euclidean geometry when processing sample correlation matrices. Therefore, since the sample correlation matrices are Hermitian Positive Definite (HPD) matrices, important geometric information is not fully utilized. Instead, we propose a new approach for beamforming design that is based on the Riemannian geometry of the manifold of HPD matrices [25], [26], [27]. Concretely, we analyze the received signal in short time windows and consider the Riemannian mean [25] of the sample correlation matrices in these windows. Then, we leverage particular spectral properties of the Riemannian mean. In [28], it was shown that the Riemannian mean of HPD matrices preserves shared spectral components and attenuates unshared spectral components. Consequently, the continual activity of the desired sources and the intermittent activity of the interference sources enable us to associate desired sources with shared spectral components and interference sources with unshared spectral components. By combining the above, we show that the incorporation of the Riemannian mean into the beamformer design leads to interference rejection, i.e., gives rise to a spatial spectrum that implicitly rejects the beams pointing at the interference sources and preserves the beams pointing at the desired sources. The resulting spectrum is, in turn, used for the estimation of the DoA of the desired sources. Importantly, our approach is applicable to a large number of beamformers used for DoA estimation.
By incorporating our Riemannian approach, we present new implementations of several beamformers: the Delay and Sum (DS) beamformer, subspace-based beamformers, and the MVDR beamformer, as well as the Bayesian learning method for DoA estimation proposed in [23].
The main contributions of this article are as follows. First, we observe that despite being commonly-used in array processing in general, and for DoA estimation in particular, the HPD structure of the signal correlation matrices is not fully exploited. Here, we identify that by dividing the computation of the sample correlation matrix into short-time segments, it can be viewed as the Euclidean mean of the sample correlation matrices of the segments. This observation enables the introduction of the Riemannian geometry of HPD matrices, and in particular, it allows us to introduce a rather simple approach that promotes the use of the Riemannian mean instead of the Euclidean mean and to show its multiple merits. Second, we demonstrate that the proposed approach can naturally be applied to a broad range of DoA estimation methods that use the signal sample correlation matrix, and we show that it is computationally efficient in the sense that it typically does not change the order of the computational complexity of the DoA estimation methods. Third, we theoretically analyze the proposed approach and show that in the case of the DS beamformer, it results in higher SIR values that lead to better DoA estimation accuracy. In addition, we present a noise sensitivity analysis. Fourth, we present empirical results in adverse conditions that include simultaneously active multiple interference sources. We showcase the applicability of our approach to both classical and recent DoA estimation methods and demonstrate that the obtained performance improvement is by a large margin.
We conclude the introduction with three remarks. First, a similar setting to ours, consisting of desired sources accompanied by interference sources, was considered in [29] and [30] but in the context of signal enhancement. In [29], a single desired source and a single interference source were considered, and in [30], multiple desired sources and multiple interference sources were considered. However, in both works, it was assumed that there is at least one segment for each source, desired or interference, in which it is the only active source. Furthermore, in [30], the number of the desired sources and their activity patterns were assumed to be known. Second, in the context of radar, the Riemannian geometry of the Toeplitz HPD matrices was used in [31] and [32] for target detection by comparing Riemannian distances to a threshold. In the radar settings, [33] estimated the correlation matrix as a linear combination of correlation matrices, with weights that are based on the Riemannian distance. Third, in this article, we demonstrate the Riemannian approach for designing beamformers that reject interference sources for DoA estimation. However, other applications, e.g., signal enhancement, could also benefit from spectra that reject interference sources.
This paper is organized as follows. In Section II, we present a brief background on the HPD manifold. In Section III, we formulate the problem and the setting. In Section IV we describe the proposed approach and present the algorithm for DoA estimation. In Section V, we provide a theoretical analysis of the proposed approach for the DS beamformer. In Section VI, extensions of the approach to other DoA estimation methods are presented. Section VII shows simulation results demonstrating our Riemannian approach. Lastly, we conclude the work in Section VIII.
Background on the HPD Manifold
An HPD matrix,
\begin{align*} d^{2}_{\text{R}}(\mathbf{\Gamma}_{1},\mathbf{\Gamma}_{2})=\Big{\|}\log\left (\mathbf{\Gamma}_{2}^{-\frac{1}{2}}\mathbf{\Gamma}_{1}\mathbf{\Gamma}_{2}^{- \frac{1}{2}}\right)\Big{\|}_{F}^{2},\tag{1} \end{align*}
where
The tangent space to a point on a manifold
The Riemannian mean,
\begin{align*} \mathbf{\Gamma}_{\text{R}}\equiv\operatorname*{arg min}_{\mathbf{\Gamma}\in \mathcal{M}}\sum_{i}d^{2}_{\text{R}}(\mathbf{\Gamma},\mathbf{\Gamma}_{i}).\tag{2} \end{align*}
In general, there is no closed-form expression for the Riemannian mean of more than two matrices, and a solution can be found using an iterative procedure [35], described in Algorithm 1.
Riemannian mean for the HPD manifold [36]
Input: a set of
Output: the Riemannian mean
Compute
do
Compute the Euclidean mean in the tangent plane:
)\boldsymbol{\overline{P}}=\frac{1}{K}\sum_{j=1}^{K}\text{Log}_{{\mathbf{\Gamma }_{\text{R}}}}(\mathbf{\Gamma}_{j} Update
{\mathbf{\Gamma}_{\text{R}}}=\text{Exp}_{\mathbf{\Gamma}_{\text{R}}} (\boldsymbol{\overline{P}}) Stop if
\|\overline{P}\|_{\text{F}}<\epsilon
The computation of the Riemannian mean requires two maps. The Logarithm map, which maps an HPD matrix
\begin{align*} \text{Log}_{\mathbf{\Gamma}}(\mathbf{\Gamma}_{i})=\mathbf{\Gamma}^{\frac{1}{2} }\log(\mathbf{\Gamma}^{-\frac{1}{2}}\mathbf{\Gamma}_{i}\mathbf{\Gamma}^{-\frac {1}{2}})\mathbf{\Gamma}^{\frac{1}{2}}.\tag{3} \end{align*}
The Exponential map, which maps a vector
\begin{align*} \text{Exp}_{\mathbf{\Gamma}}({\boldsymbol{T}})=\mathbf{\Gamma}^{\frac{1}{2}}\exp (\mathbf{\Gamma}^{-\frac{1}{2}}{\boldsymbol{T}}\mathbf{\Gamma}^{-\frac{1}{2}})\mathbf{ \Gamma}^{\frac{1}{2}}.\tag{4} \end{align*}
We note that there exists an efficient iterative estimator for the Riemannian mean [37]. It is described in Appendix C. A discussion about the choice of the Affine Invariant metric appears in Appendix D.
Problem Formulation
We consider the problem of localizing
\begin{align*} z_{m}(n)&=\sum_{j=1}^{N_{\text{D}}}s_{j}^{d}(n)*h_{jm}^{d}(n)+ \sum_{j=1}^{N_{\text{I}}}s_{j}^{i}(n)*h_{jm}^{i}(n) \\ &\quad + v_{m}(n),\tag{5} \end{align*}
where
The sources are characterized by their activation times. The desired sources are active during the entire interval. In contrast, the interference sources are only partially active, namely active during segments of the interval. We emphasize that albeit we consider intermittent interference sources, longer signals might consist of a larger number of interference sources in addition to more occurrences of active interference sources. We do not assume there is a segment with only the desired source. Longer intervals do not necessarily increase the probability of such a segment since new interference sources could emerge in such a scenario. Consequently, even a long signal still poses a significant challenge to the DoA estimation. Additionally, we assume that the desired sources, the interference sources, and the noise are all uncorrelated. The noise is assumed to be spatially white. We note that these assumptions are made for simplicity. Empirically, the proposed approach leads to superior results for correlated signals and for noise with arbitrary covariance matrix as well.
The received signal is processed using the Short-Time Fourier Transform (STFT). We denote by
\begin{align*} z_{m}(l,k)&=\sum_{j=1}^{N_{\text{D}}}s_{j}^{d}(l,k)h_{jm}^{d}(l,k) +\sum_{j=1}^{N_{\text{I}}}s_{j}^{i}(l,k)h_{jm}^{i}(l,k) \\ &\quad +v_{m}(l,k),\tag{6} \end{align*}
where we assume the length of the window is much larger than the AIR length.
We stack the received signals,
\begin{align*} \begin{split}{\boldsymbol{z}}(l,k)=[z_{1}(l,k)\;\;...\;\;z_{M}(l,k)]^{ \top}.
\end{split}\tag{7} \end{align*}
Its explicit expression is
\begin{align*} \begin{split}{\boldsymbol{z}}(l,k)={\boldsymbol{H}}^{d}(l,k){\boldsymbol{s}}^{d}(l,k)+{ \boldsymbol{H}}^{i}(l,k){\boldsymbol{s}}^{i}(l,k)+{\boldsymbol{v}}(l,k),
\end{split}\tag{8} \end{align*}
where
\begin{align*} \begin{split}{\boldsymbol{s}}^{d}(l,k) & =[s^{d}_{1}(l,k)\;\;...\;\;s^{d}_{N_{\text{D}}}(l,k)]^{\top} \\ {\boldsymbol{s}}^{i}(l,k) & =[s^{i}_{1}(l,k)\;\;...\;\;s^{i}_{ N_{\text{I}}}(l,k)]^{\top},
\end{split}\tag{9} \end{align*}
and the noise term is
\begin{align*} \begin{split}{\boldsymbol{v}}(l,k)=[v_{1}(l,k)\;\;...\;\;v_{M}(l,k)]^{ \top}.
\end{split}\tag{10} \end{align*}
The Acoustic Transfer Functions (ATFs) from the
\begin{align*} \begin{split}{\boldsymbol{h}}_{j}^{d}(l,k) & =[h_{j1}^{d}(l,k) \;\;...\;\;h_{jM}^{d}(l,k)]^{\top}\;\;\;\;j=1,...,N_{\text{D}} \\ {\boldsymbol{h}}_{j}^{i}(l,k) & =[h_{j1}^{i}(l,k)\;\;...\;\;h_ {jM}^{i}(l,k)]^{\top}\;\;\;\;j=1,...,N_{\text{I}},
\end{split}\tag{11} \end{align*}
and in a matrix form
\begin{align*} \begin{split}{\boldsymbol{H}}^{d}(l,k) & =[{\boldsymbol{h}}_{1}^{d}(l, k)\;\;...\;\;{\boldsymbol{h}}_{{N}_{\text{D}}}^{d}(l,k)] \\ {\boldsymbol{H}}^{i}(l,k) & =[{\boldsymbol{h}}_{1}^{i}(l,k)\;\;...\;\; {\boldsymbol{h}}_{{N}_{\text{I}}}^{i}(l,k)]. \\
\end{split}\tag{12} \end{align*}
Henceforth, we focus on a single frequency bin and omit the frequency index. Throughout the article, we refer to
In this work, we focus on a single frequency to demonstrate the Riemannian approach, which is suitable for narrowband signals. For broadband signals, this is the formulation of only a single frequency. We note that the fusion of the different frequencies is of great importance, and normalization techniques for the different beamformers have been proposed [38] and can be applied after our method.
Our goal is to estimate the direction to the desired sources, given
Proposed Approach
Typically, the DoA estimation of a desired source is based on the output of a beamformer. In this section, we present the proposed approach applied to the Delay-and-Sum (DS) beamformer. In Section VI, we extend the proposed approach to other DoA estimation methods.
We consider arbitrary indexing of the microphones in the array and designate the first microphone as the reference microphone. Let
\begin{align*}
{\boldsymbol{d}}(\theta)=[1,e^{j\phi_{2}(\theta)},...,e^{j\phi_{M}(\theta)}]^{\top},\tag{13} \end{align*}
where
The SRP of the DS beamformer is given by
\begin{align*} P_{\text{DS}}(\theta;\mathbf{\Gamma})={\boldsymbol{d}}^{H}(\theta)\mathbf{\Gamma}{\boldsymbol{ d}}(\theta),\tag{14} \end{align*}
where
\begin{align*}
\mathbf{\Gamma}=\mathbb{E}[{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l)]\tag{15} \end{align*}
is the population covariance matrix. Since the desired source is constantly active and assumed to be at a fixed location during the entire interval, we estimate the population correlation matrix
\begin{align*} \begin{split}\widehat{\mathbf{\Gamma}}_{i}=\frac{1}{L_{\text{w}}} \sum_{l=(i-1)\cdot L_{\text{w}}+1}^{i\cdot L_{\text{w}}}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H }(l), \\
\end{split}\tag{16} \end{align*}
where
The incorporation of the Riemannian geometry is realized by viewing each matrix
\begin{align*} \widehat{\mathbf{\Gamma}}_{\text{R}}=\operatorname*{argmin}_{\mathbf{\Gamma} \in\mathcal{M}}\sum_{i=1}^{L_{\text{s}}}d_{\text{R}}^{2}(\mathbf{\Gamma}, \widehat{\mathbf{\Gamma}}_{i}).\tag{17} \end{align*}
In general, there is no closed-form solution to (17) on the HPD manifold for more than two points [35]. Therefore, Algorithm 1 proposed in [36] is used to compute the Riemannian mean of the
Once
\begin{align*} P_{\text{DS}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{R}})={\boldsymbol{d}}^{H}(\theta) \widehat{\mathbf{\Gamma}}_{\text{R}}{\boldsymbol{d}}(\theta).\tag{18} \end{align*}
In the case of a single desired source and assuming the direct path is dominant in the AIR, the direction to it is set as the direction achieving the maximum value of the SRP of the DS beamformer, i.e.,
\begin{align*} \hat{\theta}=\operatorname*{argmax}_{\theta}P_{\text{DS}}(\theta;\widehat{ \mathbf{\Gamma}}_{\text{R}}).\tag{19} \end{align*}
We note that (19) is used since the DoA estimation is based on a single STFT frequency bin. In the case of
Direction estimation in the presence of multiple interference sources
Input: the received signal in the STFT domain
Output: the estimated direction of the desired source
Divide
For each segment
Compute
Compute
Return
As a baseline, we consider the common practice of the computation of the SRP of the DS beamformer, which is typically based on the sample correlation matrix over the entire interval, i.e.,
\begin{align*} \widehat{\mathbf{\Gamma}}_{\text{E}}=\frac{1}{L_{\text{STFT}}}\sum_{l=1}^{L_{ \text{STFT}}}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l).\tag{20} \end{align*}
We observe that computing the Euclidean mean of the sample correlation matrices per segment,
We will show that our Riemannian approach exploits the assumption that the desired source is constantly active and at a fixed location, whereas the interference sources are intermittent. More specifically, we will show both theoretically in Section V and empirically in Section VII that the Riemannian mean attenuates the intermittent interferences while preserving the constantly active sources. In contrast, the standard Euclidean mean accumulates all the sources, and as a result, the main lobe could deviate from the direction of a desired source, and even focus on an interference source.
We show in Section V and Section VII that our proposed approach results in a SRP that rejects the interference sources, allowing the beamformer to extract the DoA of the desired sources.
We remark that the proposed approach only requires that the desired sources are the only sources active during the entire interval. The rank of the signal matrix is not known nor needs to be estimated. The number of interference sources is unknown as well. This is by virtue of the Riemannian mean. Unlike other works (e.g. [30]), we do not need to know the activation times of each interference, nor the number of interference sources. Furthermore, we do not assume that there exists a segment, at which a desired source is the only active source, namely, it could always be accompanied by interference sources.
In terms of complexity, the Riemannian approach requires the computation of the Riemannian mean of the correlation matrices, which is more complex than the Euclidean mean. However, the excess complexity depends on the number of microphones, which is typically not high compared to the complexity of computing the correlation matrix which depends on the number of signal samples. Consequently, the excess complexity is negligible. In particular, following the iterative computation of the Riemannian mean in Algorithm 3 in Appendix C, at each iteration, the computation involves the eigenvalue decomposition of two complex matrices of dimension
To evaluate the performance of the proposed approach, we define the output Signal to Interference Ratio (SIR) as follows:
\begin{align*} \begin{split}\text{SIR}_{j}(\widehat{\mathbf{\Gamma}})=\frac{P (\theta^{d};\widehat{\mathbf{\Gamma}})}{P(\theta^{i}_{j};\widehat{\mathbf{ \Gamma}})},
\end{split}\tag{21} \end{align*}
where
\begin{align*} \begin{split}\text{SIR}_{j}(\widehat{\mathbf{\Gamma}})=\frac{{\boldsymbol{d}}^{H}(\theta^{d})\widehat{\mathbf{\Gamma}}{\boldsymbol{d}}(\theta^{d})}{{\boldsymbol{d}}^{H }(\theta^{i}_{j})\widehat{\mathbf{\Gamma}}{\boldsymbol{d}}(\theta^{i}_{j})}.
\end{split}\tag{22} \end{align*}
This measure of performance is used because the main challenge in this setting is the presence of interference sources rather than the microphone noise.
Analysis
In this section, we analyze the proposed approach which is based on Riemannian geometry and compare it to its Euclidean counterpart. The proofs of the statements appear in the Supplementary Material (SM). In the analysis, we consider the population correlation matrix of the received signal, neglecting the estimation errors stemming from the finite sample in a segment. We note that sections IV, VI, and VII consider the sample correlation matrices, and only Section V considers the population correlation matrix.
We begin with a short derivation, demonstrating that Riemannian geometry preserves better the desired source subspace in comparison to Euclidean geometry. Consider a single desired source and assume its ATF, denoted by
\begin{align*}
\mathbf{\Gamma}_{\text{R}}\preceq\mathbf{\Gamma}_{\text{E}}.\tag{23} \end{align*}
We get that
\begin{align*} \frac{\lambda_{0}(\mathbf{\Gamma}_{\text{R}})}{\sum_{i=1}^{M-1}\lambda_{i} (\mathbf{\Gamma}_{\text{R}})}\geq\frac{\lambda_{0}(\mathbf{\Gamma}_{\text{E}})} {\sum_{i=1}^{M-1}\lambda_{i}(\mathbf{\Gamma}_{\text{E}})},\tag{24} \end{align*}
due to the equality of the numerators and (23). The inequality in (24) implies that the desired source subspace is more dominant than the subspace of the interference and noise in the Riemannian mean compared to the Euclidean mean.
In the remainder of this section, we extend this analysis and present additional results.
A. Assumptions
To make the analysis tractable, we consider a single desired source and multiple interference sources. Therefore, we simplify the notations by omitting the superscripts
For the purpose of analysis, we make the following assumptions.
Assumption 1:
Assumption 2:
It follows from Assumption 1 and Assumption 2 that the ATFs, associated with the desired source and the interference sources are all uncorrelated. These are common assumptions, e.g., see [30]. The assumptions are made only for analysis purposes, whereas in the experimental results, we consider acoustic signals in a reverberant environment without any assumptions. We note that we do not assume there exists a segment at which only one of the sources is active (e.g., as in [30]). In case an interference source is only partially active during a segment, we consider it active during the entire segment.
The population correlation matrix of the
\begin{align*} \begin{split}\mathbf{\Gamma}_{i}=\sigma_{0}^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h }}_{0}^{H}+{\boldsymbol{H}}\Lambda_{i}{\boldsymbol{H}}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}_{M\times M},
\end{split}\tag{25} \end{align*}
where
\begin{align*} \begin{split}\Lambda_{i}=\text{diag}\left(\sigma_{1}^{2}(i)\cdot \mathcal{I}_{i\in\mathcal{L}_{1}},\;...\;,\sigma_{N_{\text{I}}}^{2}(i)\cdot \mathcal{I}_{i\in\mathcal{L}_{N_{\text{I}}}}\right),
\end{split}\tag{26} \end{align*}
where
We continue with defining the Signal to Noise Ratio (SNR) at the
\begin{align*} \begin{split}\text{SNR}{{}_{m}}=\frac{\sigma_{0}^{2}{|{\boldsymbol{h}}_{0 }[m]|^{2}}}{\sigma_{v}^{2}},
\end{split}\tag{27} \end{align*}
where
To capture the correlation between the steering vectors and the ATFs, we define
\begin{align*} \begin{split}\rho_{rs}=\frac{|\langle{\boldsymbol{d}}_{r},{\boldsymbol{h}}_{s} \rangle|^{2}}{\|{\boldsymbol{d}}_{r}\|^{2}\cdot\|{\boldsymbol{h}}_{s}\|^{2}}=\frac{|\langle{ \boldsymbol{d}}_{r},{\boldsymbol{h}}_{s}\rangle|^{2}}{M\|{\boldsymbol{h}}_{s}\|^{2}},
\end{split}\tag{28} \end{align*}
where
We conclude the preliminaries of the analysis with two additional assumptions.
Assumption 3:
Assumption 3 implies that the correlation between the ATFs and the steering vectors depends only on whether they are associated with the same source or not. Following Assumption 3, henceforth we denote
Assumption 4:
Assumption 4 is typically made in the context of source localization. It implies that the correlation between a steering vector to a source and the ATF associated with that source is higher than the correlation between a steering vector to a source and the ATF associated with a different source.
B. Main Results
Our first result states that the output SIR (22) of the Riemannian-based DS beamformer is higher than the output SIR of the Euclidean-based DS beamformer.
Proposition 1:
For every interference source
\begin{align*} \text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}}) > \text{SIR}_{j}(\mathbf{\Gamma}_{ \text{E}}),\;\;\;\tag{29} \end{align*}
for any number of microphones in the array.
Examining the dependency of the output SIR on the noise power
Proposition 2:
If
\begin{align*} \begin{split}\sigma_{0}^{2}\|{\boldsymbol{h}}_{0}\|^{2}\geq\sigma_{j}^{2} \tau_{j}\|{\boldsymbol{h}}_{j}\|^{2},\;\;\forall j,
\end{split}\tag{30} \end{align*}
then
\begin{align*} \frac{\partial}{\partial\sigma_{v}^{2}}\text{SIR}_{j}(\mathbf{\Gamma}_{\text{R }}) < \frac{\partial}{\partial\sigma_{v}^{2}}\text{SIR}_{j}(\mathbf{\Gamma}_{ \text{E}}) < 0.\tag{31} \end{align*}
Namely, the lower the noise power is the higher the output SIR is, and the improvement in
The proofs of Proposition 1 and Proposition 2 rely on the following lemma, which is important in its own right.
Lemma 1:
The Riemannian or the Euclidean mean of the population correlation matrices of the segments (25) over the entire interval can be written in the same parametric form as
\begin{align*} \begin{split}\mathbf{\Gamma}=\sigma_{0}^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h}}_{ 0}^{H}+\sum_{j=1}^{N_{\text{I}}}\mu^{2}_{j}{\boldsymbol{h}}_{j}{\boldsymbol{h}}_{j}^{H}+\sigma _{v}^{2}{\boldsymbol{I}}.
\end{split}\tag{32} \end{align*}
The Riemannian mean
\begin{align*} \mu_{j}^{2}=\frac{(\sigma_{j}^{2}\|{\boldsymbol{h}}_{j}\|^{2}+\sigma_{v}^{2})^{\tau_{j }}(\sigma_{v}^{2})^{1-\tau_{j}}-\sigma_{v}^{2}}{\|{\boldsymbol{h}}_{j}\|^{2}},\tag{33} \end{align*}
and the Euclidean mean
\begin{align*}
\mu_{j}^{2}=\sigma_{j}^{2}\tau_{j}.\tag{34} \end{align*}
We note that only assumptions 1 and 2 are necessary for this lemma to hold. In addition, we note that if the interference sources are always active, i.e.,
Lemma 1 shows that both
Furthermore, considering
Next, we examine a family of correlation matrices that pertain to the same parametric form as in (32) in Lemma 1, i.e.,
\begin{align*} \mathbf{\Gamma}_{\boldsymbol{a}}={\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sum_{j=1}^{N_{\text{I}} }a_{j}{\boldsymbol{h}}_{j}{\boldsymbol{h}}_{j}^{H}+\sigma_{v}^{2}{\boldsymbol{I}},\tag{35} \end{align*}
for some coefficients
For any
\begin{align*} \mathbf{\Gamma}_{\text{opt}}\equiv\operatorname*{argmax}_{\mathbf{\Gamma}_{ \boldsymbol{a}}}\text{SIR}_{j}(\mathbf{\Gamma}_{\boldsymbol{a}})={\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H }+\sigma_{v}^{2}{\boldsymbol{I}},\tag{36} \end{align*}
where
\begin{align*} \begin{split}\text{SIR}_{j}(\mathbf{\Gamma}_{\text{opt}})=\frac{{ \boldsymbol{d}}_{0}^{H}({\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}){\boldsymbol{d}}_{ 0}}{{\boldsymbol{d}}_{j}^{H}({\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}){\boldsymbol{ d}}_{j}}.
\end{split}\tag{37} \end{align*}
Considering vanishing noise, i.e., when the noise power approaches zero, the following result stems from Lemma 1 by considering the limit
Corollary 1:
\begin{align*} \begin{split}\lim_{\sigma_{v}^{2}\rightarrow 0}\mathbf{\Gamma}_{ \text{R}}=\mathbf{\Gamma}_{\text{opt}}.
\end{split}\tag{38} \end{align*}
According to Corollary 1, the Riemannian mean approaches the optimal correlation matrix as the noise becomes negligible. By adding a condition on the presence of the interference sources, from Lemma 1 and (33) we also have the following.
Corollary 2:
For any interference source
\begin{align*} \begin{split}\lim_{\sigma_{j}^{2}\rightarrow\infty,\sigma_{v}^{2} \rightarrow 0}\mathbf{\Gamma}_{\text{R}}=\mathbf{\Gamma}_{\text{opt}}.
\end{split}\tag{39} \end{align*}
Additionally, if
\begin{align*} \begin{split}\lim_{\sigma_{j}^{2}\rightarrow\infty\forall j=1, \ldots,N_{\text{I}},\sigma_{v}^{2}\rightarrow 0}\mathbf{\Gamma}_{\text{R}}= \mathbf{\Gamma}_{\text{opt}}.
\end{split}\tag{40} \end{align*}
Corollary 2 implies that for vanishing noise, even when all the interference sources have infinite power, the desired source is still the dominant source in the SRP of the DS beamformer using the Riemannian mean. Following (37) it holds that
To illustrate the obtained expressions for the Riemannian and the Euclidean SIR, we present the following simple example.
Example 1:
Consider an anechoic environment without attenuation, for which
\begin{align*}
\text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}})=\sqrt{\frac{M}{\sigma_{v}^{2}}+1},\tag{41} \end{align*}
and for the Euclidean geometry, we have:
\begin{align*} \text{SIR}(\mathbf{\Gamma}_{\text{E}})=\frac{2(M+\sigma_{v}^{2})}{M+2\sigma_{v }^{2}}.\tag{42} \end{align*}
Therefore, in the limit of
We conclude this analysis with a few remarks. First, we note that
C. Relation to Signal Enhancement
For signal enhancement in reverberant environments, the estimation of the ATF of the desired source is typically required. In our setting, there is no segment at which the desired source is the only active source, and therefore, the ATF estimation is done in the presence of the interference sources. In such a case, the following quantity could be of interest
\begin{align*} \begin{split}\overline{\text{SIR}}_{j}(\mathbf{\Gamma})=\frac{{ \boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}{\boldsymbol{h}}_{0}}{{\boldsymbol{h}}_{j}^{H}\mathbf{\Gamma}{ \boldsymbol{h}}_{j}},
\end{split}\tag{43} \end{align*}
which is different than (22) in the use of the ATFs instead of the steering vectors.
Similarly to Proposition 1, the following Proposition 3 examines the performance in terms of the SIR defined in (43). Here, assumptions 2-4 are not required, and therefore, the ATFs of the interference sources could be correlated, and the number of sources is not limited by the number of microphones in the array.
Proposition 3:
Under Assumption 1, for all
\begin{align*} \begin{split}\overline{\text{SIR}}_{j}(\mathbf{\Gamma}_{\text{R}}) \geq\overline{\text{SIR}}_{j}(\mathbf{\Gamma}_{\text{E}}).
\end{split}\tag{44} \end{align*}
Another interesting component in signal enhancement is the Relative Transfer Function (RTF) between different microphones [40], [41], [42]. We compute the RTFs with respect to the first microphone, i.e.,
D. The Segments and the Interference Sources Activity
In this section, we investigate the effect of misalignment between the segments and the activity of the interference sources. We consider two interference sources and two segments. We denote by
\begin{align*} \begin{split}\mathbf{\Gamma}_{1}(\alpha) & =\sigma_{0} ^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\alpha^{2}\sigma_{1}^{2}{\boldsymbol{h}}_{1}{\boldsymbol{h}}_ {1}^{H}+(1-\alpha)^{2}\sigma_{2}^{2}{\boldsymbol{h}}_{2}{\boldsymbol{h}}_{2}^{H}+\sigma_{v}^{2 }{\boldsymbol{I}} \\ \mathbf{\Gamma}_{2}(\alpha) & =\sigma_{0}^{2}{\boldsymbol{h}}_ {0}{\boldsymbol{h}}_{0}^{H}+(1-\alpha)^{2}\sigma_{1}^{2}{\boldsymbol{h}}_{1}{\boldsymbol{h}}_{1}^{H}+ \alpha^{2}\sigma_{2}^{2}{\boldsymbol{h}}_{2}{\boldsymbol{h}}_{2}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}.
\end{split}\tag{45} \end{align*}
The correlation matrices in (45) depend on
Examining the dependency of the SIR on
Proposition 4:
For any
\begin{align*} \text{SIR}(\mathbf{\Gamma}_{\text{R}}(\alpha))\geq\text{SIR}(\mathbf{\Gamma}_{ \text{E}}(\alpha)).\tag{46} \end{align*}
Proposition 4 states that for every misalignment between the segments and the activity of the interference sources, the Riemannian mean leads to higher SIR in comparison to its Euclidean counterpart. Equality in (46) is obtained for
Empirically, we found that the advantage of the Riemannian mean over the Euclidean mean decreases as the offset between the segments and the activity of the interference sources increases. We leave the question of optimal partitioning of the STFT windows into segments to future work.
Extension to Other DoA Estimation Methods
In this section, to broaden its applicability, we demonstrate the incorporation of the Riemannian approach in other beamformers. Each beamformer generates a spatial spectrum from which the directions to the desired sources are estimated according to the highest peaks in the spectrum.
As a subspace (SbSp) approach, we implement MUSIC [13] in the following way. Given
\begin{align*} {\boldsymbol{P}}_{\text{SbSp}}(\theta;\widehat{\mathbf{\Gamma}})={\boldsymbol{d}}^{H}(\theta){ \boldsymbol{U}}(\widehat{\mathbf{\Gamma}}){\boldsymbol{U}}^{H}(\widehat{\mathbf{\Gamma}}){\boldsymbol{ d}}(\theta).\tag{47} \end{align*}
We note that the appropriate number of eigenvectors
Similarly to the DS beamformer based on the SRP in (18) and (19), the Riemannian and the Euclidean SbSp methods are given by
Our approach is also applicable to the MVDR beamformer [10], whose spectrum is given by
\begin{align*} {\boldsymbol{P}}_{\text{MVDR}}(\theta;\widehat{\mathbf{\Gamma}})=\frac{1}{{\boldsymbol{d}}^{H} (\theta)\widehat{\mathbf{\Gamma}}^{-1}{\boldsymbol{d}}(\theta)}.\tag{48} \end{align*}
The typical spectrum of the MVDR beamformer is obtained by using
In principle, many DoA estimation methods that employ the sample correlation matrix could potentially benefit from the proposed approach, even if it is not based on a beamformer. For example, the Bayesian learning method for signal recovery for DoA estimation proposed in [23] employs the sample correlation matrix. Its Riemannian alternative is implemented by following their algorithm only with the Riemannian mean instead of the Euclidean mean. The empirical results appear in Section VII.
Simulation Results
In this section, we demonstrate the performance of the proposed approach based on Riemannian geometry, and compare it to Euclidean geometry, implicitly considered by the common practice1. Additionally, we compare our approach to a heuristic method, based on the intersection of subspaces. The intersection leads to the rejection of non-common components, such as the interference sources subspace, and preserves common components, such as the desired source subspace. We refer to it as the intersection beamformer (see Appendix A for more details).
We consider a reverberant enclosure of dimensions
All the sources are positioned on a
The reverberant room with the microphone array (blue circles), the desired source (red star), and the interference sources (green squares). (a) A 3D view. (b) A 2D view.
We examine the performance of both the DS and the SbSp methods. Algorithm 2 is used for the proposed Riemannian DS, and the common practice is implemented by replacing step
For quantitative evaluation, we use the root mean square error (RMSE) and the accuracy, for which DoA estimation error that is smaller than
\begin{align*} \text{SIR}=\frac{1}{N_{\text{I}}}\sum_{j=1}^{N_{\text{I}}}\frac{P({\theta}^{d}) }{P(\theta^{i}_{j})},\tag{49} \end{align*}
where
\begin{align*} \mathcal{D}(\widehat{\mathbf{\Gamma}})=\frac{P({\theta}^{d})}{\frac{1}{2}\int_ {0}^{\pi}P(\theta)\sin(\theta)d\theta}.\tag{50} \end{align*}
The proposed approach results in inherent interference rejection, which is its greatest merit. The attenuation of the interference sources by our approach allows for accurate DoA estimation even in the presence of strong interference sources. The measure of SIR is indicative of the amount of interference rejection.
In the first experiment, we consider two interference sources, each active at a single, but disjoint, segment, resulting in a signal of
We start with an example of the SRP of the DS beamformer (see (18)), computed using
The SRP of the DS beamformer using (a)
Next, we randomly generate
Fig. 3 presents the mean output SIR (a) and the directivity (b) for the DS method using the correlation matrix estimates:
(a) The mean output SIR and (b) the directivity for two interference sources, for the Riemannian and the Euclidean DS method. The x-axis indicates the input SIR, and the y-axis indicates the output SIR. The box indicates the
We see that the Riemannian DS method attains high output SIR values, even for strong interference sources (high input SIR). In contrast, the Euclidean DS method results in relatively low output SIR values. The gap in the output SIR values between the Riemannian DS, and the Euclidean DS is up to
Fig. 4 is the same as Fig. 3, but presenting the SbSp method with the addition of the intersection method, which appears in orange. Fig. 4(a) presents the results for the practical implementation that includes estimating the dimension, whereas Fig. 4(b) presents the results for the oracle. We see that the Riemannian approach outperforms its Euclidean counterpart by approximately
The mean output SIR of the SbSp methods. (a) Practical implementation. (b) Using an oracle. The box indicates the
We continue with examining the direction estimation to the desired source. The estimated direction is defined as the direction leading to the maximal value of the SRP, namely
\begin{align*}
\hat{\theta}^{d}=\operatorname*{argmax}_{\theta}P(\theta).\tag{51} \end{align*}
Fig. 5 shows the estimated direction to the desired source for the Riemannian DS method (blue square), the Euclidean DS method (red circle), and the intersection method (orange star). The black solid line marks the true location of the desired source (at
Estimation of the DoA to the desired source for (a) input SIR of
We repeat the experiment for
(a) RMSE and (b) accuracy versus the number of microphones for two interference sources with input SIR of
Next, we examine the sensitivity of the proposed approach to the SNR and the reverberation time. We repeat the setting of the two interferences, as described in the first experiment. The results are presented in Fig. 7. In Fig. 7(a), the mean output SIR for the DS method is presented as a function of the reverberation time for a fixed SNR of
The mean output SIR as a function of (a) the reverberation time and (b) the SNR, for two interference sources. The Riemannian DS appears in blue, whereas the Euclidean DS appears in red. Several input SIR values are presented:
We examine the performance of the MVDR beamformer, given by (48). The MVDR beamformer is popular when interference sources are present, thanks to its distortionless response in the direction of the desired source, and its typical narrow beams. However, in our setting, since the desired source is accompanied by interference sources, and the directions to all the sources are unknown, the DoA estimation of the desired source using the MVDR beamformer is outperformed by the DS and the SbSp methods. We also examine the case of two desired sources. The results show similar trends and appear in Appendix B, due to space considerations.
In the second experiment, we examine a multiple interference setting, by considering
(a) Activation map for the
We note that Fig. 8(b) serves as an example of an increased number of segments in comparison to Fig. 3. Since there are on average
We remark that the empirical results of the proposed approach applied to the typically-used beamformers are not sensitive to the number of samples used for the computation of the sample correlation matrix. Since, in the experiments, additional samples also include samples from at least one interference source, adding samples increases the effect of the interference sources, resulting in weak dependency on the number of samples.
Next, we evaluate the performance of the proposed approach applied to the Bayesian learning method proposed in [23] for sparse signal recovery for DoA estimation. This is done by replacing the typically-used (Euclidean) sample correlation matrix with the Riemannian mean of sample correlation matrices computed in short-time segments. We repeat the same experiment as in [23] with
The RMSE of the Bayesian learning method for
Conclusion
We present a Riemannian approach for the design of beamformers and DoA estimation methods for interference rejection in reverberant environments. Specifically, the Riemannian mean is incorporated instead of the sample correlation matrix for DoA estimation of the desired source, as it inherently rejects the interference sources. We analytically show that the DS beamformer, based on the Riemannian geometry of the HPD manifold, results in a higher output SIR than the typical DS beamformer, which implicitly considers the Euclidean geometry. We extend our approach to other beamformers, such as subspace-based beamformers and the MVDR, as well as a Bayesian learning method, experimentally demonstrating superior output SIR and better DoA estimations in comparison to their Euclidean counterparts.
ACKNOWLEDGMENT
The authors wish to thank the associate editor and the anonymous reviewers for their useful comments which helped to improve this manuscript.
Appendix AThe Intersection Beamformer
The Intersection Beamformer
Another beamformer we examine is based on the observation that the desired signal subspace is the intersection of subspaces spanned by eigenvectors of the correlation matrix of the different segments. From each segment,
\begin{align*} {\boldsymbol{P}}_{\text{sig}}(\widehat{\mathbf{\Gamma}}_{i})={\boldsymbol{V}}(\widehat{\mathbf {\Gamma}}_{i})\left({\boldsymbol{V}}^{H}(\widehat{\mathbf{\Gamma}}_{i}){\boldsymbol{V}} (\widehat{\mathbf{\Gamma}}_{i})\right)^{-1}{\boldsymbol{V}}^{H}(\widehat{\mathbf{\Gamma }}_{i}).\tag{52} \end{align*}
Since each desired source is active during all the segments, its ATF,
\begin{align*} {\boldsymbol{P}}_{\text{Intersect}}(\theta)={\boldsymbol{d}}^{H}(\theta){\boldsymbol{P}}_{\text{sig},N _{\text{D}}}{\boldsymbol{P}}_{\text{sig},N_{\text{D}}}^{H}{\boldsymbol{d}}(\theta).\tag{53} \end{align*}
We note that in addition to the dimension of the signal space of the matrix
Appendix BAdditional Experimental Results
Additional Experimental Results
We repeat the setting of the first experiment, only with an additional desired source, located at
Output SIR of the different beamformers for two desired sources and two interference sources. (a) The input SIR is
Next, we examine the performance of the Riemannian MVDR and the Euclidean MVDR beamformers, given by (48), for
The mean output SIR for the Riemannian and the Euclidean MVDR beamformers in the presence of two interference sources. The x-axis indicates the input SIR, and the y-axis indicates the output SIR.
We repeat the experiment with speech signals from the TIMIT dataset. For each source, the speaker and the time interval are chosen uniformly at random. Fig. 12 presents the results. We see that the proposed approach leads to improved results in comparison to the typically-used beamformers also for speech signals.
Estimation of the DoA of the desired source for speech signals where (a) the input SIR is -6dB, and (b) the input SIR is
Appendix CExtension to a Streaming Data Setting
Extension to a Streaming Data Setting
In a streaming data setting, we do not have access in advance to the entire signal, and the direction estimation is updated as more samples become available. Therefore, in this setting, we cannot compute the Riemannian mean using (17). To circumvent this, we turn to the estimator of the Riemannian mean proposed in [37], [46], which is updated after every received segment.
Each segment,
\begin{align*} \hat{{\boldsymbol{R}}}_{i}=\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}(\hat{{\boldsymbol{R}}}_{i-1}^{- \frac{1}{2}}\widehat{\mathbf{\Gamma}}_{i}\hat{{\boldsymbol{R}}}_{i-1}^{-\frac{1}{2}})^ {\frac{1}{i}}\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}.\tag{54} \end{align*}
Algorithm 3 Streaming DoA estimation in the presence of multiple interferences
Input: the current result of the STFT window of the received signal
Output: the estimated direction of the desired source
Set
Repeat
Accumulate
STFT windows (that form a segment)L_{\text{w}} Compute its sample correlation matrix,
, using (16)\widehat{\mathbf{\Gamma}}_{i} Compute
\hat{{\boldsymbol{R}}}_{i}=\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}(\hat{{ \boldsymbol{R}}}_{i-1}^{-\frac{1}{2}}\widehat{\mathbf{\Gamma}}_{i}\hat{{ \boldsymbol{R}}}_{i-1}^{-\frac{1}{2}})^{\frac{1}{i}}\hat{{\boldsymbol{R}}}_{i- 1}^{\frac{1}{2}} Compute
using (18)P_{\text{DS}}(\theta;\hat{{\boldsymbol{R}}}_{i}) Return
\hat{\theta}=\text{argmax}_{\theta}P_{\text{DS}}(\theta;\hat{{\boldsymbol{R}}} _{i})
The SRP of the DS beamformer is computed using (18) with
We note that the current correlation matrix estimate has a full rank if
As for the Euclidean alternative, its estimator is updated in finer granularity at every STFT window,
\begin{align*}
{\boldsymbol{E}}_{l}=\frac{n-1}{n}{\boldsymbol{E}}_{l-1}+\frac{1}{n}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l).\tag{55} \end{align*}
Setting
Appendix DOn the Particular Choice of the Riemannian Metric
On the Particular Choice of the Riemannian Metric
In this work, we consider the Affine Invariant metric (also called Fisher Information metric) [47]. Another commonly-used metric in the space of HPD matrices is the Log-Euclidean metric [48], which could be viewed as a computationally efficient local approximation of the Affine Invariant metric. The induced Log-Euclidean distance is given by
\begin{align*} d_{\text{LE}}^{2}(\mathbf{\Gamma}_{1},\mathbf{\Gamma}_{2})=\|\log(\mathbf{ \Gamma}_{1})-\log(\mathbf{\Gamma}_{2})\|_{F}^{2}.\tag{56} \end{align*}
In the context of this work, the Affine Invariant metric is advantageous over the Log Euclidean because it better enhances the desired source subspace relative to the interference and noise subspace, as we show next. We remark that the derivation is similar to the derivation in Section V, where we demonstrate the advantage of the Affine Invariant metric over Euclidean geometry. First, we note that the following holds [49]:
\begin{align*}
\text{tr}(\mathbf{\Gamma}_{\text{R}})\leq\text{tr}(\mathbf{\Gamma}_{\text{LE}}),\tag{57} \end{align*}
where
Lemma 2:
Let
\begin{align*} \begin{split}{\boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}_{\text{R}}{\boldsymbol{h}}_{ 0}={\boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}_{\text{LE}}{\boldsymbol{h}}_{0}={\boldsymbol{h}}_{0}^{H} \mathbf{\Gamma}_{\text{E}}{\boldsymbol{h}}_{0}=\|{\boldsymbol{h}}_{0}\|^{2}\lambda_{0},
\end{split}\tag{58} \end{align*}
where
The rest of the eigenvectors of the correlation matrices span the interference and noise subspace. We recall that
\begin{align*} \frac{\lambda_{0}(\mathbf{\Gamma}_{\text{R}})}{\sum_{i=1}^{M-1}\lambda_{i} (\mathbf{\Gamma}_{\text{R}})}\geq\frac{\lambda_{0}(\mathbf{\Gamma}_{\text{LE}}) }{\sum_{i=1}^{M-1}\lambda_{i}(\mathbf{\Gamma}_{\text{LE}})},\tag{59} \end{align*}
which follows from Lemma 2 and (57).
We see that the Riemannian mean induced by the Affine Invariant metric captures better the desired signal subspace in comparison to the Riemannian mean induced by the Log Euclidean metric and the Euclidean mean (according to (24)), entailing an advantage in SbSp methods, for example.