Loading [MathJax]/extensions/TeX/boldsymbol.js
On Interference-Rejection Using Riemannian Geometry for Direction of Arrival Estimation | IEEE Journals & Magazine | IEEE Xplore

On Interference-Rejection Using Riemannian Geometry for Direction of Arrival Estimation

CodeAvailable
Open Access

Abstract:

We consider the problem of estimating the direction of arrival of desired acoustic sources in the presence of multiple acoustic interference sources. All the sources are ...Show More

Abstract:

We consider the problem of estimating the direction of arrival of desired acoustic sources in the presence of multiple acoustic interference sources. All the sources are located in noisy and reverberant environments and are received by a microphone array. We propose a new approach for designing beamformers and DoA estimation methods based on the Riemannian geometry of the manifold of Hermitian positive definite matrices. Specifically, we show theoretically that incorporating the Riemannian mean of the spatial correlation matrices into frequently-used beamformers gives rise to spatial spectra that reject the directions of interference sources and result in a higher signal-to-interference ratio. We experimentally demonstrate the advantages of our approach in designing several beamformers and a recent DoA estimation method in the presence of simultaneously active multiple interference sources.
Published in: IEEE Transactions on Signal Processing ( Volume: 72)
Page(s): 260 - 274
Date of Publication: 12 October 2023

ISSN Information:

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Estimation of the direction of arrival (DoA) of an acoustic source is prevalent in signal processing; it is an important step in many tasks, such as source localization, beamforming, source separation, spectrum sensing, and speech enhancement [1], to name but a few. Despite the large research attention it has drawn in the past decades, acoustic DoA estimation is still considered a challenging open problem. Especially in noisy and reverberant environments and in the presence of interference sources, it continues to be an active research field.

Acoustic source localization, and particularly DoA estimation, are often addressed using beamforming [2]. Many beamformers have been proposed over the years for these tasks. One class of beamformers is based on the steered response power (SRP) of a beamformer output. For example, considering the maximum likelihood criterion for a single source, the output power of the beamformer from all the directions is computed, and the DoA is identified as the direction with the maximal power [3], [4], [5], [6]. Another example is the Minimum Variance Distortionless Response (MVDR) beamformer [7], [8], [9], which was first introduced by Capon [10]. The MVDR beamformer extracts the DoA of each of the existing sources, maintaining a unit gain at their direction while minimizing the response from other directions. An important generalization of the MVDR beamformer is the Linearly Constrained Minimum Variance (LCMV) beamformer [11], obtained by minimizing the output power under multiple linear constraints, and can be used for DoA estimation as well [12]. Another line of beamformers is derived based on a subspace approach, i.e., by identifying the subspace of the desired sources, which is assumed to contain only a small portion of the noise and the interference sources. A prominent subspace method, which is also used for DoA estimation, is MUltiple Signal Classification (MUSIC) [13], [14], [15], [16].

A notable algorithm for acoustic source localization is the Steered-Response Power Phase Transform (SRP-PHAT) algorithm proposed in [17]. In SRP-PHAT, the phase transform is used to normalize the different frequencies, such that only their phase information is considered. This allows for the fusion of the different frequencies when considering a broadband signal. A popular time-domain implementation of the SRP-PHAT is the generalized cross-correlation with phase transform (GCC-PHAT) proposed in [18], which normalizes each cross-correlation using the phase transform. In recent years, sparse signal recovery methods have been proposed for DoA estimation [19], [20], [21]. In particular, one approach for solving the sparse recovery problem is the sparse Bayesian learning approach proposed in [22], which was adopted for the DoA estimation problem as well, for example in [23], [24].

In this paper, we consider DoA estimation in a reverberant enclosure consisting of desired sources along with interference sources. We assume that the desired sources are constantly active, whereas the interference sources are only intermittently active. The number of sources, their locations, and their times of activity are all unknown. Consequently, their identification as desired or interference is unknown as well. The power of the different sources is also unknown, and the interference sources could, in fact, be stronger than the desired sources with overlapping activity periods. Our goal is to estimate the DoA of the desired sources in the presence of possibly simultaneously active, multiple interference sources.

This setting poses a major challenge to the common practice in existing methods that rely on maximal power because estimating the DoA of the strongest sources might result in distinct beams in the direction of the interference sources rather than the desired sources. Furthermore, these beams could mask the beams pointing at the directions of desired sources.

We address this challenge from a geometric standpoint. Our approach relies on the observation that the frequently-used beamformers implicitly consider Euclidean geometry when processing sample correlation matrices. Therefore, since the sample correlation matrices are Hermitian Positive Definite (HPD) matrices, important geometric information is not fully utilized. Instead, we propose a new approach for beamforming design that is based on the Riemannian geometry of the manifold of HPD matrices [25], [26], [27]. Concretely, we analyze the received signal in short time windows and consider the Riemannian mean [25] of the sample correlation matrices in these windows. Then, we leverage particular spectral properties of the Riemannian mean. In [28], it was shown that the Riemannian mean of HPD matrices preserves shared spectral components and attenuates unshared spectral components. Consequently, the continual activity of the desired sources and the intermittent activity of the interference sources enable us to associate desired sources with shared spectral components and interference sources with unshared spectral components. By combining the above, we show that the incorporation of the Riemannian mean into the beamformer design leads to interference rejection, i.e., gives rise to a spatial spectrum that implicitly rejects the beams pointing at the interference sources and preserves the beams pointing at the desired sources. The resulting spectrum is, in turn, used for the estimation of the DoA of the desired sources. Importantly, our approach is applicable to a large number of beamformers used for DoA estimation.

By incorporating our Riemannian approach, we present new implementations of several beamformers: the Delay and Sum (DS) beamformer, subspace-based beamformers, and the MVDR beamformer, as well as the Bayesian learning method for DoA estimation proposed in [23].

The main contributions of this article are as follows. First, we observe that despite being commonly-used in array processing in general, and for DoA estimation in particular, the HPD structure of the signal correlation matrices is not fully exploited. Here, we identify that by dividing the computation of the sample correlation matrix into short-time segments, it can be viewed as the Euclidean mean of the sample correlation matrices of the segments. This observation enables the introduction of the Riemannian geometry of HPD matrices, and in particular, it allows us to introduce a rather simple approach that promotes the use of the Riemannian mean instead of the Euclidean mean and to show its multiple merits. Second, we demonstrate that the proposed approach can naturally be applied to a broad range of DoA estimation methods that use the signal sample correlation matrix, and we show that it is computationally efficient in the sense that it typically does not change the order of the computational complexity of the DoA estimation methods. Third, we theoretically analyze the proposed approach and show that in the case of the DS beamformer, it results in higher SIR values that lead to better DoA estimation accuracy. In addition, we present a noise sensitivity analysis. Fourth, we present empirical results in adverse conditions that include simultaneously active multiple interference sources. We showcase the applicability of our approach to both classical and recent DoA estimation methods and demonstrate that the obtained performance improvement is by a large margin.

We conclude the introduction with three remarks. First, a similar setting to ours, consisting of desired sources accompanied by interference sources, was considered in [29] and [30] but in the context of signal enhancement. In [29], a single desired source and a single interference source were considered, and in [30], multiple desired sources and multiple interference sources were considered. However, in both works, it was assumed that there is at least one segment for each source, desired or interference, in which it is the only active source. Furthermore, in [30], the number of the desired sources and their activity patterns were assumed to be known. Second, in the context of radar, the Riemannian geometry of the Toeplitz HPD matrices was used in [31] and [32] for target detection by comparing Riemannian distances to a threshold. In the radar settings, [33] estimated the correlation matrix as a linear combination of correlation matrices, with weights that are based on the Riemannian distance. Third, in this article, we demonstrate the Riemannian approach for designing beamformers that reject interference sources for DoA estimation. However, other applications, e.g., signal enhancement, could also benefit from spectra that reject interference sources.

This paper is organized as follows. In Section II, we present a brief background on the HPD manifold. In Section III, we formulate the problem and the setting. In Section IV we describe the proposed approach and present the algorithm for DoA estimation. In Section V, we provide a theoretical analysis of the proposed approach for the DS beamformer. In Section VI, extensions of the approach to other DoA estimation methods are presented. Section VII shows simulation results demonstrating our Riemannian approach. Lastly, we conclude the work in Section VIII.

SECTION II.

Background on the HPD Manifold

An HPD matrix, \mathbf{\Gamma}\in{\mathbb{C}}^{n\times n}, is a Hermitian matrix, i.e. \mathbf{\Gamma}=\mathbf{\Gamma}^{H}, where \mathbf{\Gamma}^{H} is the conjugate transpose of \mathbf{\Gamma}, whose real eigenvalues are strictly positive. Associating the space of HPD matrices with the Affine Invariant metric [34] constitutes a Riemannian manifold, \mathcal{M}. The distance between two matrices \mathbf{\Gamma}_{1} and \mathbf{\Gamma}_{2}, induced by the Affine Invariant metric, is given by \begin{align*} d^{2}_{\text{R}}(\mathbf{\Gamma}_{1},\mathbf{\Gamma}_{2})=\Big{\|}\log\left (\mathbf{\Gamma}_{2}^{-\frac{1}{2}}\mathbf{\Gamma}_{1}\mathbf{\Gamma}_{2}^{- \frac{1}{2}}\right)\Big{\|}_{F}^{2},\tag{1} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \|\cdot\|_{F} is the Frobenius norm.

The tangent space to a point on a manifold \mathbf{\Gamma}\in\mathcal{M} is a vector space that can be viewed as a local approximation of the manifold around \mathbf{\Gamma}. For the HPD manifold, the tangent space around each \mathbf{\Gamma}\in\mathcal{M} is the vector space of Hermitian matrices of dimensions n\times n.

The Riemannian mean, \mathbf{\Gamma}_{\text{R}}, of a set of point \{\mathbf{\Gamma}_{i}|\mathbf{\Gamma}_{i}\in\mathcal{M}\} is defined by the Fréchet mean as the point minimizing the sum of the distances from all the points in the set as follows \begin{align*} \mathbf{\Gamma}_{\text{R}}\equiv\operatorname*{arg min}_{\mathbf{\Gamma}\in \mathcal{M}}\sum_{i}d^{2}_{\text{R}}(\mathbf{\Gamma},\mathbf{\Gamma}_{i}).\tag{2} \end{align*} View SourceRight-click on figure for MathML and additional features.

In general, there is no closed-form expression for the Riemannian mean of more than two matrices, and a solution can be found using an iterative procedure [35], described in Algorithm 1.

SECTION Algorithm 1

Riemannian mean for the HPD manifold [36]

Input: a set of K HPD matrices \{\mathbf{\Gamma}_{j}\}_{j=1}^{K}

Output: the Riemannian mean \mathbf{\Gamma}_{\text{R}}

1:

Compute {\mathbf{\Gamma}_{\text{R}}}=\frac{1}{K}\sum_{j=1}^{K}\mathbf{\Gamma}_{j}

2:

do

  1. Compute the Euclidean mean in the tangent plane: \boldsymbol{\overline{P}}=\frac{1}{K}\sum_{j=1}^{K}\text{Log}_{{\mathbf{\Gamma }_{\text{R}}}}(\mathbf{\Gamma}_{j})

  2. Update {\mathbf{\Gamma}_{\text{R}}}=\text{Exp}_{\mathbf{\Gamma}_{\text{R}}} (\boldsymbol{\overline{P}})

  3. Stop if \|\overline{P}\|_{\text{F}}<\epsilon

The computation of the Riemannian mean requires two maps. The Logarithm map, which maps an HPD matrix \mathbf{\Gamma}_{i}\in\mathcal{M} to the tangent space of the HPD manifold at \mathbf{\Gamma}, is given by \begin{align*} \text{Log}_{\mathbf{\Gamma}}(\mathbf{\Gamma}_{i})=\mathbf{\Gamma}^{\frac{1}{2} }\log(\mathbf{\Gamma}^{-\frac{1}{2}}\mathbf{\Gamma}_{i}\mathbf{\Gamma}^{-\frac {1}{2}})\mathbf{\Gamma}^{\frac{1}{2}}.\tag{3} \end{align*} View SourceRight-click on figure for MathML and additional features.

The Exponential map, which maps a vector {\boldsymbol{T}} from the tangent space at \mathbf{\Gamma}, is given by \begin{align*} \text{Exp}_{\mathbf{\Gamma}}({\boldsymbol{T}})=\mathbf{\Gamma}^{\frac{1}{2}}\exp (\mathbf{\Gamma}^{-\frac{1}{2}}{\boldsymbol{T}}\mathbf{\Gamma}^{-\frac{1}{2}})\mathbf{ \Gamma}^{\frac{1}{2}}.\tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.

We note that there exists an efficient iterative estimator for the Riemannian mean [37]. It is described in Appendix C. A discussion about the choice of the Affine Invariant metric appears in Appendix D.

SECTION III.

Problem Formulation

We consider the problem of localizing N_{\text{D}} desired sources in the presence of N_{\text{I}} interference sources. All the sources are static and located in a reverberant environment. The signals are received at a noisy microphone array of M microphones, which are positioned in known, but possibly arbitrary, positions. The acoustic environment between each source and each microphone is modeled by the Acoustic Impulse Response (AIR). The signal at the mth microphone, considering a far-field setting, is given by \begin{align*} z_{m}(n)&=\sum_{j=1}^{N_{\text{D}}}s_{j}^{d}(n)*h_{jm}^{d}(n)+ \sum_{j=1}^{N_{\text{I}}}s_{j}^{i}(n)*h_{jm}^{i}(n) \\ &\quad + v_{m}(n),\tag{5} \end{align*} View SourceRight-click on figure for MathML and additional features.

where s_{j}^{d}(n) is the jth desired source, s_{j}^{i}(n) is the jth interference source, and v_{m}(n) is the mth microphone noise. We denote by h_{jm}^{d}(n) the AIR between the mth microphone and the jth desired source. Similarly, we denote by h_{jm}^{i}(n) the AIR between the mth microphone and the jth interference source.

The sources are characterized by their activation times. The desired sources are active during the entire interval. In contrast, the interference sources are only partially active, namely active during segments of the interval. We emphasize that albeit we consider intermittent interference sources, longer signals might consist of a larger number of interference sources in addition to more occurrences of active interference sources. We do not assume there is a segment with only the desired source. Longer intervals do not necessarily increase the probability of such a segment since new interference sources could emerge in such a scenario. Consequently, even a long signal still poses a significant challenge to the DoA estimation. Additionally, we assume that the desired sources, the interference sources, and the noise are all uncorrelated. The noise is assumed to be spatially white. We note that these assumptions are made for simplicity. Empirically, the proposed approach leads to superior results for correlated signals and for noise with arbitrary covariance matrix as well.

The received signal is processed using the Short-Time Fourier Transform (STFT). We denote by s_{j}^{d}(l,k) the STFT at the lth window and the kth frequency of s_{j}^{d}(n). The notation for s_{j}^{i}(l,k), h_{jm}^{d}(l,k), h_{jm}^{i}(l,k), and v_{m}(l,k) follows similarly. Then, the STFT at the lth window and the kth frequency of the received signal is given by \begin{align*} z_{m}(l,k)&=\sum_{j=1}^{N_{\text{D}}}s_{j}^{d}(l,k)h_{jm}^{d}(l,k) +\sum_{j=1}^{N_{\text{I}}}s_{j}^{i}(l,k)h_{jm}^{i}(l,k) \\ &\quad +v_{m}(l,k),\tag{6} \end{align*} View SourceRight-click on figure for MathML and additional features.

where we assume the length of the window is much larger than the AIR length.

We stack the received signals, \{z_{m}(l,k)\}_{m}, from all the microphones to obtain a column vector {\boldsymbol{z}}(l,k)\in{\mathbb{C}}^{M\times 1} \begin{align*} \begin{split}{\boldsymbol{z}}(l,k)=[z_{1}(l,k)\;\;...\;\;z_{M}(l,k)]^{ \top}. \end{split}\tag{7} \end{align*} View SourceRight-click on figure for MathML and additional features.

Its explicit expression is \begin{align*} \begin{split}{\boldsymbol{z}}(l,k)={\boldsymbol{H}}^{d}(l,k){\boldsymbol{s}}^{d}(l,k)+{ \boldsymbol{H}}^{i}(l,k){\boldsymbol{s}}^{i}(l,k)+{\boldsymbol{v}}(l,k), \end{split}\tag{8} \end{align*} View SourceRight-click on figure for MathML and additional features.

where {\boldsymbol{s}}^{d}(l,k) and {\boldsymbol{s}}^{i}(l,k) denote the stacked STFT representations of the desired sources and the interference sources, respectively, and are given by \begin{align*} \begin{split}{\boldsymbol{s}}^{d}(l,k) & =[s^{d}_{1}(l,k)\;\;...\;\;s^{d}_{N_{\text{D}}}(l,k)]^{\top} \\ {\boldsymbol{s}}^{i}(l,k) & =[s^{i}_{1}(l,k)\;\;...\;\;s^{i}_{ N_{\text{I}}}(l,k)]^{\top}, \end{split}\tag{9} \end{align*} View SourceRight-click on figure for MathML and additional features.

and the noise term is \begin{align*} \begin{split}{\boldsymbol{v}}(l,k)=[v_{1}(l,k)\;\;...\;\;v_{M}(l,k)]^{ \top}. \end{split}\tag{10} \end{align*} View SourceRight-click on figure for MathML and additional features.

The Acoustic Transfer Functions (ATFs) from the jth desired source and the jth interference source to the microphone array are \begin{align*} \begin{split}{\boldsymbol{h}}_{j}^{d}(l,k) & =[h_{j1}^{d}(l,k) \;\;...\;\;h_{jM}^{d}(l,k)]^{\top}\;\;\;\;j=1,...,N_{\text{D}} \\ {\boldsymbol{h}}_{j}^{i}(l,k) & =[h_{j1}^{i}(l,k)\;\;...\;\;h_ {jM}^{i}(l,k)]^{\top}\;\;\;\;j=1,...,N_{\text{I}}, \end{split}\tag{11} \end{align*} View SourceRight-click on figure for MathML and additional features.

and in a matrix form \begin{align*} \begin{split}{\boldsymbol{H}}^{d}(l,k) & =[{\boldsymbol{h}}_{1}^{d}(l, k)\;\;...\;\;{\boldsymbol{h}}_{{N}_{\text{D}}}^{d}(l,k)] \\ {\boldsymbol{H}}^{i}(l,k) & =[{\boldsymbol{h}}_{1}^{i}(l,k)\;\;...\;\; {\boldsymbol{h}}_{{N}_{\text{I}}}^{i}(l,k)]. \\ \end{split}\tag{12} \end{align*} View SourceRight-click on figure for MathML and additional features.

Henceforth, we focus on a single frequency bin and omit the frequency index. Throughout the article, we refer to {\boldsymbol{z}}(l) as the received signal. Since all the sources are static the ATFs do not change over time, so in the following, we omit their STFT window index l.

In this work, we focus on a single frequency to demonstrate the Riemannian approach, which is suitable for narrowband signals. For broadband signals, this is the formulation of only a single frequency. We note that the fusion of the different frequencies is of great importance, and normalization techniques for the different beamformers have been proposed [38] and can be applied after our method.

Our goal is to estimate the direction to the desired sources, given {\boldsymbol{z}}(l),\;\;l=1,\ldots,L_{\text{STFT}}, where L_{\text{STFT}} is the number of STFT windows. The main challenge is the existence of interference sources, positioned at unknown locations with possibly high signal power.

SECTION IV.

Proposed Approach

Typically, the DoA estimation of a desired source is based on the output of a beamformer. In this section, we present the proposed approach applied to the Delay-and-Sum (DS) beamformer. In Section VI, we extend the proposed approach to other DoA estimation methods.

We consider arbitrary indexing of the microphones in the array and designate the first microphone as the reference microphone. Let {\boldsymbol{d}}(\theta) denote the steering vector of the array to direction \theta relative to the first (reference) microphone, which is given by \begin{align*} {\boldsymbol{d}}(\theta)=[1,e^{j\phi_{2}(\theta)},...,e^{j\phi_{M}(\theta)}]^{\top},\tag{13} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \phi_{m}(\theta) is the phase of the received signal at the mth microphone with respect to the first microphone. For example, for a uniform linear array and the typical microphone indexing, we have \phi_{m}(\theta)=2\pi\cdot m\frac{\delta}{\lambda}\sin\theta, where \lambda is the wavelength of the received signal, and \delta is the distance between the microphones.

The SRP of the DS beamformer is given by \begin{align*} P_{\text{DS}}(\theta;\mathbf{\Gamma})={\boldsymbol{d}}^{H}(\theta)\mathbf{\Gamma}{\boldsymbol{ d}}(\theta),\tag{14} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \begin{align*} \mathbf{\Gamma}=\mathbb{E}[{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l)]\tag{15} \end{align*} View SourceRight-click on figure for MathML and additional features.

is the population covariance matrix. Since the desired source is constantly active and assumed to be at a fixed location during the entire interval, we estimate the population correlation matrix \mathbf{\Gamma} using the sample correlation matrix \widehat{\mathbf{\Gamma}} by averaging {\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l) over multiple STFT windows. We divide the STFT of the signal into L_{\text{s}} disjoint segments, each consisting of L_{\text{w}} consecutive STFT windows, i.e., L_{\text{STFT}}=L_{\text{s}}\cdot L_{\text{w}} (for more details regarding the partitioning see Section V-D). Then, the sample correlation matrix over each segment is computed as follows \begin{align*} \begin{split}\widehat{\mathbf{\Gamma}}_{i}=\frac{1}{L_{\text{w}}} \sum_{l=(i-1)\cdot L_{\text{w}}+1}^{i\cdot L_{\text{w}}}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H }(l), \\ \end{split}\tag{16} \end{align*} View SourceRight-click on figure for MathML and additional features.

where i is the segment index. To obtain a full rank correlation matrix, the number of STFT windows is set to be larger than the dimension of the matrix (the number of microphones), i.e., L_{\text{w}}\geq M.

The incorporation of the Riemannian geometry is realized by viewing each matrix \widehat{\mathbf{\Gamma}}_{i} as a point on the HPD manifold [34] and considering their Riemannian mean, denoted as \widehat{\mathbf{\Gamma}}_{\text{R}} and given by \begin{align*} \widehat{\mathbf{\Gamma}}_{\text{R}}=\operatorname*{argmin}_{\mathbf{\Gamma} \in\mathcal{M}}\sum_{i=1}^{L_{\text{s}}}d_{\text{R}}^{2}(\mathbf{\Gamma}, \widehat{\mathbf{\Gamma}}_{i}).\tag{17} \end{align*} View SourceRight-click on figure for MathML and additional features.

In general, there is no closed-form solution to (17) on the HPD manifold for more than two points [35]. Therefore, Algorithm 1 proposed in [36] is used to compute the Riemannian mean of the L_{\text{s}} correlation matrices.

Once \widehat{\mathbf{\Gamma}}_{\text{R}} is at hand, the SRP of the DS beamformer, P_{\text{DS}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{R}}), is computed by \begin{align*} P_{\text{DS}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{R}})={\boldsymbol{d}}^{H}(\theta) \widehat{\mathbf{\Gamma}}_{\text{R}}{\boldsymbol{d}}(\theta).\tag{18} \end{align*} View SourceRight-click on figure for MathML and additional features.

In the case of a single desired source and assuming the direct path is dominant in the AIR, the direction to it is set as the direction achieving the maximum value of the SRP of the DS beamformer, i.e., \begin{align*} \hat{\theta}=\operatorname*{argmax}_{\theta}P_{\text{DS}}(\theta;\widehat{ \mathbf{\Gamma}}_{\text{R}}).\tag{19} \end{align*} View SourceRight-click on figure for MathML and additional features.

We note that (19) is used since the DoA estimation is based on a single STFT frequency bin. In the case of N_{\text{D}} desired sources, the N_{\text{D}} directions are set according to the N_{\text{D}} strongest lobes in the SRP. The algorithm for a single desired source is described in Algorithm 2.

SECTION Algorithm 2

Direction estimation in the presence of multiple interference sources

Input: the received signal in the STFT domain \{{\boldsymbol{z}}(l)\}_{l=1}^{L_{\text{STFT}}}

Output: the estimated direction of the desired source \hat{\theta}

1:

Divide \{{\boldsymbol{z}}(l)\}_{l=1}^{L_{\text{STFT}}} into L_{\text{s}} consecutive segments

2:

For each segment i, compute the sample correlation matrix \widehat{\mathbf{\Gamma}}_{i} using (16)

3:

Compute \mathbf{\Gamma}_{\text{R}} of the set \{\widehat{\mathbf{\Gamma}}_{i}\}_{i=1}^{L_{\text{S}}} using Algorithm 1

4:

Compute P_{\text{DS}}(\theta;\mathbf{\Gamma}_{\text{R}}) according to (18)

5:

Return \hat{\theta}=\text{argmax}_{\theta}P_{\text{DS}}(\theta;\mathbf{\Gamma}_{\text {R}})

As a baseline, we consider the common practice of the computation of the SRP of the DS beamformer, which is typically based on the sample correlation matrix over the entire interval, i.e., P_{\text{DS}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{E}}), where \begin{align*} \widehat{\mathbf{\Gamma}}_{\text{E}}=\frac{1}{L_{\text{STFT}}}\sum_{l=1}^{L_{ \text{STFT}}}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l).\tag{20} \end{align*} View SourceRight-click on figure for MathML and additional features.

We observe that computing the Euclidean mean of the sample correlation matrices per segment, \{\widehat{\mathbf{\Gamma}}_{i}\}_{i=1}^{L_{\text{s}}}, results in \widehat{\mathbf{\Gamma}}_{\text{E}} in (20). So, \widehat{\mathbf{\Gamma}}_{\text{E}} in (20) is the Euclidean counterpart of \widehat{\mathbf{\Gamma}}_{\text{R}}, the sample correlation matrix resulting from the Riemannian approach.

We will show that our Riemannian approach exploits the assumption that the desired source is constantly active and at a fixed location, whereas the interference sources are intermittent. More specifically, we will show both theoretically in Section V and empirically in Section VII that the Riemannian mean attenuates the intermittent interferences while preserving the constantly active sources. In contrast, the standard Euclidean mean accumulates all the sources, and as a result, the main lobe could deviate from the direction of a desired source, and even focus on an interference source.

We show in Section V and Section VII that our proposed approach results in a SRP that rejects the interference sources, allowing the beamformer to extract the DoA of the desired sources.

We remark that the proposed approach only requires that the desired sources are the only sources active during the entire interval. The rank of the signal matrix is not known nor needs to be estimated. The number of interference sources is unknown as well. This is by virtue of the Riemannian mean. Unlike other works (e.g. [30]), we do not need to know the activation times of each interference, nor the number of interference sources. Furthermore, we do not assume that there exists a segment, at which a desired source is the only active source, namely, it could always be accompanied by interference sources.

In terms of complexity, the Riemannian approach requires the computation of the Riemannian mean of the correlation matrices, which is more complex than the Euclidean mean. However, the excess complexity depends on the number of microphones, which is typically not high compared to the complexity of computing the correlation matrix which depends on the number of signal samples. Consequently, the excess complexity is negligible. In particular, following the iterative computation of the Riemannian mean in Algorithm 3 in Appendix C, at each iteration, the computation involves the eigenvalue decomposition of two complex matrices of dimension M\times M, which is O(M^{3}), and the computation of 6 matrix products with the complexity of O(M^{2}). So, at each iteration, the overall excess complexity is O(M^{3}). We recall that for N samples, the complexity of computing the sample correlation matrix is O(M^{2}N) which is much larger than O(M^{3}) since typically N>>M. Empirically, we found that Algorithm 3 converges very quickly after several iterations. To summarize, the proposed approach leads to improved performance with only negligible additional computational cost. In Appendix C, we present the estimator along with an implementation for the streaming data setting.

To evaluate the performance of the proposed approach, we define the output Signal to Interference Ratio (SIR) as follows: \begin{align*} \begin{split}\text{SIR}_{j}(\widehat{\mathbf{\Gamma}})=\frac{P (\theta^{d};\widehat{\mathbf{\Gamma}})}{P(\theta^{i}_{j};\widehat{\mathbf{ \Gamma}})}, \end{split}\tag{21} \end{align*} View SourceRight-click on figure for MathML and additional features.

where P(\theta;\widehat{\mathbf{\Gamma}}) is the SRP computed using the sample correlation matrix \widehat{\mathbf{\Gamma}}, \theta^{d} is the direction of a desired source, and \theta^{i}_{j} is the direction of the jth interference. When using the DS beamformer, the output SIR becomes \begin{align*} \begin{split}\text{SIR}_{j}(\widehat{\mathbf{\Gamma}})=\frac{{\boldsymbol{d}}^{H}(\theta^{d})\widehat{\mathbf{\Gamma}}{\boldsymbol{d}}(\theta^{d})}{{\boldsymbol{d}}^{H }(\theta^{i}_{j})\widehat{\mathbf{\Gamma}}{\boldsymbol{d}}(\theta^{i}_{j})}. \end{split}\tag{22} \end{align*} View SourceRight-click on figure for MathML and additional features.

This measure of performance is used because the main challenge in this setting is the presence of interference sources rather than the microphone noise.

SECTION V.

Analysis

In this section, we analyze the proposed approach which is based on Riemannian geometry and compare it to its Euclidean counterpart. The proofs of the statements appear in the Supplementary Material (SM). In the analysis, we consider the population correlation matrix of the received signal, neglecting the estimation errors stemming from the finite sample in a segment. We note that sections IV, VI, and VII consider the sample correlation matrices, and only Section V considers the population correlation matrix.

We begin with a short derivation, demonstrating that Riemannian geometry preserves better the desired source subspace in comparison to Euclidean geometry. Consider a single desired source and assume its ATF, denoted by {\boldsymbol{h}}_{0}, is a common eigenvector of all the correlation matrices per segment associated with the same eigenvalue (this assumption is made formal in Assumption 1 in the sequel). In this case, according to Lemma 2 (see Appendix D), {\boldsymbol{h}}_{0} is an eigenvector of both means and it is associated with the same eigenvalue, namely \lambda_{0}(\mathbf{\Gamma}_{\text{R}})=\lambda_{0}(\mathbf{\Gamma}_{\text{E}}), where \lambda_{i}(\mathbf{\Gamma}) is the ith eigenvalue of \mathbf{\Gamma}. In addition, all other eigenvectors span the interference and noise subspace. Under the aforementioned assumption, and by the following known property of the Riemannian mean [39] \begin{align*} \mathbf{\Gamma}_{\text{R}}\preceq\mathbf{\Gamma}_{\text{E}}.\tag{23} \end{align*} View SourceRight-click on figure for MathML and additional features.

We get that \begin{align*} \frac{\lambda_{0}(\mathbf{\Gamma}_{\text{R}})}{\sum_{i=1}^{M-1}\lambda_{i} (\mathbf{\Gamma}_{\text{R}})}\geq\frac{\lambda_{0}(\mathbf{\Gamma}_{\text{E}})} {\sum_{i=1}^{M-1}\lambda_{i}(\mathbf{\Gamma}_{\text{E}})},\tag{24} \end{align*} View SourceRight-click on figure for MathML and additional features.

due to the equality of the numerators and (23). The inequality in (24) implies that the desired source subspace is more dominant than the subspace of the interference and noise in the Riemannian mean compared to the Euclidean mean.

In the remainder of this section, we extend this analysis and present additional results.

A. Assumptions

To make the analysis tractable, we consider a single desired source and multiple interference sources. Therefore, we simplify the notations by omitting the superscripts (\cdot)^{i} and (\cdot)^{d} associated with the interference sources and the desired sources and setting the index of the desired source to 0.

For the purpose of analysis, we make the following assumptions.

Assumption 1:

{\boldsymbol{h}}_{0}^{H}{\boldsymbol{h}}_{j}=0,\;\;\;\forall j=1,...,N_{\text{ I}}.

Assumption 2:

{\boldsymbol{h}}_{l}^{H}{\boldsymbol{h}}_{j}=0,\;\;\;\forall l\neq j.

It follows from Assumption 1 and Assumption 2 that the ATFs, associated with the desired source and the interference sources are all uncorrelated. These are common assumptions, e.g., see [30]. The assumptions are made only for analysis purposes, whereas in the experimental results, we consider acoustic signals in a reverberant environment without any assumptions. We note that we do not assume there exists a segment at which only one of the sources is active (e.g., as in [30]). In case an interference source is only partially active during a segment, we consider it active during the entire segment.

The population correlation matrix of the ith segment is given by \begin{align*} \begin{split}\mathbf{\Gamma}_{i}=\sigma_{0}^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h }}_{0}^{H}+{\boldsymbol{H}}\Lambda_{i}{\boldsymbol{H}}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}_{M\times M}, \end{split}\tag{25} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \sigma_{0}^{2} and \sigma_{v}^{2} are the power of the desired source and the power of the noise, respectively. The diagonal matrix \Lambda_{i} captures the signal power of the interference sources and is given by: \begin{align*} \begin{split}\Lambda_{i}=\text{diag}\left(\sigma_{1}^{2}(i)\cdot \mathcal{I}_{i\in\mathcal{L}_{1}},\;...\;,\sigma_{N_{\text{I}}}^{2}(i)\cdot \mathcal{I}_{i\in\mathcal{L}_{N_{\text{I}}}}\right), \end{split}\tag{26} \end{align*} View SourceRight-click on figure for MathML and additional features.

where {\boldsymbol{H}}={\boldsymbol{H}}^{i}(l,k) due to the omission of the indices and \sigma_{j}^{2}(i)=\mathbb{E}[|s_{j}(n)|^{2}|n\in i\text{th segment}] is the expected signal power of the jth interference source at the ith segment, \mathcal{L}_{j} is the set of segments at which the jth interference source is active, and \mathcal{I}_{i\in\mathcal{L}_{j}} is an indicator function, attaining the value of 1 when the jth interference is active during the ith segment and 0 otherwise. We denote by \tau_{j}=\frac{|\mathcal{L}_{j}|}{L_{\text{s}}} the relative number of segments during which the jth interference source is active. We assume the same expected power at all the segments in the interval, i.e., \sigma_{j}^{2}(i)=\sigma_{j}^{2}\; for all j=1,\ldots,N_{\text{I}} and i=1,\ldots,L_{\text{s}}.

We continue with defining the Signal to Noise Ratio (SNR) at the mth microphone as \begin{align*} \begin{split}\text{SNR}{{}_{m}}=\frac{\sigma_{0}^{2}{|{\boldsymbol{h}}_{0 }[m]|^{2}}}{\sigma_{v}^{2}}, \end{split}\tag{27} \end{align*} View SourceRight-click on figure for MathML and additional features.

where |{\boldsymbol{h}}_{0}[m]|^{2} is the attenuation of the signal due to the acoustic channel between the desired source and the mth microphone. Since we focus on a single frequency bin, (27) is the narrowband SNR.

To capture the correlation between the steering vectors and the ATFs, we define \begin{align*} \begin{split}\rho_{rs}=\frac{|\langle{\boldsymbol{d}}_{r},{\boldsymbol{h}}_{s} \rangle|^{2}}{\|{\boldsymbol{d}}_{r}\|^{2}\cdot\|{\boldsymbol{h}}_{s}\|^{2}}=\frac{|\langle{ \boldsymbol{d}}_{r},{\boldsymbol{h}}_{s}\rangle|^{2}}{M\|{\boldsymbol{h}}_{s}\|^{2}}, \end{split}\tag{28} \end{align*} View SourceRight-click on figure for MathML and additional features.

where r,s=0,1,2,...,N_{\text{I}}, indicating the desired source or an interference source.

We conclude the preliminaries of the analysis with two additional assumptions.

Assumption 3:

\rho_{rr} is fixed \forall r, and \rho_{rs} is fixed \forall r\neq s.

Assumption 3 implies that the correlation between the ATFs and the steering vectors depends only on whether they are associated with the same source or not. Following Assumption 3, henceforth we denote \kappa=\rho_{rr} and \rho=\rho_{rs} for r\neq s.

Assumption 4:

\kappa>\rho.

Assumption 4 is typically made in the context of source localization. It implies that the correlation between a steering vector to a source and the ATF associated with that source is higher than the correlation between a steering vector to a source and the ATF associated with a different source.

B. Main Results

Our first result states that the output SIR (22) of the Riemannian-based DS beamformer is higher than the output SIR of the Euclidean-based DS beamformer.

Proposition 1:

For every interference source j, the following holds \begin{align*} \text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}}) > \text{SIR}_{j}(\mathbf{\Gamma}_{ \text{E}}),\;\;\;\tag{29} \end{align*} View SourceRight-click on figure for MathML and additional features.

for any number of microphones in the array.

Examining the dependency of the output SIR on the noise power \sigma_{v}^{2} leads to the following result.

Proposition 2:

If \begin{align*} \begin{split}\sigma_{0}^{2}\|{\boldsymbol{h}}_{0}\|^{2}\geq\sigma_{j}^{2} \tau_{j}\|{\boldsymbol{h}}_{j}\|^{2},\;\;\forall j, \end{split}\tag{30} \end{align*} View SourceRight-click on figure for MathML and additional features.

then \begin{align*} \frac{\partial}{\partial\sigma_{v}^{2}}\text{SIR}_{j}(\mathbf{\Gamma}_{\text{R }}) < \frac{\partial}{\partial\sigma_{v}^{2}}\text{SIR}_{j}(\mathbf{\Gamma}_{ \text{E}}) < 0.\tag{31} \end{align*} View SourceRight-click on figure for MathML and additional features.

Namely, the lower the noise power is the higher the output SIR is, and the improvement in \text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}}) is greater than the improvement in \text{SIR}_{j}(\mathbf{\Gamma}_{\text{E}}). Since we established that the Riemannian approach is better than the Euclidean one in terms of the SIR in Proposition 1, Proposition 2 implies that increasing the SNR further increases the gap between the two approaches. Nevertheless, it also indicates that the performance of the Riemannian approach in terms of the SIR is more sensitive to noise compared to the Euclidean counterpart. Note that this statement holds under condition (30), which implies that the received power of the desired source is stronger than the received power of each interference source, considering the attenuation stemming from the activity duration. See more details in Section I-C in the SM.

The proofs of Proposition 1 and Proposition 2 rely on the following lemma, which is important in its own right.

Lemma 1:

The Riemannian or the Euclidean mean of the population correlation matrices of the segments (25) over the entire interval can be written in the same parametric form as \begin{align*} \begin{split}\mathbf{\Gamma}=\sigma_{0}^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h}}_{ 0}^{H}+\sum_{j=1}^{N_{\text{I}}}\mu^{2}_{j}{\boldsymbol{h}}_{j}{\boldsymbol{h}}_{j}^{H}+\sigma _{v}^{2}{\boldsymbol{I}}. \end{split}\tag{32} \end{align*} View SourceRight-click on figure for MathML and additional features.

The Riemannian mean \mathbf{\Gamma}_{\text{R}} is obtained by setting the parameters \mu_{j} to \begin{align*} \mu_{j}^{2}=\frac{(\sigma_{j}^{2}\|{\boldsymbol{h}}_{j}\|^{2}+\sigma_{v}^{2})^{\tau_{j }}(\sigma_{v}^{2})^{1-\tau_{j}}-\sigma_{v}^{2}}{\|{\boldsymbol{h}}_{j}\|^{2}},\tag{33} \end{align*} View SourceRight-click on figure for MathML and additional features.

and the Euclidean mean \mathbf{\Gamma}_{\text{E}} is obtained by setting \begin{align*} \mu_{j}^{2}=\sigma_{j}^{2}\tau_{j}.\tag{34} \end{align*} View SourceRight-click on figure for MathML and additional features.

We note that only assumptions 1 and 2 are necessary for this lemma to hold. In addition, we note that if the interference sources are always active, i.e., |\mathcal{L}_{j}|=L_{\text{s}},\;\forall j, it holds that \mathbf{\Gamma}_{\text{R}}=\mathbf{\Gamma}_{\text{E}}.

Lemma 1 shows that both \mathbf{\Gamma}_{\text{R}} and \mathbf{\Gamma}_{\text{E}}, i.e., the population correlation matrices of the segments in (25), can be decomposed into three terms associated with the desired source, the interference sources, and the noise. By (32), the desired source term and the noise term (the first and third terms) are the same in both the Riemannian and the Euclidean means. In contrast, the coefficients in (33) and (34) imply that the amplitude of the interference sources term (the second term) depends on the used geometry. By further inspecting the expressions of \{\mu_{j}^{2}\}, we see that the interference attenuation using the Riemannian geometry in (33) is more involved than its Euclidean counterpart in (34), depending not only on the interference power and the duration of activity but also on the noise power and the corresponding ATF.

Furthermore, considering \mu_{j} in (34), the condition (30) could be viewed as the dominance of the desired source after the attenuation of the Euclidean mean. Discussion about the condition (30) for \mu_{j} in the Riemannian case in (33) appears in Section I-C in the SM.

Next, we examine a family of correlation matrices that pertain to the same parametric form as in (32) in Lemma 1, i.e., \begin{align*} \mathbf{\Gamma}_{\boldsymbol{a}}={\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sum_{j=1}^{N_{\text{I}} }a_{j}{\boldsymbol{h}}_{j}{\boldsymbol{h}}_{j}^{H}+\sigma_{v}^{2}{\boldsymbol{I}},\tag{35} \end{align*} View SourceRight-click on figure for MathML and additional features.

for some coefficients {\boldsymbol{a}}=[a_{1},a_{2},...,a_{N_{\text{I}}}]. Without loss of generality, we set the coefficient of {\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H} to 1. We note that \mathbf{\Gamma}_{\boldsymbol{a}} is in accordance with (25).

For any j, we have that \begin{align*} \mathbf{\Gamma}_{\text{opt}}\equiv\operatorname*{argmax}_{\mathbf{\Gamma}_{ \boldsymbol{a}}}\text{SIR}_{j}(\mathbf{\Gamma}_{\boldsymbol{a}})={\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H }+\sigma_{v}^{2}{\boldsymbol{I}},\tag{36} \end{align*} View SourceRight-click on figure for MathML and additional features.

where {\boldsymbol{a}}=\mathbf{0}. Consequently, \begin{align*} \begin{split}\text{SIR}_{j}(\mathbf{\Gamma}_{\text{opt}})=\frac{{ \boldsymbol{d}}_{0}^{H}({\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}){\boldsymbol{d}}_{ 0}}{{\boldsymbol{d}}_{j}^{H}({\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}){\boldsymbol{ d}}_{j}}. \end{split}\tag{37} \end{align*} View SourceRight-click on figure for MathML and additional features.

Considering vanishing noise, i.e., when the noise power approaches zero, the following result stems from Lemma 1 by considering the limit \lim_{\sigma_{v}^{2}\rightarrow 0}\mu_{j}^{2}=0 using (33).

Corollary 1:

\begin{align*} \begin{split}\lim_{\sigma_{v}^{2}\rightarrow 0}\mathbf{\Gamma}_{ \text{R}}=\mathbf{\Gamma}_{\text{opt}}. \end{split}\tag{38} \end{align*} View SourceRight-click on figure for MathML and additional features.

According to Corollary 1, the Riemannian mean approaches the optimal correlation matrix as the noise becomes negligible. By adding a condition on the presence of the interference sources, from Lemma 1 and (33) we also have the following.

Corollary 2:

For any interference source j, if \tau_{j}<\frac{1}{2}, then \begin{align*} \begin{split}\lim_{\sigma_{j}^{2}\rightarrow\infty,\sigma_{v}^{2} \rightarrow 0}\mathbf{\Gamma}_{\text{R}}=\mathbf{\Gamma}_{\text{opt}}. \end{split}\tag{39} \end{align*} View SourceRight-click on figure for MathML and additional features.

Additionally, if \tau_{j}<\frac{1}{2} for all j, then \begin{align*} \begin{split}\lim_{\sigma_{j}^{2}\rightarrow\infty\forall j=1, \ldots,N_{\text{I}},\sigma_{v}^{2}\rightarrow 0}\mathbf{\Gamma}_{\text{R}}= \mathbf{\Gamma}_{\text{opt}}. \end{split}\tag{40} \end{align*} View SourceRight-click on figure for MathML and additional features.

Corollary 2 implies that for vanishing noise, even when all the interference sources have infinite power, the desired source is still the dominant source in the SRP of the DS beamformer using the Riemannian mean. Following (37) it holds that \lim_{\sigma_{j}^{2}\rightarrow\infty,\sigma_{v}^{2}\rightarrow 0}\text{SIR}_{ j}(\mathbf{\Gamma}_{\text{opt}})=\frac{\kappa}{\rho}>1 for all j. For the Euclidean mean it holds that \lim_{\sigma_{j}^{2}\rightarrow\infty,\sigma_{v}^{2}\rightarrow 0}\text{SIR}_{ j}(\mathbf{\Gamma}_{\text{E}})=\frac{\rho}{\kappa}<1<\frac{\kappa}{\rho}=\lim_ {\sigma_{j}^{2}\rightarrow\infty,\sigma_{v}^{2}\rightarrow 0}\text{SIR}_{j} (\mathbf{\Gamma}_{\text{opt}}) for all j. Note that we consider noise power approaching zero rather than strictly zero, because when \sigma_{v}^{2}=0, the correlation matrix is singular, and therefore, lies outside the HPD manifold. Additionally, in practice, noise is always present.

To illustrate the obtained expressions for the Riemannian and the Euclidean SIR, we present the following simple example.

Example 1:

Consider an anechoic environment without attenuation, for which \kappa=1 and \rho=0, and two interference sources. Each interference source is active at a different segment, i.e., L_{\text{s}}=2, \mathcal{L}_{1}=\{1\}, and \mathcal{L}_{2}=\{2\}. All the sources have the same power. In this setting, for the Riemannian geometry, we have \begin{align*} \text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}})=\sqrt{\frac{M}{\sigma_{v}^{2}}+1},\tag{41} \end{align*} View SourceRight-click on figure for MathML and additional features.

and for the Euclidean geometry, we have: \begin{align*} \text{SIR}(\mathbf{\Gamma}_{\text{E}})=\frac{2(M+\sigma_{v}^{2})}{M+2\sigma_{v }^{2}}.\tag{42} \end{align*} View SourceRight-click on figure for MathML and additional features.

Therefore, in the limit of \sigma_{v}^{2}\rightarrow 0, or M\rightarrow\infty, we have \text{SIR}(\mathbf{\Gamma}_{\text{R}})\rightarrow\infty, whereas \text{SIR}(\mathbf{\Gamma}_{\text{E}})\approx 2.

We conclude this analysis with a few remarks. First, we note that \mathbf{\Gamma}_{\text{E}} leads to the ML estimator by taking \hat{\theta}_{0}=\operatorname*{argmax}_{\theta}P_{\text{DS}}(\theta;\mathbf{ \Gamma}_{\text{E}}) for the interference-free setting. In this case, the Riemannian mean coincides with the Euclidean mean, i.e., the proposed method coincides with the ML estimator. The main advantage of the proposed method lies in attenuating the interference sources while preserving the desired source. Second, under assumptions 1 and 2, the number of sources, both desired and interference, are limited by the number of microphones in the array, i.e., N_{\text{I}}+N_{\text{D}}<M. In Section V-C we alleviate Assumption 2, which removes this restriction. Third, following the same techniques in the proof of Proposition 1 and Proposition 2, similar results are derived for an alternative definition of the SIR: \text{SIR}_{\text{tot}}(\mathbf{\Gamma})\equiv\frac{{\boldsymbol{d}}_{0}^{H} \mathbf{\Gamma}{\boldsymbol{d}}_{0}}{\sum_{j=1}^{N_{\text{I}}}{\boldsymbol{d}} _{j}^{H}\mathbf{\Gamma}{\boldsymbol{d}}_{j}}, which captures the ratio between the desired source and the sum of all interference sources. See Section II in the SM for more details.

C. Relation to Signal Enhancement

For signal enhancement in reverberant environments, the estimation of the ATF of the desired source is typically required. In our setting, there is no segment at which the desired source is the only active source, and therefore, the ATF estimation is done in the presence of the interference sources. In such a case, the following quantity could be of interest \begin{align*} \begin{split}\overline{\text{SIR}}_{j}(\mathbf{\Gamma})=\frac{{ \boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}{\boldsymbol{h}}_{0}}{{\boldsymbol{h}}_{j}^{H}\mathbf{\Gamma}{ \boldsymbol{h}}_{j}}, \end{split}\tag{43} \end{align*} View SourceRight-click on figure for MathML and additional features.

which is different than (22) in the use of the ATFs instead of the steering vectors.

Similarly to Proposition 1, the following Proposition 3 examines the performance in terms of the SIR defined in (43). Here, assumptions 2-​4 are not required, and therefore, the ATFs of the interference sources could be correlated, and the number of sources is not limited by the number of microphones in the array.

Proposition 3:

Under Assumption 1, for all j we have \begin{align*} \begin{split}\overline{\text{SIR}}_{j}(\mathbf{\Gamma}_{\text{R}}) \geq\overline{\text{SIR}}_{j}(\mathbf{\Gamma}_{\text{E}}). \end{split}\tag{44} \end{align*} View SourceRight-click on figure for MathML and additional features.

Another interesting component in signal enhancement is the Relative Transfer Function (RTF) between different microphones [40], [41], [42]. We compute the RTFs with respect to the first microphone, i.e., \frac{{\boldsymbol{h}}_{j}}{{\boldsymbol{h}}_{j}(1)}. Since the RTFs are proportional to the ATFs, assumption 1 holds for the RTFs, and therefore, all the derived results apply to the RTFs as well.

D. The Segments and the Interference Sources Activity

In this section, we investigate the effect of misalignment between the segments and the activity of the interference sources. We consider two interference sources and two segments. We denote by \alpha the offset between the segments and the activity of the interference sources. For simplicity, we consider alternately active interference sources. Suppose the first interference source is active during \alpha\in[0,1] of the first segment and during 1-\alpha of the second segment, and suppose the second interference source is active during 1-\alpha and \alpha of the first and second segments, respectively. In this case, the population correlation matrices of the two segments are given by \begin{align*} \begin{split}\mathbf{\Gamma}_{1}(\alpha) & =\sigma_{0} ^{2}{\boldsymbol{h}}_{0}{\boldsymbol{h}}_{0}^{H}+\alpha^{2}\sigma_{1}^{2}{\boldsymbol{h}}_{1}{\boldsymbol{h}}_ {1}^{H}+(1-\alpha)^{2}\sigma_{2}^{2}{\boldsymbol{h}}_{2}{\boldsymbol{h}}_{2}^{H}+\sigma_{v}^{2 }{\boldsymbol{I}} \\ \mathbf{\Gamma}_{2}(\alpha) & =\sigma_{0}^{2}{\boldsymbol{h}}_ {0}{\boldsymbol{h}}_{0}^{H}+(1-\alpha)^{2}\sigma_{1}^{2}{\boldsymbol{h}}_{1}{\boldsymbol{h}}_{1}^{H}+ \alpha^{2}\sigma_{2}^{2}{\boldsymbol{h}}_{2}{\boldsymbol{h}}_{2}^{H}+\sigma_{v}^{2}{\boldsymbol{I}}. \end{split}\tag{45} \end{align*} View SourceRight-click on figure for MathML and additional features.

The correlation matrices in (45) depend on \alpha, and as a result, their Riemannian mean \mathbf{\Gamma}_{\text{R}}(\alpha) and their Euclidean mean \mathbf{\Gamma}_{\text{E}}(\alpha) depend on \alpha as well.

Examining the dependency of the SIR on \alpha leads to the following result.

Proposition 4:

For any \alpha\in[0,1], we have \begin{align*} \text{SIR}(\mathbf{\Gamma}_{\text{R}}(\alpha))\geq\text{SIR}(\mathbf{\Gamma}_{ \text{E}}(\alpha)).\tag{46} \end{align*} View SourceRight-click on figure for MathML and additional features.

Proposition 4 states that for every misalignment between the segments and the activity of the interference sources, the Riemannian mean leads to higher SIR in comparison to its Euclidean counterpart. Equality in (46) is obtained for \alpha=\frac{1}{2}, which means 50\% offset. In this case, it holds that \mathbf{\Gamma}_{1}=\mathbf{\Gamma}_{2}, and both means are the same.

Empirically, we found that the advantage of the Riemannian mean over the Euclidean mean decreases as the offset between the segments and the activity of the interference sources increases. We leave the question of optimal partitioning of the STFT windows into segments to future work.

SECTION VI.

Extension to Other DoA Estimation Methods

In this section, to broaden its applicability, we demonstrate the incorporation of the Riemannian approach in other beamformers. Each beamformer generates a spatial spectrum from which the directions to the desired sources are estimated according to the highest peaks in the spectrum.

As a subspace (SbSp) approach, we implement MUSIC [13] in the following way. Given N_{\text{D}} desired sources, we take the leading N_{\text{D}} eigenvectors of the sample correlation matrix, \widehat{\mathbf{\Gamma}}, and construct the signal subspace matrix, {\boldsymbol{U}}(\widehat{\mathbf{\Gamma}})\in{\mathbb{C}}^{M\times N_{\text{D }}}, whose columns are the {N}_{\text{D}} eigenvectors. Then, the SbSp spectrum is defined by \begin{align*} {\boldsymbol{P}}_{\text{SbSp}}(\theta;\widehat{\mathbf{\Gamma}})={\boldsymbol{d}}^{H}(\theta){ \boldsymbol{U}}(\widehat{\mathbf{\Gamma}}){\boldsymbol{U}}^{H}(\widehat{\mathbf{\Gamma}}){\boldsymbol{ d}}(\theta).\tag{47} \end{align*} View SourceRight-click on figure for MathML and additional features.

We note that the appropriate number of eigenvectors N_{\text{D}} (the dimension of the subspace) needs to be estimated.

Similarly to the DS beamformer based on the SRP in (18) and (19), the Riemannian and the Euclidean SbSp methods are given by {\boldsymbol{P}}_{\text{SbSp}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{R}}) and {\boldsymbol{P}}_{\text{SbSp}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{E}}), respectively, according to (47). The SbSp method could also benefit from our Riemannian approach. Recalling assumptions 1 and 2 and the structure of the mean correlation matrices \mathbf{\Gamma}_{\text{R}} and \mathbf{\Gamma}_{\text{E}} in (32), we see that {\boldsymbol{h}}_{0} is an eigenvector of both \mathbf{\Gamma}_{\text{R}} and \mathbf{\Gamma}_{\text{E}}, spanning the signal subspace (we assume a single desired source for simplicity). Moreover, the vectors \{{\boldsymbol{h}}_{j}\} are also eigenvectors of \mathbf{\Gamma}_{\text{R}} and \mathbf{\Gamma}_{\text{E}}, spanning the subspace of the interference sources. SbSp methods typically focus on the principal eigenvector. Following (32), the principal eigenvector is determined according to the largest coefficient among \sigma_{0} and \{\mu_{j}\}_{j=1}^{N_{\text{I}}}. The parameter \mu_{j} in (33) for the Riemannian mean \mathbf{\Gamma}_{\text{R}} is smaller than \mu_{j} in (34) for the Euclidean mean \mathbf{\Gamma}_{\text{E}}, as proven in Proposition 1. As a result, the signal subspace is more dominant relative to the interference subspace when considering the Riemannian mean in comparison to the Euclidean mean, implying better results for the Riemannian SbSp approach. For an interference source with sufficiently high power, the leading eigenvector of \mathbf{\Gamma}_{\text{E}} could span the interference subspace rather than the signal subspace, whereas the leading eigenvector of \mathbf{\Gamma}_{\text{R}} spans the signal subspace. In Section VII, we demonstrate these SbSp methods empirically and show the advantage of the Riemannian approach.

Our approach is also applicable to the MVDR beamformer [10], whose spectrum is given by \begin{align*} {\boldsymbol{P}}_{\text{MVDR}}(\theta;\widehat{\mathbf{\Gamma}})=\frac{1}{{\boldsymbol{d}}^{H} (\theta)\widehat{\mathbf{\Gamma}}^{-1}{\boldsymbol{d}}(\theta)}.\tag{48} \end{align*} View SourceRight-click on figure for MathML and additional features.

The typical spectrum of the MVDR beamformer is obtained by using \widehat{\mathbf{\Gamma}}_{\text{E}} in (48), namely {\boldsymbol{P}}_{\text{MVDR}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{E}}). We propose to use the Riemannian mean by setting \widehat{\mathbf{\Gamma}} to be \widehat{\mathbf{\Gamma}}_{\text{R}} in (48) to obtain {\boldsymbol{P}}_{\text{MVDR}}(\theta;\widehat{\mathbf{\Gamma}}_{\text{R}}).

In principle, many DoA estimation methods that employ the sample correlation matrix could potentially benefit from the proposed approach, even if it is not based on a beamformer. For example, the Bayesian learning method for signal recovery for DoA estimation proposed in [23] employs the sample correlation matrix. Its Riemannian alternative is implemented by following their algorithm only with the Riemannian mean instead of the Euclidean mean. The empirical results appear in Section VII.

SECTION VII.

Simulation Results

In this section, we demonstrate the performance of the proposed approach based on Riemannian geometry, and compare it to Euclidean geometry, implicitly considered by the common practice1. Additionally, we compare our approach to a heuristic method, based on the intersection of subspaces. The intersection leads to the rejection of non-common components, such as the interference sources subspace, and preserves common components, such as the desired source subspace. We refer to it as the intersection beamformer (see Appendix A for more details).

We consider a reverberant enclosure of dimensions 5\text{m}\times 4\text{m}\times 3.5\text{m} consisting of a microphone array with \text{M}=12 microphones. The AIRs between the different sources and the array are generated based on the image method [43], as implemented by the simulator in [44]. The sampling frequency is 16\text{KHz}, and the length of the AIRs is set to 2048 samples. The received signal is transformed to the time-frequency domain using STFT with a window size of 1024 samples with 50\% overlap. We test all methods using a single frequency bin, of index 250, chosen according to the microphone spacing, which is 4.36\text{cm}. Here, the correlation matrix estimation is based on 16 STFT windows, which results in a segment duration of 1.024\text{s}. The emitted signals are generated as white Gaussian noise. The desired source is constantly active, whereas the interference sources are active only intermittently.

All the sources are positioned on a 140^{\circ} arc of radius 2\text{m} from the center of the array on the XY plane. The heights of the interference sources vary randomly, uniformly distributed between 0.5\text{m} and 3\text{m}. The height of the desired source is set to 1.8\text{m}. Fig. 1 presents the room layout, where the (two) interferences are marked by green squares, the desired source by a red star, and the microphone array by blue circles. The leftmost microphone is positioned at (2.0436\text{m},1\text{m},2\text{m}), and the rest are positioned 4.36\text{cm} apart along the x-axis. While we focus on this specific configuration, we note that we tested other configurations that yielded similar results.

Fig. 1. - The reverberant room with the microphone array (blue circles), the desired source (red star), and the interference sources (green squares). (a) A 3D view. (b) A 2D view.
Fig. 1.

The reverberant room with the microphone array (blue circles), the desired source (red star), and the interference sources (green squares). (a) A 3D view. (b) A 2D view.

We examine the performance of both the DS and the SbSp methods. Algorithm 2 is used for the proposed Riemannian DS, and the common practice is implemented by replacing step 3 in Algorithm 2 with (20). The SbSp methods require knowing the dimension of the signal space of the mean correlation matrix. For the intersection method only, the dimension of the signal space of each segment is also required. Since the desired source may not be the strongest source received at the array, we need to consider all the active sources, and not merely the strongest one when estimating the dimension of the signal subspace. To estimate the dimension, we implement a heuristic algorithm, based on the spectral gap. We consider the dimension of the signal space to be the number of eigenvalues higher than a threshold. For all methods, the threshold for the mean correlation matrix is the mean value plus the standard deviation of the eigenvalues (normalized to a unit sum). For the intersection method, the threshold for the sample correlation matrix of each segment is 1.5 times the mean of the eigenvalues of the sample correlation matrix. Apart from this practical implementation, we also present results for an oracle implementation, assuming the dimension of the signal space is perfectly known.

For quantitative evaluation, we use the root mean square error (RMSE) and the accuracy, for which DoA estimation error that is smaller than 3^{\circ} (arbitrarily chosen) is considered accurate. In addition, we use two other metrics. The first is the mean of the empirical output SIR with respect to all the interference sources, which is given by \begin{align*} \text{SIR}=\frac{1}{N_{\text{I}}}\sum_{j=1}^{N_{\text{I}}}\frac{P({\theta}^{d}) }{P(\theta^{i}_{j})},\tag{49} \end{align*} View SourceRight-click on figure for MathML and additional features.

where P is the spectrum computed using the evaluated method. The second metric is the directivity [45, ch.2], which is given by \begin{align*} \mathcal{D}(\widehat{\mathbf{\Gamma}})=\frac{P({\theta}^{d})}{\frac{1}{2}\int_ {0}^{\pi}P(\theta)\sin(\theta)d\theta}.\tag{50} \end{align*} View SourceRight-click on figure for MathML and additional features.

The proposed approach results in inherent interference rejection, which is its greatest merit. The attenuation of the interference sources by our approach allows for accurate DoA estimation even in the presence of strong interference sources. The measure of SIR is indicative of the amount of interference rejection.

In the first experiment, we consider two interference sources, each active at a single, but disjoint, segment, resulting in a signal of 2.048\text{s}. The reverberation time is set to 150\text{ms}, and the SNR is {2}0\text{dB}.

We start with an example of the SRP of the DS beamformer (see (18)), computed using \widehat{\mathbf{\Gamma}}_{\text{R}} and \widehat{\mathbf{\Gamma}}_{\text{E}}, which is presented at the top of Fig. 2(a). The SRP of the Riemannian DS beamformer is shown in solid blue and the SRP of the Euclidean DS beamformer is shown in dashed red. Both SRPs are in a dB (log) scale. The directions to the desired source and the interference sources are represented by a solid black line and a dashed black line, respectively. We see that by using \widehat{\mathbf{\Gamma}}_{\text{R}} the main lobe is directed towards the desired source. In contrast, the SRP using \widehat{\mathbf{\Gamma}}_{\text{E}} is peaked at two different directions, none of which is the direction to the desired source. The bottom of Fig. 2(b) is the same as Fig. 2(a), only with the SRP computed using the sample correlation matrix of each of the two segments, presented in different shades of orange. Even though the main lobes of the two SRPs are not pointing toward the desired source, the Riemannian mean leads to a SRP with the main lobe directed at the desired source, whereas the lobes to other directions are highly attenuated. We emphasize that viewing the SRP of the Euclidean DS beamformer in addition to the SRP of each segment separately does not allow correct identification of the direction to the desired source. We note that since the population correlation matrix is approximately low rank and the number of microphones is typically small, the estimation of the correlation matrix based on a small sample is feasible.

Fig. 2. - The SRP of the DS beamformer using (a) $\widehat{\mathbf{\Gamma}}_{\text{R}}$ in solid blue and $\widehat{\mathbf{\Gamma}}_{\text{E}}$ in dashed red, and (b) $\widehat{\mathbf{\Gamma}}_{1}$ and $\widehat{\mathbf{\Gamma}}_{2}$ in different shades of orange. The black solid line indicates the direction of the desired source, and the dashed black lines indicate the directions to the interference sources. The Input SIR is $-6\text{dB}$.
Fig. 2.

The SRP of the DS beamformer using (a) \widehat{\mathbf{\Gamma}}_{\text{R}} in solid blue and \widehat{\mathbf{\Gamma}}_{\text{E}} in dashed red, and (b) \widehat{\mathbf{\Gamma}}_{1} and \widehat{\mathbf{\Gamma}}_{2} in different shades of orange. The black solid line indicates the direction of the desired source, and the dashed black lines indicate the directions to the interference sources. The Input SIR is -6\text{dB}.

Next, we randomly generate 200 different pairs of positions for the interference sources. For each pair, the desired source is located at 20 different equally spaced directions along the arc (with the height of 1.8\text{m}). Thus, in total, 4000 different scenarios are examined.

Fig. 3 presents the mean output SIR (a) and the directivity (b) for the DS method using the correlation matrix estimates: \widehat{\mathbf{\Gamma}}_{\text{R}} (based on Riemannian geometry) in blue and \widehat{\mathbf{\Gamma}}_{\text{E}} (based on Euclidean geometry) in red. The box indicates the 25th and 75th percentiles, and the line marks the median. We test the different methods, Riemannian or Euclidean, in varying input SIR values, i.e. the SIR with respect to the source's power (excluding the AIRs and the beamformer processing).

Fig. 3. - (a) The mean output SIR and (b) the directivity for two interference sources, for the Riemannian and the Euclidean DS method. The x-axis indicates the input SIR, and the y-axis indicates the output SIR. The box indicates the $25$th and $75$th percentiles, and the central line marks the median. Several input SIR values are presented.
Fig. 3.

(a) The mean output SIR and (b) the directivity for two interference sources, for the Riemannian and the Euclidean DS method. The x-axis indicates the input SIR, and the y-axis indicates the output SIR. The box indicates the 25th and 75th percentiles, and the central line marks the median. Several input SIR values are presented.

We see that the Riemannian DS method attains high output SIR values, even for strong interference sources (high input SIR). In contrast, the Euclidean DS method results in relatively low output SIR values. The gap in the output SIR values between the Riemannian DS, and the Euclidean DS is up to 10\text{dB}. These results coincide with Proposition 1, stating that \text{SIR}_{j}(\mathbf{\Gamma}_{\text{R}})>\text{SIR}_{j}(\mathbf{\Gamma}_{ \text{E}}), for every interference source j. We emphasize that the Euclidean mean, \widehat{\mathbf{\Gamma}}_{\text{E}}, is equivalent to the common practice of using the entire signal for a single correlation matrix estimation. Since both the mean output SIR and the directivity present similar trends, and due to space considerations, in the following, we only present the mean output SIR.

Fig. 4 is the same as Fig. 3, but presenting the SbSp method with the addition of the intersection method, which appears in orange. Fig. 4(a) presents the results for the practical implementation that includes estimating the dimension, whereas Fig. 4(b) presents the results for the oracle. We see that the Riemannian approach outperforms its Euclidean counterpart by approximately 20\text{dB}. In addition, the oracle SbSp method is better than the practical SbSp. In comparison to Fig. 3(a), it can be seen that the Riemannian SbSp method results in higher output SIRs than the Riemannian DS method. In contrast, for the Euclidean approach, the SbSp method yields slightly lower SIRs than the DS method. The reason is that the Riemannian mean better attenuates the interference sources, allowing for a better estimation of the signal subspace than the Euclidean mean.

Fig. 4. - The mean output SIR of the SbSp methods. (a) Practical implementation. (b) Using an oracle. The box indicates the $25$th and $75$th percentiles and the central line marks the median. The x-axis indicates the input SIR, and the y-axis indicates the output SIR. Several input SIR values are presented.
Fig. 4.

The mean output SIR of the SbSp methods. (a) Practical implementation. (b) Using an oracle. The box indicates the 25th and 75th percentiles and the central line marks the median. The x-axis indicates the input SIR, and the y-axis indicates the output SIR. Several input SIR values are presented.

We continue with examining the direction estimation to the desired source. The estimated direction is defined as the direction leading to the maximal value of the SRP, namely \begin{align*} \hat{\theta}^{d}=\operatorname*{argmax}_{\theta}P(\theta).\tag{51} \end{align*} View SourceRight-click on figure for MathML and additional features.

Fig. 5 shows the estimated direction to the desired source for the Riemannian DS method (blue square), the Euclidean DS method (red circle), and the intersection method (orange star). The black solid line marks the true location of the desired source (at 20 different positions). The dashed lines mark the fixed location of the interference sources. The results for input SIR -6\text{dB} and -10\text{dB} appear in Fig. 5(a) and Fig. 5(b), respectively. We see that using the Riemannian mean, the direction estimation follows the desired source. In contrast, using the Euclidean mean (the entire signal) results in estimating the direction of one of the interference sources. The intersection method is also inferior to the proposed approach, resulting in direction estimation to the desired source or an interference source depending on the input SIR. We note that since the DoA is estimated using (51), only the main lobe of the SRP is considered for the DoA estimation (for simplicity, we assume a single desired source). Typically-used beamformers, based on Euclidean geometry, result in a main lobe that is pointing toward the strongest source which is an interference source. Since the number of sources is unknown, and their relative positioning is unknown as well, considering even lower lobes does not allow for estimating the DoA of the desired source using the SRP. Even if an interference source does not mask the desired source (which is often the case), we cannot differentiate between an interference source and the desired source, and cannot assign the correct direction to the desired source. In contrast, the Riemannian approach rejects the interference sources, resulting in a main lobe that points toward only the desired source only. We report that similar results as in Fig. 5 are obtained also for an SNR value of 0\text{dB} and for interference sources with similar DoA. Results for speech signals from the TIMIT dataset appear in Appendix B where similar trends are demonstrated.

Fig. 5. - Estimation of the DoA to the desired source for (a) input SIR of $-6\text{dB}$ and (b) input SIR of $-10\text{dB}$.
Fig. 5.

Estimation of the DoA to the desired source for (a) input SIR of -6\text{dB} and (b) input SIR of -10\text{dB}.

We repeat the experiment for 200 iterations, where at each iteration the desired source is positioned at 20 different positions as before. Fig. 6 presents the RMSE and the accuracy for a different number of microphones for the Riemannian approach (blue circles) and its Euclidean counterpart (red squares). We recall that DoA estimation error that is smaller than 3^{\circ} (arbitrarily chosen) is considered accurate. We see that the proposed approach leads to significant improvements for all the tested number of microphones.

Fig. 6. - (a) RMSE and (b) accuracy versus the number of microphones for two interference sources with input SIR of $-10$dB.
Fig. 6.

(a) RMSE and (b) accuracy versus the number of microphones for two interference sources with input SIR of -10dB.

Next, we examine the sensitivity of the proposed approach to the SNR and the reverberation time. We repeat the setting of the two interferences, as described in the first experiment. The results are presented in Fig. 7. In Fig. 7(a), the mean output SIR for the DS method is presented as a function of the reverberation time for a fixed SNR of {2}0\text{dB}. Several input SIR values are shown: 0\text{dB} (asterisk), -6\text{dB} (circle), and -10\text{dB} (triangle). The results for the Riemannian and the Euclidean DS appear in blue and red, respectively. Fig. 7(b) is the same as Fig. 7(a), only for different SNR values for a fixed reverberation time of \beta=150\text{ms}. We see that the smaller the reverberation time is, the higher the output SIR is for the Riemannian DS method. In contrast, the Euclidean DS is less affected by the reverberation time, resulting in relatively low output SIRs. For all values of reverberation times, the Riemannian DS results in higher output SIR than its Euclidean counterpart. From Fig. 7(b), we see that the SNR has a large impact on the performance of the Riemannian DS; the higher the SNR is, the higher the output SIR becomes. Conversely, the Euclidean DS is moderately affected by the SNR, resulting in much lower output SIR values. A possible explanation is that the main phenomenon limiting the performance of the Euclidean approach is the existence of interference sources. In addition, as Proposition 2 predicts, the sensitivity of the Riemannian DS to the SNR is higher than the Euclidean DS, and the higher the SNR is, the larger the gap in the performance between the Riemannian and the Euclidean approaches.

Fig. 7. - The mean output SIR as a function of (a) the reverberation time and (b) the SNR, for two interference sources. The Riemannian DS appears in blue, whereas the Euclidean DS appears in red. Several input SIR values are presented: $0\text{dB}$ (asterisk), $-6\text{dB}$ (circle), and $-10\text{dB}$ (triangle).
Fig. 7.

The mean output SIR as a function of (a) the reverberation time and (b) the SNR, for two interference sources. The Riemannian DS appears in blue, whereas the Euclidean DS appears in red. Several input SIR values are presented: 0\text{dB} (asterisk), -6\text{dB} (circle), and -10\text{dB} (triangle).

We examine the performance of the MVDR beamformer, given by (48). The MVDR beamformer is popular when interference sources are present, thanks to its distortionless response in the direction of the desired source, and its typical narrow beams. However, in our setting, since the desired source is accompanied by interference sources, and the directions to all the sources are unknown, the DoA estimation of the desired source using the MVDR beamformer is outperformed by the DS and the SbSp methods. We also examine the case of two desired sources. The results show similar trends and appear in Appendix B, due to space considerations.

In the second experiment, we examine a multiple interference setting, by considering N_{\text{I}}=14 interference sources. We note that the number of interference sources is larger than the number of microphones, N_{\text{I}}>M=12, which typically limits the number of interference sources that can be accommodated (e.g. see [30]). Furthermore, assumptions 1 and 2 implicitly restrict the number of interference sources to be bounded by M-N_{\text{D}}=11. This limitation is only for the analysis, and, in practice, improved results are obtained even for a larger number of interference sources. The signal contains 10 segments, demonstrating the number of segments could be smaller than the number of interference sources. The duration of the emitted signal is 10.24\text{s}. The input SIR for each interference is -6\text{dB}. Each interference has a 30\% probability of being active at each segment. The position of all the sources is set at random on an arc as described in the first experiment. The activation map of the interference sources at the first Monte Carlo iteration appears in Fig. 8(a), where light blue indicates ‘active’. On average, {4.2} interference sources are active at the same time at each segment. An interference source that is partially active during a segment is considered active during the entire segment (a worst case). Note that there are interference sources active continuously during more than one segment, so their activation is not necessarily related to the division of the received signal into segments. Additionally, there exists no segment in which only one source is active. The output SIRs are presented in Fig. 8(b). The Riemannian DS and SbSp methods appear in blue, the Euclidean DS and SbSp methods are in red, and the intersection method is in orange. It can be seen that the Riemannian approach is superior to the Euclidean one, resulting in higher output SIRs.

Fig. 8. - (a) Activation map for the $14$ interference sources during the observed $10$ time segments (blue indicates ‘active’). (b) Output SIR of the different beamformers. The input SIR of each interference is $-6\text{dB}$.
Fig. 8.

(a) Activation map for the 14 interference sources during the observed 10 time segments (blue indicates ‘active’). (b) Output SIR of the different beamformers. The input SIR of each interference is -6\text{dB}.

We note that Fig. 8(b) serves as an example of an increased number of segments in comparison to Fig. 3. Since there are on average 4.2 active interference sources at every segment, the results are not improved as the number of segments is increased.

We remark that the empirical results of the proposed approach applied to the typically-used beamformers are not sensitive to the number of samples used for the computation of the sample correlation matrix. Since, in the experiments, additional samples also include samples from at least one interference source, adding samples increases the effect of the interference sources, resulting in weak dependency on the number of samples.

Next, we evaluate the performance of the proposed approach applied to the Bayesian learning method proposed in [23] for sparse signal recovery for DoA estimation. This is done by replacing the typically-used (Euclidean) sample correlation matrix with the Riemannian mean of sample correlation matrices computed in short-time segments. We repeat the same experiment as in [23] with 6 desired sources, only we consider two disjointly active interference sources with SIR -6dB and -10dB. The SNR is 20dB. The number of desired sources is unknown. We evaluate the performance using 3000 experiments, where at each experiment the direction of the interference sources is generated uniformly at random between [-50^{\circ},50^{\circ}]. Fig. 9 presents the RMSE for a different number of STFT windows, which are equivalent to the number of samples for the sample correlation matrix computation of each segment (there are two segments in the experiment). We see that applying the proposed Riemannian approach leads to improved performance for all the tested number of samples.

Fig. 9. - The RMSE of the Bayesian learning method for $6$ desired sources with and without the proposed approach in blue and red, respectively, for (a) SIR $-6$dB and (b) SIR $-10$dB.
Fig. 9.

The RMSE of the Bayesian learning method for 6 desired sources with and without the proposed approach in blue and red, respectively, for (a) SIR -6dB and (b) SIR -10dB.

SECTION VIII.

Conclusion

We present a Riemannian approach for the design of beamformers and DoA estimation methods for interference rejection in reverberant environments. Specifically, the Riemannian mean is incorporated instead of the sample correlation matrix for DoA estimation of the desired source, as it inherently rejects the interference sources. We analytically show that the DS beamformer, based on the Riemannian geometry of the HPD manifold, results in a higher output SIR than the typical DS beamformer, which implicitly considers the Euclidean geometry. We extend our approach to other beamformers, such as subspace-based beamformers and the MVDR, as well as a Bayesian learning method, experimentally demonstrating superior output SIR and better DoA estimations in comparison to their Euclidean counterparts.

ACKNOWLEDGMENT

The authors wish to thank the associate editor and the anonymous reviewers for their useful comments which helped to improve this manuscript.

Appendix A

The Intersection Beamformer

Another beamformer we examine is based on the observation that the desired signal subspace is the intersection of subspaces spanned by eigenvectors of the correlation matrix of the different segments. From each segment, i, we extract the desired signal subspace from the sample correlation matrix, \widehat{\mathbf{\Gamma}}_{i}, to obtain {\boldsymbol{V}}(\widehat{\mathbf{\Gamma}}_{i}) whose columns are the N_{\text{D}} leading eigenvectors. The projection matrix onto the signal subspace of \widehat{\mathbf{\Gamma}}_{i} is computed as follows \begin{align*} {\boldsymbol{P}}_{\text{sig}}(\widehat{\mathbf{\Gamma}}_{i})={\boldsymbol{V}}(\widehat{\mathbf {\Gamma}}_{i})\left({\boldsymbol{V}}^{H}(\widehat{\mathbf{\Gamma}}_{i}){\boldsymbol{V}} (\widehat{\mathbf{\Gamma}}_{i})\right)^{-1}{\boldsymbol{V}}^{H}(\widehat{\mathbf{\Gamma }}_{i}).\tag{52} \end{align*} View SourceRight-click on figure for MathML and additional features.

Since each desired source is active during all the segments, its ATF, {\boldsymbol{h}}^{d}_{j}, is an eigenvector of {\boldsymbol{V}}(\widehat{\mathbf{\Gamma}}_{i}). Consequently, it is also an eigenvector of {\boldsymbol{P}}_{\text{sig}}(\widehat{\mathbf{\Gamma}}_{i}) for all i with an eigenvalue 1, i.e. {\boldsymbol{P}}_{\text{sig}}(\widehat{\mathbf{\Gamma}}_{i}){\boldsymbol{h}}^{ d}_{j}={\boldsymbol{h}}^{d}_{j}. As a result, it is an eigenvector of {\boldsymbol{P}}_{\text{sig}}=\prod_{i=1}^{L_{\text{s}}}{\boldsymbol{P}}_{ \text{sig}}(\widehat{\mathbf{\Gamma}}_{i}) with eigenvalue 1 as well since {\boldsymbol{P}}_{\text{sig}}{\boldsymbol{h}}^{d}_{j}=\left(\prod_{i=1}^{L_{ \text{s}}}{\boldsymbol{P}}_{\text{sig}}(\widehat{\mathbf{\Gamma}}_{i})\right){ \boldsymbol{h}}^{d}_{j}={\boldsymbol{h}}^{d}_{j}. We consider the leading N_{\text{D}}\vphantom{{}^{0}} eigenvectors of {\boldsymbol{P}}_{\text{sig}} and form the matrix {\boldsymbol{P}}_{\text{sig},N_{\text{D}}}\in{\mathbb{C}}^{M\times N_{\text{D}}}, whose columns are the eigenvectors. The intersection spectrum is defined as follows \begin{align*} {\boldsymbol{P}}_{\text{Intersect}}(\theta)={\boldsymbol{d}}^{H}(\theta){\boldsymbol{P}}_{\text{sig},N _{\text{D}}}{\boldsymbol{P}}_{\text{sig},N_{\text{D}}}^{H}{\boldsymbol{d}}(\theta).\tag{53} \end{align*} View SourceRight-click on figure for MathML and additional features.

We note that in addition to the dimension of the signal space of the matrix {\boldsymbol{P}}_{\text{sig}}, the intersection method requires the estimation of the dimension of the signal space of the correlation matrix for each segment (in order to compute {\boldsymbol{V}}(\widehat{\mathbf{\Gamma}}_{i})).

Appendix B

Additional Experimental Results

We repeat the setting of the first experiment, only with an additional desired source, located at (2\text{m},3.5\text{m},2.5\text{m}). Fig. 10 is the same as Fig. 3, but the mean output SIR is taken over the two interferences and the two desired sources. Fig. 10(a) presents results for input SIR of -10\text{dB}, and Fig. 10(b) presents results for input SIR of -6\text{dB}. We see that the Riemannian approach is superior to the Euclidean one, resulting in higher SIRs. Also, the intersection method is more sensitive to the input SIR, resulting in higher output SIR for -6\text{dB}. Similar to the setting of one desired source, the Riemannian SbSp method is superior to the Riemannian DS method, as opposed to the Euclidean approach, for which they are on par. This is a consequence of the higher interference attenuation of the Riemannian approach, resulting in a better estimate of the signal space.

Fig. 10. - Output SIR of the different beamformers for two desired sources and two interference sources. (a) The input SIR is $-10$dB. (b) The input SIR is $-6$dB.
Fig. 10.

Output SIR of the different beamformers for two desired sources and two interference sources. (a) The input SIR is -10dB. (b) The input SIR is -6dB.

Next, we examine the performance of the Riemannian MVDR and the Euclidean MVDR beamformers, given by (48), for \widehat{\mathbf{\Gamma}}_{\text{R}} and \widehat{\mathbf{\Gamma}}_{\text{E}}, respectively. Fig. 11 presents the results. It is the same as Fig. 3, only for the MVDR beamformer. We see that the Riemannian approach is superior to the Euclidean one, however, with a slight advantage.

Fig. 11. - The mean output SIR for the Riemannian and the Euclidean MVDR beamformers in the presence of two interference sources. The x-axis indicates the input SIR, and the y-axis indicates the output SIR.
Fig. 11.

The mean output SIR for the Riemannian and the Euclidean MVDR beamformers in the presence of two interference sources. The x-axis indicates the input SIR, and the y-axis indicates the output SIR.

We repeat the experiment with speech signals from the TIMIT dataset. For each source, the speaker and the time interval are chosen uniformly at random. Fig. 12 presents the results. We see that the proposed approach leads to improved results in comparison to the typically-used beamformers also for speech signals.

Fig. 12. - Estimation of the DoA of the desired source for speech signals where (a) the input SIR is -6dB, and (b) the input SIR is $-10$dB.
Fig. 12.

Estimation of the DoA of the desired source for speech signals where (a) the input SIR is -6dB, and (b) the input SIR is -10dB.

Appendix C

Extension to a Streaming Data Setting

In a streaming data setting, we do not have access in advance to the entire signal, and the direction estimation is updated as more samples become available. Therefore, in this setting, we cannot compute the Riemannian mean using (17). To circumvent this, we turn to the estimator of the Riemannian mean proposed in [37], [46], which is updated after every received segment.

Each segment, i, is processed separately using L_{\text{w}} STFT windows, and an estimation of the correlation matrix, \widehat{\mathbf{\Gamma}}_{i}, is computed using (16). Next, the estimation of the Riemannian mean, denoted as \hat{{\boldsymbol{R}}}_{i}, which is the adaptive counterpart of \mathbf{\Gamma}_{\text{R}}, is updated using the following update step \begin{align*} \hat{{\boldsymbol{R}}}_{i}=\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}(\hat{{\boldsymbol{R}}}_{i-1}^{- \frac{1}{2}}\widehat{\mathbf{\Gamma}}_{i}\hat{{\boldsymbol{R}}}_{i-1}^{-\frac{1}{2}})^ {\frac{1}{i}}\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}.\tag{54} \end{align*} View SourceRight-click on figure for MathML and additional features.

Algorithm 3 Streaming DoA estimation in the presence of multiple interferences

Input: the current result of the STFT window of the received signal

Output: the estimated direction of the desired source \hat{\theta}

1:

Set \hat{{\boldsymbol{R}}}_{0}={\boldsymbol{I}}

2:

Repeat

  1. Accumulate L_{\text{w}} STFT windows (that form a segment)

  2. Compute its sample correlation matrix, \widehat{\mathbf{\Gamma}}_{i}, using (16)

  3. Compute \hat{{\boldsymbol{R}}}_{i}=\hat{{\boldsymbol{R}}}_{i-1}^{\frac{1}{2}}(\hat{{ \boldsymbol{R}}}_{i-1}^{-\frac{1}{2}}\widehat{\mathbf{\Gamma}}_{i}\hat{{ \boldsymbol{R}}}_{i-1}^{-\frac{1}{2}})^{\frac{1}{i}}\hat{{\boldsymbol{R}}}_{i- 1}^{\frac{1}{2}}

  4. Compute P_{\text{DS}}(\theta;\hat{{\boldsymbol{R}}}_{i}) using (18)

  5. Return \hat{\theta}=\text{argmax}_{\theta}P_{\text{DS}}(\theta;\hat{{\boldsymbol{R}}} _{i})

The SRP of the DS beamformer is computed using (18) with \hat{{\boldsymbol{R}}}_{i} as an estimate of \widehat{\mathbf{\Gamma}}_{\text{R}}, and the direction to the desired source is set using (19). The algorithm is described in Algorithm 3.

We note that the current correlation matrix estimate has a full rank if L_{\text{w}}>M. This implies a latency of at least M times the duration of the STFT window. In case the STFT is performed with overlap this latency is reduced accordingly.

As for the Euclidean alternative, its estimator is updated in finer granularity at every STFT window, l. The estimator at the lth update is denoted by {\boldsymbol{E}}_{l} and is computed as follows \begin{align*} {\boldsymbol{E}}_{l}=\frac{n-1}{n}{\boldsymbol{E}}_{l-1}+\frac{1}{n}{\boldsymbol{z}}(l){\boldsymbol{z}}^{H}(l).\tag{55} \end{align*} View SourceRight-click on figure for MathML and additional features.

Setting L_{\text{w}}=1 in step 2.1 in Algorithm 3, and substituting (55) in step 2.3 in Algorithm 3, results in the streaming version of the Euclidean counterpart.

Appendix D

On the Particular Choice of the Riemannian Metric

In this work, we consider the Affine Invariant metric (also called Fisher Information metric) [47]. Another commonly-used metric in the space of HPD matrices is the Log-Euclidean metric [48], which could be viewed as a computationally efficient local approximation of the Affine Invariant metric. The induced Log-Euclidean distance is given by \begin{align*} d_{\text{LE}}^{2}(\mathbf{\Gamma}_{1},\mathbf{\Gamma}_{2})=\|\log(\mathbf{ \Gamma}_{1})-\log(\mathbf{\Gamma}_{2})\|_{F}^{2}.\tag{56} \end{align*} View SourceRight-click on figure for MathML and additional features.

In the context of this work, the Affine Invariant metric is advantageous over the Log Euclidean because it better enhances the desired source subspace relative to the interference and noise subspace, as we show next. We remark that the derivation is similar to the derivation in Section V, where we demonstrate the advantage of the Affine Invariant metric over Euclidean geometry. First, we note that the following holds [49]: \begin{align*} \text{tr}(\mathbf{\Gamma}_{\text{R}})\leq\text{tr}(\mathbf{\Gamma}_{\text{LE}}),\tag{57} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \mathbf{\Gamma}_{\text{LE}} denotes the Riemannian mean based on the Log-Euclidean metric. Second, the ATF to the desired source, {\boldsymbol{h}}_{0}, is a common eigenvector of all the population correlation matrices per segment. Consequently, according to Lemma 2 below, the mean correlation matrices based on both metrics have the same eigenvalue associated with the common eigenvector {\boldsymbol{h}}_{0}. More specifically, denoting \lambda_{0}(\mathbf{\Gamma})\equiv\frac{{\boldsymbol{h}}_{0}^{H}\mathbf{\Gamma }{\boldsymbol{h}}_{0}}{\|{\boldsymbol{h}}_{0}\|^{2}}, we get \lambda_{0}(\mathbf{\Gamma}_{\text{R}})=\lambda_{0}(\mathbf{\Gamma}_{\text{LE}}).

Lemma 2:

Let \{\mathbf{\Gamma}_{j}\}_{j} be the set of HPD matrices having a common eigenvalue \lambda_{0} associated with the common eigenvector {\boldsymbol{h}}_{0}. Then \begin{align*} \begin{split}{\boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}_{\text{R}}{\boldsymbol{h}}_{ 0}={\boldsymbol{h}}_{0}^{H}\mathbf{\Gamma}_{\text{LE}}{\boldsymbol{h}}_{0}={\boldsymbol{h}}_{0}^{H} \mathbf{\Gamma}_{\text{E}}{\boldsymbol{h}}_{0}=\|{\boldsymbol{h}}_{0}\|^{2}\lambda_{0}, \end{split}\tag{58} \end{align*} View SourceRight-click on figure for MathML and additional features.

where \mathbf{\Gamma}_{\text{R}}, \mathbf{\Gamma}_{\text{LE}}, and \mathbf{\Gamma}_{\text{E}} are the Riemannian means based on the Affine Invariant and the Log-Euclidean metrics, and the Euclidean mean.

The rest of the eigenvectors of the correlation matrices span the interference and noise subspace. We recall that \lambda_{i}(\mathbf{\Gamma}) is the ith eigenvalue, so we get \begin{align*} \frac{\lambda_{0}(\mathbf{\Gamma}_{\text{R}})}{\sum_{i=1}^{M-1}\lambda_{i} (\mathbf{\Gamma}_{\text{R}})}\geq\frac{\lambda_{0}(\mathbf{\Gamma}_{\text{LE}}) }{\sum_{i=1}^{M-1}\lambda_{i}(\mathbf{\Gamma}_{\text{LE}})},\tag{59} \end{align*} View SourceRight-click on figure for MathML and additional features.

which follows from Lemma 2 and (57).

We see that the Riemannian mean induced by the Affine Invariant metric captures better the desired signal subspace in comparison to the Riemannian mean induced by the Log Euclidean metric and the Euclidean mean (according to (24)), entailing an advantage in SbSp methods, for example.

References

References is not available for this document.