Introduction
Emotion recognition is a key technology in brain-computer interfaces, enabling computers to interpret human emotions. Recent research in this field can be broadly categorized by the type of data used. The first category employed non-physiological signals [1]-[5] to recognize and classify emotions, but these signals can be intentionally concealed, limiting their ability to reflect genuine emotions. The second category focused on physiological signals such as electroencephalography (EEG) [6], [7], electrocardiogram (ECG) [8], electromyogram (EMG) [9], and galvanic skin response (GSR) [10]. Among these, EEG is particularly advantageous for emotion recognition, as it captures emotion-related brain activity that cannot be masked, improving reliability and accuracy of identification. Consequently, EEG-based emotion recognition has gained significant attention in both research and applications [7], [11]-[13].
With the significant success of deep learning, many researchers have applied these methods to emotion recognition. For example, Salama et al. [12] used CNNs as feature extractors to capture emotion-related features, while Li et al. [13] proposed a bi-hemispheric difference model (BiHDM) for emotion recognition. However, the aforementioned methods cannot effectively simulate the complex interactions between different electrodes in both local and long-range correlations, resulting in suboptimal performance in emotions recognition.
To leverage the correlations between electrodes, graph convolutional networks (GCNs) [14] have been widely applied in the field of EEG-based research, because they can exploit the connectivity between the electrodes [35]. As pioneering research, Song et al. [16] proposed a novel dynamical graph convolutional neural network (DGCNN) for emotion recognition. Chen et al. [17] introduced a static adjacency matrix in GCN for extracting the inter-correlation features of multichannel EEG. Li et al. [18] proposed a residual graph convolutional broad network for emotion recognition. However, these methods do not consider the correlation analysis between different electrodes from EEG. Research has shown that the correlation analysis plays a crucial role in emotion recognition [21]. Additionally, they also ignored the varying activation of different brain regions during emotional processes [19], [20]. In fact, neuroscience studies have demonstrated a strong association between emotions and different regions of the cerebral cortex [22]-[24]. Therefore, both aspects are critical for emotion recognition [25], [26].
The overall structure of CAGLE-net. GARO is Graph-Attention Readout, and SERO is Squeeze-Excitation Readout. LEFL is a local enhanced fusion layer.
It is hypothesized that using correlation analysis to construct electrode adjacency and aggregate local brain features for each brain region can effectively decode emotions from EEG. Based on this, we propose an EEG correlation analysis-guided graph local enhanced feature learning network (CAGLE-net). In CAGLE-net, the Pearson (PCC) and Spearman (SCC) correlation coefficients of correlation analysis are used to measure correlations between electrodes from linear and rank correlation perspectives, respectively. To capture differences in local activation, we designed a local enhancement embedding layer (LEEL) to aggregate locally enhanced features. A cross-attention fusion mechanism is then employed to fully utilize these features, yielding more discriminative representations for emotion recognition. The main contributions of this work are summarized as follows:
The PCC and the SCC are employed to guide the learning of dynamically directed neighbor matrices from two complementary perspectives to capture emotion-related brain functional networks better.
A LEEL is used to aggregate and fuse emotion-related local enhancement features from different brain regions for more accurate emotion classification.
METHODOLOGY
The overall structure of CAGLE-net is shown in Fig. 1. The following will provide a detailed explanation of the main modules.
A. Correlation Analysis Guides Dynamic Directed Adjacent Matrix Learning
The brain’s structure and functional connectivity are crucial for exploring correlations between electrodes and decoding emotion-related EEG signals [27]. In EEG correlation analysis, the PCC assesses the linear correlation between electrodes X and Y , ranging from -1 (negative) to 1 (positive), with 0 indicating no correlation. The PCC values are computed as follows:
\begin{equation*}\operatorname{PCC} (X,Y) = \frac{{\operatorname{cov} (X,Y)}}{{{\sigma _X} \cdot {\sigma _Y}}}\tag{1}\end{equation*}
The SCC evaluates the correlation between the ranks of electrodes X and Y , without assuming continuity or linearity in the signals. Its values, similar to the PCC, range from -1 (negative) to 1 (positive). The SCC is computed as follows:
\begin{equation*}{\text{SCC}} = 1 - \frac{{6\sum\nolimits_{i = 1}^N {d_i^2} }}{{N\left({{N^2} - 1}\right)}}\tag{2}\end{equation*}
These two complementary methods assess electrode correlations from distinct perspectives but provide only static measurements.
To model dynamic EEG connectivity, we use a dynamic-directed adjacency matrix AD ∈ ℝC×C, which learns inherent electrode connectivity patterns during training. Leveraging APCC and ASCC as prior knowledge to guide the learning of AD (as shown in Equation 5) enables the capture of more discriminative, emotion-related features.
\begin{align*} & A_{PC{C_{{\text{ij}}}}}^\prime = \begin{cases} {1,}&{{\text{if }}\left| {{A_{PC{C_{{\text{ij}}}}}}} \right| > 0} \\ {0,}&{{\text{otherwise}}} \end{cases}\tag{3} \\ & A_{SC{C_{{\text{ij}}}}}^\prime = \begin{cases} {1,}&{{\text{if }}\left| {{A_{SC{C_{\text{i}}}}}} \right| > 0} \\ {0,}&{{\text{otherwise}}} \end{cases}\tag{4} \\ & \begin{array}{c} A_{PCC}^D = {A_D} \times A_{PCC}^\prime \\ A_{SCC}^D = {A_D} \times A_{SCC}^\prime\end{array}\tag{5}\end{align*}
The visualization of the PCC and the SCC. The deeper red color indicates a stronger correlation, while a deeper blue color signifies a weaker correlation. Blank areas represent the absence of correlation. The diagonal is left blank to indicate that self-correlation was not calculated.
B. Graph Convolution Network
In this paper, the GCN is used to capture the spatial structural features between electrodes. Specifically, we use the adjacency matrices \begin{equation*}{H^{(l)}} = \sigma \left({{{\tilde D}^{ - \frac{1}{2}}}\tilde A{{\tilde D}^{ - \frac{1}{2}}}XW}\right)\tag{6}\end{equation*}
\begin{align*} & {H_{{\text{PCC}}}} = {\text{concatenate}}\left({X,H_{{\text{PCC}}}^{(1)},H_{{\text{PCC}}}^{(2)}}\right)\tag{7} \\ & {H_{{\text{SCC}}}} = {\text{concatenate}}\left({X,H_{{\text{SCC}}}^{(1)},H_{{\text{SCC}}}^{(2)}}\right)\tag{8}\end{align*}
C. Local Enhanced Embedding Layer (LEEL)
The visualizations of the PCC and SCC matrices as shown in Fig. 2 highlight distinct local connectivity patterns. Consequently, we divide the electrodes into five regions corresponding to the brain’s physiological divisions: frontal, temporal, central, parietal, and occipital lobes, as detailed in Table I. To aggregate features for representation, Graph-attention readout (GARO) and Squeeze-Excitation readout (SERO) are employed [29], both of which perform exceptionally well on graph structures. GARO leverages the Transformer attention mechanism using key-query embeddings [30] while SERO employs MLP-based attention from the Squeeze-and-Excitation Networks [31]. The computations for GARO (Equation 9-11) and SERO (Equation 12) are as follows:
\begin{align*} & K = {W_{key}}{H_{partial}}\tag{9} \\ & Q = {W_{query}}{H_{partial}}{\phi _{mean}}\tag{10} \\ & Z = sigmoid\left({\frac{{{Q^T}K}}{{\sqrt C }}}\right)\tag{11}\end{align*}
\begin{equation*}Z = sigmoid\left({{W_2}\sigma \left({{W_1}{H_{partial}}{\phi _{mean}}}\right)}\right)\tag{12}\end{equation*}
D. Local Enhanced Fusion Layer (LEFL)
To fully utilize the locally enhanced features learned in LEEL, a fusion approach based on cross-attention mechanisms with multi-head attention (MHA) is employed [30]. The structure of LEFL is illustrated in Fig. 3. The three inputs are processed through a linear layer to obtain XQ, XK, and XV . The computation for cross-attention fusion is as follows:
\begin{align*} & attn\left({{X_Q},{X_K}}\right) = soft\max \left({\frac{{{X_Q}X_K^T}}{{\sqrt d }}}\right)\tag{13} \\ & ou{t_{attn}} = attn\left({{X_Q},{X_K}}\right) \times {X_V}\tag{14}\end{align*}
In addition to the methods described above, an ℓ1-norm constraint is applied to AD to mitigate the over-smoothing issue in GCNs, resulting in the sparse loss term denoted as ℒsparse [32]. The cross-entropy loss function is used to measure the discrepancy between the predicted labels yp and the ground-truth labels yi. The final loss function is defined as follows:
\begin{equation*}Loss = \alpha {{\mathcal{L}}_{sparse}} + \beta {{\mathcal{L}}_{cross}}\tag{15}\end{equation*}
EXPERIMENTS
A. Dataset
The proposed model is evaluated on the SEED dataset [6], which includes EEG data from 15 subjects (7 males and 8 females). Each subject viewed 15 emotion-eliciting video clips (5 positive, 5 neutral, and 5 negative), each approximately 4 minutes long. The subjects participated in three experimental sessions, with each session spaced at least one week apart.
B. Experiment Setup
To ensure a fair comparison of our results with other methods, we follow the experimental setup outlined in [6], [16], [33]. Specifically, we use the first 9 trials as the training set and the remaining 6 trials as the test set, with all trials coming from the same session. We computed the mean accuracy and standard deviation across all subjects over two sessions. Differential entropy (DE) features, which have been extensively utilized in previous studies [13], [16], [32], [34]. In this work, We directly use the DE features provided by the dataset.
C. Experiment Result
Table II presents the emotion classification accuracy of our model on the SEED dataset, compared to several baseline methods. Our model demonstrates superior performance, achieving higher accuracy and lower standard deviation. The experimental results prove the validity of our hypothesis. Notably, it surpasses the deep learning models BiHDM [13] and BiDANN [33] by 2.65% and 1.91%, respectively. This improvement highlights our method’s effectiveness in capturing dynamic directed connections between electrodes and identifying emotion-related spatial topological features. Compared to graph-based methods, CAGLE-net achieves the highest accuracy and lowest standard deviation, underscoring its proficiency in capturing spatial brain features and aggregating locally enhanced features from different brain regions. This approach not only extracts locally enhanced features effectively but also integrates them through advanced fusion techniques, leading to more discriminative features. Fig. 4 illustrates the classification accuracy of CAGLE-net across 15 subjects. Except for the Subject 2 and the Subject 10, the accuracy for all other subjects exceeds 90%. This performance highlights the model’s capability to generalize well across different individuals, further validating its effectiveness in practical applications.
D. Ablation Studies
To validate the effectiveness of LEEL and LEFL, we perform ablation experiments based on the original model structure, with results present in Table III. The table shows a progressive improvement in accuracy as LEEL and LEFL are incrementally incorporated. When both are used in CAGLE-net, the accuracy reaches its highest level and the standard deviation its lowest. This indicates that CAGLE-net not only effectively captures EEG spatial correlation features but also efficiently aggregates locally enhanced features. By integrating cross-attention fusion methods, it extracts more discriminative features for emotion recognition. Besides, when using PCC or SCC individually, the accuracy decreases. However, when both methods are utilized together, the accuracy reaches its peak. This observation indicates that the two methods can guide the learning of the adjacency matrix in a complementary manner, addressing both linear and rank correlation perspectives.
CONCLUSION
This paper propose an EEG correlation analysis-guided local enhancement feature learning network. In CAGLE-net, PCC and SCC guide the learning of the dynamic directed adjacency matrix, capturing the topological structure information between electrodes; and LEEL can effectively aggregate emotion-enhanced features from different brain regions. Furthermore, LEFL fully leverages the enhanced features learned from LEEL, and after fusion through the cross-attention mechanism, improves emotion recognition accuracy.
ACKNOWLEDGMENT
The authors sincerely thank all anonymous reviewers for their evaluations and insightful comments, which have contributed to improving the quality of this paper.