Introduction
Aspect-Based Sentiment Analysis (ABSA) is a fine-grained task that focuses on identifying the sentiments expressed toward specific aspect terms [1] [2] [3]. Typically, to determine the sentiment corresponding to an aspect term, it is necessary to identify the sentiment-laden words associated with it in the sentence, referred to as opinion terms. Therefore, to accomplish the ABSA task, some subtasks have been introduced, including Aspect Term Extraction (ATE) [4]–[7], Opinion Term Extraction (OTE) [8]–[11], and more recently developed Aspect Sentiment Triplet Extraction (ASTE) [12]–[15]. Among these subtasks, ATE focuses on identifying and extracting aspect terms from a sentence, while OTE handles the extraction of opinion terms. ASTE combines both by identifying aspect terms, opinion terms, and their corresponding sentiment polarity expressed. Fig. 1 illustrates examples of the different subtasks.
An example of ABSA including ATE, OTE, and ASTE. The orange words represent aspects, and the blue ones represent opinions.
The ASTE task was first proposed by Peng et al. [12] which uses a two-stage pipeline approach. In the first stage, potential aspect and opinion terms are extracted as a labeling problem, while in the second stage, the aspect and opinion terms are paired, and a classifier is used to determine their corresponding sentiments. Building on Peng et al.’s pipeline, many studies have investigated the interaction among subtasks to design the end-to-end triplet extraction models [13], [15]–[20]. Notably, in prior works such as [16], [18], [19], and [20], the triplet extraction process is handled by using the table-filling method. In this approach, aspect and opinion terms are extracted from the diagonal elements of the table, while sentiment is represented as relational tags in the non-diagonal elements. The table-filling method enables comprehensive word-pair calculations and, being end-to-end, avoids the error propagation often seen in pipeline approaches. Additionally, this method simplifies the identification of relationships between words, making the process more efficient.
However, the table-filling method, which represents each triplet as a relation region in the 2D table and transforms the ASTE task into detection and classification of relation regions, mainly centers on word-level interactions, emphasizing relationships between individual words. These methods excel at capturing fine-grained representations between individual words, but they often neglect to consider and incorporate sentence-level representations. This oversight limits the model’s capacity to grasp global contextual information, which is crucial for understanding the full meaning of the entire sentence. As a result, when handling more complex sentences containing multi-word aspect and opinion terms, these models struggle to accurately capture the relationships between the terms and the overall sentiment, leading to potential performance drawbacks.
In this paper, based on the boundary-driven table-filling framework for aspect sentiment triplet extraction, we propose a cross-granularity contrastive learning (CCL) mechanism, which is designed to enhance the semantic consistency between the sentence-level representation and the word-level representations. We named our method as BTF-CCL. By constructing positive and negative sample pairs, the BTF-CCL is compelled to learn the intricate relationships and associations between the sentence-level representation and the word-level representations, improving its ability to capture both fine-grained and broad contextual information. This process ensures the model can better align local word-level details with the overall sentence-level meaning, leading to more accurate triplet extraction. In addition to cross-granularity contrastive learning, a multi-scale, multi-granularity convolutional method (MMCNN) is also introduced to enable the model to capture rich semantic information across different levels of granularity. Experimental results show that our method achieves better performance over existing state-of-the-art approaches.
Proposed Method
A. Task Definition
X = {x1,x2,…,xn} represents a sentence consisting of n words. The ASTE task aims to extract triplets represented as (aspect, opinion, sentiment), where sentiment can be categorized as Positive, Negative, or Neutral. As illustrated in Fig. 2, a triplet is depicted as a region within a 2D table. The boundaries of this region specify the position of the aspect term and the opinion term, while the type indicates the sentiment. The regions are marked by boundary tags: ‘S’ for the upper left and ‘E’ for the lower right corner.
B. Model Architecture
The architecture of our model BTF-CCL is shown in Fig. 3. We use BERT to encode input sentences and generate a 2D table representing word relationships. This table is further processed by a multi-scale, multi-granularity CNN (MMCNN) to capture local semantic information better. Cross-granularity contrastive learning is then applied between the word-level representations from the MMCNN and sentence-level representations, and all candidate regions are detected and classified for sentiment polarity.
1) Representation Leanrning
For an input sentence X = {x1,x2,…,xn}, we utilize the pre-trained BERT language model [21] to generate contextual embeddings. By extracting the hidden states from the final layer of the encoder, we obtain the full sentence representation H = {h1,h2,…,hn}. The relationship representation between two words (hi,hj) ∈ Rd at position \begin{equation*}r_{ij}^{(0)} = f\left( {\operatorname{Linear} \left( {\left[ {{h_i};{h_j};{c_{ij}};{t_{ij}}} \right]} \right)} \right)\tag{1}\end{equation*}
2) MMCNN
The relation representations between word pairs come from a 2D matrix with dependencies, such as boundary tags ‘S’ and ‘E’ and shared sentiment within the same region. This architecture applies a ResNet-style CNN with different convolution kernel sizes (3 × 3, 5 × 5) and dilations (1, 2, 3) to table representations, referred to as multi-scale, multi-granularity convolutions (MMCNN), enabling the capture of richer local information and long-range dependencies.
\begin{align*} & T' = \operatorname{ReLU} \left( {\left( {{{\operatorname{Conv} }_{1 \times 1}}\left( {{T^{(l - 1)}}} \right)} \right)} \right)\tag{2} \\ & \left. {T_{3x3,d1,d2,d3}^\prime = \operatorname{ReLU} \left( {{{\operatorname{Conv} }_{3 \times 3,d1,d2,d3}}\left( {{T^\prime }} \right)} \right)} \right)\tag{3} \\ & {T^{\prime \prime }} = T_{3x3}^\prime + T_{5x5}^\prime \tag{4} \\ & {T^{\prime \prime \prime }} = \operatorname{ReLU} \left( {\left( {{{\operatorname{Conv} }_{1 \times 1}}\left( {{T^{\prime \prime }}} \right)} \right)} \right)\tag{5} \\ & {T^{(l)}} = {T^{\prime \prime \prime }} + {T^{(l - 1)}}\tag{6}\end{align*}
The 5×5 convolution kernels and dilations are the same as the 3 × 3 ones and thus are omitted.
3) Global and Local Representation Alignment with Cross-granularity Contrastive Learning
We use cross-granularity contrastive learning to align global sentence-level representations h
Positive sample pairs: The global representation h\begin{equation*}{h_{{\text{pos}}}} = \frac{1}{{{n^2}}}\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{T_{ij}}} } \tag{7}\end{equation*}
The overview of BTF-CCL. The sentence is encoded by BERT Encoder, enriched with word-level representations via MMCNN, and then contrastive learning between sentence-level and word-level representations is applied. Finally, triplets are extracted via region detection and classification.
This represents the alignment between the global context and the local information within the same sentence.
Negative sample pairs: To construct the negative samples, we adopt the local representations h
We define the cross-granularity contrastive learning loss using a margin-based ranking function. The goal is to minimize the distance between the global representation and the corresponding local representation (positive sample) while maximizing the distance between the global representation and the non-corresponding local representation (negative sample). The cross-granularity contrastive learning loss is defined as follows:
\begin{equation*}{L_{CL}} = \max \left( {0,m + d\left( {{h_{cls}},{h_{neg}}} \right) - d\left( {{h_{cls}},{h_{pos}}} \right)} \right)\tag{8}\end{equation*}
4) Region Detection and Classification
For each element \begin{equation*}P_{ij}^S = \sigma \left( {\operatorname{Linear} \left( {r_{ij}^{(l)}} \right),\quad P_{ij}^E = \sigma \left( {\operatorname{Linear} _{ij}^{(l)}} \right),} \right.\tag{9}\end{equation*}
Given a candidate region S(a,b) and E(c,d), the representation rabcd is constructed by concatenating the S, E representations, and the max-pooling result of the region matrix:
\begin{equation*}{r_{abcd}} = \left[ {r_{ab}^{(l)};r_{cd}^{(l)};\operatorname{MaxPool} \left( {\left\{ {r_{ij}^{(l)}} \right\}} \right)} \right]\tag{10}\end{equation*}
The sentiment polarity is predicted through a softmax function: SP ∈ {Positive, Negative, Neutral, Invalid}.
\begin{equation*}{P_{abcd}}(SP) = \operatorname{Softmax} \left( {\operatorname{Linear} \left( {{r_{abcd}}} \right)} \right)\tag{11}\end{equation*}
5) Training and Decoding
During training, the loss for boundary detection is calculated with binary cross-entropy:
\begin{align*} & {L_S} = \operatorname{BCEWithLogitsLoss} \left( {P_{ij}^S,y_{ij}^S} \right)\tag{12} \\ & {L_E} = \operatorname{BCEWithLogitsLoss} \left( {P_{ij}^E,y_{ij}^E} \right)\tag{13}\end{align*}
For region classification, the ground truth region sentiment SP∗ is used to calculate loss with cross-entropy:
\begin{equation*}{L_{SP}} = - \sum\limits_{abcd} {\log } {P_{abcd}}\left( {S{P^{\ast}}} \right)\tag{14}\end{equation*}
The total loss is:
\begin{equation*}{L_{Total}} = {L_{CL}} + {L_S} + {L_E} + {L_{SP}}\tag{15}\end{equation*}
During decoding, candidate regions are identified, and the sentiment polarity is predicted, yielding the triplet (aspect, opinion, sentiment).
Experiments
A. Datasets
This study assesses our model using four datasets presented by [12], namely 14Res, 14Lap, 15Res, and 16Res. These datasets encompass three from the restaurant domain and one from the laptop domain. The four datasets originate from the SemEval Challenges [3] and were refined by Xu et al. [13] based on the earlier version from Peng et al. [12]. Table I provides a summary of the detailed statistics for these benchmark datasets.
B. Implementation Details
We employ the pre-trained "BERT-base-uncased" model, which consists of 110 million parameters, 12 attention heads, 12 hidden layers, and a hidden size of 768, as our encoding layer. The model is trained for 10 epochs and the best model parameters are selected based on the highest F1 score on the validation set, which are then applied to evaluate the model’s performance on the test set.
C. Baselines
Table-filling methods frame aspect terms, opinion terms, and their sentiment as word-pair interactions. [16] presented the Grid Tagging Scheme (GTS), coupled with an inference technique to exploit interdependencies between opinion elements. [18] proposed the Dual-Encoder to leverage table-sequence encoders for capturing both sequence and table representations. [22] developed the Enhanced Multi-Channel Graph Convolutional Network (EMC-GCN) to incorporate linguistic features. [19] introduced the boundary-driven table-filling method (BDTF), emphasizing word relationships.
Other end-to-end methods handle triplet extraction by learning a joint model. [23] proposed a bidirectional machine reading comprehension (BMRC) framework that uses three query types to capture links between subtasks. [17] presented a span-level ASTE method with a POS filter and contrastive learning, further improving model performance. [24] reformulated all ABSA subtasks into a unified generative approach utilizing BART.
D. Main Results
The results in Table II show that our model outperforms the baseline models in terms of Precision (P), Recall (R), and F1 scores across the four datasets. Specifically, it improves the F1 score by 1.09 and 0.7 points on the 14Res and 14Lap datasets, and by 1.56 and 1.53 points on the 15Res and 16Res datasets, respectively, compared to the previous best result from Span ASTE (POS&CL) [17] and BDTF [19]. While Span ASTE (POS&CL) demonstrate higher recall on the 14Res and 14Lap datasets compared to our model, it is worth highlighting that our model excels in precision. This suggests that our BTF-CCL maintains a better balance between precision and recall, resulting in enhanced overall performance.
E. Ablation Study
We performed an ablation study to evaluate our model’s performance under various configurations. The results, shown in Table III, summarize the outcomes for each setup. The details of the ablation experiments and their analysis are as follows:
Contrastive learning: Without contrastive learning, the semantic consistency between the global representation and the local representation is lost. The findings from this ablation experiment indicate a significant decline in the model’s performance across all datasets. This suggests that contrastive learning enhances the guidance of sentence-level global context over word-level local features.
MMCNN: The MMCNN is used to capture rich semantic information across different levels of granularity. To verify its effectiveness, we removed it in the ablation experiment, and the results show that the F1 scores of the model decreased across all four datasets without MMCNN.
Conclusion
In this work, we propose boundary-driven table-filling with cross-granularity contrastive learning (BTF-CCL) to enhance the semantic consistency between sentence-level representations and word-level representations. By constructing positive and negative sample pairs, the model is forced to learn the associations at both the sentence level and the word level. Additionally, a multi-scale, multi-granularity convolutional method is proposed to capture rich semantic information better. Our approach can capture sentence-level contextual information more effectively while maintaining sensitivity to local details. These contributions together enhanced the performance of table-filling ASTE models, highlighting the effectiveness of our approach in tackling key challenges and advancing sentiment analysis in aspect-based tasks.