Conferences >ICASSP 2025 - 2025 IEEE Inter...

Boundary-Driven Table-Filling with Cross-Granularity Contrastive Learning for Aspect Sentiment Triplet Extraction

Abstract:

The Aspect Sentiment Triplet Extraction (ASTE) task aims to extract aspect terms, opinion terms, and their corresponding sentiment polarity from a given sentence. It rema...Show More

Metadata

Abstract:

The Aspect Sentiment Triplet Extraction (ASTE) task aims to extract aspect terms, opinion terms, and their corresponding sentiment polarity from a given sentence. It remains one of the most prominent subtasks in fine-grained sentiment analysis. Most existing approaches frame triplet extraction as a 2D table-filling process in an end-to-end manner, focusing primarily on word-level interactions while often overlooking sentence-level representations. This limitation hampers the model’s ability to capture global contextual information, particularly when dealing with multi-word aspect and opinion terms in complex sentences. To address these issues, we propose boundary-driven table-filling with cross-granularity contrastive learning (BTF-CCL) to enhance the semantic consistency between sentence-level representations and word-level representations. By constructing positive and negative sample pairs, the model is forced to learn the associations at both the sentence level and the word level. Additionally, a multi-scale, multi-granularity convolutional method is proposed to capture rich semantic information better. Our approach can capture sentence-level contextual information more effectively while maintaining sensitivity to local details. Experimental results show that the proposed method achieves state-of-the-art performance on public benchmarks according to the F1 score.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10888888

Conference Location: Hyderabad, India

Funding Agency:

Contents

SECTION I.

Introduction

Aspect-Based Sentiment Analysis (ABSA) is a fine-grained task that focuses on identifying the sentiments expressed toward specific aspect terms [1] [2] [3]. Typically, to determine the sentiment corresponding to an aspect term, it is necessary to identify the sentiment-laden words associated with it in the sentence, referred to as opinion terms. Therefore, to accomplish the ABSA task, some subtasks have been introduced, including Aspect Term Extraction (ATE) [4]–[7], Opinion Term Extraction (OTE) [8]–[11], and more recently developed Aspect Sentiment Triplet Extraction (ASTE) [12]–[15]. Among these subtasks, ATE focuses on identifying and extracting aspect terms from a sentence, while OTE handles the extraction of opinion terms. ASTE combines both by identifying aspect terms, opinion terms, and their corresponding sentiment polarity expressed. Fig. 1 illustrates examples of the different subtasks.

Fig. 1.

An example of ABSA including ATE, OTE, and ASTE. The orange words represent aspects, and the blue ones represent opinions.

Show All

The ASTE task was first proposed by Peng et al. [12] which uses a two-stage pipeline approach. In the first stage, potential aspect and opinion terms are extracted as a labeling problem, while in the second stage, the aspect and opinion terms are paired, and a classifier is used to determine their corresponding sentiments. Building on Peng et al.’s pipeline, many studies have investigated the interaction among subtasks to design the end-to-end triplet extraction models [13], [15]–[20]. Notably, in prior works such as [16], [18], [19], and [20], the triplet extraction process is handled by using the table-filling method. In this approach, aspect and opinion terms are extracted from the diagonal elements of the table, while sentiment is represented as relational tags in the non-diagonal elements. The table-filling method enables comprehensive word-pair calculations and, being end-to-end, avoids the error propagation often seen in pipeline approaches. Additionally, this method simplifies the identification of relationships between words, making the process more efficient.

However, the table-filling method, which represents each triplet as a relation region in the 2D table and transforms the ASTE task into detection and classification of relation regions, mainly centers on word-level interactions, emphasizing relationships between individual words. These methods excel at capturing fine-grained representations between individual words, but they often neglect to consider and incorporate sentence-level representations. This oversight limits the model’s capacity to grasp global contextual information, which is crucial for understanding the full meaning of the entire sentence. As a result, when handling more complex sentences containing multi-word aspect and opinion terms, these models struggle to accurately capture the relationships between the terms and the overall sentiment, leading to potential performance drawbacks.

Fig. 2.

An example of the aspect and opinion terms marked regions in the 2D table.

Show All

In this paper, based on the boundary-driven table-filling framework for aspect sentiment triplet extraction, we propose a cross-granularity contrastive learning (CCL) mechanism, which is designed to enhance the semantic consistency between the sentence-level representation and the word-level representations. We named our method as BTF-CCL. By constructing positive and negative sample pairs, the BTF-CCL is compelled to learn the intricate relationships and associations between the sentence-level representation and the word-level representations, improving its ability to capture both fine-grained and broad contextual information. This process ensures the model can better align local word-level details with the overall sentence-level meaning, leading to more accurate triplet extraction. In addition to cross-granularity contrastive learning, a multi-scale, multi-granularity convolutional method (MMCNN) is also introduced to enable the model to capture rich semantic information across different levels of granularity. Experimental results show that our method achieves better performance over existing state-of-the-art approaches.

SECTION II.

Proposed Method

A. Task Definition

X = {x₁,x₂,…,x_n} represents a sentence consisting of n words. The ASTE task aims to extract triplets represented as (aspect, opinion, sentiment), where sentiment can be categorized as Positive, Negative, or Neutral. As illustrated in Fig. 2, a triplet is depicted as a region within a 2D table. The boundaries of this region specify the position of the aspect term and the opinion term, while the type indicates the sentiment. The regions are marked by boundary tags: ‘S’ for the upper left and ‘E’ for the lower right corner.

B. Model Architecture

The architecture of our model BTF-CCL is shown in Fig. 3. We use BERT to encode input sentences and generate a 2D table representing word relationships. This table is further processed by a multi-scale, multi-granularity CNN (MMCNN) to capture local semantic information better. Cross-granularity contrastive learning is then applied between the word-level representations from the MMCNN and sentence-level representations, and all candidate regions are detected and classified for sentiment polarity.

1) Representation Leanrning

For an input sentence X = {x₁,x₂,…,x_n}, we utilize the pre-trained BERT language model [21] to generate contextual embeddings. By extracting the hidden states from the final layer of the encoder, we obtain the full sentence representation H = {h₁,h₂,…,h_n}. The relationship representation between two words (h_i,h_j) ∈ R^d at position $\left( {inde{x_{{h_i}}},inde{x_{{h_j}}}} \right)$ in the table is generated through a nonlinear transformation, as shown in Equation 1, where d is the hidden size of the BERT output.

$\begin{equation*}r_{ij}^{(0)} = f\left( {\operatorname{Linear} \left( {\left[ {{h_i};{h_j};{c_{ij}};{t_{ij}}} \right]} \right)} \right)\tag{1}\end{equation*}$ View Source

where f(•) is the activation function, and [;]represents the concatenation operation, c_ij is the context representation between the word pair, obtained via max-pooling, and t_ij represents the word-to-word interactions across different vector spaces, captured through a tensor-based operation.

2) MMCNN

The relation representations between word pairs come from a 2D matrix with dependencies, such as boundary tags ‘S’ and ‘E’ and shared sentiment within the same region. This architecture applies a ResNet-style CNN with different convolution kernel sizes (3 × 3, 5 × 5) and dilations (1, 2, 3) to table representations, referred to as multi-scale, multi-granularity convolutions (MMCNN), enabling the capture of richer local information and long-range dependencies.

$\begin{align*} & T' = \operatorname{ReLU} \left( {\left( {{{\operatorname{Conv} }_{1 \times 1}}\left( {{T^{(l - 1)}}} \right)} \right)} \right)\tag{2} \\ & \left. {T_{3x3,d1,d2,d3}^\prime = \operatorname{ReLU} \left( {{{\operatorname{Conv} }_{3 \times 3,d1,d2,d3}}\left( {{T^\prime }} \right)} \right)} \right)\tag{3} \\ & {T^{\prime \prime }} = T_{3x3}^\prime + T_{5x5}^\prime \tag{4} \\ & {T^{\prime \prime \prime }} = \operatorname{ReLU} \left( {\left( {{{\operatorname{Conv} }_{1 \times 1}}\left( {{T^{\prime \prime }}} \right)} \right)} \right)\tag{5} \\ & {T^{(l)}} = {T^{\prime \prime \prime }} + {T^{(l - 1)}}\tag{6}\end{align*}$ View Source

The 5×5 convolution kernels and dilations are the same as the 3 × 3 ones and thus are omitted.

3) Global and Local Representation Alignment with Cross-granularity Contrastive Learning

We use cross-granularity contrastive learning to align global sentence-level representations h_cls ∈ R^d, with local word-level representations h_pos ∈ R^d(h_neg ∈ R^d). This method leverages positive and negative sample pairs to enhance the model’s ability to distinguish between relevant and irrelevant representations. We construct two types of sample pairs:

Positive sample pairs: The global representation h_cls is paired with the local representation h_pos obtained by pooling the MMCNN output. The positive sample h_pos can be defined as follows:

$\begin{equation*}{h_{{\text{pos}}}} = \frac{1}{{{n^2}}}\sum\limits_{i = 1}^n {\sum\limits_{j = 1}^n {{T_{ij}}} } \tag{7}\end{equation*}$ View Source

Fig. 3.

The overview of BTF-CCL. The sentence is encoded by BERT Encoder, enriched with word-level representations via MMCNN, and then contrastive learning between sentence-level and word-level representations is applied. Finally, triplets are extracted via region detection and classification.

Show All

This represents the alignment between the global context and the local information within the same sentence.

Negative sample pairs: To construct the negative samples, we adopt the local representations h_neg from different sentences in the same batch.

We define the cross-granularity contrastive learning loss using a margin-based ranking function. The goal is to minimize the distance between the global representation and the corresponding local representation (positive sample) while maximizing the distance between the global representation and the non-corresponding local representation (negative sample). The cross-granularity contrastive learning loss is defined as follows:

$\begin{equation*}{L_{CL}} = \max \left( {0,m + d\left( {{h_{cls}},{h_{neg}}} \right) - d\left( {{h_{cls}},{h_{pos}}} \right)} \right)\tag{8}\end{equation*}$ View Source

where d(•) denotes to compute the Euclidean distance between two vectors, and m (set to 1.0) is the margin that controls the separation between positive and negative sample pairs.

4) Region Detection and Classification

For each element $r_{ij}^{(l)}$ in the table representation T^(l), classifiers calculate the probabilities of boundary tags being ‘S’ and ‘E’ using a sigmoid function:

$\begin{equation*}P_{ij}^S = \sigma \left( {\operatorname{Linear} \left( {r_{ij}^{(l)}} \right),\quad P_{ij}^E = \sigma \left( {\operatorname{Linear} _{ij}^{(l)}} \right),} \right.\tag{9}\end{equation*}$ View Source

Given a candidate region S(a,b) and E(c,d), the representation r_abcd is constructed by concatenating the S, E representations, and the max-pooling result of the region matrix:

$\begin{equation*}{r_{abcd}} = \left[ {r_{ab}^{(l)};r_{cd}^{(l)};\operatorname{MaxPool} \left( {\left\{ {r_{ij}^{(l)}} \right\}} \right)} \right]\tag{10}\end{equation*}$ View Source

The sentiment polarity is predicted through a softmax function: SP ∈ {Positive, Negative, Neutral, Invalid}.

$\begin{equation*}{P_{abcd}}(SP) = \operatorname{Softmax} \left( {\operatorname{Linear} \left( {{r_{abcd}}} \right)} \right)\tag{11}\end{equation*}$ View Source

5) Training and Decoding

During training, the loss for boundary detection is calculated with binary cross-entropy:

$\begin{align*} & {L_S} = \operatorname{BCEWithLogitsLoss} \left( {P_{ij}^S,y_{ij}^S} \right)\tag{12} \\ & {L_E} = \operatorname{BCEWithLogitsLoss} \left( {P_{ij}^E,y_{ij}^E} \right)\tag{13}\end{align*}$ View Source

where

$y_{ij}^S,y_{ij}^E \in (0,1)$

are the ground truth boundary label.

For region classification, the ground truth region sentiment SP^∗ is used to calculate loss with cross-entropy:

$\begin{equation*}{L_{SP}} = - \sum\limits_{abcd} {\log } {P_{abcd}}\left( {S{P^{\ast}}} \right)\tag{14}\end{equation*}$ View Source

The total loss is:

$\begin{equation*}{L_{Total}} = {L_{CL}} + {L_S} + {L_E} + {L_{SP}}\tag{15}\end{equation*}$ View Source

During decoding, candidate regions are identified, and the sentiment polarity is predicted, yielding the triplet (aspect, opinion, sentiment).

SECTION III.

Experiments

A. Datasets

This study assesses our model using four datasets presented by [12], namely 14Res, 14Lap, 15Res, and 16Res. These datasets encompass three from the restaurant domain and one from the laptop domain. The four datasets originate from the SemEval Challenges [3] and were refined by Xu et al. [13] based on the earlier version from Peng et al. [12]. Table I provides a summary of the detailed statistics for these benchmark datasets.

TABLE I The dataset statistics are provided. ‘Sentence’ refers to the total count of sentences, while ‘+’, ’0’, and ‘-’ represent the counts of positive, neutral, and negative sentiment triplets, respectively.

TABLE II Test set results for the ASTE task, with the highest values highlighted in bold. The first four results are from [19], and the remaining data are sourced from the respective papers.

B. Implementation Details

We employ the pre-trained "BERT-base-uncased" model, which consists of 110 million parameters, 12 attention heads, 12 hidden layers, and a hidden size of 768, as our encoding layer. The model is trained for 10 epochs and the best model parameters are selected based on the highest F1 score on the validation set, which are then applied to evaluate the model’s performance on the test set.

C. Baselines

Table-filling methods frame aspect terms, opinion terms, and their sentiment as word-pair interactions. [16] presented the Grid Tagging Scheme (GTS), coupled with an inference technique to exploit interdependencies between opinion elements. [18] proposed the Dual-Encoder to leverage table-sequence encoders for capturing both sequence and table representations. [22] developed the Enhanced Multi-Channel Graph Convolutional Network (EMC-GCN) to incorporate linguistic features. [19] introduced the boundary-driven table-filling method (BDTF), emphasizing word relationships.

Other end-to-end methods handle triplet extraction by learning a joint model. [23] proposed a bidirectional machine reading comprehension (BMRC) framework that uses three query types to capture links between subtasks. [17] presented a span-level ASTE method with a POS filter and contrastive learning, further improving model performance. [24] reformulated all ABSA subtasks into a unified generative approach utilizing BART.

D. Main Results

The results in Table II show that our model outperforms the baseline models in terms of Precision (P), Recall (R), and F1 scores across the four datasets. Specifically, it improves the F1 score by 1.09 and 0.7 points on the 14Res and 14Lap datasets, and by 1.56 and 1.53 points on the 15Res and 16Res datasets, respectively, compared to the previous best result from Span ASTE (POS&CL) [17] and BDTF [19]. While Span ASTE (POS&CL) demonstrate higher recall on the 14Res and 14Lap datasets compared to our model, it is worth highlighting that our model excels in precision. This suggests that our BTF-CCL maintains a better balance between precision and recall, resulting in enhanced overall performance.

E. Ablation Study

We performed an ablation study to evaluate our model’s performance under various configurations. The results, shown in Table III, summarize the outcomes for each setup. The details of the ablation experiments and their analysis are as follows:

Contrastive learning: Without contrastive learning, the semantic consistency between the global representation and the local representation is lost. The findings from this ablation experiment indicate a significant decline in the model’s performance across all datasets. This suggests that contrastive learning enhances the guidance of sentence-level global context over word-level local features.

MMCNN: The MMCNN is used to capture rich semantic information across different levels of granularity. To verify its effectiveness, we removed it in the ablation experiment, and the results show that the F1 scores of the model decreased across all four datasets without MMCNN.

TABLE III The results of ablation experiment. W/O denotes removing the component.

SECTION IV.

Conclusion

In this work, we propose boundary-driven table-filling with cross-granularity contrastive learning (BTF-CCL) to enhance the semantic consistency between sentence-level representations and word-level representations. By constructing positive and negative sample pairs, the model is forced to learn the associations at both the sentence level and the word level. Additionally, a multi-scale, multi-granularity convolutional method is proposed to capture rich semantic information better. Our approach can capture sentence-level contextual information more effectively while maintaining sensitivity to local details. These contributions together enhanced the performance of table-filling ASTE models, highlighting the effectiveness of our approach in tackling key challenges and advancing sentiment analysis in aspect-based tasks.

References is not available for this document.

MIT Libraries

MIT Libraries

Boundary-Driven Table-Filling with Cross-Granularity Contrastive Learning for Aspect Sentiment Triplet Extraction

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction