Introduction
Biomedical event extraction aims to uncover complex interactions between the molecular level and larger entities like cells, tissues, organs and even organisms. The utility of event extraction includes many applications in the biomedical domain [1]. Consequently, various relevant entities and event types are involved in the process, making event extraction in this domain a challenging task. An event trigger is a textual term that relates to an event type and indicates the occurrence of an event. Thus, the identification of the trigger word is critical to the extraction process [2].
As we know, an event in biomedical text is composed of an event trigger and one or more arguments. Here, event triggers mainly signify the appearance of a biomedical event by a word or phrase. One usually consists of a verb or gerund by the statistical analysis. Meanwhile, the argument is mainly the related biomedical entity or other biomedical events. For example, Figure 1 shows a biomedical event with an event trigger and two participants. The trigger word “treated” is annotated with its type “Planned_process”. Each event has one, and only one trigger. Meanwhile, there can be multiple participants (event arguments). In this example there are two arguments, “Hamster” with its type “Organism” and “DMBA” with “Drug_or_compound”.
Biomedical event trigger identification is a fundamental task in natural language processing and catches the attention of many researchers. Usually, it considers as a multi-classification problem, and the number of samples contain in each category is an imbalance. The traditional methods mainly focus on manually features engineering [3], such as bag-of-words, position, path and so on. However, features engineering is costly to develop, making the model of biomedical event trigger difficult to adapt to new tasks or new domains. In recent years, deep learning methods have been used to recognize the event trigger. Both convolutional neural networks [4], [5] and long short term memory networks achieve competitive performance. But they still face great challenges. Firstly, Although manual features are not required, the deep learning methods still require sufficient learning of better features to ensure good performance. In order to identify various biomedical event trigger in a complex context, a method needs to acquire local and contextual features for biomedical event trigger detection. Secondly, a more powerful and suitable classifier becomes particularly critical for a kind of multi-classification and imbalance data when we get the potential high-level features automatically. Support vector machine (SVM) is usually used to solve binary classification. Softmax, as an activation function, is rarely used in a classification other than in deep learning. It needs a suitable classification function and features extraction model to improve the performance of biomedical event trigger identification, such as ELM and WELM.
Base on the two points analyzed above, we present an end-to-end Bidirectional Long Short Term Memory Convolution Neural Network Weighted Extreme Learning Machine (BC-WELM) architecture for recognizing the biomedical event trigger. Our proposed biomedical event trigger identification framework includes two processes: automatically features extraction by Bi-LSTM and CNN (BC) and multi-category classification by WELM.
The BC combination of Bi-LSTM and CNN is to automatically learn potential semantic features from the distributed representation of words. Bi-LSTM is adapted to take advantage of its powerful ability to model the long term contextual information. Therefore, we first use Bi-LSTM to encode word information into its word representation. Because CNN can use its sliding window to obtain local information, the outputs of the Bi-LSTM are feed into CNN to model local information of the text. The proposed model builds the context representation by Bi-LSTM as the input of different embeddings, and combines the convolutional neural network (CNN) to build the local representation for detecting biomedical event trigger.
WELM is based on original unweighted ELM for multiclass classification tasks [6], [7]. This model is simple in theory and fast in operation. Moreover, compared with traditional machine learning methods, it has better generalization performance. In an attempt to alleviate the bias in performance caused by imbalanced class distribution, WELM assigns an extra weight to each sample to strengthen the impact of minority class while weakening the relative impact of the majority. In this paper, we use WELM as the classification of biomedical event trigger based on the semantic features. The analysis of the experimental results demonstrate that BC-WELM can model the contextual and local representation to settle with the unbalanced data problem, and better identify the biomedical event trigger compared with traditional methods.
The contributions of our paper are as follows:
We propose a novel end-to-end Bidirectional Long Short-Term Memory Convolution Neural Network Weighted Extreme Learning Machine (BC-WELM), which identifies the biomedical event trigger from contextual and local representation.
The loss of WELM is able to solve the unbalanced problem from biomedical event texts for training the BC-WELM framework and improves the performance of biomedical event trigger identification.
Experimental results demonstrate our method can significantly improve the performance of the state-of-the-art baseline by 1.71% on F1 score and we can analyze the experimental examples to verify the effectiveness of our proposed model BC-WELM for biomedical event trigger identification.
The structure of our paper is as follows: Section 2 introduces the related work on biomedical event trigger identification. Section 3 gives a detailed description of our BC-WELM model for detecting biomedical event triggers. Section 4 presents the extensive experiments to evaluate the effectiveness of our model, and Section 5 sums up the work and outlines potential future directions.
Related Work
To address the need for capturing biomedical entities and their complex relations in biomedical literature, methods of biomedical event extraction have been widely studied in recent years.
Rule-based methods are proposed to deal with the limited resources of annotated texts. These methods commonly employ pattern recognition techniques to generate a series of word patterns with a pre-defined trigger words dictionary. In paper [8], high-precision rules of event extraction were constructed by biologists. Bui et al. [9] presented an approach in which the terms of each sentence were matched against a dictionary to detect candidate event triggers. Although these hand-tailored matching rules usually identify event triggers accurately, constructing rules is time consuming and it is difficult to cover all types of trigger pattern.
Machine learning-based methods usually depend on extracting rich features from the text instead of patterns, to train classifiers. Pyysalo et al. [3] presented a typical example of event extraction in their work, where events on biological organization ranging from the subcellular to the organism level were categorized into 19 classes. Based on the rich features extracted from the text, a SVM was adopted to classify triggers and arguments. Zhou and Zhong [10] used k-nearest neighbors (KNN) to add unlabeled data to the training set, and then combined this with SVM to construct a semi-supervised event extraction framework. By adding a large amount of semantic information not included in the training dataset, their framework significantly improved the recall. He et al. [11] adopted an SVM-based method integrating feature selection and word embeddings. These models show that machine learning methods, mainly SVM, act as effective classifiers on this task. However, machine learning-based methods require the construction of a large number of features, and their generalization is limited.
On the other hand, various neural network-based methods have achieved promising results in text mining tasks in recent years. These methods learn features from text automatically and so can be applied to biomedical event extraction. Huang et al. [12] applied multi-layer perceptron (MLP) for the task of event trigger identification, whereby the cross-entropy error function is used as the cost function. Wang et al. [13] employed CNN to exploit higher-level features. Without the use of external NLP tools, their method avoids the noise caused by extra processing. Rahul et al. [14] adopted bidirectional gated recurrent units (Bi-GRU) and variants of a bidirectional recurrent neural network (Bi-RNN) for event trigger identification. Li et al. [15] proposed a multi-pooling CNN model with dependency-based word embeddings as its input. The rich semantic features contained in their dependency-based word embeddings help to improve the performance of the model. Most of these neural network models use the softmax layer for classification. Although softmax is used as a cost function for probabilistic multiclass classification, by itself it is not a classifier. A major problem with the softmax cost function is that it does not optimize the features associated with minority samples.
In recent years, to take advantage of traditional machine learning methods on small datasets, a scheme to combine these methods with neural networks is beginning to emerge. The combination of CNN and SVM is most common. Ebert et al. [16] presented an approach which integrates CNN and SVM for the emotional classification task on tweets. Huang et al. [17] presented a two-stage method of combining LSTM and SVM for drug-drug interaction (DDI) extraction. Matsugu et al. [18] adopted a hybrid model for face detection, using an SVM being fed with a feature vector generated from CNN. Duan et al. [19] combined CNN and ELM to propose a hybrid framework for predicting age and gender. In their framework, CNN was used to extract the features from the input while ELM classified the intermediate results.
The above hybrid models extract features through deep learning and then use SVM or ELM for classification. Although these methods achieve promising results in various tasks, the impact of data imbalance on the results is not considered, thus their improvements in a small corpus is limited. Our study takes into account the advantages of the hybrid model and the challenge of the imbalanced data, and then proposes a novel, computable and effective hybrid BC-WELM for identifying the biomedical event trigger.
Methodology
In this section, we first describe the biomedical event trigger identification task. Then, we analyze the proposed BC-WELM model in detail. Finally, we introduce the training details.
A. The Definition of Biomedical Event Trigger Identification
In general, event trigger detection is used to confirm whether a word is a trigger or not, and then identify its type. Biomedical events are often characterized by complex, nested argument structures involving several entities or relations. Among the elements of an event, the trigger is crucial because it affects the type of event and subsequent event arguments detection. For each event trigger candidate, by assigning a tag that represents the type of trigger, or a non-trigger tag, we make event trigger detection a classification problem. We convert the raw annotation sentence S into a sequence of terms
B. BC-WELM for Biomedical Event Trigger Identification
In this section, we introduce our BC-WELM model based on the Bi-LSTM representation and CNN representation with the balanced loss function of WELM. The overall architecture of the BC-WELM model for detecting a biomedical event trigger is shown in Figure 2. Our network is composed of five parts: (1) an embedding layer with different dimensions of embeddings, (2) a contextual modeling layer with Bi-LSTM model, (3) a local modeling layer with a CNN model, (4) a pooling layer, and (5) a WELM classification layer. We will present the design of each part in the following sections.
1) Embedding Layer
Specifically, we suppose that a context consists of n words. The different dimension of features can be extracted in a biomedical event text from the dimension word embedding, entity embedding, position embedding and part of speech (POS) embedding. These embeddings are able to represent the latent semantic space for biomedical event triggers.
Word Embedding. Every word feature of biomedical text can be mapped to a high dimension feature space in this layer for capturing the meaningful semantic regularities. Here, the word embedding model GloVe [20] is applied as the pre-trained word vector in order to produce the word embedding for detecting the biomedical event trigger.
Entity Embedding. This layer is able to capture the event entity information as additional features. With each entity type of biomedical event, the random vector is initialized and then updated in the training process.
Position Embedding. As we know, information about the position can be used to detect the event trigger in biomedical text. Therefore, we use the position embedding to represent the semantic distance between words and entities in the biomedical context. We apply the random vectors as the position embedding.
POS Embedding. This part of speech information is the critical clue to recognizing the biomedical event trigger. It can represent the POS information as a dense dimension semantic space by the random initialize vectors.
2) Contextual Modeling Layer
Generally, RNN is an effective sequential model for learning useful features, and is popular in the domain of natural language processing. Long Short-Term Memory Networks (LSTM) can be applied to build the temporal interactions among words as the input of word embedding, which was proposed by Hochreiter and Schmidhuber [21] and shows good performance in many NLP tasks. Moreover, proposed by Graves et al. [22], Bi-LSTM is composed of the forward LSTM and backward LSTM to better present the context representation.
In this framework, Bi-LSTM is good at learning long-term dependencies to avoid gradient vanishing and expansion problems. The design contains three gates and one cell to build the semantic relationship. The inputs are word embedding, entity embedding, position embedding and POS embedding, respectively. The update process formulas of the forward LSTM network are as follows: \begin{align*} X=&\begin{bmatrix} h_{s-1} \\ x_{s}\end{bmatrix} \tag{1}\\ f_{s}=&\sigma (W_{f}\cdot X+b_{f}) \tag{2}\\ i_{s}=&\sigma (W_{i}\cdot X+b_{i}) \tag{3}\\ o_{s}=&\sigma (W_{o}\cdot X+b_{o}) \tag{4}\\ c_{s}=&f_{s}\odot c_{s-1}+i_{i}\odot tan(W_{c}\cdot X+b_{c}) \tag{5}\\ h_{s}=&o_{s}\odot tanh(c_{s})\tag{6}\end{align*}
A similar process is used for the backward LSTM. Then the forward and backward LSTM are concatenated. With different dimension contextual modeling for four types of embeddings, we obtain the final sentence contextual output
3) Local Modeling Layer
For the contextual modeling of biomedical text, biomedical event trigger identification also needs to learn the local semantic features by convolution operation [23].
In general, the convolutional layer is composed of a filter and a feature map. A filter w mainly uses a window of h words to produce a new feature map for generating the local features. Here, a feature map \begin{equation*} c_{i}=f(w\times x_{i:i+h-1}+b) \tag{7}\end{equation*}
4) Pooling Layer
After compressing the different dimensions of semantics into feature maps in the biomedical text using a local convolution layer, we use the max-pooling operation to capture the critical features in each feature map. Therefore, the sentence lengths are calculated from the integrated features information according to the max-pooling operation.
Max-pooling operation is employed for each feature map \begin{equation*} p_{i} = max(c_{i})\tag{8}\end{equation*}
5) WELM Classification Layer
To classify the features obtained by the pooling layer, considering softmax is difficult to adapt to unbalanced datasets, we introduce WELM to deal with these high-dimensional features. In this subsection, we start with an introduction to the basic ELM model, and then explain the improvement of the WELM model.
6) ELM
ELM [24] is a generalized single hidden layer feed forward network, which can be used for classification. ELM exhibits better generalization performance and gets rid of the iterative, time-consuming training process [25].
For the samples \begin{equation*} h(x_{i}+b)=g(Vx_{i}+b) \tag{9}\end{equation*}
\begin{align*}&\text {Minimize: }\frac {1}{2} ||\beta ||^{2}+C\frac {1}{2} \sum _{i=1}^{N}||\xi _{i}||^{2} \tag{10}\\&\text {Subject to: }h(x_{i})\beta =t_{i}^{T}-\xi _{i}^{T} \tag{11}\end{align*}
\begin{equation*} \beta =\begin{cases} H^{T}{\left({\dfrac {I}{C}+HH^{T}}\right)}^{-1}T, & \text {if }~ N < L \\ {\left({\dfrac {I}{C}+HH^{T}}\right)}^{-1}H^{T}T, & \text {if }~ N > L \\ \end{cases}\tag{12}\end{equation*}
7) WELM
In order to enhance the robustness of the ELM on unbalanced data, Graves et al. [22] proposed the weighted extreme learning machine (WELM). WELM proposed and evaluated two generalized weighting schemes which assign weights to instances as per their class distribution. The first weighting schemes proposed by WELM are: \begin{equation*} W_{ii}=\frac {1}{q_{k}} \tag{13}\end{equation*}
The second weighting schemes proposed by WELM are: \begin{align*} q_{avg}=&\sum _{k=1}^{m}\frac {q_{k}}{m} \tag{14}\\ W_{ii}=&\begin{cases} \dfrac {1}{q_{k}}, & \text {if }~ q_{k} \leq q_{avg} \\[7pt] \dfrac {0.618}{q_{k}}, & \text {if }~ q_{k} > q_{avg} \\ \end{cases} \tag{15}\end{align*}
\begin{align*}&\text {Minimize: }\frac {1}{2} ||\beta ||^{2}+CW\frac {1}{2} \sum _{i=1}^{N}||\xi _{i}||^{2} \tag{16}\\&\text {Subject to: }h(x_{i})\beta =t_{i}^{T}-\xi _{i}^{T} \tag{17}\end{align*}
According to the literature [22], can be obtained as follows: \begin{equation*} \beta =\begin{cases} H^{T}{\left({\dfrac {I}{C}+WHH^{T}}\right)}^{-1}T, & \text {if }~ N < L \\ {\left({\dfrac {I}{C}+HWH^{T}}\right)}^{-1}H^{T}T, & \text {if }~ N > L \\ \end{cases} \tag{18}\end{equation*}
\begin{equation*} f(x) \!=\!\!\begin{cases} sign h(x)H^{T}{\left({\dfrac {I}{C}+WHH^{T}}\right)}^{-1}\!WT & \text {if }~ N < L \\ sign h(x){\left({\dfrac {I}{C}+HWH^{T}}\right)}^{-1}H^{W}TT & \text {if }~ N > L \\ \end{cases} \tag{19}\end{equation*}
C. Model Training
In a BC-WELM framework, the model training stage and testing stage should adopt different ways to successfully identify the biomedical event trigger.
For the training process, we needed to optimize all the parameters from our networks. Then, we applied cross entropy with L2 regularization as the loss function, which is defined as: \begin{equation*} J=-\sum _{i=1}^{C}y_{i}log(y_{i}')+\lambda _{r}\big(\sum _{\theta \in \Theta }\theta ^{2}\tag{20}\end{equation*}
We apply the back propagation method to find the gradients and update all the parameters during the training process. In order to avoid overfitting, a dropout strategy was employed to reduce the feature detectors on each training case.
For the testing process, we incorporated WELM with its loss function to prevent overfitting. The high-dimensional features extracted by contextual modeling and local modeling are fed to WELM after pooling. We are able to calculate the related event trigger type by formula 19.
Experiments
In this section, we evaluate the performances of our proposed model BC-WELM for biomedical event trigger identification.
A. Experiment Preparation
In this section, we introduce the dataset, evaluation metric and hyperparameters settings in detail.
1) Dataset
We conducted all the experiments on the MLEE [3] dataset to validate the effectiveness of biomedical event trigger identification. This dataset is composed of 295 event documents which consist of 2608 sentences and 6677 biomedical events. Table 1 shows the statistical information of our dataset. The biomedical events are divided into 19 subclasses as a multicategory classification task. The aim was to recognize the correct trigger sub-category of a biomedical event.
2) Evaluation Metric
To evaluate the performance of biomedical event trigger identification, we adopted the evaluation metric in our experiments, Precision (P), Recall (R) and F-measure (F1). They are commonly accepted in other biomedical event tasks so we can compare our data with other strong baselines.
3) Hyperparameters Settings
In our experiments, all words embedding were initialized by GloVe and pre-trained on the Pubmed corpus. The dimension of word embedding was set to 300. Then, the dimension of entity embedding, position embedding and POS embedding was set to 50, 20 and 20, respectively. Meanwhile, all out-of-vocabulary words were initialized by sampling from the uniform distribution U(−0.1, 0.1). We applied a dropout strategy to avoid overfitting, which was set at 0.2. The filter of the convolution layer was 100, and the units of Bi-LSTM were 150. All weighted matrices were initialized by uniform distribution U (−0.1, 0.1) and all biases were set to zeros. The mini-batch of the model was set to 32 instances.
B. Model Comparisons
In order to comprehensively evaluate the performance of BC-WELM, we list some of the baseline approaches for model comparison. The baselines are as follows:
SVM uses an SVM-based classifier to detect the biomedical event trigger based on manual features [3].
CNN adopts the convolution neural network to model the local features and classify the event trigger labels [9].
LSTM uses the long short-term memory network to model the semantic context as the input of word embedding and entity type embedding [10].
GRU employs the gated recurrent unit network for semantic sentence modeling to reduce the training time and achieve state-of-the-art performance [10].
BC-S uses Bi-LSTM combined with CNN to extract features. Softmax is used in the classification.
BC-ELM uses Bi-LSTM combined with CNN to identify the biomedical event trigger for settling unbalanced problems with ELM.
BC-WELM uses Bi-LSTM combined with CNN to identify the biomedical event trigger for settling unbalanced problems with WELM.
Table 2 shows the performance comparison of BC-WELM with strong baselines. From these results, we infer that:
The SVM approach obtains the worst performance of all the baseline methods by precision metrics, because the traditional features based on an SVM classifier are complex and must be adjusted to detect biomedical event triggers due to the neural network methods. However, the recall of the SVM classifier is better than the other methods, which suggests a direction for future improvement.
Both LSTM and GRU outperform CNN by 1.29% on F1 average, because the recurrent neural network is able to capture the contextual information from the biomedical event information. CNN outperforms SVM by 2.15% on F1 because of how it discovers the local information.
Our BC-WELM model takes a further step towards emphasizing the importance of context and local modeling, and the balanced loss function of WELM. We can see that BC-WELM achieves the best performance across all baselines by 1.71% on F1. With this design, we can discover the contextual and local representation based on different dimensions of embeddings. Moreover, compared with BC-S, BC-ELM and BC-WELM achieve better performance on all of the metrics. However, ELM does not target the imbalanced data problem. It has only a limited effect on classification performance. BC-WELM outperforms BC-ELM by 1.62 on F1. The significant improvement is observed by the proposed BC-WELM. BC-WELM is verified to be efficient on the performance of biomedical event trigger identification.
C. Detailed Analysis
1) Impact of Different Features for BC-WELM
In this section, we design a series of models after comparing several fine-grained embeddings to verify the effectiveness of our BC-WELM model. All is our proposed model which imports the contextual and local representation with a balanced function to detect biomedical event trigger. All-word replaces word embedding GloVe with a random initialize vector. All-POS ignores the POS embedding. All-position does not consider the position representation. All-entity does not contain the embedding of entity type. We set the same parameters for the whole experiment. The results are shown in Table 3.
From these results, we learn that: (1) Word embedding outperforms random initialization, which confirms that GloVe presents the semantic space to detect the biomedical event trigger. (2) With different types of event, triggers and event participants should cause a difference. POS embedding enables our structure to identify the biomedical event trigger using semantic information. (3) The position shows the relative position in the candidate event trigger, capturing the structural information for biomedical events which is the significant feature to improve the performance. (4) Because there are multiple levels of biological organization, there are many types of event entities in a biomedical text. Entity embedding is a helpful factor for identifying biomedical event triggers. (5) Our research shows that the different models cannot compete with our BC-WELM model, which demonstrates our model’s state-of-the-art effectiveness.
2) Analysis of Category Wise Performance
iN order to comprehensively understand our proposed BC-WELM model for recognizing biomedical event triggers with the analysis of fine-grained category-wide performance, results are shown in Table 4. To keep up with Rahul in Table 4, we used micro averages as the evaluation metrics.
Generally, there is a similar distribution between our proposed BC-WELM model and the Rahul approach. It is clear that the performance in anatomical and molecular categories is better than in general and planned categories, which implies the necessity of the trained dataset scale and balanced classification function of WELM. In addition, we can see that the proposed BC-WELM is 1.87 percentage points, 3.78 percentage points, 1.55 percentage points and 5.07 percentage points higher on F1 in anatomical, molecular, general and planned categories than in Rahul et al.‘s results. All the terms in the evaluation metrics of our BC-WELM model exceed those of the Rahul method. The reason is that our approach has the ability to capture the contextual and local features to better detect the biomedical event trigger with the balanced classification cost function of WELM.
3) Case Study
As we have demonstrated, our model achieves a state-of-the-art performance level. We analyze the example of Figure 1 as a case study to show the effectiveness of our proposed model. We apply BC-WELM to find the biomedical event trigger type “Planned_process”. From this example, we can see that the common words “were”, “with” and “or” and punctuation “.” contribute little to the judging of trigger types. Therefore, it is far from sufficient to use only word information for biomedical event trigger identification. Obviously, POS embedding, position embedding and entity embedding play a great role in this task. The trigger word “treated” is strongly associated with “Hamsters” and “DMBA”. Bi-LSTM can better mine the related information between them. Furthermore, the combination of Bi-LSTM and CNN gives our model a deeper network structure. It is conducive to extract the latent semantic features by contextual and local modeling information. Especially, our model exhibits excellent performance on the “Planned_process” trigger type which contains a relatively large number of instances. This also suggests our model can outperform other models on imbalanced data sets.
Conclusion
In this paper, we present an effective end-to-end Bidirectional Long Short-Term Memory Convolution Neural Network Weighted Extreme Learning Machine (BC-WELM) for recognizing biomedical event triggers. This framework model the contextual representation and local representation with the balanced classification function. Experimental results conducted on a real-world dataset (MLEE) demonstrate the effectiveness of our proposed BC-WELM model.
Looking forward, we will try to combine different effective deep learning methods with useful biomedical event characteristics to identify biomedical event triggers, and other biomedical tasks.