Introduction
An event is a specific occurrence of something that happens in a certain time and a certain place involving one or more participants, which can frequently be described as a change of state [1]. The goal of event extraction is to detect event instance(s) in texts, and if existing, identify the event type as well as all of its participants and attributes. Although different event types may be defined with different arguments, a simple summary of event extraction is to obtain structured representation of events from unstructured natural languages, so as to help answering the “5W1H” questions, including “who, when, where, what, why” and “how”, of real-world events from numerous sources of texts, like news articles, social media posts and etc.
As an important task of information retrieval in the natural language processing (NLP), event extraction has lots of applications in diverse domains. For example, the structured events can be directly used to expand knowledge bases upon which further logical reasoning and inference can be made [2], [3]. Event detection and monitoring have long been the focus of public affair management for governments, as timely knowing the outbursts and evolutions of popular social events help the authorities to respond promptly [4]–[8]. In the business and financial domain, event extraction can also help companies quickly discovering market responses of their products and inferencing signals for risks analysis and trading suggestions [9]–[11]. In biomedical domain, event extraction can be used to identify the alterations in the state of a biomolecule (e.g. gene and protein) or interactions between two or more biomolecule, which is described in natural language in the scientific literature for the understanding of physiological and pathogenesis mechanisms [12]. In short, many domains can benefit from the advancements of event extraction techniques and systems.
Despite its promising applications, event extraction is still a rather challenging task, as events are with different structures and components; while natural languages are often with semantic ambiguities and discourse styles. Furthermore, event extraction is also closely related to other NLP tasks, like named entity recognition (NER), part-of-speech (POS) tagging, syntactic parsing and etc., which can either boost event extraction or reversely impact on its performance, depending on how these tasks can perform and how to exploit their outputs. To prompt the developments and applications of event extraction, many public evaluation programs have been conducted to provide task definitions, annotated corpora as well as open contests to promote information extraction research like event extraction, which have also been attracting many talents to contribute novel algorithms, techniques and systems. We next briefly introduce these well-known programs.
A. Public Evaluation Programs
The Message Understanding Conference (MUC) [13], [14] has been generally recognized as the first public evaluation program for information extraction, which was organized and sponsored by the Defense Advanced Research Projects Agency (DARPA).1 It had held for seven times from 1987 to 1997. The MUC aims at extracting information from unstructured text and populating into a slot-value structure in predefined schemas. Some common slots include entities, attributes, entity relations, events and etc. In 1997, the DARPA, Carnegie Mellon University, Dragon Systems, and the University of Massachusetts at Amherst co-founded another public evaluation program, called Topic Detection and Tracking (TDT), to promote finding and following new events in a stream of broadcast news articles [15], [16]. Later on, the National Institute of Standards and Technology (NIST)2 established a complete evaluation system for the TDT program.
The Automatic Content Extraction (ACE) is the most influential public evaluation programs up to now, which was proposed by NIST in 1999 and later on incorporated into a new public evaluation program Text Analysis Conference (TAC) in 2009. From 2000 to 2004, the ACE was devoted to entity and relation detection and tracking, and the event extraction task had been appended into the ACE since 2005 [17]. Following the ACE, the Deep Exploration and Filtering of Text (DEFT) program of DARPA proposed the Entities, Relations, Events (ERE) standard for text annotation and information extraction. The Light ERE was defined as a simplified version of the ACE annotation in order to rapidly produce consistently labeled data. Subsequently, the Light ERE has been extended to a more complex Rich ERE specification [18].
Moreover, event extraction has served as the mainstream task of the Knowledge Base Population (KBP) public evaluation program that had been held for four times since 2014 up to now. Now the KBP has integrated with the TAC, with the aims of extracting information from a large text corpus to complete deficient elements for knowledge bases [19]. In addition, there are also some other event public evaluation programs for event extraction in specific domains, such as the BioNLP in the biomedical domain [20], the TimeBANK for extracting temporal information of events [21].
B. Summary of This Survey
This article provides an up-to-date survey for event extraction from text. We note that there are some related survey articles on this task, yet each with a particular focus for specific application domain. Hogenboom et al. [22], [23] reviewed the text mining techniques for event extraction in various decision support systems. Vanegas et al. [12] mainly reviewed the biomolecular event extraction; While Zhang et al. [24] mainly focused on open domain event extraction; Some have also focused on event extraction from social media, especially from Twitter [25], [26].
Compared with the aforementioned articles, we try to provide a more comprehensive survey and systematic technique taxonomy for event extraction from text, not only providing its task definitions, data sources and performance evaluations, but also categorizing the main approaches from the viewpoint of its whole development history. Furthermore, we present and analyze the most representative methods in each technique class, especially their origins, basics, advantages and disadvantages. We also discuss the promising directions for future research of event extraction.
The rest of the article is organized as follows: Section II introduces the main task definitions in both closed-domain and open-domain event extraction; While the mostly used corpora are introduced in Section III. We categorize the main technique approaches into 5 groups, including those for closed-domain event extraction, the earlier approaches of event pattern matching in Section IV, machine learning methods in Section V, deep learning models in Section VI, semi-supervised learning schemes in Section VII, and for open-domain event extraction, the unsupervised learning approaches in Section VIII. We also compare the extraction performance for those algorithms experimented on the ACE corpus in Section IX. Finally, Section X concludes the survey with some discussions about future research directions.
Event Extraction Tasks
Event extraction aims at detecting the existence of an event reported in text, and if existing, discovering event-related information from the text, such as the “5W1H” about an event (i.e., who, when, where, what, why and how). Sometimes, particular event structures are predefined, which includes not only event types but also event arguments’ roles. Event extraction needs not only to detect an event but also extracting the corresponding characters/words/prases to fill in the given event structure, so as to output a structured form of event. This is normally called closed-domain event extraction, as different domains may need different event structures. On the other hand, the task of open-domain event extraction does not assume predefined event structures, and the main task is to detect the existence of an event in text. In many cases, it also extracts keywords about an event and clusters similar events.
A. Closed-Domain Event Extraction
Closed-domain event extraction uses predefined event schema to discover and extract desired events of particular type from text. An event schema contains several event types and their corresponding event structures. We use the ACE [1] terminologies to introduce an event structure as follows:
Event mention: a phrase or sentence describing an event, including a trigger and several arguments.
Event trigger: the main word that most clearly expresses an event occurrence, typically a verb or a noun.
Event argument: an entity mention, temporal expression or value that serves as a participant or attribute with a specific role in an event.
Argument role: the relationship between an argument to the event in which it participants.
Ahn [27] first proposed to divide the ACE event extraction task into four subtasks: trigger detection, event/trigger type identification, event argument detection, and argument role identification. For example, consider the following sentence:
Sentence 1: At daybreak on the 9th, the terrorists set off a truck bomb attack in Nazareth.
There exists an event of type “Conflict/Attack” in Sentence 1. An event extractor should discover such an event and identify its type by detecting the trigger word “attack” in the sentence and classifying it to the event type of “Conflict/Attack”. It next should extract all arguments related to this event type from the text and identify their respective roles according to the predefined event structure.
Fig. 1 illustrates the closed-domain extraction for structured events. The left part illustrates some predefined event schemas in ACE 2005; While the right part illustrates the extraction results of four subtasks for trigger detection, event type identification, argument detection and argument role identification.
Illustration of closed-domain event extraction. The left part illustrates some predefined event schemas in ACE 2005; While the right part illustrates the extraction results of four subtasks for trigger detection, event type identification, argument detection and argument role identification.
Following similar definitions by ACE [1], many other event types and structures have been defined and adopted in the field, like those defined by ERE and TAC-KBP [17]–[19]. Besides organizations, individual researchers have also defined event types and structures for specific domains. Petroni et al. [28] defined structures for breaking events, including 7 event types like “Floods”, “Storms”, “Fires” and etc., as well as their “5W1H” attributes, so as to extract breaking events from news reports and social media. Yang et al. [29] focused on extracting events in the financial domain to help predicting the stock market, investment decision support and etc. They defined 9 financial event types, like “Equity Pledge”, “Equity Freeze” and etc., as well as their corresponding arguments with different roles. Han et al. [30] defined a taxonomy of business events containing 8 event types with 16 subtypes with their corresponding arguments, like tense, time, result, entity etc.
B. Open-Domain Event Extraction
Without predefined event schemas, open-domain event extraction aims at detecting events from texts and in most cases, also clustering similar events via extracted event keywords. Event keywords refer to those words/phrases mostly describing an event, and sometimes keywords are further divided into triggers and arguments.
The TDT public evaluation program aims at automatically spotting previously unreported events or following the progress of the previously spotted events from news articles [15]. Besides events, the TDT also defines the story as a segment of news article describing a specific event, and topic as a set of events in articles yet strongly related to some real-world topic. Based on such definitions, it defines the following tasks:
Story segmentation: detecting the boundaries of a story from news articles.
First story detection: detecting the story that discuss a new topic in the stream of news.
Topic detection: grouping the stories based on the topics they discuss.
Topic tracking: detecting stories that discuss a previously known topic.
Story link detection: deciding whether a pair of stories discuss the same topic.
Besides the TDT tasks, many other researches have also been conducted for detecting and clustering open-domain events from news articles [6], [31]–[34]. For example, the Joint Research Centre of the European Commission investigated extracting violent events with keywords, like killed, injured, kidnapped and etc., from online news for global crisis surveillance [6], [31]. Yu and Wu [33] aggregated news articles about a same event into a topic-centered set; While Liu et al. [34] clustered news articles according to daily significant events about politics, economics, societies, sports, entertainment and etc.
Some work have focused on sentence-level event detection and clustering [35]–[37]. For example, Naughton et al. [35] grouped sentences in news articles that refer to the same event, where non-event sentences are removed before initiating the clustering process. They used a set of news stories collected from different sources describing events related to the Iraq war. Moreover, they designed clustering labels, such as terrorist attack, bombing, shooting, air attack and etc., when grouping event-related sentences. Besides event detection and clustering, Wang et al. [36] also proposed to extract keywords for each event, like the type, location, time and people about an event.
Other than newswire articles, many online social media, such as Twitter and Facebook etc., provide abundant and timely information about diverse types of events. Detecting and extracting events from social media have also been becoming an important task recently [28], [38]–[43]. It is worth noting that as posts in social networks are kind of unofficial texts with lots of abbreviations, misspellings and grammar errors, how to extract events from such online posts faces more challenges than extracting events from news articles.
Although detecting and clustering events are the main tasks in open-domain event extraction, some researchers have also proposed to further construct event schemas from the clustered event-related sentences and documents by assigning each event cluster an event type label as well as one or more event attribute labels [44]–[52]. Notice that such cluster labels might be better explained as a kind of semantic synthesis from the keywords of each cluster, other than the predefined ones with clear structure as that in the closed-domain event extraction.
Event Extraction Corpus
This section primarily introduce the corpus resource of event extraction tasks. Generally, public evaluation programs provide several corpus for task evaluation of event extraction. The corpus are manually annotated by public evaluation programs according to the task definition, which is also used for model training and verification in machine learning methods.Sample annotations are completed by professionals or experts with domain knowledge and annotated samples can be regarded as with ground truth labels. However, as the annotation process is cost-prohibitive, many public corpora are with small size and low coverage.
A. The ACE Event Corpus
The ACE program [1] provided annotated data and evaluation tools for various extraction tasks, including entity, time, value, relation and event extraction. Entities in ACE fall into 7 types (person, organization, location, geo-political entity, facility, vehicle and weapon) each with a number of subtypes. Furthermore, time is annotated according to the TIMEX2 standard [53], [54], which is a rich specification language for event and temporal expressions in natural language text. Every text sample was dually annotated by two independent annotators, and a senior annotator adjudicated the version discrepancies between them.
Events in the ACE corpus have complex structures and arguments involving entities, times, and values. The ACE 2005 event corpus defined 8 event types and 33 subtypes, each event subtype corresponding to a set of argument roles. There are in total 36 argument roles for all event subtypes. In most of researches based on the ACE corpus, the 33 subtypes of events are often treated separately without further retrieving their hierarchical structures. Table 1 provides these event types and their corresponding subtypes. An example of annotated event sample has been provided in Fig. 2. The ACE 2005 corpus contains in total 599 annotated documents and about 6000 labeled events, including English, Arabic and Chinese events from different media sources like newswire articles, broadcast news and etc. Table 2 provides their source statistics.
Examples of event annotation. In Chinese, events are annotated with the character BIO label (begin/intermediate/other).
B. The TAC-KBP Corpus
The event nugget detection task in TAC-KBP focuses on detecting explicit mentions of event with its types and subtypes as defined in Rich ERE. The TAC-KPB 2015 corpus provided by Linguistic Data Consortium(LDC) includes 158 documents as prior training set and 202 additional documents as test set for the formal evaluation from newswire articles and discussion forum [55]. The event types and subtypes in TAC-KBP (Rich ERE) are defined referring to the ACE corpus, including 9 event types and 38 subtypes. In addition, event mentions must be assigned into one of the three REALIS valus: ACTUAL(actually occurred), GENERIC(without specific time or place), and OTHERS(non-generic events, such as failed events, future events, and conditional statements etc.) [19]. The TAC-KBP 2015 corpus is only for English evaluation, however, Chinese and Spanish have been added in TAC-KBP 2016 for all tasks.
C. The TDT Corpus
In open-domain event extraction, the LDC also provides a series of corpus to support TDT research from TDT-1 to TDT-5, including both text and speech in both English and Chinese (Mandarin) [15], [56]. Each TDT corpus contains millions of news stories annotated with hundreds of topics (events) collected from multiple sources like newswire and broadcast articles. Furthermore, all the audio materials are transformed into text intermediaries by LDC. These story-topic tags are assigned a value of YES, if the story discusses the target topic, or BRIEF if that discussion comprises less than 10% of the story. Otherwise, the (default) tag value is NO.
D. Other Domain-Specific Corpora
Besides the aforementioned well-known corpora, some domain-specific event corpora have also been established and released. The BioNLP Shared Task (BioNLP-ST) is defined for fine-grained biomolecular event extraction from scientific documents in the biomedical domain, which compiled various manually annotated biological corpus including GENIA event corpus, BioInfer corpus, Gene regulation event corpus, GeneReg corpus and PPI corpora [12], [20]. The TERQAS (Time and Event Recognition for Question Answering Systems) workshop has built a corpus, called TimeBANK, which was annotated for events, times, and temporal relations from a wide variety of media sources for breaking news events extraction [21]. Meng et al. [57] also annotated breaking news events reported in Chinese, called CEC (Chinese Event Corpus). Other domain-specific corpora include the MUC series corpus for event extraction in the field of military intelligence, Terrorist attacks, Chip technology and financial [13], [14], and Ding et al.’s corpus for event extraction in the field of music [58].
Event Extraction Based on Pattern Matching
The earlier approaches for event extraction are a kind of pattern matching technique, which first constructs some specific event templates, and then performs template matching to extract an event with a single argument from text. As illustrated by Fig. 3, event template can be constructed from raw texts or annotated texts, yet both require profession knowledge. In the online extraction phase, an event as well as its argument are extracted if the they match a predefined template.
Illustration of event pattern construction and event extraction based on pattern matching.
A. Manual Pattern Construction
We first review some typical pattern-based event extraction systems, which employ experts with professional knowledge to manually constructing event patterns for different domain applications. The first pattern-based extraction system might be dated back to the AutoSlog developed in 1993 by Riloff et al. [59], which was a domain-specific one for extracting terrorist events. The AutoSlog exploited a small set of linguistic patterns and a manually annotated corpus to obtain event patterns. As presented in Table. 3, in total 13 linguistic patterns were defined, such as “<subject> passive-verb”, meaning a phrasal verb in passive form followed by a grammatical element acting as the subject. Note that the linguistic patterns are distinguished from the event patterns in AutoSlog. The linguistic patterns are used to automatically establish event patterns from manually annotated corpus; While the event patterns are used for event extraction. Moreover, the AutoSlog was designed to extract an event with a single event argument, thus their corpus were annotated with a single argument for each event. For example, in the following sentence, the underlined phrase public buildings is annotated as an argument of “target” role with an event of “bombing” type.
Sentence 2: In La oroya, Junin department, in the central Peruvian mountain range, public buildings (target, bombing) were bombed and a car-bomb was detonated.
Specifically, the AutoSlog employed a syntactic analyzer CIRCUS [60] to identify the Part of Speech (POS) of each sentence in manually annotated corpus, such as subjects, predicates, objects, prepositions etc. Subsequently, It generates a trigger word dictionary and concept nodes, which can be viewed as event patterns, involving event trigger, event type, event argument and argument role. In Sentence 2, the phrase of “public buildings” is identified as a subject by CIRCUS. Since it is followed by the passive verb “bombed”, it matches the pre-defined linguistic pattern “<subject> passive verb”. As a result, it adds “bombed” into the trigger word dictionary and generates an event pattern “<target> was bombed” with the bombing type.
In the process of pattern matching (event extraction), the AutoSlog first uses the trigger word dictionary to locate candidate event sentences, and then performs the POS on the candidate sentences by CIRCUS. It next associates the syntax features surrounding the trigger word (i.e. the output of POS using CIRCUS) and the event patterns to extract argument and its role of event. For a predefined an event pattern “took <victim>” with kidnapping type, its trigger word “took” would locate a candidate sentence “they took 2-year-old Gilberto Molasco, son of Patricio Rodriguez”. The result of POS using CIRCUS further identifies that “Gilberto Molasco” is a dobj (direct object). Finally, combing the POS results and the predefined event pattern, the AutoSlog can extract an event of kidnapping type with its trigger word “took” and argument of victim “Gilberto Molasco”.
Inspired by the AutoSlog system, many pattern-based event extraction systems have been developed for different application domains, including biomedical event extraction [61]–[64], financial event extraction [65], [66] and etc. For example, Cohen et al. [61] extracted biomedical events via the biomedical ontology analysis by exploiting the OpenDMAP semantic parser [67], which provides various high-quality ontological templates for biomedical concepts and their properties. Casillas et al. [62] applied the Kybots (Knowledge Yielding Robots) developed by the KYOTO project3 to extract biomedical events. The Kybots system follows a rule-based approach to manually build event patterns. In the financial domain, Borsje et al. [65] proposed a semi-automatic financial event extraction method based on lexico-semantic patterns from news feeds. Arendarenko and Kakkonen [66] developed an ontology-based event extraction system, called BEECON, to extract business events from online news. Since constructing event patterns often requires professional knowledge, these pattern-based extraction systems are normally designed for specific application domain. For applications involving multiple domains, Cao et al. [68] proposed to built event patterns by combining ACE corpus with other expert-defined patterns, such as the TABARI corpus,4 to produce more event patterns.
Besides many systems designed for English event extraction, some pattern-based extraction systems have been designed for other languages [69], [70]. Tran et al. [69] developed a real-time extraction framework, called VnLoc, by combining lexico-semantic rules with machine learning to extract events from Vietnamese news. Saroj et al. [70] designed a rule-based event extraction system from newswires and social media text for Indian language. Valenzuela-Escárcega et al. [71] observed that there is not a standard language to express patterns, which could hindered the development of the pattern-based event extraction task. They proposed a domain-independent, rule-based framework and designed rapid development environment for event extraction to reduce the cost for newcomers.
Although pattern-based event extraction can achieve high extraction accuracy, yet the pattern construction suffers the scalability problem in that the patterns are often dependent on the application domain. To increase scalability, Kim et al. [72] designed the PALKA (Parallel Automatic Linguistic Knowledge Acquisition) system to automatically acquire event patterns from annotated corpus. They defined a specialized representation for event patterns, called FP-structures (Frame-Phrasal and pattern structure). In PALKA, event patterns were constructed in the form of FP-structures, and further tuned through the generalization of semantic constraints. Furthermore, the FP-structures extended the single event argument pattern to event pattern with multiple arguments, i.e, an event may contain multiple arguments. For example, the “bombing” event in Sentence 2, in addition to the “target” argument, it may have multiple potential event arguments, such as “agent”, “patient”, “instrument” and “effect”. Furthermore, the PALKA converts multiple clauses in one sentence into a multi-sentence form for multiple event extraction. In addition, Aone and Ramos-Santacruz [73] developed a large-scale event extraction system, called REES, which can extract up to 100 event types by establishing event and relation ontology schemas for generating new event patterns, yet the new patters are still subject to manual validation through the graphical interface of their system.
B. Automatic Pattern Construction
For event patterns manually constructed by experts with professional knowledge, as they are well-defined with high quality, so event extraction based on pattern matching can often achieve high accuracy for domain-specific applications. However, manually constructing event patterns is rather time-consuming and labor-intensive, leading to not only the scalability problem for producing large pattern databases but also the adaptivity problem when applying event patterns in other domains. Some researchers have proposed to apply weakly supervised method or bootstrapping method to obtain more patterns automatically, using only a few pre-classified training corpus or seed patterns.
Riloff et al. [74] developed the AutoSlog-TS event extraction system to enable automatic pattern construction, which was an extension of their previous AutoSlog system. In particular, base on the thirteen linguistic patterns defined in the AutoSlog system, it applies a syntactic analyzer CIRCUS to obtain new event patterns from untagged corpus. The following example illustrates its new pattern construction process.
Sentence 3: World trade center was bombed by terrorists. Via syntactic analysis, we can get the subject “World trade center”, the verb phrase “was bombed” and the prepositional phrase “by terrorists”. Combining the “<subject> passive-verb” and “passive-verb prep <np>” in linguistic patterns, we can get potential event patterns: “<x> was bombed” and “bombed by <y>”. Whether the new event patterns are included is dependent on their scorings according to the statistics in both domain-dependent and domain-independent documents. Their experiments on MUC-4 terrorism dataset5 validated that the AutoSlog-TS dictionary performs comparably to a hand-crafted dictionary. Despite some manual intervention is still required, the AutoSlog-TS can significantly reduce the workloads to create a large training corpus.
Many other approaches have been proposed to facilitate automatical pattern construction by designing machine learning algorithms to learn new patterns based on a few seed patterns [6], [8], [31], [75]–[84]. For example, the ExDisco proposed by Yangarber et al. [75] provided a small set of seed patterns instead of linguistic patterns to obtain potential event patterns. Not only seed patterns, but also seed terms or seed event instances have also been exploited to construct potential event patterns [78], [81]–[84]. The NEXUS system developed by Piskoriski et al. [8], [31] learned candidate event patterns via an entropy maximization-based machine learning algorithm from a small set of annotated corpus, yet the candidate patterns are still manually checked and modified to be included into the pattern database. Cao et al. [79] proposed a pattern technique by using active learning to import frequent patterns from external corpus. Li et al. [80] proposed a minimally supervised model for Chinese event extraction from multiple views, including pattern similarity view (PSV), semantic relationship view (SRV) and morphological structure view (MSV). Furthermore, each view can also be used to extract event patterns. The PSV ranks each candidate pattern according to its structural similarity to existing ones, and then accepts those top ranked news patterns. In contrast, the SRV captures the relevant event mentions from relevant documents while the MSV is incorporated to infer new patterns.
The pattern-based event extraction has been applied in many industrial applications for its high extraction accuracy from using high quality event patterns established by domain experts. However, constructing a large scale of event patterns is cost prohibitive. As such, recent years have witnessed a fast development of various machine learning-based event extraction techniques.
Event Extraction Based on Machine Learning
This section reviews those using traditional machine learning algorithms, like support vector machine (SVM), maximum entropy (ME) and etc., for event extraction. Those using neural networks techniques (or so-called deep learning) will be reviewed in the next section. Note that they both are a kind of supervised learning techniques and require training data with ground truth labels, which is normally done by experts with professionals knowledge. How to annotate event and their arguments has been introduced in Section III.
The basic idea of machine learning approaches is almost the same, that is, learning classifiers from training data and applying classifiers for event extraction from new text. Fig. 4 illustrates the overall structure of event extraction based on machine learning. Furthermore, event extraction based on machine learning can be generally divided into two stages and four subtasks:
Stage I includes two subtasks: (1) trigger detection, i.e., detect whether an event exists, and if so, the corresponding event trigger in a text; and (2) trigger/event type identification, i.e., classify the trigger/event as one of given event types;Stage II includes two subtasks: (3) argument detection, i.e., detect which entity, time, and values are arguments; and (4) argument role identification, i.e., classify the arguments’ roles according to the identified event type.
Illustration of event extraction based on machine learning. Word/phase features are obtained from execution feature engineering and then input to classifiers for event extraction to output trigger and arguments.
For either pipeline or joint execution, learning classifiers need to first perform some feature engineering work, i.e., extracting features from texts as inputs to the classification model. So in this section, we first introduce some common features used for classifier training; Then we review and compare the pipeline and joint classification approaches in the literature.
A. Text Feature for Learning Models
The mostly used text features can be generally divided into three types: lexical, syntactic and semantic feature. Most of them can be obtained via some open-source NLP tools.
Some commonly used lexical features include: (1) full word, lowercase word and proximity word; (2) lemmatized word, which returning different forms of a single word to its root form. For example, the word computers is an inflected form of computer; (3) POS tag, which marking up a word in a corpus as corresponding to a particular part of speech such as nouns, verbs, adjectives, adverbs and etc.
The syntactic features are obtained from dependency parsing, which is to work out the lexical dependency relation structure of sentences. In brief, it creates edges between words in the sentences denoting different types of relations and organizes them as a tree structure. Some commonly used syntactic features include: (1) the label of dependency path; (2) the dependency word and its lexical features; (3) the depth of candidate word in a dependency tree.
Some commonly used semantic features include: (1) synonyms in linguistic dictionaries and their lexical features; (2) event and entity type features, which are often used in argument identification.
Each feature can be represented as a binary vector based on the word-of-bag model. Feature engineering is about how to select the most important features and how to integrate them into a high dimensional vector to represent each word in a sentence. Finally, text features together with their corresponding labels from training datasets are used to train event extraction classifiers.
B. Pipeline Classification Model
The pipeline classification normally trains a set of independent classifiers each for one subtask; Yet the output of one classifier can also serve as a part of input to its successive classifier. To train classifiers, diverse features can be used, including local features within sentences and global features within documents. According to how local and global features are applied for classifier training, we divide pipeline classification approaches into sentence level event extraction and document level event extraction.
1) Sentence Level Event Extraction
In sentence level event extraction, a sentence is firstly tokenized into discrete tokens, and then each token is represented by a feature vector based on the result of feature engineering. In English, a token is normally a word; Yet in some languages, a token could be one character or multiple characters as a word or a phrase. Classifiers are trained from annotated text from a corpus and then used to determine whether a token is a trigger word (or event argument) and its event type (or argument role).
David Ahn [27] presented a typical pipeline processing framework consisting of two consecutive classifiers: The first one, called TiMBL, applies a nearest neighbor learning algorithm for detecting trigger(s); The second classifiers, called MegaM, adopts a maximum entropy learner for identifying argument(s). To train the TiBML classifier, lexical features, WordNet features, context features, dependency features and related entity features are used; and the features of trigger word, event type, entity mention, entity type, and dependency path between trigger word and entity are used for training the MegaM classifier. Many different pipeline classifiers have been proposed and they are trained by diverse types of features. For example, Chieu and Ng [85] added unigram, bigram and etc. as features and employed a maximum entropy classifier. In the biomedical domain, more domain-specific features and professional knowledge are exploited for classifier training, like frequency features, token features, path features and etc [86]–[93].
Some researchers have proposed to integrate the pattern matching into the machine learning framework [94]–[96]. As reviewed in the previous section, event patterns can provide more precise event structures with explicit relations between some particular triggers and their associated arguments, though such relations are kind of manually designed without too much extensibility. Grishman et al. [94], [97] proposed to first perform pattern matching so as to preassign some potential event types,6 and then a classifier is applied to identify the remaining event mentions. The two approaches are complementary to each other to augment event extraction. Besides, pattern matching can also be executed after a machine learning classifier has detected a trigger [95]. Since event arguments are closely related to the trigger type, so applying some well-established trigger-argument relations can help improving the argument identification. Furthermore, event patterns can also be merged with token features as input for machine learning classifications [96].
As shown in Fig. 2, sentence tokenization is a trivial task in some languages, like English, as words are explicitly separated by delimiters. However, in some other languages, like Chinese [98]–[100] and Japanese [101], a sentence consists of consecutive characters without using delimiters to separate words. Therefore, word segmentation is normally required to firstly divide a sentence into many discrete tokens (words/phases). As the word segmentation can be implemented independent of the event extraction task, segmentation errors, if any, could be propagated to the downstream tasks and degrade their performance [99], [100]. Furthermore, due to the ambiguity of natural language, it is likely that after word segmentation, a word consists of two or more triggers; While a trigger is divided into two or more words [99], [100].
To solve such tokenization problems, Zhao et al. [98] proposed, for each word after sentence segmentation, to first obtain potential triggers each with its corresponding event type from a dictionary of synonyms. Chen and Ji [99] proposed to employ both word-level and character-level trigger labeling strategy. Furthermore, they proposed to use character-level features, including the current character, its previous and next character and etc., to train a maximum entropy Markov model classifier for trigger detection. Li et al. [100] employed compositional semantics insider triggers and discourse consistency between trigger mentions to augment Chinese trigger identification. Specifically, a verb word is a trigger, if its components (one or more Chinese characters) are labeled as Chinese triggers. For discourse consistency, a single character of a verb word, if it belongs to labeled triggers, is merged with its previous and next character to form candidate triggers.
2) Document Level Event Extraction
In the aforementioned approaches, only local information of the words or phrases within a sentence and sentence-level contexts are applied to train classifiers. However, if we put the event extraction task, even for extracting an event only from one sentence, against a larger background, like a document with multiple sentences or a collection with multiple documents, many global information are ready to be exploited to augment the extraction accuracy. For document level event extraction, two key design issues include: what kind of global information can be used; and how to apply them to assist event extraction. For the first design issue, global information, like word sense, entity type and argument mention, can be mined from cross-document, cross-sentence, cross-event and cross-entity inferences. For the second, global information can be used either as a complimentary module to local classifiers or as global features in local classifiers.
Global information can be exploited to build an additional inference model to augment local classifiers through evaluating the confidence of local classifiers’ outputs [102]–[105]. Ji and Grishman [102] observed that event arguments would maintain some consistency across sentences and documents, like word sense consistency in different sentences and related documents, and consistency of arguments and roles across different mentions of the same or related events. To exploit such observations, they proposed to establish two global inference rules for a cluster of topically-related documents, namely, one trigger sense per cluster and one argument role per cluster, to help improving the extraction confidence of sentence-level classifiers. Liao and Grishman [103] later on extended such applications of global information inference and further proposed to apply document-level cross-event information. Liu et al. [105] explored two types of global information, namely, event-event association and topic-event association, to build a probabilistic soft logic (PSL) model for further processing the initial judgements from local classifiers to generate final extraction results.
Global information can also be exploited as global features together with sentence-level features to train local event extraction classifiers [106]–[108]. Hong et al. [106] proposed a cross-entity inference model to extract entity-type consistence as relation features. They argued that entities of the consistent type normally participate in similar events as the same role. To this end, they defined 9 new global features, including entity subtype, entity-subtype co-occurence in domain, entity-subtype of arguments and etc., to train a set of sentence-level SVM classifiers. Similar approaches have also been adopted in [107], [108]. Liao and Grishman [107] proposed to first compute topical distributions for each document via the latent dirichlet allocation (LDA) model [109] from a collection of documents and encoded such topic features into local extraction classifiers. In [108], three types of global features are defined, including lexical bridge features, discourse bridge features and role filler distribution features, and they are used together with local features to train extraction classifiers.
C. Joint Classification Model
The aforementioned pipeline classification models could suffer from the error propagation problem, where errors in an upstream classifier are easily propagated to those downstream classifiers and could degrade their performance. On the other hand, a downstream classifier cannot impact on its previous classifiers’ decisions, and inter-dependencies of different subtasks cannot be well exploited. To address such pipeline problems, joint classification have been proposed as a promising alternative to enjoy possible benefits from the close interactions between two or more subtasks, as useful information from one subtask can be carried both forward to next ones and backward to previous ones.
Joint classification models can be trained for multiple subtasks in each event extraction stage [110]–[112]. For example, Li et al. [110] proposed a joint model for trigger detection and event type identification, where an inference framework based on integer logic programming (ILP) is used to integrate two types of classifiers. In particular, they proposed two trigger detection models: One is based on the conditional random field (CRF) and another based on maximum entropy (ME). The ILP-based framework can help finding the optimal values for constrained variables to minimize a weighted objective function. Therefore, multiple subtasks can be jointly trained within the ILP inference framework yet each with different constraints and weights. Similar approaches have also been adopted by Li et al. [111] to train a joint model for discourse-level argument determination and role identification. Chen and Ng [112] trained two SVM-based joint classifiers each for one stage subtasks, yet with more linguistic features.
Joint classification models can also be trained to simultaneously extract the trigger and the corresponding arguments of an event according to predefined event structures [113], [114]. For example, Li et al. [113] formulate the event extraction as a structured learning problem, and proposed a joint extraction algorithm integrating both local and global features into a structured perceptron model [115] to predict event triggers and arguments simultaneously. In particular, the outcome of the entire sentence can be considered as a graph in which trigger or argument is represented as node, and the argument role is represented as a typed edge from a trigger to its argument. They applied the beam-search to perform inexact decoding on the graph to capture the dependencies between triggers and arguments. Judea and Strube [114] observed that event extraction are structurally identical to the frame-semantic parsing which is to extract semantic predicate-argument structures from text. As such, they optimized and retrained the SEMAFOR [116],7 a frame-semantic parsing system, for structural event extraction.
The structured prediction approaches have also been widely used in joint extraction models in biomedical domain [117]–[127]. For example, Riedel et al. [117], [118] represented events as relationally structured tokens of a sentence, and applied a joint probabilistic model based on Markov logic for biomedical event extraction. Venugopal et al. [119] combined the Markov logic networks (MLNs) and SVMs as a joint extraction model, where they leveraged SVM classifiers to handle high-dimensional features and modeled relational dependencies by MLNs. Vlachos et al. [122] employed a search-based structured prediction framework to provide high modeling flexibility. McClosky et al. [124] exploited the tree structures of event-argument in a re-ranking dependency parser to capture global event structure properties.
Besides event extraction, joint classification models can also be trained with other NLP tasks, like named entity recognition, event co-reference resolution, event relation extraction and etc. Some researchers have proposed to train a single model to jointly execute these tasks [128]–[132]. For example, Li et al. [128] proposed a framework to jointly execute these tasks together with event extraction in a single model. In particular, they used an information network to represent entities, relations, and events as an information network representation which extracts all of them by one single model based on structured prediction. The aforementioned joint extraction approaches only operate at sentence level but may miss valuable information from document level. On the basis of Li’s model [128], Judea and Strube [129] presented an global inference model to incorporate the global and document context into the intra-sentential base system to extract entity mentions, events and relations jointly. In addition, Araki and Mitamura [131] found that events and their co-references offer useful semantic and discourse information for both tasks. They proposed a document-level joint model to capture the interactions between event trigger and event co-references so as to improve the performance of event trigger identification and event co-reference resolution simultaneously.
Event Extraction Based on Deep Learning
Feature engineering is the main challenging issue of event extraction based on machine learning. As reviewed in the previous section, although diverse features like lexical, syntactic, semantic features and etc. can be crafted as classifiers’ inputs, their construction requires linguistic knowledge and domain expertise, which might limit the applicability and adaptability of the trained classification models. Furthermore, such features are often each with a one-hot representation, which not only suffers from the data sparsity problem but also complicates the feature selection for classification model training.
Recently, deep learning techniques, which use multiple layers of connected artificial neurons to construct an artificial neural network, have been intensively studied for various classification tasks. In an artificial neural network, the lowest layer can take raw data with a very simple representation as its input. Each layer can learn to transform its lower layer input into a more abstract and composite representation, which is then input to its own higher layer, until the highest layer whose output feature is then used for classification. Compared with the classical machine learning techniques, deep learning can help to greatly reduce the difficulties of feature engineering.
Deep learning has been successfully applied in various NLP tasks, such as named entity recognition (NER) [133], search query retrieval and question answering [134], [135], sentence classification [136], [137], name tagging and semantic role labeling [138], relation extraction [139], [140]. For event extraction, many deep learning schemes have also been proposed recently [141]–[146]. The general process is to build a neural network that takes word embeddings as input and outputs a classification result for each word, namely, classifying whether a word is an event trigger (or an event argument), and if so, its event type (or argument role).
How to design an efficient neural network architecture is the main challenging issue for event extraction based on deep learning. We will review some typical neural networks in the rest of this section. Before that, we briefly introduce the word embedding technique, as most neural networks for event extraction share a common approach of using word embedding as the raw data input. Word embedding techniques are used to convert a word or a phrase in the vocabulary into a low-dimensional and real-valued vector. In practice, the vector representation of a word is trained from a large scale corpus, and various word embedding models have been proposed, such as the continuous bag-of words model (CBOW) and continuous skip-gram model (SKIP-GRAM) [147], [148].
A. Convolutional Neural Networks
One of the mostly used neural network structures is the convolutional neural network (CNN), which consists of multilayer fully connected neurons. That is, each neuron in a lower layer is connected to all neurons in its upper layer. As a CNN is capable of learning text hidden features based on the continuous and generalized word embeddings, it has been proven to be efficient to capture the syntactics and semantics of a sentence [136].
Nguyen and Grishman [141] might be the first researchers of designing a CNN for event detection, viz. identifying a trigger and its event type in a sentence. As illustrated by Fig. 5, each word is firstly transformed into a real-valued vector representation, which is a concatenation of the word embedding, its position embedding and entity type embedding, as the network input. The CNN consists of, from input to output, a convolution layer, a max pooling layer and a softmax layer, and outputs the classification result for each word. Many typical techniques can be used for training the CNN, including the back-propagation gradient, dropout regularization, stochastic gradient descent, shuffled mini-batches, AdaDelta learning rate adaptive and weight optimization.
The typical CNN structure often employs a max-pooling layer with a max operation over the representation of an entire sentence, however, a sentence may contain more than one event sharing arguments yet with diverse roles. Chen et al. [149] proposed a Dynamic Multi-Pooling Convolutional Neural Network (DMCNN) to evaluate each part of a sentence via a dynamic multi-pooling layer extracting both lexical-level and sentence-level features. In DMCNN, each feature map is divided into three parts according to the predicted trigger, and the max value of each part is kept to reserve more valuable information other than using a single max-pooling value. Furthermore, their CNN model also uses a skip-gram word model to capture meaningful semantic regularities for words.
In both [141] and [149], the convolutionary operation linearly maps the vectors for the
Some other improvements of the typical CNN model have also been proposed [151]–[154]. For example, Burel et al. [152] designed a semantically enhanced deep-learning model, called Dual-CNN, which adds a semantic layer in a typical CNN to capture contextual information. Li et al. [153] proposed a parallel multi-pooling convolutional neural network (PMCNN), which can capture the compositional semantic features of sentences for biomedical event extraction. The PMCNN also utilizes dependency-based embedding for word semantic and syntactic representations and employs a rectified linear unit as a nonlinear function. Kodelja et al. [154] built the representation of global contexts following a bootstrapping approach, and integrated the representation into a CNN model for event extraction.
B. Recurrent Neural Networks
Many CNN-based event extraction schemes suffer from the error propagation problem due to their pipeline execution of the two subtasks, namely, event detection first and argument identification second. Furthermore, the CNN structure normally takes the concatenation of words’ embedding as input, and the convolutionary operation is executed for consecutive words to capture contextual relations of the current word to its neighboring words. As such, they cannot well capture some potential interdependencies in between distant words to exploit a sentence as a whole for jointly extracting trigger and arguments; While joint extraction, as discussed in previous section, can exploit the relations between event trigger and event arguments to reciprocate the two individual subtasks.
In language modelling, a sentence is often regarded as a sequence of words, i.e., one word after another from the start to the end of a sentence. The recurrent neural network (RNN) structure, which consists of a series of connected neurons, can effectively make use of such sequential inputs. As illustrated by Fig. 6, a simple RNN consists of a series of connected long and short term memory (LSTM) neurons, where the output of a LSTM neuron also serves as the input of its sequentially connected LSTM neuron. As such, the RNN structure can exploit the potential dependencies in between any two words, either directly or indirectly connected, which enables its wide applications in many NLP tasks [155], including named entity recognition [156], part-of-speech tagging [157], relation extraction [158], sentence parsing [159], sequence labeling [160] and etc. Furthermore, the RNN structure also outputs a sequence of words, each of which could be predicted as trigger or argument, whereby the two tasks of joint trigger detection and argument identification can be jointly executed.
For event extraction, some RNN-based models have been proposed to exploit words’ interdependencies by inputting words according to their sequential order, either forwardly or backwardly, in a sentence [161]–[164]. For example, Nguyen et al. [161] designed a bi-directional RNN architecture for joint event extraction, as illustrated by Fig. 6. Their model runs over sentences in both forward and reverse direction by two individual RNN, each consisting of a series of gated recurrent units (GRU) [165]. The joint extraction consists of two phases: the encoding phase and the prediction phase. In the encoding phase, different from the CNN model, it does not employ the position features but replacing them with binary vectors to represent the dependency features for predict event triggers and arguments jointly. In the prediction phase, the model classifies the dependencies between trigger and argument into three categories: (1) the dependencies among trigger subtypes; (2) the dependencies among argument roles; (3) the dependencies between trigger subtypes and argument roles.
Syntactic dependency in between words can also be used to augment the basic RNN structure. For example, Sha et al. [166] designed a dbRNN (dependency bridge RNN) by adding a syntactic dependent connection of two RNN neurons into a bidirectional RNN. As shown in Fig. 7, the syntactic dependency of
The dependency bridge on LSTM. Apart from the last LSTM cell, each cell also receives information from former syntactically related cells.
Besides using dependency bridges, the syntactic dependency tree of a sentence can also be directly exploited to build a tree-structured RNN [167]. Upon the typical Bi-LSTM (bidirectional LSTM), Zhang et al. [168] further constructed a Tree-LSTM yet centered at the target word by transforming the original dependency tree of a syntactic dependency analyzer for Chinese event detection. Li et al. [169] proposed to further augment a Tree-LSTM with external entity ontological knowledge for biomedical event extraction.
The RNN structure can be applied not only to sentence level but also document level event extraction. Duan et al. designed a DLRNN model (document level RNN) [170] to extract cross-sentence or even cross-document clues by using a distributed vector for document representation, which is to capture the topic distribution of a document by the unsupervised learning PV-DM model [171]. All the words in a document use the same document vector, and the concatenation of word embedding and document vector is used as the input of a Bi-LSTM model.
The RNN structure can also be applied to train a joint classification model for not only event extraction including trigger detection and argument identification, but also entity mention detection [172]. The three subtasks are executed in a pipeline way, in the order of entity mention detector, trigger classifier and argument role classifier, by training a single Bi-GRU (bidirectional GRU) network model, where the network hidden representations are shared for all the three subtasks to exploit some common knowledge across subtasks and potential dependencies or interactions in between the subtasks.
The aforementioned RNN structures have adopted the GRU or LSTM as the basic composing unit, which uses a gating strategy to control the information processing in the neural network. However, the gate computation cannot be done in parallel and is time-consuming. Zhang et al. [173] proposed to use a kind of simple recurrent unit (SRU) as the basic composing neuron for its capable of reducing gate computation complexity without incurring the multiplication operation dependent on previous units [174]. They built two bidirectional SRU models (Bi-SRU): one for learning word-level representations, and another for character-level representations.
C. Graph Nerual Networks
Recently, many neural networks operating on graphs, or called graph neural networks (GNNs) as a general reference, have been recently prompted for a wide range of application domains [175]–[177]. Simply put, a GNN applies multiple neurons operating on a graph structure to enable so called geometric deep learning in non-Euclidean spaces. Traditional neuron operations, like recurrent kernels and convolutional kernels that have been widely used in RNNs and CNNs, can also be applied in a graph structure so as to learn various deep features embedded within a graph for diverse tasks.
Some researchers have attempted to apply GNN models for event extraction [178]–[180]. The core issue of such approaches is to first construct a graph for words in text. Rao et al. [178] adopted a semantic analysis technique, called abstract meaning representation (AMR) [181], which can normalize many lexical and syntactic variations in text and output a directed acyclic graph to capture the notions of “who did what to whom” in text. Furthermore, Rao et al. argued that an event structure is a subgraph of an AMR graph and casted the event extraction task as a subgraph identification problem. They trained a graph LSTM model to identify such an event subgraph for biomedical event extraction.
Another approach of graph construction is based on some transformation for the syntactic dependency tree of a sentence [179], [180]. Notice that the syntactic dependency can be regarded as a directed edge from a head word to its dependent word, yet also with an edge label as the dependency type. At first, for each word, a new self-loop edge as proposed in [182] is created and included into the dependency tree, that is, an edge starting from a word and ending at the word. All such self-loop edges are with a same new edge type. Then for each dependency edge
D. Hybrid Neural Network Models
The aforementioned three basic types of neural network architecture each has its own merits and demerits when used to capture diverse features, relations and dependencies in text for event extraction. Many researchers have proposed hybrid neural network models which combine different neural networks to enjoy each excellence. A common approach of building such a hybrid model is to use different neural networks to learn different types of word representations. For example, both [179] and [180] have firstly applied a Bi-LSTM model to obtain initial word presentations before performing graph convolutionary operation.
Some have proposed to combine the CNN and RNN as a hybrid model [183]–[189]. For example, Zeng et al. [183] proposed to first use a CNN for learning the local contextual representation of each word, which is concatenated with the output of another Bi-LSTM to obtain the final word representation for classification. Similar approaches of concatenating CNN output and Bi-LSTM hidden layers have also been applied to event extraction for Chinese, India and other languages [184]–[187]. Nurdin and Manlidevi [188] also designed a hybrid model consisting of a CNN and a Bi-LSTM for extracting events from Indonesian news with the following arguments: who, did what, when, where, why and how. Liu et al. [189] also used a CNN to obtain the local contextual representation for each word; Yet they proposed to use a Bi-LSTM to obtain a document representation as a weighted sum of concatenated hidden states of the forward and backward layers. The word local representation and document representation are then further concatenated for event trigger detection.
Another important type of hybrid model is the generative adversarial network (GAN) [190], which normally consists of two neural networks contesting each other, one dubbed as a generator
Some researchers have proposed to apply the GAN framework for event extraction [191]–[193]. They adopted the RNN structure like Bi-LSTM for the generator and discriminator network. In the training process, Hong et al. [191] proposed to regulate the learning process with a two-channel self-regulated learning strategy. In the self-regulation process, the generator is trained to produce the most spurious features; While the discriminator with a memory suppressor is trained to eliminate the fakes. Liu et al. [192] proposed an adversarial imitation strategy to incorporate a knowledge distillation module into the feature encoding procedure. In their hybrid model, a teacher encoder and a student encoder, each being a Bi-GRU network, are used. The teacher encoder is trained by gold annotations, yet the student encoder is trained by minimizing the distance between its output to that of the teacher encoder via an adversarial imitation learning. Zhang et al. [193] used the reinforcement learning (RL) strategy to update a Q-table during the training process, where the Q-table records the reward values computed from the system states and actions.
E. Attention Mechanism
Attention mechanism first appeared in the field of computer vision, and its purpose was to emulate the visual attention mechanism of human brain. Recently, the attention mechanism has been widely used in many NLP tasks [194]–[196]. Simply put, attention is a discrimination mechanism to guide a neural model to unequally treat each component of the input according to its importance to a given task. Although it is implemented by assigning different weights to different neurons’ states or outputs, the weights are actually self-learned from the model training process.
Many word-level attention mechanisms have been proposed to learn the importance of each word in a sentence, like distinguishing different argument words, word types, word relations [197]–[202]. They mainly differ in what elements should be given more attentions and how to train attention vectors. For example, Liu et al. [197] argued that the argument words to a trigger should receive more attentions than other words. To this end, they first constructed gold attention vectors to encode only annotated argument words and its contextual words for each annotated trigger. Furthermore, they designed two contextual attention vectors for each word: One is based on its contextual words; and another is based on its contextual entity type encoded in a transformed entity type space. The two attention vectors are then concatenated and trained together with the event detector to minimize a weighted loss of both event detection and attention discrepancy. Wu et al. [198] applied argument information to train attentions with a Bi-LSTM network.
In word-level attention mechanisms, entity relations from syntactic parser can also be used to train attentions [199], [200]. The basic idea is that the syntactic dependency can provide the connection in between two possibly nonconsecutive yet distant words; While the dependency type can help to distinguish the syntactic importance in between words. For Chinese event extraction, as there has no explicit word segmentation like English, Wu et al. [202] also proposed a character-level attention mechanism to distinguish each character importance in a Chinese word.
Some researchers have also proposed to integrate both word-level and sentence-level attentions to augment event extraction in multi-sentence documents [203]–[205]. Zhao et al. [203] argued that in many multi-sentence documents, sentences in one document are often correlated with respect to the document theme, although they may contain different types of events. They proposed a DEEB-RNN model with a hierarchical and supervised attention mechanism, which pays word-level attention to event triggers and sentence-level attention to those sentences containing events. To this end, they constructed two gold attention vectors: one for word-level attention based on the sentence trigger; another for sentence-level attention for each sentence if containing a trigger word. Besides sentence interdependencies in one document, it is could also be the case that multiple events are embedded within a single sentence. Chen et al. [204] argued that events mentioned in the same sentence tend to be semantically coherent. To capture both intra-sentence dependency and inter-sentence correlation, they proposed a HBTNGMA model with a gated multi-level attention mechanism for extracting and fusing intra-sentence and inter-sentence contextual information to augment event detection.
Besides exploiting words and sentences, some researchers have proposed to integrate extra knowledge for attentions, like using multi-lingual knowledge [206] or a priori trigger corpus [207]. Most event extraction models are trained and applied for a particular language, which may suffer from the typical ambiguity problem of a single word with different meanings in different contexts. Liu et al. [206] examined the annotated events in ACE 2005 and observed that 57% of the trigger words are ambiguous. They argued that using a multilingual approach can help to deal with the ambiguity problem based on the observations of multilingual consistency and multilingual complementation. As such, they proposed a gated multilingual attention frame which contains both mono-lingual context attention and gated cross-lingual attention. For the latter, they first applied a machine translator to obtain the translated text in another language. Li et al. [207] designed a prior knowledge integration network to encode their collected keywords as a prior knowledge representation. Such knowledge representations are then integrated with a self-attention network structure.
Event Extraction Based On Semi-Supervised Learning
The aforementioned algorithms based on machine learning and deep learning techniques are kind of supervised learning approaches, which require a labeled corpus for model training. As deep learning approaches normally involve a large number of parameters in a neural network, generally the larger the labeled corpus, the better a model can be trained. However, obtaining labeled corpus is a rather cost prohibitive task for its time-consuming and labor-intensive annotation process, which in most cases also requires domain expertise and professional knowledge. Due to this, many labeled corpus are with small size and low coverage. For example, in the ACE 2005 corpus, only 33 event types of interested were defined, a very low coverage for diverse applications; Furthermore, among the labeled events, about 60% event types have less than 100 instances, and 3 event types have even fewer than 10 instances.
How to improve the extraction accuracy from a small set of labeled gold data has become a critical challenge. A straightforward solution is to first automatically produce more training data, and then use mixture data containing both original gold data and newly generated ones for model training. In this section, we review such solutions in the literature, though they may have used different names, like semi-supervision, weak supervision, distant supervision and etc. We note that although using mixture data impacts on the model training process, the basics of these learning-to-classify algorithms are similar to those reviewed in the previous two sections. In this section, we mainly focus on how they expand a small set of labeled data to a larger corpus, and how extraction models can be trained from mixture data.
A. Joint Data Expansion and Model Training
The set of labeled gold data may be small, however, they can be iteratively used for model training. Such an iterative model training via data replacement is a variant of the kind of well-known technique, called bootstrapping [208]. The basic idea is to first train a classifier with a small set of labeled data to classify new unlabeled data. Besides classified labels, a classifier also outputs classification confidence for new data. Then the new data with very high confidence can be included to the training data for next round model training.
The key challenge of employing classification results for data expansion lies in how to evaluate classification confidence for new data [209]–[212]. As events have complicated structures, like different event types containing different arguments with different roles, computing event extraction confidence is normally with low accuracies. Liao and Grishman [209], [210] proposed to use only a part of extraction results from new samples for data expansion, in particular, the most confident trigger with its most confident argument. Based on their extraction model [94], the most confident “<role, trigger>” pairs are selected based on the product between probability from their trigger classifier and argument classifier. Moreover, they employed an information retrieval system, called INDRI [213], to collect a cluster of related documents, and applied the cross-document inference algorithm proposed in [102] to include new training data with high confidence.
Wang et al. [211] proposed to use trigger-based latent instances to discover unlabeled data for expansion. If a word serves as a trigger in the gold dataset, all instances mentioning this word in the unlabeled dataset might also express the same latent instance. Based on such assumptions, they proposed an adversarial training process to filter out noisy instances yet distilling informative instances to include new data. Although including new data from classifiers’ outputs is cost-effective, it actually cannot guarantee that the newly included data are with completely correct annotations. In this regard, manual checking is of great necessities. Liao and Grishman [212] proposed an active learning strategy, called pseudo co-testing, to employ minimum manual manpower to only annotate those high confidence new data.
The aforementioned algorithms have focused on the expansion of training data, yet with the same event type as the labeled gold ones. Recently, some researchers have proposed a kind of transfer learning algorithms to expand training data with event types different from the reference data in the gold dataset [214]–[218]. For example, Nguyen et al. [216] proposed a two-stage algorithm to train a CNN model that can effectively transfer knowledge from old event types to the target type (new type). Specifically, the first stage is to train a CNN model based on labeled data of old event types with randomly initialized weight matrices. The second stage is to train the CNN model based on a small set of labeled data of target types with the weight matrices initialized by the first stage. Finally, the two-stage trained CNN model is used for both target type and old type event detection.
Besides utilizing a small set of labeled data with new types, Huang et al. [218] proposed a zero-short transfer learning for extracting events with new unseen types, which only needs a manually structured definition of new event types (e.g., event type names and argument role names from event schema). In particular, they first constructed the vector representation of event mention structure (trigger, arguments and their relation structure from event instance) and event type structure (type, roles and their relation structure from event schema) via training a CNN model based on the gold data. They next use the optimized CNN model to represent event mention structure and event type structure of new data, and find the closest event type for each new event mention.
B. Data Expansion from Knowledge Bases
Many existing knowledge bases store a large amount of structured information, such as FrameNet,8 Freebase,9 Wikipedia10 WordNet,11 which can be exploited to generate new labeled data as training data for event extraction.
The FrameNet defines many complete semantic frames, each of which consists of a lexical unit and a set of name elements. Such frames share highly similar structures with events. As many frames in FrameNet actually express certain events, some researchers have explored mapping frames to events for data expansion [219]–[221]. For example, Liu et al. expanded [220] the ACE training data by using events detected from given exemplar sentences in FrameNet. Specifically, they first learned an event detection model based on the ACE labeled data, which is then used to yield initial judgements for exemplar sentences. Then, a set of soft constraints are applied for global inference based on such hypotheses: “The same lexical unit, the same frame and the related frames tend to express the same event.” The initial judgements and soft constraints are then formalized as first-order formulas and modeled by probabilistic soft logic (PSL) [222] for global inference. Finally, they detect events from given exemplar sentences for data expansion.
Analogously, the compound value types (CVTs) in the Freebase (a semantic knowledge base) can be regarded as event templates, and CVTs instances are regarded as event instances. The types, values and roles of CVTs are regarded as event types, arguments in events and roles of arguments playing in events, respectively. Zeng et al [223] exploited structural information of CVTs in Freebase to automatically annotate event mentions for data expansion. They first identified the key arguments from CVTs, which play an important role in one event. If a sentence contains all key arguments of a CVTs, it is likely to express the event presented by the CVTs. As a result, they recorded the words or phrases in this sentence to match the CVTs properties as the involved arguments with their roles for annotation.
Moreover, Chen et al. [221] proposed to expand training data by exploiting both FrameNet and Freebase. Araki and Mitamura [224] utilized the WordNet and Wikipedia to generate new training data. In addition to using the general knowledge bases, some studies have focused on using relevant knowledge bases for domain-specific event extraction [29], [178], [225], [226]. For example, Rao et al. [178] used the biological pathway exchange (BioPax) knowledge database, which contains relations between proteins, to expand training data from PubMed12 central articles. In the financial domain, Yang et al. [29] utilized a financial event knowledge database for data expansion, which contains 9 common financial event types.
C. Data Expansion From Multi-Language Data
Motivated by the facts that a same event may be described in different languages and the labeled data from one language is highly possible to convey similar information in another language, some approaches have been proposed to utilize such multi-language information to address the data sparseness and low-coverage problem [227]–[231]. For example, Zhu et al. [227] employed Google Translate to eliminate the language gap between Chinese and English, and produced uniform text representations with bilingual word features. In this manner, it can merge the training data from both languages for model training.
In comparison, some researches have proposed to bootstrap event extraction via exploiting cross-lingual data, i.e. cross-lingual bootstrapping, without using machine translation or manually aligned knowledge base [228]–[230]. For example, Chen and Ji [228] proposed a co-training bootstrapping framework, which contains two monolingual self-training bootstrapping event extraction systems, one for English and another for Chinese. A labeled event can be transformed from one language to another, so as to produce so called projected triggers and arguments for event extraction in another language. Hsi et al. [229], [230] have proposed to augment a standard event extraction pipeline of classifiers by leveraging multilingual training via the use of multilingual features, such as universal POS tags, universal dependencies, bilingual dictionaries, multilingual word embeddings and etc.
Event Extraction Based on Unsupervised Learning
Unlike supervised and semi-supervised learning, unsupervised learning does not train event extraction models based on labeled corpus. Instead, unsupervised learning approaches mainly focus on open-domain event extraction tasks, like detecting trigger and arguments based on word distributional representations and clustering event instances and mentions according to their similarities.
A. Event Mention Detection and Tracking
Event mention is collective of keywords that can describe an event from one or more sentences. The tasks include detecting event mentions in an article and tracking similar event mentions in different articles. Notice that classifications of trigger type and/or argument role are not required, as types and roles are usually not predefined in such tasks.
The TDT program [15] defined a topic as a set of news and/or stories that are strongly related by some seminal real-world event. It then defined the TDT task as determining whether a given article is related to a clustering of events of the same topic and provided a TDT corpus with simple labels like {
In many TDT algorithms, sentences are firstly converted into vector representations, and the vector distance is computed to measure the similarity to some topic for event detection [232]–[236]. For example, Yang et al. [232] and Nallapati et al. [233] represented documents by their TF-IDF vectors, a conventional vector space model which uses the bag-of-terms representation. Specifically, terms (words or phrases) in documents are statistically weighted using the term frequency (TF) and inverse document frequency (IDF). They keep the
Stokes and Carthy proposed to represent documents by the lexical chains [237], which explore the cohesion structure of text to create a sequence of semantically related words [234]. For example, the lexical chains of a document concerning airplane might consist of the following words: plane, airplane, pilot, cockpit, airhostess, wing, engine. They identified the lexical chains in documents using WordNet, which represent synonymous words in terms of a single unique identifier.
Following the task definition of the TDT program, some other researches have been conducted to detect whether new articles in various websites are related to some already identified event, without using the TDT corpus [32]–[37], [238]. For example, Naughton et al. [35] proposed to vectorize sentences using the bag-of-words encoding from news articles and clustered sentences via using the agglomerative hierarchical clustering algorithm [239].
Besides using word and sentence embeddings, some have proposed to exploit several additional information of news articles, like time and location, to augment event mention detection. Ribeiro et al. [32] proposed to integrate time, location and content dimensions into text representation, and applied an all pairs similarity search algorithm and a Markov clustering algorithm to cluster news articles describing the same event. Likewise, Yu and Wu [33] proposed to a Time2Vec representation technique that constructs article representation by a context vector and time vector and employed a dual-level clustering algorithm for event detection.
Recently, several novel methods have been proposed for event detection and tracking, including the negative-examples-pruning support vector machine [238], multiple instance learning model based on convolutional neural networks [36], and weighted undirected bipartite graph based ranking framework [34].
B. Event Extraction and Clustering
The task of event mention detection mainly focuses on detecting event keywords in order to cluster sentences or articles expressing a same event. Some studies have proposed to further discriminate event trigger from arguments for clustering similar events and constructing event schema for similar events [44]–[51].
A straightforward approach is to regard the verb of a sentence as an event trigger. For example, Rusu et al. [47] used verbs in sentences as event triggers and identified event arguments by utilizing the dependency paths between triggers and named entities, time expressions, sentence subjects and sentence objects. Moreover, some knowledge bases can be applied to augment trigger and argument discriminations. Chambers and Jurafsky [44] considered verbs and their synset in WordNet as triggers; While the entities of syntactic objects are arguments. Huang et al. [45] considered all noun and verb concepts in OntoNotes13 and verbal and lexical units in FrameNet as candidate event triggers. Furthermore, they regarded all concepts with semantic relations to candidate triggers as candidate arguments. Then they computed the similarity between each pair of candidate trigger and arguments for identifying final trigger and arguments.
Based on the extracted triggers and arguments, event instances can be clustered into different event groups each with a latent yet distinguished topic. For example, Chambers and Jurafsky [44] proposed to use a probabilistic latent Dirichlet allocation topic model to compute event similarities for their clustering. Romadhony et al. [49] proposed to utilize structured knowledge bases to cluster triggers and arguments. Jans et al. [240] identified and clustered event chains, which can be viewed as a partial event structure consisting of a verb and its dependency actor based on the skip-gram statistics.
Furthermore, for each event group, an event schema can also be established with a slot-value structure, where a slot can represent some argument role and a value the corresponding argument of an event instance. Yuan et al. [46] first introduced a new event profiling problem to fill a slot-value event structure from open-domain news documents. They also proposed a schema induction framework by exploiting entity co-occurrence information to extract slot patterns. Glavasŝ and Ŝnajder [52] argued that events often contain some common argument roles like agent, target, time and location and proposed to construct an event graph structure to identify a same event yet mentioned in different documents.
C. Event Extraction From Social Media
Many online social networks, such as Twitter, Facebook and etc., present a large number of most up-to-date information. Twitter is a representative one. It has been reported that there are about 200 million tweets posted every day [241] and 1% of tweets covers 95% of events also reported on newswire [242]. We next take Tweet as an example and review some work on event extraction from social media.
Compared with newswire articles, Tweet posts have its own characteristics and event extraction faces new challenges. Tweets are mostly published by individual users, and each tweet is with a character limit. So tweets are often with abbreviations, misspellings and grammar errors, which causes many fragmented and noisy text without enough contexts for event extraction [28], [38]–[41]. In contrast to event extraction from newswire, entities, date, location, and keywords are the main components to be extracted from tweets. Since the tweets are short, all entities in tweets are considered as participants of an event of interest.
Weng and Lee [42] proposed to first analyze individual words with their frequencies and filtered out trivial words based on their correlations. After that, remaining words are clustered to form events by a modularity-based graph partitioning technique. Ritter et al. [38] utilized a named entity tagger and the TempEx tool [243] to resolve the temporal expressions. Zhou et al. [40], [41] proposed to filter out noisy tweets through lexicon matching. The lexicon contains event keywords (or triggers) which are extracted from newswire articles published about the same period as tweets. Furthermore, Zhou et al. [40] proposed to first identify named entities from newswire and built a dictionary to match the named entities in tweets. In particular, they employed the SUTime [244] to resolve the ambiguity of time expressions and proposed an unsupervised latent event model to extract events from tweets. Guille et al. [43] proposed a mention-anomaly-based event detection algorithm by computing the occurrence anomaly in the frequency of a word for a given contiguous sequence of time-slice to detect event.
Event Extraction Performance Comparison
As reviewed in previous sections, event extraction may have different task definitions and may have been experimented on different corpora, As such, a fair comparison is not likely to be conducted for all algorithms. But thanks to those public evaluation programs, the standardization of performance metrics and open datasets have made it possible for researchers to compare their algorithms. In this section, we mainly compare the algorithms experimented on the public ACE 2005 dataset with the standard evaluation procedures as follows:
Trigger detection: A trigger is correctly detected if its offsets (viz., the position of the trigger word in text) match a reference trigger.
Type identification: An event type is correctly identified if both the trigger’s offset and event type match a reference trigger and its event type.
Argument detection: An argument is correctly detected if its offsets match any of the reference argument mentions (viz., correctly recognizing participants in an event).
Role identification: An argument role is correctly identified if its event type, offsets, and role match any of the reference argument mentions.
Table 4 and Table 5 present the reported event extraction results by different algorithms in English and Chinese dataset, respectively, in terms of standard F-Measure (F1) which is obtained by the two commonly used performance metrics, viz., Precision and Recall. We note that not all algorithms reported all the results, as some of them were only designed for implementing some subtasks. It is also worth of noting that event extraction also depends on some upstream tasks’ results, like named entity recognition entity mention classification and etc. Most of these algorithms have assumed to directly use the gold annotations for entities, times, values and etc. in ACE 2005 as a part of the input to the event extraction task. However, we note that the ACE 2005 is a small dataset even with a very few of erroneous annotations.
From the table, we can observe that in general, algorithms on English can achieve better performance than those on Chinese. This may be due to that event extraction for Chinese sentences is also dependent on the word segmentation results. Furthermore, it can also note that recent advanced neural models can achieve better performance, which can be attributed to their powerful capabilities for learning deep yet more comprehensive context-aware and/or syntactic-aware word and sentence representations. Finally, we can observe that all subtasks have not achieved very high F1 values, which, on the one hand, indicates the difficulties of event extraction tasks, and on the other hand, motivates more advanced algorithms to be developed.
Conclusion and Discussion
Event extraction is an important task in natural language processing, with the objectives of detecting whether sentences have mentioned some real-world event, and if so, classifying event types and identifying event arguments. For its diverse applications, event extraction has been intensively researched decades ago, and recently it has been attracting more than ever research interests due to the fast development of many novel techniques like deep learning.
In this article, we have tried to provide a comprehensive yet up-to-date review for event extraction from text. We first introduced the public evaluation programs as well as their task definitions and annotated datasets for both closed-domain and open-domain event extraction. We divided the solution approaches into five groups, including pattern matching algorithms, machine learning methods, deep learning models, semi-supervised learning techniques, and unsupervised learning schemes. We have presented and analyzed the most representative methods in each group, especially their origins, basics, strengths and weaknesses. In addition, we introduced evaluation methods and compared typical algorithms experimented on the ACE 2005 corpus.
The pattern matching approaches normally can achieve high extraction accuracy, however, the pattern construction is with the prohibitive cost of human efforts and professional knowledge. The supervised learning approaches including machine learning and deep learning, on the other hand, are based on the availabilities of large annotated corpus, although they seems to achieve better performance. Furthermore, these closed-domain extraction algorithms are normally trained for specific domains, and may not be directly applied to other domains. For open-domain event extraction, how to deal with noisy text like those social posts and how to organize temporal-spatial events are still great challenges in practice. In what follows, we discuss possible future research directions for event extraction.
A. Knowledge-Boosted Deep Learning
Although deep learning techniques have proven a powerful tool in automatic feature learning for event extraction, those neural models are normally with too many learning parameters as well as network configurations, which not only demand huge amount of annotated raw data but also require careful tuning of numerous network configurations. On the other hand, pattern matching can well exploit experts’ knowledge for specifying accurate event patterns, though with the cost of increased human efforts. Although the two approaches seem contrary to each other, a promising direction can be that how to boost deep learning model with the inclusion of experts’ knowledge. While an initial try can be envisioned as to use patterns in the neural model training process, more efforts need to be devoted to enjoying high-quality human knowledge.
B. Domain-Adaptive Transfer Learning
In closed-domain event extraction, various event schemas are usually predefined with detailed event types and event arguments’ roles. Although schemas provide structured representation for events, clearly defining schemas involves domain-specific knowledge. Furthermore, domain-specific schemas are not easily extended from one domain to other domains, although many arguments with different labels share similar roles even in different types of events. Transfer learning seems to be a promising approach for developing domain-adaptive event extraction systems, where an extraction model trained in one domain can be easily applied to other domain with only a few of adjustments. Nonetheless, it could be an insightful attempt to first transfer the commonly trained models for multi-type classification into multi-task learning models.
C. Resource-Aware Event Clustering
In open-domain event extraction, events are firstly detected from different sources, and event mentions or keywords are next extracted. Then event clustering is often performed according to the similarities in between event keywords, that is, a kind of topic-centered event clustering is enforced. However, we notice that not only events can be briefly described by its keywords, but also other resources in texts like publisher, date of publication, spatial tagging, and various relations in between subjects and/or objects can be exploited to enrich the clustering process. On the one hand, we agree that topic-centered event clustering should still be the basic operation for open-domain event extraction; On the other hand, we note that with the inclusion of other resources, multi-focus clustering can be implemented, which might further prompt other tasks, like event reasoning for answering why such events happening, and event inference for answering what kind of next events being expected.