Journals & Magazines >IEEE Access >Volume: 8

Building Character Graphs and Dividing Communities in Chinese Novels Based on Graph Data Extraction: Community Division for Character Emotional Polarity Networks

0 seconds of 0 secondsVolume 90%

00:00

The roles of the co-occurrence network in RenMinDeMingYi and community division performed by CDC method shown as original result where pink nodes represent the positive c...

Abstract:

Chinese authors indirectly portray real social relationships in society by incorporating their own experiences and feelings about politics, economic life, and cultural ha...Show More

Metadata

Abstract:

Chinese authors indirectly portray real social relationships in society by incorporating their own experiences and feelings about politics, economic life, and cultural habits. Social and natural knowledge are mixed and hidden in natural literary language. Therefore, by well defining the type of original text, the pipeline for processing the text and the rules to build the graph, we can extract enough valid and useful information. Based on higher-level information, we can build a literary character graph to promote computers’ comprehension of literature. The identification of opposite characters is a challenging topic because the character relationships are intricate and complex. The cryptic expression in literary works can enhance the readability of the plot, but it is obvious that it increases the cost of understanding. Even for human readers different types of characters are difficult to classify and identify, and it usually needs careful reading for several times to get the subtle differences among the characters. Though extracting knowledge from Chinese literature is a complicated and difficult topic, this paper models the graph of polar literary characters and divides polar communities by extracting polar vertices and polar edges based on the concept of literary sentiment polarity. The results show that using a long-sentence window is a good trade-off. The experiments of modern Chinese polar literature show that the accuracy of the community division method for the integrated graph polarity is obviously better than the method based on the co-occurrence network, and it can automatically match the positive and negative communities as well as build the complete graph structure. It clearly reflects the meaning of literary works from the perspective of polarity. This systematic model describes general standard operation steps for the machine to understand complex Chinese literature. The experiments conducted on seven benchmark Chinese novel datasets demonstrate that...

0 seconds of 0 secondsVolume 90%

00:00

The roles of the co-occurrence network in RenMinDeMingYi and community division performed by CDC method shown as original result where pink nodes represent the positive c...

Published in: IEEE Access ( Volume: 8)

Page(s): 95559 - 95573

Date of Publication: 19 May 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.2995738

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Literature is an important form for people to spread culture and exchange ideas by shaping images and reflecting social life through language [1]. The rich text mining value of long texts carries and records much information related to human society in the history of Chinese culture. With literary long texts, many political, economic and cultural content descriptions in human life are preserved for posterity and can be studied. Digital technology has promoted the increase in electronic long-text resource reserves, is user-friendly, provides a clear definition of copyright, and provides more opportunities for scientific research and general applications [2].

The exponential increase in the amount of electronic text information makes it fast and effective to obtain target information in massive texts, which has become an urgent need [3]. Text mining techniques can be used on-demand to acquire the knowledge and potential useful value for users who are curious or interested [4] and to extract previously unknown, understandable, and potentially useful value patterns and knowledge from a large number of long-text corpora [5]. It can solve the problems of management, utilization, and application of many unorganized electronic long texts. Analyzing the network of persons in literary works [6] helps publishers and readers find the information they need in long texts, assists publishing decision support, mines long-text knowledge bases, and effectively improves the accuracy of text retrieval and plot understanding.

A novel is a kind of literary narrative with large capacity, complete layout, plot development, and various themes. By describing the story of a character and shaping the character’s image, the character and destiny of the character can be represented in detail, intricate contradictions can be expressed, and the social environment in which the character is located can also be described. Based on these literary contents, combined with the field of social network analysis, the value of characters and their relationships within novels can be discovered.

A typical research area in social network analysis is community division, also known as community discovery [7]. Through user behavior analysis and social network sentiment analysis [8], we determine the closely connected node groups in social networks. Modern Chinese literature is a very good dataset. It reflects real life and reveals the contradictory aspects of modern Chinese society [9]. This type of literature has rich emotions and obvious community differentiation. This article uses “polarity” to describe the sharp opposition between this emotion and the community.

Community detection is an important tool to analyze hidden information such as functional module and topology structure in complex networks [10]. Traditional social network analysis of literary works is based on co-occurrences to determine the relationship between characters in a novel and to construct and analyze the character network [1], which usually discards the wonderful emotions, plots, and characters in the literary works. If one wants to mine the rich and internal textual information, the intrinsic linguistic connection with the literary language of the plot must be explored first.

Although novels are fictional, authors fully combine the understanding of life and social experience from real society in the creative process. Therefore, his or her creative works are a reflection of society, and the analysis based on it is reasonable and can be indirect; it reflects the social value of the author’s cognitive perspective.

Previous studies [11]–[14] mostly focused on selecting influential nodes in social network data independently, while they still need identifying other classes of nodes for pivot selection. Reference [15] simply utilized the node centrality to generate the structure-based network and did not consider contexts. [16] provided an overview of the characters and the relationships between them from the narrative texts but no further analysis of the network structure was presented. In [17], the authors conducted a community detection and character categorization which identify the protagonists and antagonists. This dimension gives limited information from the aspect of frequency about the original literature. Overall, few of previous works have completed graph structure extraction [18] and community division from exacted polar nodes. Extracting knowledge from Chinese literature is a complex and difficult topic for the following reasons:

First, to improve the attractiveness and readability of the works in the creative process, the author usually deliberately conceals, covers, reverses, reveals, and darkens the relationship between the antagonist and the protagonist, which makes it difficult to divide the characters into subgroups [19].

Second, the novel’s natural language has complex semantics and twists and turns. Although there are many techniques that can be used for reference, more fully theoretical experimental verification is needed in named entity recognition [20] and emotional polarity confirmation.

Third, although the amount of digital text is sufficient, the proportion of the labeled text is low, manual labeling confirmation lacks a uniformly standard definition, and the cost is high, which brings difficulties to our work.

In addition, the characters in novels usually have dynamic personalities, which are hidden in various aspects of expression. Even for human readers, it is difficult to classify and identify different types of characters, and it usually needs careful reading for several times to get the subtle differences among those characters.

Finally, there are limited number of datasets available to apply machine learning algorithm [21], [22] and it is quite time-consuming to produce denotations on large scale [23]. It is laborious to label the prior knowledge manually in some scenarios where prior knowledge is hard to obtain [24].

Our work combines the theory on natural language processing, knowledge graph and social network. We systematically propose a integrated framework, conduct the experiments on typical datasets, present the outcomes and solve this unsolved topic by solving the problems synthetically from different fields. We establish the systematic method and start from relatively polar literature that consists of characters with a steady and clear personality. It is a novel trial for community division with character graph building and polarity information extraction. This paper uses natural language processing based on the concept of literary sentiment polarity to select a representative polar literary dataset, extract characters and sentiment indication information in literary space, and mine the node polarity and link polarity contained in the text. After modeling the network of polar literary characters, we divide polar communities in the network. This study is a comprehensive application in the field of text classification and information extraction that integrates natural language processing and social network theory, combined with extracting graph data and visualizing for these data and knowledge structure.

The results show that the use of long-sentence windows can ensure the accuracy of literary interactions in the window during information extraction, and the information contained in the window is not very sparse, which can better ensure the network modeling accuracy. As there is not related benchmark for this work, we select the basic community detection method with general co-occurrence parameter to work as baseline. The accuracy of the community division results of the integrated network polarity method is significantly improved compared to the method based on the co-occurrence network results. It can automatically match the positive and negative communities and includes the complete link structure of the network that clearly reflects the literature from the perspective of polarity, which is the intention of this work.

Our main contributions in this paper are as follows: First, a systematic model is well defined to process long natural language text. It provides general standard operation steps for the machine to understand complex Chinese literature and extract graph data. Second, we propose four methods including evaluation standard of parameter to achieve building a character graph and dividing communities. They combine emotional information with social network computing to form a character network with emotional knowledge. Third, experiments conducted on seven benchmark Chinese novel datasets give out the accuracy and reliability of the methods as well as visualized graph data structure to make them more applicable and demonstrate that the method based on emotional polarity shows a significant improvement compared to baseline performance. Instead of costing manpower to analyse graph knowledge in texts as before, the automatic methods provide more change for applicaition on data mining. With texts in literature industries, the methods are designed to benefit writers, readers and publishers. With other types of compatible texts, the methods are favourable to politicians, governments, media industry and knowledge digitalization.

SECTION II.

Graph Data Extraction and Character Graph Building

A. Polarity Data and the Polar Literary Character Network

Literary emotional polarity (LEP) refers to the obvious literary emotional tendencies of the characters and the relationships between the them in literary works under a specific classification criteria based on community relation factions, such as the antagonist, when the authors develop the plots and describe the characters’ behaviors.

The discrimination of literary polarity provides a sense of community division within the literature. As characters in a literary work can be determined by emotional tendencies, character relationships can also be determined by emotional tendencies and values. Literary works of polarity are referred to as polar literature (PL).

The content of PL shows polarity. Therefore, the nodes, relationships, and communities extracted according to the content also show polarity. Together, they form a network model containing polarity information and therefore we can build the knowledge graph [25]. We will define and explain these concepts as below.

Definition 1 (Polarity of Vertices, PoV):

In PL, all the characters described by the author are called literary character sets $V=\{v_{1},v_{2},\ldots,v_{N} \}$ . If the literary content has a clear positive or negative tendency to a certain node $v$ in $V$ , $v$ has the vertex polarity, and $v$ is called a polar vertex (PV). Each vertex with polarity can determine the degree to which the literary content praises or denounces its emotions and the vertex polarity value $p(v)\in R$ is recorded, it is also called the character polarity.

When the author’s description and the reader’s understanding reach a consensus that emotions are positive, $p(v)$ is larger than a polar threshold value thresh, and it is known as a positive node, denoted by $v^{+}$ ; a negative emotion corresponds to a negative node, where $p(v)$ is less than thresh, referred to as $v^{-}$ . If we confirm that a node has polarity but has not completed the community division work and has not specifically determined its polarity, then we will consider the polarity node as an unclassified node $v'$ . If the emotions of node $v$ tend to be neutral, multifaceted, changeable, or the authors and readers cannot reach a consensus in order to determine $p(v)$ or obtain $p(v)$ close thresh, then $v$ does not have polarity, which is called a nonpolar vertex, denoted by $v^{0}$ .

In this paper, it is necessary to remove $v^{0}$ from the network. In PL, the number of $v^{0}$ nodes is small, and the frequency is low, so it has little effect on community division [26].

Definition 2 (Polarity of Edges, PoE):

In PL, the relationship between all the characters described by the author is called a link set $E=\{e_{1},e_{2},\ldots,e_{M} \}$ , where any link $e$ corresponds to a node tuple $(v_{i},v_{j})$ . If the link $e$ of a certain pair of polar nodes $(v'_{i},v'_{j})$ in the literary content has a clear tendency of emotional attraction or repulsion, then $e$ has a link polarity, and $e$ is called a polar edge (PE). For each link with polarity, the degree of emotional attraction or repulsion can be determined and recorded as the edge polarity value $p[e(v'_{i},v'_{j})]\in R$ . PoE is also known as the polarity of relations.

If $v'_{i}$ and $v'_{j}$ of $e$ corresponds to the same polarity, then $p[e(v'_{i},v'_{j})]>0$ , it is called a positive link; otherwise, $p[e(v'_{i},v'_{j})]< 0$ , it is called a negative link. Because the edge polarity value can indicate the edge polarity, in this paper, the link polarity value is simply referred to as the edge polarity. If the emotional tendency of the link $e$ is neutral, or there are variations among the readers in understanding the description or the emotion is undetermined to get $p[e(v'_{i},v'_{j})]$ or $p[e(v'_{i},v'_{j})]=0$ , then the vertex does not have polarity, and the edge $e(v'_{i},v'_{j})$ is called a nonpolar link. Because the nonpolar node’s emotional tendency is fuzzy and the emotional relationship with other nodes (polar or nonpolar) is also fuzzy, the nodes in the corresponding two-tuple of the polar link are all polar nodes.

Definition 3 (Literary Character Network $G(V,E))$ :

It refers to the simple undirected positive-weighed network composed of the literary collections $V$ and link collections $E$ in literature.

Definition 4 (Polarized Literature Character Network $G'(V',E'))$ :

It refers to the undirected network composed of PV set and PE set in PL. In this paper, $v^{0}$ will be excluded from $V$ to get $V'$ . For literature with rich polarity, each node has one and only one polarity $v'\in \{v^{+},v^{-}\},v'\ne v^{0}$ , and each PoE has a nonzero value $p[e(v'_{i},v'_{j})]\ne 0$ , and all $v'$ and $p[e(v'_{i},v'_{j})]$ sufficiently characterize all the information in $G'$ . Not all literary character networks can construct polar literary character networks.

Definition 5 (Characters Community, CC):

CC is also called the literary vertices sub-network and refers to sub-network corresponding to the characters that are closely connected in $G(V,E)$ .

The phenomenon of communities in the network diagram is called the community structure, and the community structure is a common feature of the network. The general literary character network comprises two types of communities: communities that do not intersect with each other are called nonoverlapping (disjoint) communities, and communities containing intersections are called overlapping communities [27]. Given a network, the process of determining the community structure is called community detection. The detected communities are denoted by $G_{1},G_{2},\ldots,G_{C}$ , $C\ge 2$ , $G_{1} \cup G_{2} \cup \ldots \cup G_{C} =G$ . For nonoverlapping communities, $G_{i} \cap G_{j} =\emptyset$ .

Definition 6 (Polarity of Community, PoC):

In $G'(V',E')$ , two nonoverlapping communities consist of positive vertices and negative vertices having a tendency toward emotional praise, these communities are referred to as a polar community (PC).

A polar literary character network $G'$ has only two positive and negative polar communities. The positive community $G^{+}$ is composed of positive vertices, and the negative community $G^{-}$ is composed of negative vertices. In the polarized literary character network, the vertices must belong to either a positive or negative community. In this study, the network has at least one positive vertex and one negative vertex.

The goal of community division (CD) is to extract the information inherent in the literature by identifying PV, PE, and classifying the vertices into the corresponding PC. A key issue of literary polarity is the classification criteria used to determine the polarity. The construction of a literary character network is based on the reading experience and plot development. In the novels from the new era, literature has developed in a pluralistic trend, and various plot modes have been blended with each other, thereby comprehensively and profoundly showing the complex and multifaceted spiritual world of human beings [28]. There are usually two different frameworks for character classification in novels: one framework is the nature of the character itself, and the other framework is the role of the character in the work [29].

The criterion of actionability is not suitable for the classification standard of literary polarity. The judgment of social morality may be different for different readers, so the polarity should be judged by the author. From the author’s specific perspective, readers and authors can reach a consensus and obtain reliable emotional orientation results, which is a good polarity division standard, and a literary work applicable to this standard is a polarized literary work. This paper’s work is different from analyzing general social network communities; the two communities have obvious polar distinctions.

B. Text Space and the Text Space Entropy

Definition 7 (Morpheme $c$ ):

In morphology, morphemes are the smallest grammatical units with a combination of phonetics and semantics.

Morphemes can be divided into free morphemes, bound morphemes, and semi-free morphemes, according to whether they can form words alone [30]. According to the different information planes used in this text, we define three types of text spaces: a natural morpheme space, a character-indicating text space, and an emotion-indicating text space [31]. According to the extraction order of text information, we first introduce the natural morpheme sequence:

Definition 8 (Natural Morpheme Sequence, NMS):

$S_{N} =\{c_{i} \},i=1,2,3,\ldots$ . NMS is a finite sequence of all morphemes $c$ in a natural language text. It is the embodiment of the unprocessed natural language of a long-text novel. It contains complete morphemes and does not include high-level information such as the section, segment, emotion, and social network level. NMS is characterized by large morpheme capacity but sparse information content.

Definition 9 (Character-Indicating Sequence, CIS):

$S_{V} =\{c_{Vi} \},i=1,2,3,\ldots$ , in which $c_{Vi} \in V$ and $c_{Vi} \in S_{N}$ , $S_{V} \subset S_{N}$ . CIS indicates the specific morphemes $c_{Vi}$ that can indicate the appearance of the character extracted from the $S_{N}$ , and the original morpheme sequence $S_{N}$ is retained to form a $S_{V}$ .

Definition 10 (Emotion-Indicating Sequence, EIS):

$S'=\{c'_{i} \},i=1,2,3,\ldots$ , in which $c'_{i} \in S_{N}$ , $S'\subset S_{N}$ . PIS contains morphemes $c'_{i}$ that can express emotional polarity extracted from $S_{N}$ . NMS, CIS, and EIS are independent finite-length sequences of morphemes. To characterize the interaction and correlation between morphemes, it is also necessary to indicate the scope of morphemes, as defined below.

Definition 11 (Literacy Field Window, LFW):

Under different segmentation modes (chapter identifiers, paragraph delimiters, and periods and other punctuation), the sequence is segmented to represent the literary unit with the largest distance of morpheme interactions.

The segmentation mode refers to the methods of semiotics. This paper does not consider the influence of inner punctuation on semantics [32]. The size of the LFW is called granularity, denoted as $Q$ . When a different $Q$ is selected, the amounts of information contained in the text space generated by the text sequence are different. The chapter window is divided according to the chapter identifier, the paragraph window is divided according to the paragraph separator, the long-sentence window is divided according to the end of the sentence, and the short-sentence window is divided according to other punctuation [33].

Definition 12 (Polarity of Window, PoW):

PoW is the tendency of the LEP expressed by the entire text in the LFW. When $Q$ is a short sentence, it is called the short-sentence polarity; when $Q$ is a long sentence, it is called the long-sentence polarity; when $Q$ is a paragraph, it is called the paragraph polarity. When $Q$ is a chapter, it is called the chapter polarity, PoW is calculated as below,

$\begin{equation*} p(LFW)=p(S_{j} \subseteq S)=\sum {p(c_{i} \in S_{j})}.\tag{1}\end{equation*}$ View Source

Definition 13 (Literacy Filter, LF):

Under an LFW of a certain granularity $Q$ , the morpheme sequence can be divided into a series of LFW units $t=\{c_{i},c_{i+1},\ldots,c_{j} \}$ . The morpheme sequence is uniquely mapped to obtain a text space $T=\{t_{i} \},i=1,2,3,\ldots$ , composed of literary interaction units, and the extractor transformation process is called $T=F(S,Q)$ , the process of LF is denoted as below,

$\begin{equation*} T=F(S,Q)=LFW_{Q} (S)=\{S_{j} \subseteq S\}_{Q}.\tag{2}\end{equation*}$ View Source

Under a set of $F$ with different $Q$ , $S_{N}$ , $S_{V}$ , and $S'$ can be used to obtain the natural morpheme text space $T_{N}$ , character-indicating text space $T_{V}$ , and emotion-indicating text space $T'$ , respectively. The unit $t_{N}$ of the natural morpheme text space $T_{N}$ is called the natural morpheme window, the unit $t_{V}$ of the character-indicating space $T_{V}$ is called the character-indicating window, and the unit $t'$ of the emotion-indicating space $T'$ is called the emotion-indicating window. $t_{N}$ , $t_{V}$ and $t'$ are composed of morphemes $c$ , character-indicating morphemes $c_{V}$ , and emotion-indicating morphemes $c'$ . The granularity $Q$ of $F$ determines the information content of the filtered text space $T$ . Therefore, when $Q$ is selected, the information contained in the text space $T$ needs to be evaluated, as defined below.

Definition 14 (Text Space Entropy, TSE):

The information content of the text space $T$ divided by the granularity $Q$ is referred to as the TSE, denoted as $H$ .

The text space entropy is used to evaluate the information content of the text space when extracting subspaces. Information theory states that the amount of information is related to changes in the uncertainty, and the changes in uncertainty are related to the number and probability of possible outcomes [34],

$\begin{equation*} H(T)=E[I(T)]=E[-\ln (P(X))].\tag{3}\end{equation*}$ View Source

Among them, $X$ is the variable of interest when extracting information from the literary interaction unit $t$ , $P$ is the probability quality function of $X$ [35], and $E$ is the expectation function, $I(T)$ is the amount of spatial information in $T$ and called the spatial self-information. In particular, since the $X$ enumeration type is limited, when $X$ represents the number of $v^{+}$ and $v^{-}$ in a character-indicating window $t_{V}$ , the entropy of a text space can be represented as below,

$\begin{equation*} H(T)=\sum \limits _{i} {P(x_{i})I(x_{i})=-} \sum \limits _{i} {P(x_{i})\ln P(x_{i})}.\tag{4}\end{equation*}$ View Source

Using chapter identifiers, paragraph separators, periods and other punctuation as segmentation identifiers, the corresponding information content is also different when the smallest unit of the natural morpheme space is divided. When $Q$ is selected, the abovementioned text space entropy is used to evaluate the granularity.

In the following section, we divide the community based on the natural morpheme space $T_{N}$ , character-indicating space $T_{V}$ , and emotion-indicating space $T'$ to obtain the polar community division result.

SECTION III.

Building Character Graph and Dividing Communities Using Emotional Data of Character Network

A. Community Division Based on a Co-Occurrence Network Structure

In $T_{V}$ , The structure of the network model used in the community division method according to the co-occurrence of character nodes is the GN algorithm proposed by Girvan and Newman [36]. It gives out the sub-communities using weights of edges that are positive integers.

Representing the entire network and then using algorithms to complete community discovery based on the co-occurrence network structure of the community division is the method called community division based on co-ocurrences (CDC).

We divide our methods into 4 parts to fully demonstrate that CDEP performs much better than CDC qualitatively and quantitively, which is GN community discovery based on co-occurrence frequency. CDC is a general but sketchy method to transform natural text into knowledge graph.

The community division of the co-occurrence network structure is completed based on the general literary character network $G(V,E)$ . Regardless of the emotional polarity information, the.literary character set $V$ and the link set $E$ are extracted from $T_{V}$ , and the weights of the links $w[e(v'_{i},v'_{j})]$ in $E$ indicate the interaction strength between nodes.

In this way, a simple community division result is obtained, that is, two community subnets. However, since $E$ does not contain emotional polarity information, these two subnets are the final division results, and the positive and negative polarities need to be manually matched. This method has flaws because it only considers the co-occurrence network structure and does not consider the context semantics of the co-occurrence of character nodes (i.e. co-occurring words, the emotional tendency of co-occurrence paragraphs) [37]. For example, the polarities are not clearly marked, and the division of polar communities is not accurate. We will extract more context and emotional information modularly in the following methods to improve the performance.

B. Community Classification Based on the Node Polarity

Community division based on node polarity is completed based on $G'(V',E')$ . When the network is modeled, $T_{V}$ and $T'$ are coupled within the corresponding window; i.e., emotion-indicating words in $t'$ points to a $v$ in $t_{V}$ , we will then calculate $p(v)$ of this vertex. Then a suitable polarity threshold value $thresh$ is calculated and determined to confirm the polarity $\{v^{+},v^{-}\}$ of each vertex $v$ in set $V'$ .

Classifying all the positive nodes directly into the positive community $G^{+}$ and all the negative nodes directly into the negative community $G^{-}$ is called community division based on the vertexes’ polarity (CDVP).

This method not only considers the co-occurrence network structure but also fuses the co-occurrence context semantics [38] of character nodes into the network through the PoW, completes the division of polar communities, and improves the utilization of the NMS. Therefore, the contribution to the accuracy of the model is greatly improved. However, the nodes are directly classified into the community, and the result does not include the graph structure of the network.

C. Community Classification Based on the Link Polarity

Community division based on the link polarity is also completed based on $G'(V',E')$ . While the network is being modeled, $T_{V}$ and $T'$ are coupled on the corresponding timing window. The emotion-indicating words in $t'$ points to all possible pairs of nodes link $e$ in $t_{V}$ . Then we calculate the value of each link as the weight in the network in the GN algorithm to complete community discovery, and the community is divided based on the link polarity. This process is called polar community division based on the edges’ polarity (CDEP).

Algorithm 3 CDEP

Input:

$T_{V}$ , $T'$

Output:

$E'=\{e(v'_{i},v'_{j})\},p[e(v'_{i},v'_{j})]$ is a real number

for $t_{V} $ in $T_{V} $ and $t'$ in $T'$ :

for $c'$ in $t'$ :

$p(c') =$ Polarity judgment value of $c'$

for $e(v'_{i},v'_{j})$ in $T_{V} $ :

$p[e(v'_{i},v'_{j})]+=p(c')$

The set of polar links indicates the.polarity interaction between nodes in the network of polar literary characters. Although it contains emotional polarity information, these two communities are the final division results, and their positive and negative polarities still need to be matched to complete the evaluation of the algorithm model. CDEP uses the PoW to measure the distance between nodes so that the communities in the network can be identified, and the community polarities can be divided by simple matching.

The magnitude of the link polarity value indicates the interaction strength of the attraction or repulsion of the node. The link polarity value is also a kind of co-occurrence number. What has been improved is that it is the number of co-occurrences after emotion indication space filtering from $T'$ .

D. Community Classification Based on the Polarity of the Entire Network

The integrated network community divided polarity, called polar community division based on the network’s polarity (CDNP) takes into account both the link polarity and the node polarity.

In addition, when coupling $T_{V}$ and $T'$ in the corresponding windows in the time sequence, each emotion-indicating word in $t'$ is directed to the vertices in $t_{V}$ to calculate $p(v)$ and is also directed to the links in $t_{V}$ to calculate $p(e)$ .

Algorithm 4 CDNP

Input:

$T_{V}$ , $T'$

Output:

$V'=\{v'_{i} \}$ , $v'_{i}$ is in $\{v^{+},v^{-}\}$

$E'=\{e(v'_{i},v'_{j})\},e(v'_{i},v'_{j})$ is a real number

for $t_{V} $ in $T_{V} $ and $t'$ in $T'$ :

for $c'$ in $t'$ :

$p(c') =$ Polarity judgment value of $c'$

for $t_{V} $ in $T_{V} $ :

$p(t_{V}) += p(c')$

for $v'$ in $V'$ :

if $p(v'_{i})> thresh$ :

$v'$ is $v^{+}$

else if $p(v'_{i}) < thresh$ :

10:

$v'$ is $v^{-}$

11:

for $(v'_{i},v'_{j})$ in $E'$ :

12:

if $p(v'_{i})$ is the same as $p(v'_{j})$ :

13:

$p[e(v'_{i},v'_{j})] - -$

14:

else if $p(v'_{i})$ is not the same as $p(v'_{j})$ :

15:

$p[e(v'_{i},v'_{j})]++$

The appropriate polarity threshold $th$ is determined. Each PoV can be pre-confirmed. Then, based on the assumption that nodes with the same polarity repel each other and nodes with different polarities attract each other, this interaction force relationship is fused to the polar interaction. Finally, the community classification will be handled by Algorithm 1a.

Algorithm 1a GN Community Discovery

Input:

$G(V,E),E_{W} =\{w(e_{i})\vert e_{i} \in E\}$

Output:

$G_{1},G_{2},\ldots,G_{C}$ , $C\ge 2$

def GN($G(V,E)$ , $E_{W}$ )

repeat:

for $e_{i} $ in $E$ :

$B(e) =$ edge betweenness function of $e_{i} $

$C_{B} (e)=B(e)/w[e(v_{i},v_{j})]$

if $C_{B} (e)$ is max

$\{C_{B} (e_{1}),C_{B} (e_{2}),\ldots,C_{B} (e_{M})\}$ :

remove($e_{i}$ );

$E=\{e_{1},e_{2},\ldots,e_{i-1},e_{i+1},\ldots,e_{M} \}$

mod_all.append(modularity of $G(V,E))$

write_treestructure($G(V,E))$

10:

if $E=\emptyset $ :

11:

break

12:

for modularity in modular_all:

13:

if $C=2$ :

14:

modular_polar.append

(modularity)

15:

when mod_all.index (max(modular_polar))

16:

$G_{1} $ , $G_{2} =$ read_tree ($G(V,E))$

17:

return($G_{1} $ , $G_{2} $ , max(modular_polar))

Algorithm 1b CDC

Input:

$T_{V}$

Output:

$G(V,E),E_{W} =\{w(e_{i})\vert e_{i} \in E\}$

for $t_{V} $ in $T_{V} $ :

if the length of $t_{V} < 2$ :

$H(X)\propto 0$

else if the length of $t_{V} \ge2$ :

for $(v_{i},v_{j})$ in $t_{V} $ :

$w[e(v_{i},v_{j})]+=1$

Algorithm 2 CDVP

Input:

$T_{V}$ , $T',V'=\{v'_{i} \},v'_{i}$ is not identified

Output:

$V'=\{v'_{i} \}$ , $v'_{i}$ is in $\{v^{+},v^{-}\}$

for $t_{V} $ in $T_{V} $ and $t'$ in $T'$ :

for $c'$ in $t'$ :

$p(c') =$ Polarity judgment value of $c'$

for $t_{V} $ in $T_{V} $ :

$p(t_{V}) += p(c')$

for $v'$ in $V'$ :

if $p(v'_{i})> thresh$ :

$v'$ is $v^{+}$

else if $p(v'_{i}) < thresh$ :

10:

$v'$ is $v^{-}$

The comprehensive polarity interaction between the nodes in $G'$ is shown, which comes from the node polarity and link polarity information. The addition of node polarities into CDEP not only contains emotional polarity information but also automatically matches their positive and negative polarities as the final result of community division. The addition of the link polarity makes the result have the graph structure of the network.

SECTION IV.

Experiments and Analysis

A. Data Acquisition and Character Dictionary

In our experiments, seven novels with characters that have obvious positive and negative polarities in the social network of the text were selected. The text and text expansion materials were used to determine the node dictionary and to label the node polarities. Depending on the length of the text and the complexity of the plot, the number of nodes is between 6 and 22.

Text-based statistical results can help retrieve most of the information about the person’s name in the text [39] and avoid omissions when constructing the entries, but a small amount of information may be duplicated or even wrong. The text expansion material helps us match the main role of the character and other roles into the same character entry [40].

In literary works, the main character’s tendency to depreciate is obvious, and the polarity is easy to distinguish; secondary characters are mostly used to assist the protagonist and the main plot. During the extraction of $S_{V}$ from $S_{N}$ , the secondary characters corresponding to nonpolar nodes are removed and protagonists are labeled. Finally, the character dictionary used in the experiment inis obtained. Taking RenMinDeMingYi as an example, the selection of the primary and secondary characters and the polarities of the primary characters are shown in Table 1.

TABLE 1 Primary and Secondary Characters in

$RenMinDeMingYi$

B. Selection of the Granularity of the Literary Interaction Window

The CDC, CDVP, CDEP, and CDNP algorithms all need to complete the extraction of $t_{V}$ and $t'$ from $T_{V}$ and $T'$ , and then continue to complete the division based on these windows. The selection of the window granularity has a large impact on the performance of the algorithm.

We use the text space entropy to evaluate the information content of windows obtained by dividing the text space $T$ with different $Q$ . In particular, we choose the number of $v^{+}$ and $v^{-}$ in $t_{V}$ as the variable of interest $X$ . Figure 1 shows the probability mass function distribution graph of $X$ with different $Q$ . It maps the probability mass function on the left axis, which is the overall distribution of $X$ . The right axis and the line graph represent the frequency cumulative percentage of the information entropy calculation. The sum of the corresponding frequencies of all the $X$ values is the total number of windows in the entire literary work.

FIGURE 1.

The information content of text space through different filters in RenMinDeMingYi. The ordinates on the two sides show the frequency and the percentage of X with different windows.

Show All

As shown in Figure 1, most short-sentence windows cannot be extracted to indicate the appearance of the character and only play a role in plot development and logical connections. However, for the research of this paper, the information in the window is missing and is an invalid window because the length of the window is too short. A small number of windows can be extracted with one character-indicating morpheme, and vertex polarity may be calculated but cannot be used to calculate the link polarity, as there is a small proportion of active window. In summary, the information content contributed by the short-sentence granularity is not sufficient, and it is not suitable for selection in this work.

The chapter window regards each chapter as the smallest unit for analyzing the status of the characters and the literary polarities. After the literary space extractor of the chapter granularity is implemented, the amount of contributed information is sufficient, but at the same time, the accuracy of the polarity indication in the literary interaction window and the window error are considered. If the length of the interaction window is too large, the data are noisy and inappropriate. Table 2 further illustrates the calculated indexes when different windows are selected for 7 novels: the text length $l$ , the number of windows $n$ , the window word length $l$ , the text space entropy, and the maximum of morphemes in $t_{V}$ and $t'$ .

TABLE 2 Index and Entropy of Long, Short Sentence and Paragraph Windows

The information content measured by the text space entropy and the polar action accuracy rate measured by the number of character-indicating morphemes and emotion-indicating morphemes contained in the window cannot be optimized simultaneously. In terms of the text space entropy and polarity accuracy, long-sentence windows and paragraph windows are suitable. Taken together, the percentage of invalid windows in paragraphs is low, and there are more valid windows in long-sentence windows, which can ensure that the accuracy of literary interactions in the window and the information content in the window are not very sparse. This can better ensure network construction and the accuracy of the model. Long sentence granularity is eventually obtained with the particle extractor to get $T_{N}$ , $T_{V}$ and $T'$ .

C. Character-Indicating Morpheme Recognition

In the data acquisition and processing phase, we have constructed the node dictionary, and the extraction of nodes is based on this dictionary. The word segmentation results of natural morphemes are compared with the node dictionary, keeping the matching morphemes, and these morphemes are arranged in the original order to call a new sequence, which is the CIS. Before word segmentation, a long-grained literary text space extractor generates a natural morpheme space from a natural morpheme sequence, then we perform word segmentation on the natural morpheme window in the natural morpheme space, and then matches the node dictionary. Finally, a high-confidence character-indicating text space is obtained.

D. Emotion-Indicating Morpheme Recognition

The analysis and recognition of polar morphemes belong to the field of sentiment analysis, and databases of Chinese sentiment dictionaries are relatively scarce. To improve the accuracy of polar morpheme recognition, we synthesize several Chinese polarity sentiment dictionaries and build a special polarity sentiment thesaurus. The NTU Sentiment Dictionary [41] and the Hownet Sentiment Dictionary [42] are Chinese emotional polarity analysis keyword sets with a total of 6110 positive words and 11152 negative words.

The morphemes that match the natural morphemes are retained, and these morphemes are arranged in the original order to call a new sequence, which is the EIS. Correspondingly, the NMS can obtain the EMS.

E. Community Discovery and Community Matching

Regardless of the emotion-indicating morphemes, CDC builds a simple literary character network. Figure 2 shows the visual result of the CDC method and it will work as the baseline for performance comparison. Without emotional knowledge being extracted, the result of the CDC is 50.00% (avg. weight) and 44.86% (link weight), see Table 7. Li D.K. (

Show All

) is classified as the biggest positive node in pink surrounded by other good characters such as Sha R.J. (

Show All

) and Wang D.L. (

Show All

). Hou L.P. (

Show All

) has remarkable co-occurrences with those negative shown as green nodes, so it’s divided into negative community though he is a good character without polar information at this step. Gao Y.L. (

Show All

) is the head of negative community, other bad characters like Qi T.W. (

Show All

) and Ding Y.Z. (

Show All

) are distributed and emanated around him.

TABLE 3 Examples of Character Morpheme Recognition in

$RenMinDeMingYi$

TABLE 4 Keywords Set for the Chinese Emotional Polarity Analysis (High Frequency)

TABLE 5 Examples of Polar Morpheme Recognition in

$RenMinDeMingYi$

TABLE 6 Graph Nodes and Emotional Value in RenMinDeMingYi

TABLE 7 Number of Graph Nodes and Emotional Value in the Dataset

FIGURE 2.

The roles of the co-occurrence network in RenMinDeMingYi and community division performed by CDC. The names of characters are shown as the original processing result in Chinese. In the graph, pink nodes represent that they are divided into the positive community, while green ones are classified into the negative community. The larger the nodes are, the more frequent they show up in the literature. The thicker the edges are, the stronger connection (whether good or bad) the nodes on two sides have.

Show All

The CDVP algorithm considers the emotion-indicating morphemes. In this paper, the ratio of the number of negative words to the number of positive words in all the matching polarity indication windows of the node is selected as the polarity quantization value of the node. Tables 6 and 7 list the node polarity quantization values and the node polarity statistics in all the novels, including the number of positive nodes, the number of negative nodes, the maximum and minimum node polarity quantization values, and the threshold values.

The CDEP algorithm matches the link in the character-indicating window and the polarity word in the corresponding emotion-indicating window. Figure 3 shows a typical timing diagram of the link polarity for the RenMinDeMingYi. The cumulative result is used as the weight of the link in the polar literary character network. In Algorithm 1a, the weight of the network link only accepts positive values. Considering that the negative link indicates the node’s repellent relationship, weakening the repellent relationship to a link disconnection has no effect on the community discovery results.

FIGURE 3.

The interaction data between characters by the timeline in RenMinDeMingYi. The red lines show the positive interactions between the two nodes, which means how friendly they are to each other. While the green ones represent negative and hostile interactions. The gray lines are the superimposed of all their interactions. Through the entire timeline, the accumulated value will be calculated as PoE in the graph.

Show All

Finally, the scattered nodes discovered by the community weakened as a result of the disconnection and were classified into the negative community. The average weight method does not consider the importance of the character in the text and comprehensively evaluates the overall effect of the model.

F. Evaluation of the Methods

The node weight method takes the frequency of a character’s appearance in the entire literary space as a weight to evaluate the importance of the node and uses the weight to assign different discrimination scores to the nodes.

Table 8 shows the accuracy of the results of the seven novels when the four methods are used to divide the polar community, and the average weight method and the link weight method are used to evaluate the overall method. The accuracy of CDC is calculated from the common community detection method and general co-occurrence calculation. It is used as a benchmark for comparing the accuracy of the methods proposed in this paper. This method can reflect the character network in the literature to some extent, but it is not suitable for polar community division. The CDVP algorithm first determines the PoV and then builds the network structure from the positive and negative communities. The CDEP algorithm divides the community by using the common community discovery algorithm to weaken the negative link weight. The CDNP algorithm takes the node and links polar polarity. The statistical test ( $t$ -test) results given at the bottom of Table 8 show that CDNP performs better compared with basic CDC with both average and link weight. By performing statistical tests on our findings, we find that it validates the experimental results to an infinite possible number of experiments.

TABLE 8 Precision Results of Divided Communities in Chinese Novels From Extracted Polar Graph Data

The performance of the CDEP algorithm is a bit lower than CDNP because of the following reasons: When the polarity indicator arrives at the link, it points to all the possible character pairs. The indication relationship is not clear, although the influence of the polarity effect is eliminated as much as possible through the choice of the window granularity. However, the links are combined with the corresponding characters without regard to the true existence and high reliability of the linking relationship. The accuracy of the sentimental polarity keyword set may not be adaptable to the link polarity context. Most of the existing sentimental polarity keywords are attached to a single subject, that is, a node. While these emotional indicators should have been attached to several subjects but there are fewer words that can have a dual-agent or multiagent relationship. With all these words pointing to the link together, there is considerable noise in the link polarity calculation.

The CDVP calculation does not face the two problems above and achieves better test performance. In the CDNP algorithm, the influence weight of the node polarity is set to a higher value, which can include the network structure information while the polar community is better divided so that the positive and negative polar nodes can be connected to form a complete network structure. For instance, in the graph of RenMinDeMingYi in Figure 4, Zhao L.C. (

Show All

) and Gao Y.L (

Show All

) are regarded as bad characters shown as square nodes because they both have polarity values smaller than the threshold. And they are closely connected to each other because their interactions always indicate that they have positive emotions towards each other. Ouyang J. (

Show All

) and Ding Y.Z. (

Show All

) are also bad characters but have a closer connection, while Lu Y.K. (

Show All

) and Ji C.M (

Show All

) are connected good characters shown as round nodes because of polarity value are higher than the threshold. As Hou L.P. (

Show All

) and Gao Y.L. (

Show All

) occurred next to each other frequently and EISs indicate negative polarity, there are a few edges are confusing and Hou L.P. (

Show All

) results as a false negative node in the network. In the graph of ZhiQuWeiHuShan (see Figure 4), Zuo S.D. (

Show All

), Ma X.S. (

Show All

) and Xu D.M.B. (

Show All

) are in the negative clustered community, while Shao J.B. (

Show All

), Li Y.Q. (

Show All

) and Yang Z.R. (

Show All

) are connecting in the positive community.

FIGURE 4.

The graph extraction and network building result of six Chinese novels, where square nodes are good characters and round nodes are bad characters. Separate nodes are set as bad ones because they have few interactions with either good or bad characters, which means they tend to be influenced by others. With CDNP, we can get the result of community division together with the graph structure of these literary works.

Show All

By analyzing the accuracy results, it can also be found that in the novels of the 1970s and 1980s, the emotional tendency was more obvious, the polarity of the natural language nodes became stronger, and the model discrimination accuracy was higher. At the same time, when constructing the novel character network, the longer the novel is, the greater the number of nodes, the more information and the higher the accuracy.

As for the efficiency of the methods, the classification result of the graph is relatively simple, but a big difficulty with this method is that as the number of emotional words increases, the detection time of these words will increase linearly. We chose basic and naive filter for searching emotional morphemes, which took relatively more time but perform easily and equally. The searching process can be substituted by deterministic finite automaton (DFA) or Bayesian spam filter (BSF) to extract graph knowledge from the huge amount of literature materials.

SECTION V.

Conclusion

Chinese literary works are precious cultural treasures and important datasets for natural language processing. However, the data labeling and cleaning phase is complex and time-consuming, and there is a lack of sufficiently labeled sets to use machine learning methods for analysis.

This paper identifies the characters and emotional morphemes from natural text data and divides the literary interaction window. Based on the concept of literary sentiment polarity, the graph model of polar literary characters is built, and the polar communities are divided based on the extraction of polar nodes and polar links. The literary space entropy clearly defines the process of extracting information from the natural text in long texts, and the literary interaction window defines the range of nodes and polar morpheme indicators.

The graphs built with CDC are the baseline for comparing the performance. Most of the previous works on network analysis are based on co-occurrence only, so the analysis is limited in interaction frequency regardless of how emotional interactions affect the relationships. Therefore, the results of those community division methods only describe some parts of the exact original literature. In our methods, we attach emotional polar information to nodes and edges, so that characters and their relationships contain abundant knowledge about how they act and interact with each other. This multidimensional information has a significant contribution to graph building, social network computing and social network analysis such as community division. The experimental results show that the accuracy of the CDNP algorithm is much higher than the baseline. The key work of this paper is focused on the network structure using a dictionary to achieve the effective division of opposite characters. Though the method based on the link polarity has some limitations, it contributes valuable information of network structure which can be combined with the method based on node polarity. The CDNP method takes the advantages of them and can automatically match the positive and negative communities as well as include the complete link structure of the network. It clearly reflects the connotation of literary works from the perspective of polarity.

This systematic model provides general standard operation steps for the machine to understand complex Chinese literature and extract graph data. The experiments conducted on seven benchmark Chinese novel datasets demonstrate that the method based on emotional polarity shows a significant improvement compared to baseline performance. It is an unprecedented and beneficial effort to analyze polar characters in literature. This work plays a role model for further work on long-text literature understanding and is a meaningful reference to researches on natural language processing in the future.

As an interdisciplinary problem, it took impressive efforts to complete the fusion and the connection of semantics, network theory, emotion quantification recognition. And it is a really complicated and difficult topic to extract knowledge from literatures. Especially for Chinese literature, rhetoric sentences, polyseme base, lack of labeled data and dynamic personalities for characters make it even more laborious to convert the natural language to semantic data and calculate the relationships among the entities. We manage to overcome these difficulties, simplify and model this problem into a standard operation algorithm. The practical application of the method is that readers can quickly find the positive and negative nodes in the literary work to understand the characters and plots with a small amount of manual assistance and assist decision making for writing and publishing [43]. It is a bold experiment and novel attempt for computers to understand natural language literary works. These methods can be used for reference when integrating short texts and analyzing polar networks of public opinion. Though it is always a tricky and troublesome problem to extract emotional information, recognize character graph and divide community into specific sub-communities, we get a satisfying result for these problems with the methods in this paper.

There is a saying in Chinese, In the book, there is $a$ house of gold; On the book, there is $a$ shade of jade, which is used to say there is tremendous value in text database remained untouched. The identification of positive and negative characters is a hard topic because the character relationships are intricate and complex, and even human readers are difficult to identify. This paper has completed the analysis of novels with distinctive personalities. We hope these works can help researchers analyze the more subtle and delicate novels in the future. It provides a benchmark for graph knowledge extraction from literature which aims to improve the efficiency of understanding. Though the experiments are conducted in Chinese datasets, we believe that the proposed model and methods are referential for researches in other languages.

For further research, time can be considered as a variable in a character graph analysis [44], which would capture the character-spaces as the narrative is unfolded [45]. Some well-defined temporal centrality measures will categorize the actors in different time spans. More stylistic analysis according to specific literature of different languages will enhance flexibility of our methods [46], [47]. It can also be used with online datasets to prioritize the responses and better manage numerous posts [48]. For the basic and supportive process of the natural language, customed entity recognition method will make it more efficient [49] and multi-polar emotion calculation will represent multi-dimensional knowledge for tremendous literature. With the character recognition method [50], our work will contribute to its promising performance, provide a good benchmark that can assist investigations of intelligent heritage in historical document images. Among with methods that identify and analyse other types of relationships, it will assist extracting knowledge from films because adaptation of a film is the transfer of a novel to its visual medium [50]–[52]. We believe that in the next few years NLP will be a promising technology and extract more interesting results from huge volumes of works of literature in human history.

This article includes datasets hosted on IEEE DataPort^(TM), a data repository created by IEEE to facilitate research reproducibility or another IEEE approved repository. Click the dataset name below to access it on the data repository

Dataset Name: Demonstrating Dataset Used for Character Graphs and Dividing Communities in Chinese Novels

References is not available for this document.

Building Character Graphs and Dividing Communities in Chinese Novels Based on Graph Data Extraction: Community Division for Character Emotional Polarity Networks

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Graph Data Extraction and Character Graph Building

A. Polarity Data and the Polar Literary Character Network

Definition 1 (Polarity of Vertices, PoV):

Definition 2 (Polarity of Edges, PoE):

Definition 3 (Literary Character NetworkG(V,E))G(V,E)) :

Definition 4 (Polarized Literature Character NetworkG'(V',E'))G'(V',E')) :

Definition 5 (Characters Community, CC):

Definition 6 (Polarity of Community, PoC):

B. Text Space and the Text Space Entropy

Definition 7 (Morphemecc ):

Definition 8 (Natural Morpheme Sequence, NMS):

Definition 9 (Character-Indicating Sequence, CIS):

Definition 10 (Emotion-Indicating Sequence, EIS):

Definition 11 (Literacy Field Window, LFW):

Definition 12 (Polarity of Window, PoW):

Definition 13 (Literacy Filter, LF):

Definition 14 (Text Space Entropy, TSE):

Building Character Graph and Dividing Communities Using Emotional Data of Character Network

A. Community Division Based on a Co-Occurrence Network Structure

B. Community Classification Based on the Node Polarity

C. Community Classification Based on the Link Polarity

Algorithm 3 CDEP

D. Community Classification Based on the Polarity of the Entire Network

Algorithm 4 CDNP

Algorithm 1a GN Community Discovery

Algorithm 1b CDC

Algorithm 2 CDVP

Experiments and Analysis

A. Data Acquisition and Character Dictionary

B. Selection of the Granularity of the Literary Interaction Window

C. Character-Indicating Morpheme Recognition

D. Emotion-Indicating Morpheme Recognition

E. Community Discovery and Community Matching

F. Evaluation of the Methods

Conclusion

Authors

Figures

References

Citations

Keywords

Metrics

Code & Datasets

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Definition 3 (Literary Character Network $G(V,E))$ :

Definition 4 (Polarized Literature Character Network $G'(V',E'))$ :

Definition 7 (Morpheme $c$ ):