Introduction
Every day millions of people use social media and produce huge amount of digital data that can be effectively exploited to extract valuable information concerning human dynamics and behaviors. Such data, commonly referred as Big Data, contains valuable information about user activities, interests, and behaviors, which makes it intrinsically suited to a very large set of applications [1]. Big Social Data analysis is a subfield of Big Data analysis aimed at studying the interactions of users on social media for extracting useful information, such as moods or opinions on topics or events of interest [2].
This paper presents a new methodology, namely IOM-NN (Iterative Opinion Mining using Neural Networks), for estimating the polarization of public opinion on political events characterized by the competition of factions or parties. It can considered as an alternative technique to traditional opinion polls, since it is able to capture the opinion of a larger number of people more quickly and at a lower cost. In particular, IOM-NN uses an automatic incremental procedure based on feed-forward neural networks for analyzing the posts published by social media users. Starting from a limited set of classification rules, created from a small subset of hashtags that are notoriously in favor of specific factions, our methodology iteratively generates new classification rules. A classification rule allows to determine if a post is in favor of a faction based on the words/hashtags it contains. Then, such rules are used to determine the polarization of social media users - who wrote posts about the political event - towards a faction.
The proposed methodology has been applied to two case studies for analyzing the polarization of a large number of Twitter users during the 2018 Italian general election and the 2016 US presidential election. The results obtained by IOM-NN have been compared to opinion polls collected before voting and the most relevant techniques used in the literature (i.e., sentiment analysis with NLP [3], adaptive sentiment analysis [4], emoji-based polarization [5], hashtag-based polarization [6]). The results achieved by IOM-NN are very close to the real ones and more accurate than opinion polls and other relevant techniques, revealing the high accuracy and effectiveness of the proposed approach. For example, considering the 2018 Italian general election and the four parties that received the highest number of votes (M5S, PD, LEGA, FI), IOM-NN achieved a mean average error (MAE) of 1.13 percentage points and a log accuracy ratio (LogAcc) very close to 1. Opinion polls achieved a MAE of 3.74 percentage points and a LogAcc of 0.81. Compared with the other existing techniques, IOM-NN turned out to be the most accurate in forecasting the winning candidate. For example, considering the 2016 US presidential election, IOM-NN has been able to correctly identify the winning candidate in 8 out of 10 states, while the other techniques identified the winner in up to 6 out of 10 states.
Compared to existing techniques, our methodology includes the following innovative aspects:
This manuscript significantly extends a previous work [7] in the following main aspects:
The remainder of the paper is organized as follows. Section II discusses related work. Section III describes the proposed methodology. Section IV presents the case studies and Section V concludes the paper.
Related Work
In recent years, social media analysis is arousing great interest in various scientific fields, such as sociology, political science, linguistics, and computer science [8]. In this section, we focus on the main techniques and algorithms proposed for measuring public opinion and predicting election results through social media. As suggested in [9] and [10], the existing techniques can be divided into three main categories: volume-based, sentiment-based and network-based. For each category, the main proposed solutions and their differences with respect to our technique are discussed.
Volume-based techniques counts the number of mentions (e.g., posts, likes, retweets) related to a candidate/party for predicting the election results. In many cases, such techniques analyze social media data for predicting the outcome of an election. For example, Gaurav et al. [11] proposed a technique based on moving average aggregate probability, which infers the results of an election by counting how many times a candidate’s name is mentioned in tweets. Tumasjan et al. [12] used micro-blogging data for understanding how people express their political orientation, showing a high closeness between the volume of tweets mentioning a party and the election results. Burnap et al. [13] analyzed the volume of mentions for calculating an overall score for each party.
Unlike volume-based techniques, which consider the number of posts in favor of a faction, our technique takes into account the number of users supporting a faction. In this way, IOM-NN obtains more accurate results since it is not influenced by users who published a large number of posts.
Sentiment- or opinion-based techniques exploit natural language processing (NLP) or text mining algorithms for understanding the opinion of users towards political candidates or parties. Such techniques result to be more advanced than the volume-based ones as they analyze the textual content of posts to calculate a score.
The techniques based on natural language processing consider the hierarchical structure of a text to understand its meaning and sentiment. For example, Oikonomou and Tjortjis [3] exploited Textblob,1 a Python library for natural language processing, to predict the outcome of USA presidential elections in three states of interest (i.e., Florida, Ohio and North Carolina). Wong et al. [14] combined convex optimization techniques with SentiStrength,2 a lexicon-based sentiment analysis tool, for modeling the political behaviors of users by analyzing tweets and retweets. Alashri et al. [15] analyzed Facebook posts about the 2016 US presidential election with CoreNLP3 [16], one of the most popular tool for natural language processing, to calculate a score for each political candidate.
Techniques based on text mining discover the sentiment of a text by considering only the words it contains without analyzing its structure. For example, El Alaoui et al. [4] proposed an adaptive sentiment analysis approach that generates dictionaries from tweets classified as positive/negative for the different factions. Such dictionaries are then used to calculate a score for each faction. Similarly, Marozzo and Bessi [6] calculated a polarization score for each faction by considering only the hashtags in tweets labeled as positive. Chin et al. [5] exploited the emojis contained in a post to determine its sentiment (e.g., positive or negative). Other studies exploited machine learning techniques for discovering the political orientation of users, such as classification models based on the Naïve Bayes algorithm [17] or logistic regression [18].
IOM-NN is a text mining technique that uses bag-of-words and neural networks to classify posts, and consequently users who wrote such posts about a political event. Compared to existing text mining techniques, its iterative approach allows to greatly increase the number of classified posts, while the use of neural networks permit to automatically discover classification rules with a high level of accuracy. With regard to NLP techniques, it is worth noting that their usability and accuracy depend on the specific tool used and the supported languages. In fact, the most popular tools for natural language processing (e.g., CoreNLP) support sentiment analysis only for English texts. Instead IOM-NN classifies users by analyzing the words/hashtags contained in the posts that can be written in any language, without using dictionaries or translation systems.
Network-based techniques analyze the network structure of social media users, which support or discuss about certain candidates or parties, for understanding the dynamics of public opinion. Such analysis can provide useful insights for estimating the standing of political events or identifying the opinion leaders on a social media platform [19]. In fact, some studies have demonstrated a relation between the centrality of political candidates on social networks and their electoral consensus [20], [21]. However, it should be noticed that such techniques require the use of specific data that represents the social network structure, which is often visualized through graphs or sociograms. In our study, we collected and analyzed tweets containing specific keywords or hashtags on the political event under analysis, without capturing the structure of the related social network. For this reason, in this paper we cannot make a comparison with network-based techniques.
Some studies highlighted the issues related to the use of social media data for predicting the outcome of political events, which are language barrier, misclassification, data imbalance and reliability [10]. During the design of IOM-NN we faced such issues by proposing the following solutions:
Language barrier. Our technique classifies users by analyzing the words/hashtags contained in their posts, regardless of the language used to write them.
Misclassification. Starting from a limited set of classification rules, created from a small set of hashtags that are notoriously in favor of specific factions, the methodology iteratively generates new classification rules.
Data imbalance. To avoid the learning process being biased towards majority classes, a random under-sampling approach is used to balance the dataset at each training phase (see Algorithm 1).
Data reliability. The statistical significance of the collected data has been evaluated for assessing the representativeness of users, i.e., understanding whether they can be considered voters in the political event under analysis.
Proposed Methodology (IOM-NN)
As mentioned in Section I, IOM-NN is a methodology for estimating the polarization of public opinion during a political event, which is characterized by the rivalry of different factions. As shown in Figure 1, the proposed methodology consists of three main steps:
Collection of posts: posts are collected by using a set of keywords related to the selected political event (see Section III-A).
Classification of posts: the collected posts are then classified by using an incremental procedure implemented through neural networks (see Section III-B).
Polarization of users: the classified posts are analyzed for determining the polarization of users towards a faction (see Section III-C).
For each step, a formal description and practical examples are provided in the following sections. For the sake of clarity, Table 1 reports the meaning of the main symbols used to describe the different steps.
A. Collection of Posts
A political event
, which contains generic keywords that can be associated toK_{context} without referring to any specific faction in\mathcal {E} .F , whereK^\oplus _{F} = K^\oplus _{f1} \cup \ldots \cup K^\oplus _{fn} contains the keywords used for supportingK^\oplus _{fi} (positive faction keywords).f_{i} \in F
The keywords in
The collected posts are pre-processed before the analysis. In particular, they are modified and filtered as follows:
The text of posts is normalized by transforming it to lowercase and replacing accented characters with regular ones (e.g., IOVOTOSI or iovotosí
iovotosi).\rightarrow Words are stemmed for allowing matches with declined forms (e.g., vote or votes or voted
vot).\rightarrow Stop words are removed from text by using preset lists.
All the posts written in a language different from the one(s) spoken in the nation(s) hosting the considered political event are filtered out.
The output of this step is a collection of posts
Before the analysis, the statistical significance of the collected data has to be evaluated. We studied the age, gender and geographical distribution of social media users who generated such data. The aim is to assess the users’ representativeness by understanding whether they can be considered voters of the political event under analysis (more details in Sections IV-A.1 and IV-B).
B. Classification of Posts
Algorithm 1 shows the pseudo-code used for classifying the posts. The input is composed of: the posts
The algorithm is divided in two parts. The fist part (lines 1-9) performs the preliminary iteration (iteration 0). At this iteration, IOM-NN exploits the set of positive faction keywords (
The algorithm initializes an empty set
classifies
usingp , which produces a vectorM^{0} (line 4);v_{b} if
is in favor of a single factionp (lines 5-6), the classified post (i.e., a pairf ) is added to\langle p, f\rangle (line 7).C^{0}
At the end of iteration 0, the set of classified posts
The second part of the algorithm (lines 10-21) performs at most
It initializes an empty set
for storing the classified posts atC^{i} -th iteration (line 11).i It builds a classification model
by training a neural network using the classified posts at previous iterationsM^{i} (lines 12). The training set is balanced by using a random under-sampling approach to avoid a learning process biased towards majority classes.C^{0} \cup \ldots \cup C^{i-1} For each unclassified post at the previous iteration
(line 13), the algorithm classifiesN^{i-1} usingp , which produces a vector of probabilitiesM^{i} (line 14), wherev_{p} is the probability thatv_{p}[i] supportsp . If the maximum value off_{i} is greater than the given thresholdv_{p} , the post is assigned to the most likely factionth (lines 15-16) and added tof (line 17).C^{i} The set of classified posts
is added toC^{i} (line 18), and the unclassified postsC are obtained as difference betweenN^{i} andN^{i-1} (line 19).C^{i} If the ratio between the size of
and the size ofC^{i} is lower thanN^{i-1} or greater thateps , then it breaks the loop (lines 20-21).1-eps
Finally, the algorithm returns the dictionary
Figure 3 shows how the post classification algorithm (Algorithm 1) works starting from a set of posts
Table 2 shows an example of post classification on ten tweets about the 2016 US presidential election. The input of the algorithm is composed of a set of tweets regarding the political event and a set of faction keywords
{#voteHillary, #imwithher, #strongertogether, #hillary2016}K^\oplus _{Clinton} = {#voteTrump, #maga, #americafirst, #wakeupamerica}K^\oplus _{Trump} =
At iteration 0,
since Donald Trump has been accused of sexual assault by some women, tweets with keywords #sex and #woman are classified in favor of Clinton;
similarly, since Hillary Clinton contravenes the federal laws by using personal email account for government business, tweets with keywords email and #hillary are classified in favor of Trump.
At iteration 2, the algorithm learns other classification rules about immigration, a topic on which the two candidates had an opposite opinion. The iterative learning process ends when the algorithm is no longer able to generate new classification rules and therefore to classify new tweets.
C. Polarization of Users
Algorithm 2 shows the pseudo-code of the algorithm used for determining the polarization of users. The input is a collection of classified posts
As first step, the classified posts are aggregated by user to produce a dictionary (
On each pair
It filters out all the pairs that do not match the criteria defined by the
function (line 5). For example, users who published a number of posts below a given threshold are skipped.filter Using the classified posts
, it computesP_{u} a vector containing the score of userv^{u}_{s} for each faction (line 6). The score vector is calculated by using the functionu .polarize It adds the pair
to\langle u, v_{s}\rangle (line 7).U
Then, the algorithm calculates the overall faction score
The filter and polarize functions, used for analyzing the data collected for our case studies (see Section IV), have been configured as follows. Specifically, a user
Figure 4 shows how the user polarization algorithm (Algorithm 2) works on the classified posts shown in Table 2. For each user, the posts if favor of Clinton and Trump are counted. Users who fulfill the criteria of filter function are considered and added to the set of classified users
Case Studies
In this section we describe and analyze two case studies: the 2018 Italian general election and 2016 US presidential election. In both case studies, for each faction
As described in Section III, IOM-NN exploits only positive faction keywords (
Sentiment analysis with NLP [3], [15]. For each post, we used CoreNLP [16] for calculating a sentiment score that ranges from 0 (very negative) to 4 (very positive). The neutral keywords (
) are then used for grouping posts and calculating an overall score for each faction.K_{fi}^{\bigcirc\!\!\!{\circ}} Adaptive sentiment analysis [4]. Starting from the positive and negative keywords of each faction (
andK^\oplus _{fi} ), this technique generates two word-polarity dictionaries, which are built from a set of posts containing such positive and negative keywords. Also in this case, a score for each faction is returned.K^\circleddash _{fi} Emoji-based polarization [5]. This technique groups the posts of each faction by using keywords (
), then classifies their sentiment by using emojis and returns a score for each faction.K_{fi}^{\bigcirc\!\!\!{\circ}} Hashtag-based polarization [6]. The posts are classified as in favor of a given faction based on the positive faction keywords (
). Then the posts are aggregated by users and the polarization of each user is computed.K^\oplus _{fi}
To allow a direct comparison with the real percentages, the results obtained by the different techniques have been normalized with respect to the sum of the real ones.
A. 2018 Italian General Election
Here we discuss the case study carried out to analyze the polarization of a large number of Twitter users during the 2018 Italian general election. Twitter users have been classified using the polarization rules extracted from our methodology and the results have been compared to:
Italians voted to elect 630 deputies and 315 senators of the XVIII legislature: the results decreed the center-right coalition as the most voted, with about 37% of votes, while the most voted list was the Movimento 5 Stelle, which received over 32% of votes. The electorate was composed of 50,782,650 voters for the Chamber of Deputies and 46,663,202 for the Senate,4 with a turnout of about 73%, the lowest in Italian republican history.
In order to assess the validity of the proposed methodology, the analysis we carried out focused on the four most successful political factions, in decreasing order of consensus: M5S (Movimento 5 Stelle), PD (Partito Democratico), LEGA, FI (Forza Italia). In the following, we show how the classification model has been trained and discuss the main achieved results.
1) Models Training and Iteration-Level Results
IOM-NN has been used to classify 60,782 tweets posted by 21,883 users from February 1, 2018 to March 3, 2018 (the day before the election). By following the approach described in [6] we can assess that collected data is statistically significant for the event under analysis, since:
All the tweets under analysis have been written in Italian, that means they have the lang field set to it (Italian). With very few exceptions, the Italian language is used only by Italians people who reside in Italy or abroad.5
92% of the users who set a location in their profile, specified a region in Italy. Moreover, there is a strong correlation between the number of users that can be assigned to a region and the number of people actually living in that region according to official statistics (the Pearson correlation coefficient is equal to 0.8 with a confidence interval of 99%).
About 98% of the Italian social media users are adults and equally divided by gender (51.2 females and 48.8 males)6.
IOM-NN exploits the following positive faction keywords for analyzing the collected data:
{#iovotom5s,#m5salgoverno, #dimaiopresidente}K^\oplus _{M5S} = {#sceglipd,#iovotopd,#pdvinci}K^\oplus _{PD} = {#4marzovotolega,#iovotolega, #salvini- premier}K^\oplus _{LEGA} = {#berlusconipresidente,#votoforzaitalia, #4marzovotoforzaitalia}K^\oplus _{FI} =
The threshold
2) Polarization of Users and Final Results
The algorithm described in Section III-C has been used for analyzing the users who have written the 23,997 classified tweets so as to determine their polarization degree towards the considered factions. The first three rows of Table 4 shows a comparison between the official results, the average of the latest polls and the percentages obtained by IOM-NN. We evaluated the accuracy through different statistical indexes, comparing the obtained results with the latest opinion polls published before the elections. Considering the four most supported parties, our methodology obtained the following approval percentages: M5S 31.64%, PD 19.89%, LEGA 18.45%, and FI 12.80%. These results are extremely close to the real ones (i.e., M5S 32.68%, PD 18.72%, LEGA 17.37%, FI 14.01%), even more than the average of polls. In addition, the obtained results are characterized by very good values of log accuracy ratio, as well as a negligible value of mean percentage and absolute errors. In particular, our methodology achieved a mean average error (MAE) of 1.13 percentage points and a log accuracy ratio (LogAcc) very close to 1. On the other hand, opinion polls achieved a MAE of 3.74 percentage points and a LogAcc of 0.81, which confirm the ability of the proposed methodology to forecast election results. Figure 5 shows an info-graphic about the comparison of the real percentages, opinions polls and obtained results.
Comparison among real percentages, opinions polls and IOM-NN results (2018 Italian general election).
Table 4 also presents the results obtained by the other techniques in the literature (rows 4-7). These techniques have been configured with the positive faction hashtags used by IOM-NN (see
{#nom5stelle, #rimborsopolim5s, #maim5s} andK^\circleddash _{M5S} = {m5s, movimento5stelle, dimaio}K_{M5S}^{\bigcirc\!\!\!{\circ}} = {#nonvotopd, #maipd, #bastapd} andK^\circleddash _{PD} = {pd, partitodemocratico, renzi}K _{PD}^{\bigcirc\!\!\!{\circ}} = {#maiconsalvini, #iononvotolega} andK^\circleddash _{LEGA} = {lega, salvini, leganord}K _{LEGA}^{\bigcirc\!\!\!{\circ}} = {#maipiuberlusconi, #stopberlusconi} andK^\circleddash _{FI} = {forzaitalia, berlusconi}K_{FI}^{\bigcirc\!\!\!{\circ}} =
Compared to such techniques, IOM-NN turned out to be the most accurate in estimating the voting percentages, outperforming the competitors in terms of achieved LogAcc, MAPE and MAE. Compared to the emoji- and hashtag-based techniques, IOM-NN is able to classify a much greater number of tweets and users, which ensures greater statistical representativeness of data and robustness of results.
It should be noticed that, differently from other techniques (i.e., sentiment analysis with NLP and adaptive sentiment analysis), IOM-NN is not volume-based. This means that, since it gives the same weight to each user regardless of the number of published posts, the results obtained are not influenced by users who published a large number of posts. Since CoreNLP provides a well-trained model for sentiment analysis only for the English language, all the tweets downloaded in Italian have been translated in English before being processed.
B. 2016 Us Presidential Election
After the presentation of the Italian use case, here we discuss the analysis we carried out on the 2016 US presidential election, which was characterized by the rivalry between Hillary Clinton and Donald Trump.
The analysis has been performed on data collected for ten US Swing States: Colorado, Florida, Iowa, Michigan, Ohio, New Hampshire, North Carolina, Pennsylvania, Virginia, and Wisconsin. Swing states are those characterized by greater political uncertainty, in which neither major political party holds a lock on the outcome of presidential elections. These states are considered of strategic importance, as their votes have a high probability of being the deciding factor in a presidential election. For each state, data have been collected through the standard Search Twitter API, which allows for collecting tweets published in a given area or place. Overall about 2.5 million of tweets, posted by 521,291 users, have been collected from October 10, 2016 to November 7, 2016 (the day before the election). From such data we filtered out all the tweets posted by users with a not defined location or with a location that does not belong to any of the considered states. Filtered data (818,403 tweets posted by 141,959 users) are statistically significant for the event under analysis, since:
All the tweets under analysis have the lang field set to en (English).
For each state, there is a strong correlation between the number of analyzed users and the number of people actually living in that state according to official statistics (the Pearson correlation coefficient is equal to 0.95 with a confidence interval of 99%).
About 94% of the social media users in USA are adults (at least 18 years old) and almost equally divided by gender (42.7% females and 57.3% males).7
The following keywords have been used in our experiments:
{#voteHillary, #imwithher, #strongertogether, #hillary2016},K^\oplus _{Clinton} = {#neverhillary, #lockherup}K^\circleddash _{Clinton} = {clinton, hillary, democrats, dems}K_{Clinton}^{\bigcirc\!\!\!{\circ}} = {#voteTrump, #maga, #americafirst, #wakeupamerica}K^\oplus _{Trump} = {#nevertrump, #dumpfortrump}K^\circleddash _{Trump} = {trump, donald, republicans, gop}K_{Trump}^{\bigcirc\!\!\!{\circ}} =
Table 5 shows the results obtained using IOM-NN in comparison with the real voting percentages, the main opinion polls, and the other related techniques. For each state in the table, we reported the results obtained by the two candidates, where “
Comparison among the real winning candidate and that identified by IOM-NN and opinions polls. The Democratic Donkey symbolizes the party of Hillary Clinton, while the Republican Elephant that of Donald Trump.
Also in comparison with the other techniques, IOM-NN turned out to be the most accurate in discovering the winning candidate. In fact, emoji- and hashtag-based techniques classified a much smaller amount of tweets and correctly identified the winning candidate in 6 and 3 cases respectively. The results of the adaptive sentiment analysis were very poor, since it correctly identified the winning candidate in only one case, while sentiment analysis with NLP produced right predictions in 4 out 10 cases. Compared to IOM-NN, some techniques (i.e., adaptive sentiment analysis and opinion polls) achieved slightly better results in terms of log accuracy ratio, MAPE, and MAE, because their predictions are quite balanced by assigning almost the same score to the two candidates in the different states.
Conclusion
This paper proposes a methodology, named IOM-NN, for estimating the polarization of public opinion regarding political events characterized by the competition of factions or parties. The designed methodology uses an automatic incremental procedure based on feed-forward neural networks for analyzing the posts published by social media users. IOM-NN can be considered an alternative technique to traditional opinion polls since it is able to capture the opinion of a larger number of people more quickly and at a lower cost. In addition, it can capture the public opinion for topics perceived as embarrassing or offensive, for which people are reluctant to declare their true opinion during the polls.
IOM-NN has been validated through two case studies that analyzed the polarization of a large number of Twitter users during the 2018 Italian general election and 2016 US presidential election. The achieved results are very close to the real ones and more accurate than the average of the opinion polls, thus revealing the high accuracy and effectiveness of the proposed approach. Moreover, our approach has been compared to the most relevant techniques used in the literature (sentiment analysis with NLP, adaptive sentiment analysis, emoji- and hashtag- based polarization). Results show that IOM-NN achieved the best accuracy in estimating the polarization of social media users.
As future work, IOM-NN can be adapted to process real-time data, also coming from different sources (e.g, e-commerce and news sites blogs). Furthermore, its effectiveness can be evaluated also in other application domains, such as reputation evaluation of companies and competitive product analysis.