Conferences >2020 IEEE Recent Advances in ...

Belief Index for Fake COVID19 Text Detection

Abstract:

An increase in news articles on various communication platforms or social media has resulted in higher possibilities of spread of non-factual or fake information. The ove...Show More

Metadata

Abstract:

An increase in news articles on various communication platforms or social media has resulted in higher possibilities of spread of non-factual or fake information. The overall volume and veracity of news makes it even more impossible to manually fact check each data and label them as true or false. Under such circumstances, we propose a belief index generator model that quantifies the belief to be associated with any random information making use of text analytic proximity measures. In the initial feature engineering, we use a modified TF-IDF algorithm. Post generation of word embeddings, various distance measures have been proposed and compared as possible belief scores. The analysis has been carried out using 50K research articles on CoVid-19 to validate truths and The CoronaVirusFacts/DatosCoronaVirus Alliance Database to validate falsities in random CoVid related information.

Published in: 2020 IEEE Recent Advances in Intelligent Computational Systems (RAICS)

Date of Conference: 03-05 December 2020

Date Added to IEEE Xplore: 28 January 2021

ISBN Information:

DOI: 10.1109/RAICS51191.2020.9332508

Conference Location: Thiruvananthapuram, India

References is not available for this document.

Contents

SECTION I.

Introduction

Fake news detection has been so far carried out[1] by a two step process after the preliminary pre-processing. The primary step has been feature engineering which has mostly included generation of word embeddings from raw text using text vectorization techniques as Word2Vec [2], FastText, TF-IDF, and GloVe. This has been followed by text classification using models mainly trained upon labelled data. The variations in vectorization procedures and classification models implemented in the task and their combinations have resulted in different accuracies, given that the labelled dataset stands the same. However, the consistent difficulty that remained throughout was the impossible labelling of an almost infinite bulk of data with ‘true’ or ‘false’ tags for the model to objectively and most correctly predict the tag corresponding to any text information supplied. The approach here has been the discrete and binary labelling of text. And the missing point in the whole process had been the scope of an effective use of the distance metric based on which the classification was done. We put forward, a three step implementation in which post feature engineering[3] and classification phases, there is a calibration of similarity between vector corresponding to supplied text and the nearest neighbour already identified during the classification step. Experimentally we noticed that for any supplied text the nearest neighbour i.e. one with the least Euclidean distance has the maximum similarity score. Therefore the solution was seen as picking up the maximum similarity score out of similarity measures calculated for all pairs of the supplied text and each of the labelled information. Since this max score signified the closeness of the given information to an already labelled one, the degree of that closeness could be safely considered the amount of belief or disbelief (depending upon the classification) that could be bestowed upon that random text. Thus adding to a discrete tag of truth or falsity, we quantify the possibility of the information towards the tag, which mostly solves the requirement of labelling of every possible data to reach to a deterministic conclusion. For creation of word embeddings we use TF-IDF Vectorizer and in generation of belief scores we use Cosine Similarity.

SECTION II.

Term Frequency $-$ Inverse Document Frequency (tf-idf) Vectorizer

Term frequency (TF) is a measure of how important [4] a term is to a document. The ith term's $tf$ in a document $j$ is defined as:\begin{equation*} tf_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{k,j}} \tag{1} \end{equation*}View Source where $n_{i,j}$ is the number of occurrences of the term in document $d_{j}$ and $\sum_{k}n_{k,j}$ is the number of occurrences of all terms in document $d_{j}$. The inverse document frequency (IDF) is a measure of the general importance of the term in a corpus of documents, calculated by dividing the number of documents by the number of all documents containing the term. In case of a large corpus, the IDF value explodes, so taking the logarithm dampens the effect:\begin{equation*} idf_{i}=\log\frac{\vert D\vert }{\vert d_{j}:t_{i}\in d_{j}\vert } \tag{2} \end{equation*}View Source where $\vert D \vert$ is total number of documents in the corpus and $\vert d_{j}$: $t_{i}\in d_{j}\vert$ is the number of documents containing the term ti. Then\begin{equation*} tfidf_{i,j}=tf_{i,j}\ast idf_{i}. \tag{3} \end{equation*}View Source

Thus for all documents in the corpus, each term is assigned a tfidf score[5] [6]. Quite clearly, occurrence of a word in a document and occurrence of the word in the corpus are respectively directly and inversely proportional to its tfidf score particular to a document. Therefore it points out specific importance[7] [8] of that word for the document, and it is better if it is not or less used anywhere else. Consequently we consider an array of size fixed to that of entire corpus of words in all the documents. Each block in the array is a place to hold tfidf scores of static words, with respect to dynamic documents. Resultant array thus formed gives the word embedding or vector representation of the document.

SECTION III.

Classification

The state now presents word embeddings against ‘truth’ or ‘falsity’ tags. A trend needs to be learnt from these sets of correspondences, to exactly predict the tag for a word embedding derived using a similar encoding method performed over any foreign text. For this classification[9] [10] task, we used K- Nearest Neighbor (KNN) Classifier with the ‘number of neighbours' parameter as 15.

A. KNN Classifier

Prediction of the unknown tag against a feature set learning trends from pre-labelled feature sets is done by regression and classification methodologies. However, the use of discrete (here, even binary) labels in the specific case makes the prediction problem solely a classification task. This is because, in using discrete tags it creates two or more (here just two) classes of labels and the goal narrows down to simply finding which class a foreign feature set belongs to. The principle used in kNN Classifier[11] revolves around finding ‘k’ least distant feature sets which here being texts converted to word embeddings are simply vectors, for $a$ particular unlabelled word embedding. The feature sets actually can be plotted as points on a co-ordiante system for which the axes are the features themselves. Similarly word embeddings can be plotted, and the label that majority from the ‘k’ nearest word embeddings have is assigned as the tag for this foreign embedding as well.

The ‘k’ is a parameter[12]–[14] in the model. Its value directly affects and is proportional to the smoothness of the decision boundaries segregating between classes. However variations in values of ‘k’ i.e. changes in number of neighbours considered while predicting the tag for a foreign feature set, makes the prediction drift away or come closer to the actual label that was supposed to have been assigned. The ‘k’ for which this drift is minimum is taken as the optimal solution and the model is parameterized with the value. On the dataset used in this paper, the result of the optimization was ‘k’ = 15.

SECTION IV.

Post Classification Cross-Check

As an intermediate between classification and belief index generation, it becomes important to verify and not assume that the label predicted for a random text is same as the label of a pre-labelled text most similar to it. Since, KNN in theory takes Euclidean distances [15] in finding neighbors, need to tally the result with tags received using similarity score calculations is prior as different similarity analysis techniques have different background algorithms[16] [17]. The maximums from each type of distance metric is checked for correspondence with the same labelled text. Three parallel calculations namely Euclidean Distance, Cosine Similarity Score and Jaccard Similarity Score are implemented and most scoring sentences/paragraphs of each model are checked for equality. Cosine Similarity technique and Jaccard Similarity evaluate scores on completely different terms, while they are found to have a similar nature of dependence with Euclidean Distance.

A. Euclidean Distance

This can be implemented on $\mathrm{n}$ dimensional vectors, $n\in N$. When $n=1$, it represents points plotted along or parallel to a same axis. Euclidean distance [18] [19] most simply is the length of the straight line joining two points on the coordinate axes system. It is formulated as\begin{equation*} d=\sqrt{(x_{11}-x_{21})^{2}+(x_{12}-x_{22})^{2}+(x_{13}-x_{23})^{2}+\ldots} \tag{4} \end{equation*}View Source $d$ being the Euclidean distance between two vectors $(x_{11},\ x_{12},\ x_{13},\ \ldots)$ and $(x_{21},\ x_{22},\ x_{23},\ \ldots)$. Euclidean distances here are found between the word embedding of the text which is to be predicted ‘true’ or ‘false’ and each of the word embedding of the other labelled text. The text with minimum Euclidean distance[20] is checked for equality with text that scored highest in similar way in the similarity analysis techniques.

B. Jaccard Similarity

In contrast to calculations of distance metric between vectors in other two methods, Jaccard similarity[21] quantifies the degree of coincidence between two sets. Given two sets, $A$ and $B$, the Jaccard Similarity between them is calculated by\begin{equation*} J(A,B)=\frac{A\cap B}{A {\cup}B} \tag{5} \end{equation*}View Source

Therefore, to find level of coincidence between two texts, the texts are tokenized [22] and corresponding to each of them, a set is created with those tokens. The similarity score is expressed in terms of percentage and therefore the resultant value stands $J(A,\ B)\times 100$.

Fig. 1.

Variations in maximum cosine similarity score corresponding to minimum euclidean distance for each iteration representing a new unlabelled text paired with all texts from the corpus of labelled data.

Show All

Both the types of similarity scores show an inversely proportional relation with min values of Euclidean distance.

Fig. 2.

Variations in maximum jaccard similarity score correponding to minimum euclidean distance for each iteration representing a new unlabelled text paired with all texts from the corpus of labelled data.

Show All

However, it is noticed that Cosine Similarity points are more aligned to the line of best fit, while Jaccard Similarity points are deviant[23] [24]. This indicates possible derivation of better trend using Cosine Similarity values as they bear consistency in correlation. In all experimental cases, for each case, every distance metric representation method is found to indicate one unique text as the one most similar to the unlabelled text. The label has been found to correspond with the one predicted during classification.

SECTION V.

Cosine Similarity and Belief Index Generation

From the consistency in correlation derived, the evident way was to consider a similarity approach that performs on vectors. Therefore in quantifying beliefs over random unlabelled text, Cosine Similarity is used. Based on the classification, the direction of belief gets determined, with the magnitude being intact at the resultant similarity score obtained.

A. Cosine Similarity

The degree of match between two vectors can be function of the angle of inclination[25] of one vector upon other. The more it inclines, the better is the match. This proportion also remains true on taking cosine of the angle instead of the angle directly. Cosine similarity can be implemented upon n dimensional vectors, $n$ ∊ $N - 1$. If $n = 1$, the similarity scores thrown would either be 1 or 0, giving no clear indication or calibration of matches between just points along or parallel to any axis. Cosine Similarity technique[26] [27] in this case is used over word embeddings. If A and B are two word embeddings or vectors having an angle of inclination θ between them, then\begin{align*} & A. B=\vert A\Vert B\vert \mathbf{cos}\theta \tag{6}\\ \mathbf{cos}\ \theta= & \frac{A.B}{\vert A\Vert B\vert }=\frac{\sum_{i}^{n}A_{i}.B_{i}}{\sqrt{\sum_{i}^{n}A_{i}^{2}}.\sqrt{\sum_{i}^{n}B_{i}^{2}}} \tag{7} \end{align*}View Source where $A.B=\sum_{i}^{n}A_{i}.B_{i}$ represents dot product of $A$ and $B, \vert X\vert =\sqrt{\sum_{i}^{n}X_{i}^{2}}$ represents norm of the vector ‘${}^{\prime}X^{\prime}, X_{i}$ represents ith component of vector ${}^{\prime}X^{\prime}$ ‘.

B. Belief Index

In generation of Belief scores, Cosine Similarity technique is applied to measure the degree of matches between the foreign unlabelled text and all the labelled texts. The maximum among the measures, which has been seen to correspond with minimum Euclidean distance is taken as the absolute belief index. This calibrates the inclination of the unknown text towards an already labelled text and this calibration quantifies its closeness to truth or falsity depending upon the tag of the pre-labelled text. However, an absolute belief score cannot put a belief or disbelief over a random textual data. Therefore, based on the classification, the absolute belief score is multiplied with 1 if the prediction showcases true and it is multiplied with −1 if the prediction is false. The resultant value thus obtained is the final Belief Index.

Euclidean distance could not be seen as a possible belief index primarily because it outputs a lower value for higher match and vice versa. This would have created possibilities of getting higher belief scores for text that are classified false and the other way round for those predicted true. Moreover, since the length between two points cannot be confined putting $a$ range, corresponding Euclidean distances received can start with 0 and reach any value. This would not have allowed a proper calibration as relative to what maximum (which should be attainable if the text is truest) the belief is measured is not known. On the other hand Cosine Similarity scores range between 0 and 1[28] and relative to 1, the fraction of belief that can be associated to a text is found.

SECTION VI.

Results and Discussion

A. Dataset

Facts tokenized into sentences from corpus of text in CoVid-19 Open Research Dataset have been labelled ‘true’ and fact-checked fake news from Corona VirusFacts/DatosCorona Virus Alliance Database have been assigned ‘false’ labels. In total near to 100K news articles and facts have been used for the analysis.

B. Performance

The accuracy of classification have been reported at 0.9. And 90% of the unlabelled data predicted have been found to have more than 50% absolute belief indices. Table 1 showcases such experimental results.

C. Comparisons

Since using both Cosine and Jaccard Similarity, scores generated remain confined within 0 to 1, therefore apparently relative to a maximum, beliefs could be quantized using both the approaches. However, experiments tell Jaccard Similarity scores are not only much deviant and inconsistently correlated, but also they define lower belief indices than Cosine Similarity counterparts even on truest of facts. This is mainly because in Jaccard Similarity Score calculation, no actual word embeddings are created or matched between. This allows for no intrinsic feature between two texts to be compared. On the other hand, in Cosine Similarity approach, on usage of TF-IDF vectorized[29] [30] word embeddings, general imporatnces of the two texts are tallied, while on usage of BERT[31] embeddings, contextual similarities are assessed. Jaccard similarity simply addresses presence of similar word or character tokens in two texts. This implies same Jaccard Similarity Scores for two different permutations in order of placing character or word tokens in a set. Thus irrespective of changes or removal in meanings, contexts and importances due to changing orders of words in a sentence, the Jaccard Similarity Score remains constant. This makes it prone to giving faulty results as Belief Scores.

Table I Examples of quantification of beliefs and dis-beliefs in sentences.

Accuracy in predictions by KNN Classifier over word em-beddings derived using three different vectorization techniques have been documented in Table III.

Comparisons clearly show that during feature engineering, encoding texts based on relative frequencies of words in documents and entire corpus, helps improve classification. This implies using TF-IDF[32] [33] in creating word embeddings shortens Euclidean distances between points of the same class. The distance between two different classes are also expanded.

Table II Comparison between belief indices found by cosine similarity and jaccard similarity techniques

Table III Performance comparisons of different text vectorizers - precision and recall

SECTION VII.

Conclusion

This paper depicts a method of using Cosine Similarity in quantifying inclination of a random text towards truth or fake-ness. This can be implemented on news articles, tweets, social media statements to get an idea of their reality. Traditionally Cosine Similarity has just been used in matching between texts, and therefore its widespread use is found in search engines and small scale query searches. Unconventionally in this case, it has been involved to aid classification. Equivalence between KNN Classification method and Cosine Similarity scores to classify data hints at possibility of inclusion of Cosine Similarity in paradigms of supervised learning. Since clustering as well takes place based on relative distances between data points, semi-supervised and unsupervised learning can also be looked at using this method. Here, Cosine Similarity Score is a final output of the model, in contrast to query searches where match and mismatch are the binary outputs. An intermediate product can also be extended to build a model is purely evident from generation of Belief Indices using Cosine Similarity.

Select All

K. Shu, A. Sliva, S. Wang, J. Tang and H. Liu, "Fake news detection on social media: A data mining perspective", ACM SIGKDD explorations newsletter, vol. 19, no. 1, pp. 22-36, 2017.

MIT Libraries

MIT Libraries

Belief Index for Fake COVID19 Text Detection

Alerts

Abstract:

Metadata

Abstract:

Introduction

Term Frequency $-$ Inverse Document Frequency (tf-idf) Vectorizer

Classification

A. KNN Classifier

Post Classification Cross-Check

A. Euclidean Distance

B. Jaccard Similarity

Cosine Similarity and Belief Index Generation

A. Cosine Similarity

B. Belief Index

Results and Discussion

A. Dataset

B. Performance

C. Comparisons

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?