Introduction
Fake news detection has been so far carried out[1] by a two step process after the preliminary pre-processing. The primary step has been feature engineering which has mostly included generation of word embeddings from raw text using text vectorization techniques as Word2Vec [2], FastText, TF-IDF, and GloVe. This has been followed by text classification using models mainly trained upon labelled data. The variations in vectorization procedures and classification models implemented in the task and their combinations have resulted in different accuracies, given that the labelled dataset stands the same. However, the consistent difficulty that remained throughout was the impossible labelling of an almost infinite bulk of data with ‘true’ or ‘false’ tags for the model to objectively and most correctly predict the tag corresponding to any text information supplied. The approach here has been the discrete and binary labelling of text. And the missing point in the whole process had been the scope of an effective use of the distance metric based on which the classification was done. We put forward, a three step implementation in which post feature engineering[3] and classification phases, there is a calibration of similarity between vector corresponding to supplied text and the nearest neighbour already identified during the classification step. Experimentally we noticed that for any supplied text the nearest neighbour i.e. one with the least Euclidean distance has the maximum similarity score. Therefore the solution was seen as picking up the maximum similarity score out of similarity measures calculated for all pairs of the supplied text and each of the labelled information. Since this max score signified the closeness of the given information to an already labelled one, the degree of that closeness could be safely considered the amount of belief or disbelief (depending upon the classification) that could be bestowed upon that random text. Thus adding to a discrete tag of truth or falsity, we quantify the possibility of the information towards the tag, which mostly solves the requirement of labelling of every possible data to reach to a deterministic conclusion. For creation of word embeddings we use TF-IDF Vectorizer and in generation of belief scores we use Cosine Similarity.
Term Frequency $-$ Inverse Document Frequency (tf-idf) Vectorizer
Term frequency (TF) is a measure of how important [4] a term is to a document. The ith term's \begin{equation*}
tf_{i,j}=\frac{n_{i,j}}{\sum_{k}n_{k,j}}
\tag{1}
\end{equation*}
\begin{equation*}
idf_{i}=\log\frac{\vert D\vert }{\vert d_{j}:t_{i}\in d_{j}\vert }
\tag{2}
\end{equation*}
\begin{equation*}
tfidf_{i,j}=tf_{i,j}\ast idf_{i}.
\tag{3}
\end{equation*}
Thus for all documents in the corpus, each term is assigned a tfidf score[5] [6]. Quite clearly, occurrence of a word in a document and occurrence of the word in the corpus are respectively directly and inversely proportional to its tfidf score particular to a document. Therefore it points out specific importance[7] [8] of that word for the document, and it is better if it is not or less used anywhere else. Consequently we consider an array of size fixed to that of entire corpus of words in all the documents. Each block in the array is a place to hold tfidf scores of static words, with respect to dynamic documents. Resultant array thus formed gives the word embedding or vector representation of the document.
Classification
The state now presents word embeddings against ‘truth’ or ‘falsity’ tags. A trend needs to be learnt from these sets of correspondences, to exactly predict the tag for a word embedding derived using a similar encoding method performed over any foreign text. For this classification[9] [10] task, we used K- Nearest Neighbor (KNN) Classifier with the ‘number of neighbours' parameter as 15.
A. KNN Classifier
Prediction of the unknown tag against a feature set learning trends from pre-labelled feature sets is done by regression and classification methodologies. However, the use of discrete (here, even binary) labels in the specific case makes the prediction problem solely a classification task. This is because, in using discrete tags it creates two or more (here just two) classes of labels and the goal narrows down to simply finding which class a foreign feature set belongs to. The principle used in kNN Classifier[11] revolves around finding ‘k’ least distant feature sets which here being texts converted to word embeddings are simply vectors, for
The ‘k’ is a parameter[12]–[14] in the model. Its value directly affects and is proportional to the smoothness of the decision boundaries segregating between classes. However variations in values of ‘k’ i.e. changes in number of neighbours considered while predicting the tag for a foreign feature set, makes the prediction drift away or come closer to the actual label that was supposed to have been assigned. The ‘k’ for which this drift is minimum is taken as the optimal solution and the model is parameterized with the value. On the dataset used in this paper, the result of the optimization was ‘k’ = 15.
Post Classification Cross-Check
As an intermediate between classification and belief index generation, it becomes important to verify and not assume that the label predicted for a random text is same as the label of a pre-labelled text most similar to it. Since, KNN in theory takes Euclidean distances [15] in finding neighbors, need to tally the result with tags received using similarity score calculations is prior as different similarity analysis techniques have different background algorithms[16] [17]. The maximums from each type of distance metric is checked for correspondence with the same labelled text. Three parallel calculations namely Euclidean Distance, Cosine Similarity Score and Jaccard Similarity Score are implemented and most scoring sentences/paragraphs of each model are checked for equality. Cosine Similarity technique and Jaccard Similarity evaluate scores on completely different terms, while they are found to have a similar nature of dependence with Euclidean Distance.
A. Euclidean Distance
This can be implemented on \begin{equation*}
d=\sqrt{(x_{11}-x_{21})^{2}+(x_{12}-x_{22})^{2}+(x_{13}-x_{23})^{2}+\ldots}
\tag{4}
\end{equation*}
B. Jaccard Similarity
In contrast to calculations of distance metric between vectors in other two methods, Jaccard similarity[21] quantifies the degree of coincidence between two sets. Given two sets, \begin{equation*}
J(A,B)=\frac{A\cap B}{A {\cup}B}
\tag{5}
\end{equation*}
Therefore, to find level of coincidence between two texts, the texts are tokenized [22] and corresponding to each of them, a set is created with those tokens. The similarity score is expressed in terms of percentage and therefore the resultant value stands
Variations in maximum cosine similarity score corresponding to minimum euclidean distance for each iteration representing a new unlabelled text paired with all texts from the corpus of labelled data.
Both the types of similarity scores show an inversely proportional relation with min values of Euclidean distance.
Variations in maximum jaccard similarity score correponding to minimum euclidean distance for each iteration representing a new unlabelled text paired with all texts from the corpus of labelled data.
However, it is noticed that Cosine Similarity points are more aligned to the line of best fit, while Jaccard Similarity points are deviant[23] [24]. This indicates possible derivation of better trend using Cosine Similarity values as they bear consistency in correlation. In all experimental cases, for each case, every distance metric representation method is found to indicate one unique text as the one most similar to the unlabelled text. The label has been found to correspond with the one predicted during classification.
Cosine Similarity and Belief Index Generation
From the consistency in correlation derived, the evident way was to consider a similarity approach that performs on vectors. Therefore in quantifying beliefs over random unlabelled text, Cosine Similarity is used. Based on the classification, the direction of belief gets determined, with the magnitude being intact at the resultant similarity score obtained.
A. Cosine Similarity
The degree of match between two vectors can be function of the angle of inclination[25] of one vector upon other. The more it inclines, the better is the match. This proportion also remains true on taking cosine of the angle instead of the angle directly. Cosine similarity can be implemented upon n dimensional vectors, \begin{align*}
& A. B=\vert A\Vert B\vert \mathbf{cos}\theta
\tag{6}\\
\mathbf{cos}\ \theta= & \frac{A.B}{\vert A\Vert B\vert }=\frac{\sum_{i}^{n}A_{i}.B_{i}}{\sqrt{\sum_{i}^{n}A_{i}^{2}}.\sqrt{\sum_{i}^{n}B_{i}^{2}}}
\tag{7}
\end{align*}
B. Belief Index
In generation of Belief scores, Cosine Similarity technique is applied to measure the degree of matches between the foreign unlabelled text and all the labelled texts. The maximum among the measures, which has been seen to correspond with minimum Euclidean distance is taken as the absolute belief index. This calibrates the inclination of the unknown text towards an already labelled text and this calibration quantifies its closeness to truth or falsity depending upon the tag of the pre-labelled text. However, an absolute belief score cannot put a belief or disbelief over a random textual data. Therefore, based on the classification, the absolute belief score is multiplied with 1 if the prediction showcases true and it is multiplied with −1 if the prediction is false. The resultant value thus obtained is the final Belief Index.
Euclidean distance could not be seen as a possible belief index primarily because it outputs a lower value for higher match and vice versa. This would have created possibilities of getting higher belief scores for text that are classified false and the other way round for those predicted true. Moreover, since the length between two points cannot be confined putting
Results and Discussion
A. Dataset
Facts tokenized into sentences from corpus of text in CoVid-19 Open Research Dataset have been labelled ‘true’ and fact-checked fake news from Corona VirusFacts/DatosCorona Virus Alliance Database have been assigned ‘false’ labels. In total near to 100K news articles and facts have been used for the analysis.
B. Performance
The accuracy of classification have been reported at 0.9. And 90% of the unlabelled data predicted have been found to have more than 50% absolute belief indices. Table 1 showcases such experimental results.
C. Comparisons
Since using both Cosine and Jaccard Similarity, scores generated remain confined within 0 to 1, therefore apparently relative to a maximum, beliefs could be quantized using both the approaches. However, experiments tell Jaccard Similarity scores are not only much deviant and inconsistently correlated, but also they define lower belief indices than Cosine Similarity counterparts even on truest of facts. This is mainly because in Jaccard Similarity Score calculation, no actual word embeddings are created or matched between. This allows for no intrinsic feature between two texts to be compared. On the other hand, in Cosine Similarity approach, on usage of TF-IDF vectorized[29] [30] word embeddings, general imporatnces of the two texts are tallied, while on usage of BERT[31] embeddings, contextual similarities are assessed. Jaccard similarity simply addresses presence of similar word or character tokens in two texts. This implies same Jaccard Similarity Scores for two different permutations in order of placing character or word tokens in a set. Thus irrespective of changes or removal in meanings, contexts and importances due to changing orders of words in a sentence, the Jaccard Similarity Score remains constant. This makes it prone to giving faulty results as Belief Scores.
Accuracy in predictions by KNN Classifier over word em-beddings derived using three different vectorization techniques have been documented in Table III.
Comparisons clearly show that during feature engineering, encoding texts based on relative frequencies of words in documents and entire corpus, helps improve classification. This implies using TF-IDF[32] [33] in creating word embeddings shortens Euclidean distances between points of the same class. The distance between two different classes are also expanded.
Conclusion
This paper depicts a method of using Cosine Similarity in quantifying inclination of a random text towards truth or fake-ness. This can be implemented on news articles, tweets, social media statements to get an idea of their reality. Traditionally Cosine Similarity has just been used in matching between texts, and therefore its widespread use is found in search engines and small scale query searches. Unconventionally in this case, it has been involved to aid classification. Equivalence between KNN Classification method and Cosine Similarity scores to classify data hints at possibility of inclusion of Cosine Similarity in paradigms of supervised learning. Since clustering as well takes place based on relative distances between data points, semi-supervised and unsupervised learning can also be looked at using this method. Here, Cosine Similarity Score is a final output of the model, in contrast to query searches where match and mismatch are the binary outputs. An intermediate product can also be extended to build a model is purely evident from generation of Belief Indices using Cosine Similarity.