I. Introduction
Recent research on deep learning to learn a distribution representation vector over words has achieved popularity in Natural Language Processing (NLP). The technique to learn this distributed representation vector is referred to word embedding. In a solid definition word embedding is a technique to learn continuous low-dimensional vector space representations of words by leveraging the contextual information from large corpora. Word2Vec [2] is a neural network model to learn the center word by its surrounding (context) words in a statement. It makes words of similar meanings closer over continuous vector space. Despite its simplicity, the model has generated high quality in NLP tasks such as language modeling, text understanding and machine translation. Other popular representations range from the simplest BOW and its term-frequency based variants [3], language model based methods [4]–[6], topic models [7], [8], Denoising Autoencoders and its variants [9], [10], and distributed vector representations [11], [12], [13]. Another prominent line of work on word embedding includes GloVe [14] and fastText [15]. However, despite the rising popularity regarding the use of word embeddings in different natural language processing task, they often fail to capture the emotional semantics the words convey. To overcome this challenge and include effective information into the word representation, Tang et al [1] proposed Sentiment-Specific Word Embeddings (SSWE) which encodes both positive/negative sentiment and syntactic contextual information in a vector space. This work demonstrates the effectiveness of incorporating sentiment labels in a word level information for sentiment-related tasks compared to other word embeddings. Yet, it only focuses on binary labels, which weakens its generalization ability on other tasks. Yu et al. [16] instead proposed to refine pre-trained word embeddings with a sentiment lexicon, observing improved results. Felbo et al. [17] achieved good results on affect tasks by training a two-layer bidirectional Long Short-Term Memory (bi-LSTM) model, named DeepMoji, to predict emoji of the input document using a huge dataset of 1.2 billions of tweets. However, collecting billions of tweets is expensive and time consuming for researchers. Emo2vec [18] is a word-level representation that encodes emotional semantics into fixed-sized, real-valued vectors. Labutov and Lipson [19] proposed a method that takes an existing embedding and labeled data as input and produces an embedding in the same space, but with a better predictive performance in the supervised task.