Introduction
Mental health problem remains a major challenge affecting modern people’s health. People’s mental health condition is closely related to their quality of life. It affects how they think, feel, and act. Depression, one of the most common mental disorder, is the leading cause of self-harm and suicide worldwide and affects millions of people [1]. According to the estimates from the World Health Organization [2], more than 300 million people are now suffering from depression. Moreover, from 2005 to 2015, the number of people with depression increased by at least 18%. Depression is the most preventable disorder [3]. Several studies have demonstrated that early depression detection and treatment can reduce the damage caused by the disorder [4]–[6]. However, services for the early detection and treatment of depression and other mental health disorders are extremely limited. Furthermore, many patients are reluctant to seek help from the healthcare provider. These problems prevent patients from getting timely treatment, leading to further deterioration of the condition [7], and, worst of all, suicide [8]. Therefore, it is vital to recognize people suffering from depression and provide them with psychological therapy as soon as possible [9].
An increasing number of people suffering from depression turn to online resources (Twitter, websites, Reddit, etc.) to express their psychological issues and seek help [10]–[12]. In particular, online forums that can choose to remain pseudonymous or anonymous are more popular. Using social media data for early detection of depression tasks has become an effective means. Meanwhile, massive social media data makes it difficult to identify users with depression or at risk of suicide manually, which makes the development of automatic depression detection technology more critical. Early detection of depression on social media is a continuous process of data accumulation, and only when there is sufficient evidence can a certain degree of depression risk be identified. This requires collecting a large number of posts with strong temporal associations and long time spans. However, most existing approaches use data from a limited number of health centers involving privacy and confidentiality issue for depression detection [13]. There are a few methods for depression detection based on large-scale social media data. Therefore, how to build user emotional state representations and identify critical sentiment information from a large number of posted contents is very important.
Self-expression and social support can help improve the psychological state of depressed people [14]. Moreover, the words people use on social media can reveal real and significant aspects of their social and psychological worlds [15]. Natural language is related to personality, mental state, and situational fluctuation. Therefore, how to identify the linguistic style of the individuals involved is particularly important. There has been a great deal of research that is focused on the depression detection and mental health problems, starting from the analysis of text extracted from social media. For example, Mowery et al. conducted a large number of machine learning algorithms to classify depressive symptoms from twitter data for mental health [16]. Choudhury et al. collected several hundred Twitter users who have been diagnosed with depressive disorder, using a statistical classifier to estimate the risk of depression through measuring the users’ emotions, language, and linguistic styles [17]. Moreno et al. utilized a large amount of data from Facebook, referencing to depression symptoms to ultimately determine the depression users [18]. Towards large-scale data like Reddit dataset, Yates et al. presented a general neural network architecture for combining posts into a representation of a user’s features to assess depression and self-Harm risk [19]; Then Qing Cong et al. proposed a deep learning-based approach for solving the imbalanced RSDD datasets [20]. In the Natural Language Processing field, text-based depression detection can also be considered as a sentiment analysis task. In fact, in addition to depression detection, early detection techniques can also be used in many other health-related fields. For example, it might be used to identify potential pedophiles, people with suicidal tendencies, and to monitor the evolution of psychological disorders.
In this work, we propose two general hierarchical posts representations models for identifying depressed users on large-scale datasets, including two essential parts: the post-level operation and the user-level operation. We attempt to apply gating weight to construct the representation of the users’ posts. The main contributions of our work are as follows:
We introduce two hierarchical neural network models with gated units and convolutional networks for fulfilling depression detection task, which are named Multi-Gated LeakyReLU CNN (MGL-CNN) and Single-Gated LeakyReLU CNN (SGL-CNN). The user’s dataset is divided into a certain number of posts, and we can use our models to identify the genuinely crucial sentiment features of each user’s posts and suppress other unimportant information as possible.
The proposed models can encode the relations between posts in user representation. It consists of two parts: the first one is a post-level operation, which is used to learn the representation of the user’s every post, and the second one is a user-level operation, which can obtain the overall representation of the user’s emotional state. The traditional convolutional neural network is weak in identifying crucial depression features. According to this situation, we add gated units to improve the performance of this task dramatically.
Empirical results on the RSDD dataset task demonstrate that our models perform better than the state-of-the-art methods. To prove the generality of our models, we also introduce the Early Detection of Depression dataset from another online forum to estimate the risk of depression. Our methods also perform well on this dataset compared to strong existing methods, demonstrating that our framework is robust and general.
The rest of this paper is organized as follows. Section II gives an overview of related works. Section III introduces our depression detection models. Section IV describes in detail how we conduct the experiment and discusses the experimental results of our proposed models. Section V is the conclusion of our depression detection work.
Related Work
Many studies analyze mental health-related texts in social media to better identify and understand mental health-related issues. Some of these studies use traditional machine learning methods. For instance, Schwartz et al. built a regression model by using Facebook data to predict multiple-granularity depression in individuals [21]. Thompson et al. used the clinical notes and online social media data to build a model based on Random Forest classifier [22] with bag-of-words features, detecting the risk of suicide in military personnel and veterans [23]. Furthermore, many traditional methods were used in the shared task automatic identification of content in mental health forums by the 2016 Computational Linguistics and Clinical Psychology Workshop. For instance, Malmasi et al. used a Random Forest meta-classification approach on top of a set of base classifiers [24]. Brew used SVM with Radial Basis Function (RBF) kernel [25].
Aside from the machine learning explorations which have achieved sound results in depression detection, many deep learning methods have had impressive successes on text classification and sentiment analysis. These methods only rely on text and are not dependent on any external features. For example, Long Short-term Memory (LSTM) [26] networks and their variants introduced the memory units and the gating mechanism to decide whether to delete or add information from memory units so that longer dependency information can be learned. Much research about sentiment analysis is based on LSTM. For instance, Gui et al. [27] proposed a novel cooperative multi-agent model to depression detection on Twitter. The model included a text feature extraction and an image feature extraction. The part of text feature extraction applied a gated recurrent unit and convolutional neural networks to extract the textual sentiment features. Tang et al. [28] developed two effective target-dependent models for sentiment classification on Twitter by using the bidirectional LSTM. In addition to LSTM, Convolutional Neural Networks (CNNs) are actively exploited for text classification in the medical domain or other NLP tasks, designed to learn to extract a hierarchy of crucial text elements. For instance, Kim first applied a simple CNN with one layer of convolution for sentence classification [29]. Yates et al. proposed a general neural network architecture for combining posts into a representation of a user’s features to assess depression and self-Harm risk [19]. Although CNN is originally designed for Computer Vision, they are very successful in NLP tasks and easily parallelized during training for not having time dependency.
Gated Convolutional Neural Network [30] (GCNN) firstly introduced the gated units into CNN for language modeling, providing a linear path for the gradients while retaining nonlinear capabilities for reducing the gradient vanishing. This model utilizes one convolutional layer to produce gating weights and identify abstract features. Since the weights and the abstract features are convolved at the same level, the significant features identified by the gating weights is very monotonous. To better discover contextual information in text classification, Yang Liu et al. introduced a new CNN model (AGCNN) for sentence classification, which generated the gating weights by a variety of specialized convolution kernels to integrate the contextual information of a particular context window into the control weights [31]. And to achieve better performance with aspect-based sentiment analysis, Wei Xue et al. proposed a model based on gated convolutional neural networks, which can selectively output the sentiment features according to the given aspect or entity [32].
The Proposed Method
We proposed two novel hierarchical depression detection models named MGL-CNN and SGL-CNN for identifying depressed individuals in online forums. Since a user’s overall data consists of a list of posts, and each post consists of a list of words, the models consist of two parts: a post-level operation and a user-level operation. It first produces continuous post representations from word representations. Afterward, post representations are treated as inputs of the second part to get the user’s overall emotional state representations. Users’ activity representations are then used as features for depression classification. The architecture of our proposed depression detection models is shown in Fig. 1. The difference between the two models is the number of gated units in their two parts.
Architecture of our proposed depression detection models. Each target user has up to
Current natural language processing methods mostly use long short-term memory and attention mechanisms to predict the sentiment polarity of the concerned targets, which need more training time and computational cost. Our proposed models replace the recurrent connections typically used in recurrent networks with gated temporal convolutions. Meanwhile, special convolution encoders are used to convolve the inputs and obtain gating weights independently, and the hierarchical structure can reuse parameters. Therefore, the computations of our models don’t have time dependency and can be easily parallelized over the individual words of every post in the user’s document.
In this session, we mainly describe the details of the post-level operation for the two models. The structure of the user-level operation is the same as the post-level operation. The input to the models will pass through multiple layers of the convolutional neural network with gated units, making full use of limited contextual information to obtain the critical features of the post’s representation.
Fig. 2 and Fig. 3 respectively display the post-level operation in SGL-CNN and MGL-CNN, which all consist of two convolutional layers and a global average pooling. The difference between them is the number of gating weights generated. In the demo model of post-level operation in MGL-CNN, the first convolutional layer uses two convolution kernels of different size to obtain an abstract feature map (no padding on the input). The second convolutional layer with the gated unit then uses convolution kernels to obtain different gating weights (padded where necessary). These gating weights are applied to do the element-wise product with the feature map generated by the first convolutional layer to get a post representation.
The architecture of post-level operation in SGL-CNN, denoting a user’s post with the length
The architecture of post-level operation in MGL-CNN, denoting a user’s post with the length
We have described the process by which \begin{equation*} X_{1:n}=\left [{x_{1},x_{2}, \ldots,x_{n}}\right]\tag{1}\end{equation*}
In the first convolutional layer, we use CNN with multiple convolutional filters of different widths [33] to produce post’s representation. The convolutional filters with different widths can be regarded as extractors to obtain multi-grained local information like N-Grams. Similarly, a convolutional filter with a width of 2 essentially captures the semantics of bigrams in a user’s post. Multiple convolutional filters with different window sizes are applied to obtain multiple feature maps. Let \begin{equation*} a_{i}=f\left ({K * X_{i: i+s-1}+b}\right)\tag{2}\end{equation*}
\begin{equation*} A=\left [{a_{1}, a_{2}, {\dots }, a_{n-s+1}}\right]\tag{3}\end{equation*}
The second convolutional layer consists of a convolutional layer and gated units. This layer is designed to produce different gating weights. We denote a convolution operation involving a kernel \begin{align*} g_{l}= \begin{cases}{f\left ({F * A_{l-\frac {h}{2}: l+\frac {h}{2}-1}+b}\right)\quad ({}h {~\text {is even }})} \\[0.3pc] {f\left ({F * A_{l-\frac {h-1}{2}:l+\frac {h-1}{2}}+b}\right)\quad (h {~\text {is odd }})} \end{cases}\tag{4}\end{align*}
\begin{equation*} G=\left [{g_{1}, g_{2}, \cdots, g_{n-s+1}}\right]\tag{5}\end{equation*}
Afterwards we get the output feature map \begin{equation*} O =A \otimes G\tag{6}\end{equation*}
The output
The obtained post representations are fed to the user-level operation to calculate the user’s activity representation. We use the same method as the post-level operation. The obtained user’s features are then passed to a fully connected softmax layer whose output is the probability distribution over labels. Categorical cross-entropy is used as the model’s loss function. Let \begin{equation*} \text {loss}=-\sum _{i \in T} \sum _{j=1}^{C} p_{j}^{T}(i) \cdot \log \left ({p_{j}(i)}\right)\tag{7}\end{equation*}
Experiments and Results
Experiments are conducted based on the Reddit Self-reported Depression Diagnosis (RSDD) dataset and the Early Detection of Depression dataset (eRisk 2017). We evaluate the performance of our proposed models by comparing them with other strong baseline models and analyze the performance of our models. The reported results are on the test set.
A. Experimental Datasets
The large-scale novel Reddit Self-reported Depression Diagnosis (RSDD) dataset [19] contains over 9,000 diagnosed users with depression, which is matched with approximately 107,000 control users who have a healthy mental state (data imbalance). On average, there are about 900 posts per user in the dataset, with 148 words per post. This dataset is created from a publicly-available online forums Reddit, which is used to train and test the model of identifying the users with depression. The RSDD dataset is magnitude larger and more high-accurate than prior work creating self-reported diagnoses datasets. The diagnosis posts, which includes false positives diagnosis such as hypotheticals, negations are all removed from the diagnosed users, and the users publishing fewer than 100 posts are also discarded. Meanwhile, in order to avoid easy identification of the diagnosed users through sensitive terms strongly associated with depression, the posts with depression terms are removed.
The Early Detection of Depression dataset (eRisk 2017) [34] can be used to develop an exploratory task on early risk detection of depression. It is a collection of posts from a set of social media users, including two categories of users: depressed users and mental health users. Both categories are unbalanced (more mental health users than depressed users). For each user, the collection contains a sequence of posts (in chronological order). The number of all users is not very high (about 486 users), but each user has a long history of writings (on average hundreds of messages from each user). Furthermore, the mean date from the first to the last submission is quite long (more than 500 days).
B. Baseline Models
We compare our methods with the following baseline methods used on the Reddit Self-reported Depression Diagnosis dataset. The previous state-of-the-art model on the RSDD dataset is User model-CNN [19].
BoW-SVM and BoW-MNB classifiers [35]. Support Vector Machines (SVM) or Multinomial Naive Bayes (MNB) combines with the post itself represented as a sparse bag of words features for depression detection tasks.
Feature-rich-SVM and Feature-rich-MNB. The two methods use multiple features such as a sparse bag of words features, external psycholinguistic features captured by LIWC5 [36] and emotion lexicon features [37].
User model-CNN [19]. The depression detection model consisted of a shared architecture based on a CNN, a merge layer, model-specific loss functions, and an output layer. It was the previous state-of-the-art model on the RSDD dataset.
Besides, we introduce several popular models in natural language processing and compare our models with them.
Long Short-Term Memory is a recurrent neural network with memory cells and three gate mechanisms, which is designed to avoid long-term dependency [26]. In our depression detection task, it takes the whole words of a post as a single sequence to obtain the post’s representation and use the whole posts of a user to get the user’s representation for detection.
Bi-directional Long Short-Term Memory consists of two LSTMs, which can capture bidirectional semantic dependency and improve the abilities of memory [38].
GRU-Attention model consists of a word- and sentence-level attention mechanisms and sequence encoders, which is based on GRU for document classification [39]. Besides, we also replace GRU with LSTM (LSTM-Attention) and Bidirectional LSTM (Bi-LSTM-Attention) to be the baselines.
CIFG-LSTM is a variant on Long Short Term Memory, which is designed to couple the input gate and the forget gate as one uniform gate [40]. Instead of individually deciding what to forget and add, the CIFG-LSTM makes those decisions together to simplify the structure of the LSTM.
To consider the spatial structure between words in the user’s posts, we also introduce Tree-LSTM [41] to achieve the representation of words to sentences over parse tree structures rather than in a sequential way. For the depression detection task, we firstly use the Stanford CoreNLP [42] to do tokenization and split sentences on the RSDD datasets and generate dependency parses using Stanford Neural Network Dependency Parser. Then we use Tree-LSTM to obtain the post representations and LSTM to get the user’s activity representations.
We also introduce Bert [43] for the depression detection task and make some modifications. We use Bert to obtain the representation of posts that integrate the context semantics. All posts published by a user are then fed to LSTM to get the user’s activity representations.
For the eRisk 2017 dataset, we choose the top methods [34] from the early detection of depression task as baselines and compare our methods against these baselines.
C. Experimental Setup
The RSDD dataset consists of training, validation, and testing datasets, and each contains approximately 3,000 diagnosed users and 35,000 control users. We used the validation set to tune the hyperparameter of our models and the baselines. The eRisk 2017 dataset consists of training and testing sets. The training set of eRisk 2017 contains 83 depressed users and 403 control users, and the test set contains 52 depressed users and 349 control users.
The statistical summary of the training datasets after tokenization are shown in Table. 1.
The value of the hyperparameters of our model is shown in Table. 2. The RSDD validation set is used to select the depression detection model’s hyperparameters, and the test set is used to report the results. We do not initialize the embedding layer with pre-trained embeddings such as publicly available Glove or Word2Vec. The input of the depression models is composed of original terms encoded as one-hot vectors. The input layer is then used to learn 50-dimensional and 100-dimensional embeddings of the terms (Embed_size). The learning rate (lr) is set to 0.001. For RSDD, eRisk 2017, we set the mini-batch size to 64, 128. We define
For the post-level operation of the MGL-CNN, the window sizes of the first convolutional kernels (s) are set as 2, 3, 4, 5 and 6, with 30 different kernels for each window size. In the second convolutional layer, the window sizes of kernels (
D. Results
The results in RSDD for identifying depressed users from both our methods and other baselines are shown in Table. 3. The differences between our models and baselines are statistically significant (McNemar’s test,p < 0.05). We compare our models against several baselines using MNB and SVM classifiers with two sets of features. Although the traditional methods SVM and MNB with rich features can achieve high precision, the performance on Recall and F1 are not good compared with the state-of-the-art User model-CNN and other popular models in NLP (e.g., CNN-based and LSTM-based methods). For instance, Feature-rich-SVM and Bow-SVM give outstanding performance 0.71 and 0.72 respectively on precision but only have performance 0.31 and 0.29 respectively on recall.
Besides, our models also gain competitive results over several popular models in natural language processing. The proposed model MGL-CNN achieved precision close to Bi-LSTM- Attention but performed better on recall and F1. Moreover, we can conclude from Table. 3 that the selected seven sequence models have achieved almost the same performances on the RSDD dataset (aside from Bi-LSTM-Attention). The Bi-LSTM-Attention model achieved the best performance among them. Compared to the User model-CNN, the precision increased of 5.1%, but the recall decreased. The bidirectional architecture can look forward and backward to capture bidirectional semantic dependency and improve the abilities of the memory. Therefore, Bi-LSTM has a better performance than single directional models. And experiments on the data also indicated that the attention mechanism could help LSTM and variants achieved good results in this task.
Compared with previous work, our proposed SGL-CNN model outperforms the state-of-the-art User model-CNN in terms of Recall and F1 on depressed users (increases of 24.4% and 3.9%, respectively). Besides, our proposed MGL-CNN model outperforms the User model-CNN in terms of Precision, Recall, and F1 on depressed users (increases of 6.8%, 6.7%, and 5.9%, respectively). We can find the comprehensive result of MGL-CNN is slightly better than SGL-CNN. Our proposed models can obtain an effective improvement over the User model-CNN and perform better than other strong baseline models. We believe that the Multi-Gated (Single-Gated) LeakyReLU unit can help CNN make full use of limited contextual information to obtain the critical features of the post’s representation. For our model, the first convolutional layer can capture the n-gram features of the text. The gated unit with different kernels then obtains gating weights to effectively identify language associated with negative sentiment across a user’s posts and suppress the impact of unimportant information.
The results on the Early Detection of Depression dataset for our models and the current best-performing methods are shown in Table. 4. The absolute values of the metrics from baselines illustrate that the early detection of depression task is difficult. In terms of F1, performance is low. The highest F1 is 0.64. Some methods, e.g., FHDO-BCSGB, opted for optimizing precision but had a low recall, while other methods, e.g., UNSLA, chose for optimizing recall but had low precision. This might be related to the scale and creation of the dataset. We can find that our proposed models (SGL-CNN and MGL-CNN) achieve performance close to several state-of-the-art methods in terms of Precision, Recall, and F1 on depressed users. Besides, our models are not aimed at improving one indicator like the baseline models, but perform well on all three indicators(Precision, Recall, and F1). Comparison between the result of our models and that of the latest methods suggested that our proposed general neural network architecture can also be applied to the early detection of depression in different online forums. Besides, as shown in Fig. 4, the comparison is based on the changing of training loss within the 40 epochs. We can find that they have almost the same convergence speed, and the results of our models are even slightly better than the state-of-the-art User model-CNN model, indicating that although our models are more complex, the convergence speed does not decrease.
The convergence of models. The X-axis is the number of iterations and the Y-axis is the training losses.
Conclusion
In this work, we proposed two hierarchical posts representations models for identifying depressed individuals, which was more accurate and efficient than general early depression detection models. The proposed models can effectively represent the user’s overall emotional state through their posts. We applied our models on the large-scale Reddit Self-reported Depression Diagnosis dataset and found that it substantially outperformed strong existing methods in terms of Precision, Recall, and F1. However, the absolute values of the metrics illustrate that depression detection on large-scale datasets in social media is still a challenging task and worthy of further exploration. And for demonstrating that our models focus on learning representations of the user’s posts from different online forums, we also applied our models on the Early Detection of Depression dataset. We found that it also achieved performance close to strong previously-proposed methods.
Our work is significant from several perspectives: we provide strong models to identify depressed users on social media and a method for large-scale public mental health studies about depression, and do a more in-depth study of the close connection between social media and mental health; we demonstrate the possibility of sensitive applications in combining clinical care with users’ online activities, where doctors could be notified and help in time if the activities of user suggest they have symptoms of depression. For future work, we will explore the application of MGL-CNN and SGL-CNN to general document-level sentiment analysis.
ACKNOWLEDGMENT
The authors would like to thank all authors’ contributions to this work. Besides, they would like to thank the anonymous reviews for their valuable comments.