Journals & Magazines >IEEE Access >Volume: 10

Multimodal Educational Data Fusion for Students’ Mental Health Detection

This paper proposes a novel framework called CASTLE for students’ mental health detection through fusing multimodal information generated from campus life.

Abstract:

Mental health issues can lead to serious consequences like depression, self-mutilation, and worse, especially for university students who are not physically and mentally ...Show More

Society Section: IEEE Systems, Man and Cybernetics Society Section

Metadata

Abstract:

Mental health issues can lead to serious consequences like depression, self-mutilation, and worse, especially for university students who are not physically and mentally mature. Not all students with poor mental health are aware of their situation and actively seek help. Proactive detection of mental problems is a critical step in addressing this issue. However, accurate detections are hard to achieve due to the inherent complexity and heterogeneity of unstructured multi-modal data generated by campus life. Against this background, we propose a detection framework for detecting students’ mental health, named CASTLE (educational data fusion for mental health detection). Three parts are involved in this framework. First, we utilize representation learning to fuse data on social life, academic performance, and physical appearance. An algorithm, named MOON (multi-view social network embedding), is proposed to represent students’ social life in a comprehensive way by fusing students’ heterogeneous social relations effectively. Second, a synthetic minority oversampling technique algorithm (SMOTE) is applied to the label imbalance issue. Finally, a DNN (deep neural network) model is utilized for the final detection. The extensive results demonstrate the promising performance of the proposed methods in comparison to an extensive range of state-of-the-art baselines.

Society Section: IEEE Systems, Man and Cybernetics Society Section

This paper proposes a novel framework called CASTLE for students’ mental health detection through fusing multimodal information generated from campus life.

Published in: IEEE Access ( Volume: 10)

Page(s): 70370 - 70382

Date of Publication: 30 June 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3187502

Funding Agency:

Citations are not available for this document.

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Mental health is an important component of overall health and well-being. Mental health issues can lead to serious consequences, such as self-mutilation and suicide, particularly among university students who are not yet physically and mentally mature [1]–[5]. Currently, youth mental health is deteriorating. According to The State Of Mental Health In America 2021, 9.7 percent of youth in the United States have severe major depression, up from 9.2 percent in the previous year’s data [6]. The situation is even worse in developing countries [2]. Research also demonstrated that young people are more likely than any other age group to experience moderate to severe anxiety and depression during the COVID-19 pandemic [7]–[9]. However, not all students who have mental health issues are aware of their situation and actively seek help. According to research, roughly three-quarters of college students are hesitant to seek help when they have a mental health problem [10]. In this context, a proactive detection system for students with mental health problems is the key to addressing this issue.

However, detecting these students proactively is a tremendous challenge because mental health is influenced by a variety of complex factors. Previous studies have demonstrated that social life [11], [12], academic performance [13], physical appearance [11], and demographic features [4] can all have an impact on students’ mental health, and these features are recorded by the unstructured multi-modal data generated from various systems. Social life records, for example, belong to graph or network data, whereas physical appearance belongs to image data. It is difficult to accurately represent and fuse these features. The most commonly used methods in the current study are manual scoring and calculating summary statistics (e.g., the mean or variance). For example, researchers use Grade-Point Average (GPA) as a proxy for students’ academic performance or they manually assign a score to people’s physical appearance [11], [14]. It is undeniable that these approaches are capable of effectively overcoming data complexity and heterogeneity. However, it is not only subject to human bias, but it also risks losing information when data distribution varies greatly. For example, two students with the same GPA may take completely different courses. In this case, an effective solution to represent and fuse these features is a crucial challenge for detecting students with mental health problems.

The rapid development of technologies such as network science and representation learning [5], [15]–[17] enables us to profile campus life accurately and effectively, which brings us an unprecedented opportunity for the detection of students’ mental health [18]–[21]. Opportunities and challenges, however, coexist. Several significant issues continue to impede researchers in related fields. First, previous studies have utilised the friendship network to represent students’ social life [22], [23]. However, in addition to friendship, social life includes a variety of scenarios such as information dissemination and seeking assistance [12]. Representing students’ social life comprehensively remains a significant challenge. Secondly, GPA does not accurately represent a student’s academic performance, as mentioned in the previous paragraph. Accurate representation of academic performance remains an open question. In addition, because the number of students with mental health issues is much smaller compared to those who are healthy mentally, the related dataset is highly imbalanced, posing another critical challenge for mental health detection.

In this paper, we aim to detect students with mental health problems by overcoming the challenges mentioned above. We collect data through a combination of questionnaires and the learning management system (LMS), and select students’ social life, physical appearance, academic performance, and demographic features to detect their mental health status (shown in Figure 1). As mentioned above, these four features have been demonstrated to be related to the mental state of students [4], [11]–[13]. To profile students’ social life comprehensively, we collect social relationships for multiple scenarios, including friendship, life advice, academic advice, support, cooperation, intelligence, and good/bad news sharing [11], [12]. The experimental process of this research is shown in Figure 2. A framework, named CASTLE (eduCational dAta fuSion for menTaL hEalth detection), is proposed to achieve an accurate and effective detection, which includes three parts. First, representation learning is used here for the effective fusion of students’ multi-modal information, including multi-view social network embedding, physical appearance representation, and academic performance representation. In this part, we propose MOON (Multi-view SOcial NetwOrk EmbeddiNg), a multi-view social network embedding algorithm, to embed students’ heterogeneous social relations effectively. Moreover, a convolutional neural network (CNN)-based auto-encoder is used to embed students’ ID photos to obtain an accurate representation of their physical appearance. In addition, we introduce the method of combining the variant of one-hot encoding and autoencoder to overcome the heterogeneity of students’ academic performance [24]. Second, a synthetic minority oversampling technique (SMOTE) algorithm is used to mitigate the effects of label imbalance. Finally, a deep neural network (DNN) model is used for the final detection.

FIGURE 1.

The illustration of our research problem.

Show All

FIGURE 2.

The experimental process of this research.

Show All

Our contributions can be summarized as follows:

We propose a detection framework CASTLE for students’ mental health detection through fusing multi-modal information generated from campus life.
We design a multi-view social network embedding algorithm MOON, eliminating the information redundancy across different views through simplifying the strategy of federated embedding.
We conduct comprehensive experiments on a real-world educational dataset, and the extensive results demonstrate the promising performance of the proposed methods in comparison with an extensive range of state-of-the-art baselines.

This paper is organized as follows. In Section II, related work is reviewed. The problem formulation is presented in Section III. In Section IV, the CASTLE detection framework is introduced in detail. In Section V, all the data used in this research and its collection process are introduced. In Section VI, we analyze the results of our experiment. We present the discussion and conclusion of our work in Section VII.

SECTION II.

Related Work

The mental health of students attracts tremendous attention [25]. Scholars explore the laws behind mental health from different dimensions through statistical analysis theory. Rossin-Slater et al. [26] carried out experiments to explore the link between depression and school shootings through youth antidepressant use. They found that exposure to fatal school shootings increases youth antidepressant use. Duckworth and Seligman [27] investigated the self-regulation of eighth-grade students through longitudinal research, thus completing the detection of student achievement. The results indicated that students with poor self-regulation also have poor intelligence. Usher and Curran [28] predicted the influences on the mental health status of Australian university students using a cross-sectional study including an online survey. The results showed that there is a significant positive correlation between mental health and gender, age, health status, physical activity levels, sporting club participation, and social-emotional well-being. Morelli et al. [12] carried out an experiment to explore the association between mental traits and social centrality, and found that different mental traits can have different effects on a student’s social centrality.

Recently, social behavior has received increasing interest. Gong et al. [1] analyzed social anxiety disorders in university students using the research sensor of smartphones. They explored the relationship between social interaction and location, and the changes in social anxiety disorder with changes in location. The results showed that, depending on social anxiety, different students demonstrate large differences in personal behavior. Wongkoblap et al. [29] also concentrated on social behavior. They proposed a detection model based on social networks that can identify participants with poor mental health. The experimental results showed that social network data can effectively detect mental health problems. Meanwhile, Vanlalawmpuia and Lalhmingliana [30] analyzed data from social networking sites. They utilized data mining techniques to identify depression in Facebook users. Moreover, the study identified the number of depression indicator words, which are significant to related works.

Thanks to the development of big data technology, machine learning algorithms are applied to detect mental health issues [31]–[33]. Unlike previous studies that concentrate on the relationship between a single feature and mental health, machine learning-based research attempts to combine multiple features for prediction or classification. Brathwaite et al. [34] validated an existing model for the prediction of mental health in Nigeria. They selected 11 predictors and collected data through questionnaires, such as biological sex, childhood maltreatment, school failure, social isolation, fights, running away from home, and drug use. Tate et al. [35] designed a model to predict mental health problems in mid-adolescence and extract features from parental reports and register data such as the National Patient Register and the Multi-Generation Register. Walsh et al. [36] aimed to explore the law behind the nonfatal suicide attempts of adolescents and applied the random forest to make predictions. The features they used include diagnostic, demographic, medication, and socioeconomic factors. Ge et al. [37] carried out experiments to predict student mental status during the COVID-19 epidemic. They used early psychometric test results to predict future test results through the Xgboost algorithm and achieved good performance. Rubaiyat et al. [38] analyzed the predictability of major disorders based on a series of psychological tests, such as internet addiction, depression, and low self-esteem. They used machine learning methods to create models for detection. The experimental results showed that different disorders are interconnected.

Generally, these studies are mainly based on structured data such as demographic information. However, some important pieces of information, such as social patterns and appearance, are stored in unstructured data that is difficult to quantify and represent. With the development of representation learning and deep learning, some scholars are trying to introduce the information hidden in these unstructured data into mental health prediction. Oyebode et al. [32] introduced processing techniques for unstructured data to capture more complex features. They made predictions for mental health by performing sentiment analysis on 88125 user reviews. Mathur et al. [39] designed an experiment for suicidal tweet detection through natural language processing technology. Gaur et al. [40] incorporated domain-specific knowledge and proposed a detection framework to determine the severity of suicide risk-based data from Reddit. Cai et al. [41] proposed a multi-modal fusion-based depression recognition system based on EEG (Electroencephalogram) data, and the results showed that their system provides a highly flexible technique for depression identification. The data used for all relevant studies are summarized in Table 1. It can be seen that some scholars are trying to mine information from unstructured data to improve the performance of prediction. In a word, mental health prediction based on unstructured data is becoming a hot topic in this field.

TABLE 1 Summary of the Data Used in the Current Study

TABLE 2 Summary of Main Notations

SECTION III.

Problem Formulation

In this section, we introduce notations and formally define the research problem of this work. In our research, we use a multi-view network to profile students’ heterogeneous social relations. A multi-view network consists of a set of nodes $U$ and a set of views $V$ , where each view $v \in V$ contains a set of edges $E^{(v)}$ between nodes involved in $U$ . Specifically, an edge $e^{(v)}_{ij} \in E^{(v)}$ links nodes $i,j \in U$ . In this case, the multi-view network is denoted as $G = (U,V,\{E^{(v)}:v \in V\})$ . Given the multi-view network, the social life of student $i$ is represented as a low-dimension vector $\mathbf {f}_{i} \in \mathbb {R}^{D}$ through embedding the corresponding node $i$ , where $D$ is the dimension of the embedding space. Besides, for student $i$ academic performance, physical appearance, and demographic features are used as features and are represented by $\mathbf {a}_{i} \in \mathbb {R}^{P}$ , $\mathbf {c}_{i} \in \mathbb {R}^{K}$ and $\mathbf {o}_{i}$ , respectively, where $P$ and $K$ are the dimensions of the embedding space. Moreover, the mental status of student $i$ denoted as $y_{i} \in \{0,1\}$ and according to the results of the mental status test, we divide students into two groups: mentally healthy students and mentally unhealthy students.

Mental Status Detection: For student $i$ , given the features mentioned above, we detect its mental health status $y_{i}$ .

SECTION IV.

Design of CASTLE Framework

The CASTLE framework, including educational data fusion, data augmentation, and the detection model, is introduced in this section. First of all, we introduce educational data fusion, consisting of three subparts (shown in Figure 3): multi-view network embedding, physical appearance representation, and academic performance representation. First, the MOON algorithm proposed in this paper is introduced for multi-view social network embedding. Second, we represent students’ physical appearance and academic performance accurately and effectively through a CNN-based auto-encoder and an auto-encoder based algorithm [24]. Moreover, in the data augmentation part, we use the SMOTE algorithm to generate the data of students with mental health problems to balance the dataset. Finally, a DNN with the dropout mechanism is utilized for the final detection.

FIGURE 3.

The illustration of educational data fusion.

Show All

A. Educational Data Fusion

1) Social Life Representation

As mentioned before, students’ heterogeneous social relations are represented by a multi-view network in this paper. In other words, eight networks are used to represent the social relationships of experimental participants in eight different scenarios. Eight social scenarios include friendship, life advice, academic advice, support, cooperation, intelligence, and good/bad news sharing (details are shown in Section V-D). Inspired by [42], we propose a multi-view social network embedding algorithm, named MOON, to embed social information of students in multiple social scenarios into low-dimensional dense vectors. The details are shown as follows.

Ata et al. [42] proposed concepts of first-order and second-order collaboration and experiments demonstrate that its embedding performance is better than other state-of-the-art algorithms. In this case, we introduce these concepts into our experiments and divide the node pairs into three categories:

Intra-view Pairs: Two nodes are linked in the single-view network. The representation can be generated through random walks in each single-view network to retain the diversity of different views.
Cross-view, intra-node Pairs: The same node (i.e., intra-node) across two views (i.e., cross-view) forms pairs. Through the alignment process of node representations in such a pair, we aim to capture the first-order social relations.
Cross-view, cross-node Pairs: A node in one view forms pairs with various nodes (i.e., cross-node) in another different view (i.e., cross-view) based on their associations in each view, to obtain the second-order social relationship.

Details of these pairs are shown as follows.

First, each node has a view-specific representation to retain the diversity of each view. The embedding process follows the Deepwalk model [43], i.e., it generates topologically associated node pairs from random walks [44]. For a certain view $v \in V$ , a set of intra-views pairs is denoted as $\Omega ^{(v)} \subset U \times U$ . According to the mechanism of the skip-gram model, a pair $i^{(v)},j^{(v)} \in \Omega ^{(v)}$ consists of a root node $i^{(v)}$ and a context node $j^{(v)}$ . The prediction task is that uses the root node $i^{(v)}$ to predict the context node $j^{(v)}$ , i.e., to maximize $P\left ({j^{(v)} \mid i^{(v)}}\right)$ . The loss function of this step about parameters $\Theta$ is: $\begin{equation*} L_{\mathrm {Div}}(\Theta)=-\sum _{v \in V} \sum _{\left ({i^{(v)}, j^{(v)}}\right) \in \Omega ^{(v)}} \log P\left ({j^{(v)} \mid i^{(v)}; \Theta }\right) \tag{1}\end{equation*}$ View Source

$P\left ({j^{(v)} \mid i^{(v)}; \Theta }\right)$ is defined based on a softmax function: $\begin{equation*} P\left ({j^{(v)} \mid i^{(v)}; \Theta }\right)=\frac {\exp \left ({\tilde {\mathbf {f}}_{j}^{(v)} \cdot \mathbf {f}_{i}^{(v)}}\right)}{\sum _{u \in U} \exp \left ({\tilde {\mathbf {f}}_{u}^{(v)} \cdot \mathbf {f}_{i}^{(v)}}\right)} \tag{2}\end{equation*}$ View Source where $\mathbf {f}_{i}^{(v)}, \tilde {\mathbf {f}}_{i}^{(v)} \in \mathbb {R}^{\lfloor D /|V|\rfloor }$ represent the embedding vector of the center node $i$ and its context, respectively, for view $v$ . $\Theta$ represents the model parameter.

However, for cross-view joint embedding, the current research strategy is to unite all networks indiscriminately without capturing the essential characteristics of different networks [42]. This strategy may lead to high computational overhead, especially when there are too many networks involved (for example, our dataset contains eight networks representing different social scenarios). In this case, to achieve more effective embedding, we design the following embedding strategy for multi-view social networks. Social relations have an intrinsic nature that most social relations are based on friendships. Taking life advice as an example, students only seek help from their friends when they are in trouble. To capture this law, we divide all views of the social network into two categories:

Source View: Based on the assumption mentioned above, the source view is friendship.
Target View: Except for friendship, the rest of social relations are target views.

Based on these definitions, we define the First-order social relation and the Second-order social relation.

a: First-Order Social Relation

Cross-view, intra-node Pairs. The intuition here is that the same node across various views represents the same entity, so its view-specific representation should collaborate with one another. In other words, for cross-view intra-node pairs, the two view-specific embeddings of the same node should become similar. For the research question in this paper, each target view is a derivative of the source-view. In this case, the source-view representation should help the target-view representation, i.e., the friendship should impact other views by optimizing the following: $\begin{align*} L_{\mathrm {S} 1}(\Theta)=&-\sum _{v^{\prime } \in V^{\prime }} \sum _{\left ({i^{(v_{0})}, \cdot }\right) \in \Omega ^{(v_{0})}} \log P\left ({i^{\left ({v^{\prime }}\right)} \mid i^{(v_{0})}; \Theta }\right) \\=&-\sum _{v^{\prime } \in V^{\prime }} \sum _{\left ({i^{(v_{0})}, \cdot }\right) \in \Omega ^{(v_{0})}} \log \frac {\exp \left ({\mathbf {f}_{i}^{\left ({v^{\prime }}\right)} \cdot \mathbf {f}_{i}^{(v_{0})}}\right)}{\sum _{u \in U} \exp \left ({\mathbf {f}_{u}^{\left ({v^{\prime }}\right)} \cdot \mathbf {f}_{i}^{(v_{0})}}\right)} \\\tag{3}\end{align*}$ View Source where $v_{0}$ is the source view for social network and $v^{\prime }$ is the target view for social network. $V^{\prime }$ denotes the set of target views of social networks.

b: Second-Order Social Relation

Cross-view, cross-node Pairs. For the second-order social relation, the two nodes without a link in the target views may entail a latent relation when they are linked in the source view. It is not appropriate to directly assume a link between them. Instead, we introduce an implicit mechanism to enable such across-view relations: for a node in each target view, its context’s context in the source view may contain information that contributes to its embedding. For example, one of your friend’s friends is yourself.

For the multi-view social network, the context of nodes in the friendship network should be used to guide their embedding in other views. Formally, $\left ({i^{(v_{0})}, j^{(v_{0})}}\right) \in \Omega ^{(v_{0})}$ represents node $i$ and its context $j$ in view $v_{0}$ . Given such a relation, we form a cross-view, cross-node pair $\left ({i^{\left ({v^{\prime }}\right)}, j^{(v_{0})}}\right)$ , to close the distance between the representations of $i^{(v^{\prime })}$ and $j^{(v_{0})}$ . The loss is shown as follows: $\begin{align*} L_{\mathrm {S} 2}(\Theta)=&-\sum _{v^{\prime } \in V^{\prime }} \sum _{\left ({i^{(v_{0})},j^{(v_{0})}}\right) \in \Omega ^{(v_{0})}} \log P\left ({i^{\left ({v^{\prime }}\right)} \mid j^{(v_{0})}; \Theta }\right) \\=&-\sum _{v^{\prime } \in V^{\prime }} \sum _{\left ({i^{(v_{0})},j^{(v_{0})}}\right) \in \Omega ^{(v_{0})}} \log \frac {\exp \left ({\tilde {\mathbf {f}}_{j}^{(v_{0})} \cdot \mathbf {f}_{i}^{\left ({v^{\prime }}\right)}}\right)}{\sum _{u \in U} \exp \left ({\tilde {\mathbf {f}}_{u}^{(v_{0})} \cdot \mathbf {f}_{i}^{\left ({v^{\prime }}\right)}}\right)} \\\tag{4}\end{align*}$ View Source

The complete loss function of the MOON algorithm is: $\begin{equation*} L=L_{\mathrm {Div}}+\alpha \cdot L_{\mathrm {S} 1}+\beta \cdot L_{\mathrm {S} 2} \tag{5}\end{equation*}$ View Source where $\alpha, \beta \geq 0$ are hyperparameters to control the importance among the three components.

2) Physical Appearance Representation

The photos of students are processed by a convolutional auto-encoder in this research to capture the spatial structure characteristics. The classical auto-encoder can be defined as: in each hidden layer, we adopted the following nonlinear transformation function: $\begin{align*} \boldsymbol {h_{(2) }}=&f(\boldsymbol {W}_{(2) }\boldsymbol {h}_{(1) } + \boldsymbol {b}_{(2) }) \\ \boldsymbol {h_{(3) }}=&f(\boldsymbol {W}_{(3) }\boldsymbol {h}_{(2) } + \boldsymbol {b}_{(3) }) \\&\ldots \\ \boldsymbol {h_{(i)}}=&f(\boldsymbol {W}_{(i)}\boldsymbol {h}_{(i-1)} + \boldsymbol {b}_{(i)}),\quad i=4,5,\ldots k \tag{6}\end{align*}$ View Source where $f$ is the activation function and $\boldsymbol {W}_{(i)}$ , $\boldsymbol {b}_{(i)}$ are the transformation matrix and the bias vector respectively. Based on the auto-encoder, we added convolutional layers to extract valid image features. The extraction of convolutional features is defined as follows: $\begin{equation*} \boldsymbol {h_{(m)}} = f(\boldsymbol {\mathcal {W}}_{(m)} * \boldsymbol {h}_{(m-1)} + \boldsymbol {b}_{(m)})\tag{7}\end{equation*}$ View Source where * represents the convolution operator and $\boldsymbol {\mathcal {W}}$ represents filter weight. The corresponding loss function is as follows: $\begin{equation*} \mathcal {L}\left ({\mathbf {x}, \quad \hat {\mathbf {x}}}\right)=\left \|{\mathbf {x}-\hat {\mathbf {x}}}\right \|^{2} \tag{8}\end{equation*}$ View Source where $\mathbf {x}$ represents the input and $\hat {\mathbf {x}}$ represents the output (i.e., $\hat {\mathbf {x}}=f\left ({\boldsymbol {W} \mathbf {x}+\boldsymbol {b}}\right)$ , where $\boldsymbol {W}$ and $\boldsymbol {b}$ represent the transformation matrix and bias vector of the model, respectively).

3) Academic Performance Representation

The heterogeneity of academic performance, caused by the diversity of the student curriculum, is always a challenge in the education field. For example, one student chooses courses A, B, and C and another student chooses courses C and D. The content and number of dimensions vary when using their exam grades as features. The current popular method in this field is to calculate summarizing statistics (e.g., the mean or GPA) as the agent to represent academic performance. It can effectively overcome the heterogeneity of the curriculum, but information loss may be significant when data distribution varies widely. To preserve the completed information of academic performance while overcoming the heterogeneity of grade data, we use the method of combining the variant of one-hot encoding and autoencoder for homogenization [24]. Firstly, we embed their course through one-hot encoding, replacing the 1 with the corresponding exam grade. In this way, we create the matrix $\boldsymbol {C}\,\,\in \mathbb {R}^{n \times m}$ (shown as follows) where $n$ and $m$ represent the number of students and the number of courses, respectively. $\begin{align*} \left \{{ \begin{matrix} c_{11} &\quad c_{12} &\quad \cdots &\quad c_{1n}\\ c_{21} &\quad c_{22} &\quad \cdots &\quad c_{2n}\\ \vdots &\quad \vdots &\quad \ddots &\quad \vdots \\ c_{m1} &\quad c_{m2} &\quad \cdots &\quad c_{mn}\\ \end{matrix} }\right \}\end{align*}$ View Source

However, the number of courses taken by each student is much less than the total number of courses offered by the university. This issue leads to the severe sparsity of $\boldsymbol {C}$ . To overcome this problem, we use an auto-encoder (Eq. 6) to reduce the high dimension caused by the previous step and obtain an accurate representation of students’ intelligence features.

B. Data Augmentation for Label Imbalance

The number of students with mental health problems is smaller generally (shown in Section V-B), so the label imbalance problem exists in our experiments. Thus, we utilize the SMOTE algorithm [45] to augment data in order to improve generalization performance. Normal oversampling methods take a simple strategy that copies the sample of the target category, resulting in the law captured by learning models on modified data being too specific. By contrast, the basic idea of SMOTE is to analyze the categories with fewer data and to generate data by the following equation: $\begin{equation*} x_{new} = x + rand (0,1) \times (\tilde {x} - x)\tag{9}\end{equation*}$ View Source where $x$ and $\tilde {x}$ represent two different data in the same category. So, the learning model can capture more general patterns on modified data by the SMOTE algorithm.

C. Detection Model

In this study, the students’ mental health detection task is regarded as a binary classification experiment. A three-layer DNN model, including the input, hidden, and output layers, is used for the final binary detection (shown in Figure 4). The input and output layers serve as nodes to buffer input and output, respectively, and the function of the hidden layer is to fit the relationship between input and output. Before any data has been run through the network, the weights for the three-layer model are random. Through the back-propagation algorithm, all weights are updated according to the laws hidden in the data. A dropout mechanism is applied to overcome the overfitting caused by a relatively small dataset. Dropout is a technique for addressing the overfitting problem of neural network-based models, and the mechanism is to drop units (along with their connections) of neural networks randomly during model training. Meanwhile, we also introduce batch normalization to optimize the training process. (Note that our experiments are carried out in the second semester of the student’s university life, so only the first grades are used as the feature. A model for time-series data, like temporal convolutional network (TCN) [46] or long short-term memory network (LSTM) [47] should be the alternative if more semester grades are involved.)

FIGURE 4.

The illustration of three-layer model.

Show All

SECTION V.

Dataset

In this section, we detail the data and its collection process. The dataset used in this research includes 509 university students in the same school from a Chinese university and they are freshmen who have just finished their first semester exams. They are required to be more than 18-year-old freshmen (aged 18-20, mean = 19.03, SD = 0.21), who live in several specific residential buildings (next to each other) in the same area. Removing the error data, 485 students are involved in this experiment. First, ethical consideration is introduced. Second, we present the data collected through the LMS (academic performance and demographic features) and the questionnaire (physical appearance and social networks).

A. Ethical Considerations

This research has been given ethical approval through the university’s ethical approval process. Participants consent to release all related data for the study. Participants are given the option to freely withdraw at any time during the study or omit any particular answers without providing a reason. Participants’ pictures and information are kept coded and confidential, and the questionnaire data is kept with separate IDs (e.g., letter code for faces, number code for other data). The key relating codes are kept separately in a password-protected file. The study is not anticipated to cause any distress, but if, for any reason, participants are distressed, they are encouraged to contact the student support service.

B. Mental Test

In this research, we use the Symptom Checklist 90 (SCL-90), a widely used self-report psychometric instrument, to assess mental distress and symptoms of psychopathology [48]. The primary symptom dimensions of SCL-90 consist of total scores of psychological health, somatization, obsessive-compulsive, interpersonal sensitivity, depression, anxiety, hostility, phobic anxiety, paranoid ideation, psychoticism, and a category of “additional items” which help clinicians assess other aspects of the clients’ symptoms. According to the university student norm, all students are divided into two categories: students with mental health problems and students with healthy mental status. The ratio of students with healthy mental status to students with mental health problems is 7 to 1.

C. Academic Performance and Demographic Feature

The LMS is the infrastructure of the university, which records information about students’ learning and daily life. Students’ academic performance can generally be recorded as the exam grade of each course stored in the LMS. The academic performance data used in this research includes 13234 records, and 1455 records of the demographic data are involved in our experiment, including gender, age, and nationality.

D. Physical Appearance and Social Networks

To collect physical appearance accurately, all participants are assigned to a lab and instructed to sit down and look at the center of the camera lens with neutral expressions, hair pulled back, and no adornments. Students are photographed under consistent lighting conditions with a fixed camera distance. We used a Fujifilm FinePix S5 Pro digital SLR camera (60 mm fixed length lens) and a photo booth painted white with calibrated D65 white lighting. These facial photographs are aligned according to interpupillary distance. We resize and crop the photographs to ensure the display of equal proportions of neck and hair. This process results in a set of 485 images of 485 identities.

After the photos, participants are asked to nominate members of their social networks regarding important dimensions of friendship, life advice, academic advice, support, cooperation, intelligence, and good/bad news sharing. They are instructed to write down 5–8 names of freshmen living in their dorm area (The 485 reference student names are given as a reference list.) with questions including: please choose 5–8 freshman names from the list who are your friends; who are intelligent; whom you would go to for academic advice/life advice, sharing good/bad news, support, or cooperation. For the sake of aesthetics in the format, we selected six of these networks for visualization shown in Figure 5.

FIGURE 5.

The sketch of social networks.

Show All

SECTION VI.

Experiments and Results

In this section, we present the experimental results in detail to not only demonstrate the performance of the proposed approaches, including the CASTLE framework and MOON algorithm, but also to explore the detectability of students with mental health problems. All experiments are implemented in Python 3.6. Packages including Pandas and Scikit-learn are utilized for data analysis and detection. Origin 2018 and Graph are utilized for the visualization of data and experimental results. We first introduce the representation results of academic performance and physical appearance. Then, we introduce the experimental settings of mental status detection and its results.

A. Representation of Academic Performance and Physical Appearance

To deal with the heterogeneity of academic performance data, we apply the representation approach of combining the variant of one-hot encoding and autoencoder, as mentioned before. Since this experiment was conducted in the second semester, only grades from the first semester are used as a feature for detection. We test the different dimensions and the performance is shown in Figure 6. The value of the loss function fluctuates slightly, representing that even vectors with low dimensions can still accurately represent the academic performance of each student. Thus, we choose 6 as the dimension of representation for computational efficiency. Moreover, we use a CNN-based auto-encoder to process students’ photos and the representations are shown in Figure 6. We choose 6 as the dimension of representation as well.

FIGURE 6.

The results of feature representation for academic performance and physical appearance.

Show All

B. Detection Results

1) Results Analysis

We design a series of experiments to explore the detectability of students with mental health problems and to validate the methods proposed in this paper. There are a total of 485 samples in our dataset, and as we mentioned before, the ratio of students with healthy mental status to students with mental health problems is 7 to 1. Four features are used for detection, including social life, appearance, academic performance, and demographic information. The embedding dimension of social life is 8 (details are shown in Section VI-B2). The embedding dimension of both appearance and academic performance is 6. Moreover, three pieces of demographic information were included in this experiment, including gender, age, and nationality. In this case, the final feature used for prediction is a 23-dimensional vector.

Due to privacy concerns, there is a lack of publicly available datasets in the field of student mental health. It is difficult for all scholars in the field to test the performance of the algorithm in the same data environment. In this case, we design the following comparative experiments based on algorithms commonly used in related fields. Firstly, we replace the specified parts of the proposed framework with some popular algorithms, which include two parts. In the first part, we replace our network embedding algorithm with the following algorithms:

Deepwalk [43]: Deepwalk is a classic embedding algorithm that is widely used in the field of network representation.
MANE [42]: MANE is a multi-view network algorithm, which inspired us to propose the MOON algorithm.

In the second step, we replace our final detection model with current popular algorithms shown as follows:

Support Vector Machine (SVM) [49]: SVM is a classic algorithm and is widely used in the field of data mining.
Random Forest (RF) [50]: is a classic ensemble algorithm that achieves good performance in various applications.
XGBoost [51]: XGBoost is a boosting-tree-based method and is widely used in various data mining scenarios with good performance.

We divide the training and test sets into ratios of 9:1, 8:2, 7:3, 6:4, and 5:5, respectively. For each train and test repartition, we use SMOTE to alleviate the class imbalance problem, and the data generation process is shown as follows:

First, raw data is divided into two categories: the training set $a$ and the testing set $b$ by stratified sampling.
Second, we use SMOTE on the training set $a$ to generate samples of the minority class. Then in the new training set $a'$ , the number of students in the two classes is equal.

We test the performance of these algorithms from two aspects for a comprehensive evaluation. On one hand, we fit algorithms based on the raw training set $a$ and test them on the testing set $b$ . The results are shown in Figure 7. In this paper, we quantify the experimental results by Accuracy, Recall, and F1 Score, because Precision is a constant value given that Recall and F1 Score are known. In this case, Accuracy, Recall and F1 Score can provide a comprehensive and effective evaluation of the experimental results. Especially, we are more concerned with how many students with mental problems are identified, so Recall and F1 Score are more effective indicators. It is shown that the detection is not accurate and the fluctuation is quite large due to the label imbalance of raw data. To overcome this problem, we fit algorithms based on the balanced training set $a'$ and test them on testing set $b$ . The results are shown in Figure 8. The performance is improved significantly.

FIGURE 7.

Performance of mental health detection on raw training dataset $a$ .

Show All

FIGURE 8.

Performance of mental health detection on balanced training dataset $a'$ .

Show All

Note that from the performance of MOON+DNN and MANE+DNN in Figure 8, the proposed MOON algorithm is only slightly better than the MANE algorithm, but the proposed algorithm has a clear advantage in computational overhead. In the original mode MANE, the complexities for cross-view-intra-node and cross-view-cross-node consistencies are both $O(|E| \cdot D /|V| \cdot K \cdot |V|)$ and the overall complexity is $O(|E| \cdot D \cdot K)$ , where $K$ is the number of negative samples. For the MOON proposed in this paper, the complexities for cross-view-intra-node and cross-view-cross-node consistencies are both $O(|E| \cdot D /|V| \cdot K)$ through simplifying collaboration between networks (Eq. 3 and Eq. 4). Finally, the overall complexity of the MOON is $O(|E| \cdot D /|V| \cdot K)$ .

Finally, to better understand the performance of our framework, we compare the performance of current popular algorithms trained on the raw training set $a$ (with label imbalance) and the performance of CASTLE trained on the balanced training set $a'$ (processed by SMOTE). Note that in this part, since SMOTE is part of the CASTLE framework, only CASTLE is based on the SMOTE-processed balanced dataset. Other algorithms are based on the raw dataset. The results are shown in Table 3, which demonstrates the performance of the framework.

TABLE 3 Comparison of the CASTLE Framework and Popular Algorithms

2) Input Analysis

First, the detection performance with different embedding dimensions of the MOON algorithm is analyzed based on the CASTLE framework, and the results are shown in Table 4. For computational efficiency, the embedding dimension of the MOON algorithm is set at 8. Second, the contribution of each type of feature is analyzed in this paper. The results are shown in Table 5. All features contribute to the detection, which is consistent with our assumption. Note that the contribution of physical appearance is relatively small because social life and physical appearance may contain redundant information [11].

TABLE 4 Detection Performance With Different Embedding Dimension of the MOON Algorithm

TABLE 5 Performance With Different Input. For Concise Presentation in the Table, We Used Shorthand to Represent Each Part:

$F$ (Friendship Network),

$S$ (Social Behavior (Multi-View Social Network)),

$A$ (Academic Performance),

$P$ (Physical Appearance),

$D$ (Demographic Features)

Table 5-
Performance With Different Input. For Concise Presentation in the Table, We Used Shorthand to Represent Each Part:
$F$
(Friendship Network),
$S$
(Social Behavior (Multi-View Social Network)),
$A$
(Academic Performance),
$P$
(Physical Appearance),
$D$
(Demographic Features)

Moreover, the questionnaire is applied to collect the multi-view social network of students, and this method is time- and cost-consuming and hardly applicable to large-scale. In this case, we explore the performance of the situation without manual data collection methods such as questionnaires. We replace the multi-view network collected through questionnaires with a data-generated friendship network through their canteen co-occurrence frequency [52]. The structure of the friendship network is embedded through Deepwalk and the performance is shown in Table 5 as $F+P+A+D$ . Compared with $S+P+A+D$ , its performance declined slightly, but this approach contains all data from LMS, which is more practical for the education management division (More discussion in Section VII).

In addition, we design experiments to analyze the parameters $\alpha$ and $\beta$ in Eq. 5 separately. To avoid the influence of other characteristics, this part of the experiment is based on the original data (without data augmentation) and uses only social networks to detect the targeted students. We fix $\alpha$ and $\beta$ to 0.5 respectively to test the effect of another parameter on the detection results. The results are shown in Figure 9. It can be seen that both parameters affect the detection results in different degrees.

$FIGURE 9. - Parameters analysis of $\alpha $ and $\beta $ in Eq. 5.$

FIGURE 9.

Parameters analysis of $\alpha$ and $\beta$ in Eq. 5.

Show All

Finally, as mentioned above, a dropout mechanism is utilized to improve the generalization performance, and we test the sensitivity of the proposed framework on dropout rate (Figure 10). The change of dropout proportions impacts the performance slightly, and 0.3 is the best.

FIGURE 10.

Performance of the proposed framework with different dropout proportions.

Show All

SECTION VII.

Conclusion and Discussion

In this paper, we investigate an important problem of the detection of students’ mental health. We propose an educational data fusion detection framework CASTLE for achieving an effective and accurate detection through fusing multi-modal data generated from campus life. We tackle the various challenges that exist in multi-modal data by using representation learning theories. Specifically, for the representation of students’ social life, we divide social networks into the source view and the target view according to the inherent nature of social behaviors and propose the MOON algorithm for multi-view social network embedding. In essence, the idea we provided here is a new embedding strategy for the multi-view network that different networks need to be treated differently depending on the characteristics of the specific application scenario, and this strategy could be extended to other fields. Moreover, we use a SMOTE model to overcome the label imbalance problem. We conduct comprehensive experiments on a real-world educational dataset and the extensive results demonstrate the performance of the proposed methods.

Although we demonstrate the performance of the proposed algorithm through rich experiments, there is still some room for improvement in this study. First, we consider campus social networks as static weightless multi-view networks. However, social networks are dynamic and each link is weighted differently in real life. How to accurately characterize the social network of students is still an open question. In addition to campus social relationships, other social networks also have a significant impact on mental health, such as intimate relationships, family relationships, and teacher-student relationships. Subsequent research needs to consider social networks more comprehensively. Moreover, as mentioned above, the questionnaire is applied to collect multi-view social network of students, and this method is time- and cost-consuming and hardly applicable to large-scale. Although we try to automatically capture friendship relationships among students based on cafeteria co- occurrence, such methods are too crude and the accuracy cannot be guaranteed. How to capture students’ social networks in diverse scenarios based on a data-driven approach is still an open question. Finally, the deep learning model is adopted in this paper, which means losing the interpretability of the experimental results while obtaining better prediction performance. This can easily cause educators to mistrust the prediction.

According to the drawbacks mentioned above, there are multiple directions for future work as follows:

First, we will develop subsequent versions of the CASTLE framework to fuse more features, like students’ Internet access patterns and life orderliness, to achieve better detection performance.
Second, we will attempt to develop data-driven methods for capturing students’ social patterns based on group work records, or discussion records on LMS, to replace the questionnaire-based data collection.
Third, in the next step, we try to introduce causal learning related techniques to analyze the experimental results.
Last but not least, we also intend to integrate the CASTLE framework into the modern educational management system to assist with educational decision making.

ACKNOWLEDGMENT

The authors thank Dongyu Zhang, Chuanhui Yuan, and Qing Qing for their help with the experiments as well as all students who participated in the experiments.

References is not available for this document.

Multimodal Educational Data Fusion for Students’ Mental Health Detection

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

Problem Formulation