Introduction
Important tasks for adaptive learning are intended for accurate prediction of a student's performance and for capturing the student's ability change based on the student's prior learning history data. In the field of artificial intelligence, knowledge tracing (KT) has been researched actively to predict a student's performance (correct or incorrect responses to an unknown item) and to discover concepts that the student has not mastered by tracing a student's evolving knowledge state [1], [2], [3], [4], [5]. These tasks are important to help students learn effectively by presenting optimal problems and a teacher's support.
Recently, various KT methods have been developed using major approaches: probabilistic approaches, deep-learning-based approaches, and attention-mechanism-based approaches. Bayesian KT (BKT) is a well-known probabilistic approach that employs a hidden Markov model to trace a student's evolving knowledge state [1].
BKT estimates whether the student has mastered the skill or not according to the student's past response data. It then predicts the student's responses to unknown items. Researchers have proposed several BKT variants to improve interpretability [6], [7], [8], [9]. The BKT models predict a student's knowledge state using only simple discrete values. Therefore, they are inflexible with the student knowledge state changes. Moreover, they assume a single dimension of the ability. They are unable to capture the multidimensional ability sufficiently or predict performance precisely.
Recently, item response theory (IRT) has been used for KT to predict a student's correct answer probability to an unknown item [10], [11], [12]. In fact, IRT has been used in the field of test theory, where it has high parameter interpretability by virtue of its capability of estimating the student's latent ability parameter and item characteristic parameters.
Several studies have extended standard IRT models to ascertain student ability changes for learning processes with the hidden Markov process [11], [12], [13], [14], [15]. These are regarded as generalized models of BKT and IRT because they estimate the ability as a continuous hidden variable following a hidden Markov process. Actually, a learning task is associated with multiple skills. Students must master the knowledge of multiple skills to solve a task. However, BKT and IRT have a restriction: They express only unidimensional ability. Therefore, BKT and IRT are unable to capture the multidimensional ability sufficiently. They are unable to predict the performance precisely.
To overcome this shortcoming, Piech et al. [2] developed deep KT (DKT) as the first method among deep-learning-based approaches.
DKT employs long short-term memory (LSTM) to relax the restrictions of skill separation and binary state assumptions [16]. That earlier report describes that DKT can predict a student's performance more precisely than probabilistic models such as BKT can. However, the hidden states include a summary of the past sequence of learning history data in LSTM. Therefore, DKT does not explicitly treat the student's ability of each skill.
To improve DKT performance, a dynamic key-value memory network (DKVMN) was developed to exploit the relation between underlying skills and to trace the respective knowledge states [4]. By employing a memory-augmented neural network, DKVMN can estimate the relations between underlying skills and items addressed by students. In addition, DKVMN has a memory-updating component to allow forgetting and updating of the latent variable memory, which stores the students' knowledge states during the learning process [4]. Furthermore, Deep-IRT has been proposed to improve the explanatory capabilities of the parameters [3]. Deep-IRT can estimate a student's ability and an item's difficulty, just as standard IRT models can by combining DKVMN with an IRT module. However, it has remained insufficient to improve interpretability because the student's ability of Deep-IRT depends on each item characteristic. Although DeepIRT implicitly functions on the assumption that items with the same skills are equivalent, that assumption does not hold true when the item difficulties for the same skills differ greatly. Items of the same skill that are not equivalent interfere with the interpretation of the student's ability estimates.
The self-attentive KT (SAKT) method is the first method to employ an attention mechanism, the transformer method, for KT [18], [19]. To predict student performance, SAKT identifies the relation between skills and an item addressed by a student from past learning data. Most recently, attentive KT (AKT) was developed to improve SAKT performance [5]. To incorporate a forgetting function of past data, AKT employs attention mechanisms. It optimizes parameters to weight past learning data needed to predict student performance. In addition, Ghosh et al. [5] pointed out the error of the assumption in earlier KT methods that items with identical skills are equivalent. To overcome that shortcoming, they employed both items and skills as inputs. In fact, AKT provides state-of-the-art performance for student response prediction. However, the interpretability of the parameters remains inadequate because AKT cannot express a student's ability transition for each skill.
The most challenging aspect of KT is to estimate the interpretable student's ability without decreasing predictive accuracy. This study specifically addresses this point of difficulty. Recent studies of deep learning have clarified that parameter redundancy in training data reduces generalization error, contrary to Occam's razor [20], [21], [22]. Based on those reports, this study proposes a novel Deep-IRT that has two independent redundant networks: 1) a student network and 2) an item network [23]. The proposed method learns the student's ability parameters and the item's characteristic parameters independently. This method provides the high interpretable ability parameters to a greater extent than the earlier Deep-IRT does.
In addition, a student network employs memory network architecture to reflect dynamic changes in student abilities as DKVMN does. The memory updating component in DKVMN is more effective than the forgetting function of AKT because it updates the current latent variable, which stores the students' skills and abilities using only the immediately preceding values.
However, room for improvement remains in the prediction accuracy of the proposed Deep-IRT. In fact, the forgetting parameters that control the degree of forgetting the past latent variable are optimized from only the current input data: the student's latest response to an item. It might degrade the prediction accuracy of the Deep-IRT because the latent variable only insufficiently reflects the past data. As a result, it might interrupt the accurate estimation of the ability transition in a long learning process. It should use not only the current input data but also past latent variables to optimize the forgetting parameters.
A simple solution to this problem is to add new weight parameters that balance the current input data and past latent variables at each time. However, this solution increases the number of weight parameters dynamically when the learning process progresses. It often yields too many weight parameters to support a successful estimate.
To resolve that difficulty, we combine a novel hypernetwork with the proposed method because it optimizes the degree of forgetting of the past latent variables and thereby avoids greatly increasing the number of parameters.
Recent studies in the field of natural language processing (NLP) have proposed several hypernetworks to optimize the latent variables and the weights of the hidden layers for LSTM [24], [25]. Some hypernetworks scale the latent variables and columns of all weight matrices expressing a context-dependent transition [24], [26]. No report of the relevant literature has described a study of the use of hypernetworks for KT methods. Using the proposed method, the proposed hypernetwork balances both current input data and past latent variables that store a student's knowledge state in the learning process. Before the model updates the latent variable, it optimizes not only the weights of the forgetting parameters but also the past latent variables in the hypernetwork.
We conducted experiments to compare the proposed method's performance and those of earlier KT methods. Surprisingly, the results demonstrate that the proposed method improves the prediction accuracy and the interpretability of earlier KT methods, although the parameters of the proposed method are far more numerous than those used for earlier methods.
This study is an extension of our work reported in earlier papers accepted at the International Conference on Educational Data Mining in 2021 and 2022 [23], [27]. The main differences between this article and the earlier papers are the following. Tsutsumi et al. [23] did not propose a new deep-learning technology but combined only existing technologies. Although Tsutsumi et al. [27] proposed a hypernetwork for KT, they described no related details: only the conceptual idea of incorporating a hypernetwork into Deep-IRT [23]. Furthermore, the authors in [23] and [27] improved the parameter interpretability. However, their prediction accuracies did not outperform AKT, which provided the best prediction performance among the earlier methods. In contrast, this study proposes a novel hypernetwork architecture to optimize the balance between the latest input data and the past latent variables. The proposed method provides the highest prediction accuracy and outperforms AKT with high parameter interpretability to a considerable degree.
The main contributions of the work described in this article are presented as follows.
The proposed method can estimate student and item parameters with high interpretability as in IRT by two independent redundant networks. The proposed method provides higher parameter interpretability than other KT methods.
The proposed method with hypernetworks improves the prediction accuracy of earlier KT methods. Especially, it functions more effectively for long learning processes because hypernetworks reflect past learning data.
The rest of this article is organized as follows. In Section II, we review IRT and deep learning methods for KT. In Section III, we describe the proposed method to improve parameter interpretability. In Section IV, we describe the proposed method with a hypernetwork to improve the prediction accuracy. Section V shows experiments using benchmark datasets to compare the performances of the proposed methods against existing methods. Section VI explains experiments that were performed to evaluate the interpretability of the ability parameters of the proposed method. Finally, Section VII concludes this article.
Our code is also available on GitHub.1
Related Work
A. Item Response Theory
Many IRT models exist [10], [28], [29]. This section briefly introduces the two-parameter logistic model (2PLM): an extremely popular IRT model. For 2PLM,
\begin{equation*}
u_{ij} = {\begin{cases}1, & (\text{student}{\kern2.84526pt} i{\kern2.84526pt} \text{answers}{\kern2.84526pt} \text{correctly}{\kern2.84526pt} \text{to}{\kern2.84526pt} \text{item}{\kern2.84526pt} j)\\
0, & (\text{otherwise}). \end{cases}}
\end{equation*}
\begin{align*}
P_{j}(\theta _{i})& =P(u_{ij}=1\mid \theta _{i}) \\
&= \frac{1}{1+exp(-1.7a_{j}(\theta _{i} - b_{j}))} \tag{1}
\end{align*}
Actually, IRT models are known to have high interpretability. However, in standard IRT models, the ability is assumed to be constant throughout the learning process. Therefore, student's ability changes are not reflected in the models. Recently, several studies have extended standard IRT models to capture student's ability changes for the learning processes with the hidden Markov process [11], [12], [13], [14], [30], [31], [32]. These are regarded as generalized models of BKT and IRT because they estimate the ability as a continuous hidden variable following a hidden Markov process.
For example, temporal IRT (TIRT) is a hidden Markov IRT with a parameter to forget past response data [12]. In TIRT, the probability of a correct answer assigned to item
\begin{align*}
P_{ij}(x_{ij}=1\mid \theta _{it})&=\frac{1}{1+\exp {(-\tilde{a}_{\Delta _{t}}(\theta _{it}-b_{j}))}}\tag{2}\\
\tilde{a}_{\Delta _{t}}&=\frac{a_{j}}{\sqrt{1+\epsilon a_{j}^{2}\Delta _{t}}} \tag{3}
\end{align*}
However, these IRT models incorporate the assumption of a single dimension of the ability. In other words, they completely consider independent multiple skills. Apparently, these are unable to accommodate items that require different skills.
B. Deep KT
DKT [2] was proposed as the first deep-learning-based method. It exploits recurrent neural networks and LSTM [16] to simulate transitions of ability. It can capture complex multidimensional features of both items and students and can relax the limitations of traditional methods such as independence between skills. An earlier study demonstrated that DKT outperformed BKT in terms of predictive accuracy [2]. However, DKT summarizes a student's ability of all skills in one hidden state, which makes it difficult to trace the degree to which a student has mastered a certain skill and pinpoint concepts with which a student is proficient or unfamiliar.
C. Dynamic Key-Value Memory Network
To improve the DKT interpretability, researchers have undertaken great efforts to propose novel methods for use with KT [17]. Specifically, a DKVMN exploits a memory-augmented neural network along with attention mechanisms to trace student abilities in different dimensions [4]. Fig. 1 presents a simple illustration.
Network architecture of Deep-IRT with DKVMN. The underside of the structure describes DKVMN. The whole structure describes Deep-IRT. The blue components represent the process of getting the attention weight. The yellow components are associated with the student network and the process of updating the value memory. The green components are associated with the item network. The designation
The salient feature of DKVMN is that it assumes
First, DKVMN calculates the attention, which indicates how strongly an item
\begin{align*}
w_{jl} &=\text{Softmax}\left(\bm M^{k}_{l} \bm s_{j}\right) \tag{4}
\end{align*}
\begin{align*}
\bm \theta _{1}^{(t)} = \sum _{l=1}^{N} w_{jl}\left(\bm M^{v}_{tl}\right)^\top \tag{5}
\end{align*}
\begin{align*}
\bm \theta _{2}^{(t)} &= \tanh \left(\bm W^{(\theta _{2})}\left[\bm \theta _{1}^{(t)},\bm s_{j}\right]+ \bm \tau ^{(\theta _{2})}\right)\tag{6}\\
P_{jt} &=\sigma \left(\bm W^{(P_{jt})}\bm \theta _{2}^{(t)} + \bm \tau ^{(P_{jt})}\right) \tag{7}
\end{align*}
D. Deep-IRT
To improve the DKVMN interpretability, Deep-IRT is implemented by combining DKVMN with an IRT module [3]. Deep-IRT exploits both the strong prediction ability of DKVMN and the interpretable parameters of IRT. Fig. 1 presents a simple illustration.
Deep-IRT adds a hidden layer to DKVMN to gain applicable ability and item difficulty. Specifically, when a student attempts item
\begin{align*}
\theta _{3}^{(t,j)}&=\tanh \left(\bm W^{(\theta _{3})}\bm \theta ^{(t)} _{2} + \bm \tau ^{(\theta _{3})}\right)\tag{8}\\
\beta ^{j} & =\tanh \left(\bm W^{(\beta)}\bm s_{j} + \bm \tau ^{(\beta)}\right). \tag{9}
\end{align*}
\begin{align*}
P_{jt} = \sigma \left(3.0*\theta _{3}^{(t,j)}-\beta ^{j} \right). \tag{10}
\end{align*}
Here, ability
E. Attentive KT
Ghosh et al. [5] proposed AKT, which combines the attention-based model with the Rasch model, which is also known as the 1PLM IRT model [33]. It is noteworthy that AKT incorporates a forgetting function for past data into attention-based neural networks. Attention weights in AKT express the relation between student's latest data and past data, decaying exponentially during the learning process. Specifically, AKT calculates the attention weight
\begin{align*}
\alpha _{t,\lambda } =& \frac{\exp {(f_{t,\lambda })}}{\sum _{\lambda ^{\prime }}\exp {(f_{t,\lambda })}}\tag{11}\\
(f_{t,\lambda }) =& \frac{\exp {(-\eta d(t,\lambda))\cdot \bm {q_{t}^\top } \bm {k}_\lambda }}{\sqrt{D_{k}}} \tag{12}
\end{align*}
\begin{align*}
d(t,\lambda) = |t-\lambda |\sum _{t^{\prime }=\lambda +1}^{t}\frac{ \frac{\bm {q_{t}^\top } \bm {k}_{t^{\prime }}}{\sqrt{D_{k}}}}{\sum _{1\leq \lambda ^{\prime } \leq t^{\prime }} \frac{\bm {q_{t}^\top } \bm {k}_{\lambda ^{\prime }}}{\sqrt{D_{k}}}} \ \ \forall t^{\prime } \leq t. \tag{13}
\end{align*}
In addition, they pointed out that the earlier KT methods assumed that items with the same skills were equivalent. To resolve the difficulty, AKT employs both items and skill inputs. Results show that, among the earlier KT methods, AKT provides the best performance for predicting the students' responses. Nevertheless, the interpretability of its parameters remains inadequate because it cannot express a student's ability transition for each skill.
Deep-IRT With Independent Student and Item Networks
The ability parameter of the Deep-IRT [3] depends on each item because it implicitly assumes that items with the same skills are equivalent. That assumption does not hold when the item difficulties for the same skills differ greatly. Therefore, when the items for the same skills are not equivalent, it is difficult to interpret a student's ability estimate.
To resolve the difficulty, this study proposes a novel Deep-IRT method comprising two independent neural networks: 1) the student network and 2) the item deep network [23], as presented in Fig. 2. The student network employs memory network architecture, such as DKVMN to ascertain changes in student ability comprehensively. The item network includes inputs of two kinds: 1) the item attempted by a student and 2) the necessary skills to solve the item. Using the outputs of both networks, the probability of a student answering an item correctly can be calculated.
Network architecture of Deep-IRT with independent student and item networks. The yellow components are associated with the student network. The green components are associated with the item network. In addition, the right side of the figure presents the memory updating component. The designations
The proposed method can estimate student parameters and item parameters independently such that the prediction accuracy does not decline because the two independent networks are designed to be more redundant than they are with earlier methods, based on state-of-the-art reports [20], [21], [22]. The proposed method predicts
A. Item Network
In the item network, two difficulty parameters of item
In the proposed method, to express the
\begin{align*}
\bm \beta _{1}^{j}&= \tanh \left(\bm W^{ (\beta _{1})} \bm q_{j} + \bm \tau ^{ (\beta _{1})}\right) \tag{14}\\
\bm \beta _{k^{\prime }}^{j}&= \tanh \left(\bm W^{(\beta _{k^{\prime }})} \bm \beta _{{k^{\prime }}-1}^{j} + \bm \tau ^{(\beta _{k^{\prime }})} \right)\tag{15}\\
\beta _{\text{item}}^{j} &= \bm W^{(\beta _{\text{item}})}\bm \beta ^{j} _{k} + \bm \tau ^{(\beta _{\text{item}})}. \tag{16}
\end{align*}
Similarly, to compute the difficulty of skills, the proposed method uses the input of necessary skills
\begin{align*}
\bm \gamma _{1}^{j}&= \tanh \left(\bm W^{(\gamma _{1})} \bm s_{j} + \bm \tau ^{(\gamma _{1})} \right) \tag{17}\\
\bm \gamma _{k^{\prime }}^{j}&= \tanh \left(\bm W^{(\gamma _{k^{\prime }})} \bm \gamma _{{k^{\prime }}-1}^{j} +\bm \tau ^{(\gamma _{k^{\prime }})} \right) \tag{18}\\
\beta _{\text{skill}}^{j} &= \bm W^{(\beta _{\text{skill}})}\bm \gamma ^{j} _{k} + \bm \tau ^{(\beta _{\text{skill}})} \tag{19}
\end{align*}
B. Student Network
In the student network, the proposed method calculates
\begin{align*}
\bm \theta _{1}^{(t,j)} &= \sum _{l=1}^{N} w_{jl}\left(\bm M^{v}_{tl}\right)^\top \tag{20}
\end{align*}
\begin{align*}
\bm \theta _{k^{\prime }}^{(t,j)}&= \tanh \left(\bm W^{(\theta _{k^{\prime }})} \bm \theta _{{k^{\prime }}-1}^{(t,j)} + \bm \tau ^{(\theta _{k^{\prime }})}\right)\tag{21}\\
\theta ^{(t,j)} &= \sum _{l=1}^{N} w_{jl} \theta _{{k^{\prime }}l}^{(t,j)} \tag{22}
\end{align*}
C. Prediction of Student Response to an Item
The proposed method predicts a student's response probability to an item using the difference between a student's ability
\begin{align*}
P_{jt} = \sigma \left(3.0* \theta ^{(t,j)}-(\beta _\mathrm{item}^{j}+\beta _\mathrm{skill}^{j})\right). \tag{23}
\end{align*}
After the procedure, the latent value memory
\begin{align*}
\bm e_{t}&= \sigma (\bm W^{e} \bm v_{t} + \bm \tau ^{e}) \tag{24}\\
\bm a_{t}&= \tanh (\bm W^{a} \bm v_{t} + \bm \tau ^{a}) \tag{25}\\
\tilde{\bm M}_{t+1,l}^{v}&= \bm M_{t,l}^{v} \otimes (1- w_{jl}\bm e_{t})^\top \tag{26}\\
\bm M_{t+1,l}^{v}&= \tilde{\bm M}_{t+1,l}^{v} + w_{jl}\bm a_{t}^\top. \tag{27}
\end{align*}
In general, deep-learning-based methods learn their parameters using the back-propagation algorithm by minimizing a loss function. The loss function of the proposed method employs cross-entropy, which reflects classification errors. Then, the cross-entropy of the predicted responses
\begin{equation*}
\ell (u_{jt},{P}_{jt}) = - \sum _{t} \left(u_{jt} \log {{P}_{jt}} + (1- u_{jt}) \log (1-{{P}_{jt}})\right). \tag{28}
\end{equation*}
Deep-IRT With Hypernetwork
The preceding section described the proposed Deep-IRT method with independent student and item networks [23]. However, room for improvement of the prediction accuracy remains because the parameters that control the degree of forgetting the past latent value memory
One simple solution is to add new weight parameters that balance current input data
Recent reports of studies conducted in the field of NLP have proposed extension components to LSTM [16] in the form of mutual gating of the current input data and earlier hidden variables [24]. These extension components are called hypernetworks. In standard LSTM [16], the hidden variables change with time, but the weights used to update them are fixed values that are not optimized for each time point. To resolve this difficulty, various hypernetworks have been proposed to support the main recurrent neural network by optimizing the nonshared weights for each time point in the hidden layers [24], [26], [36], [37], [38], [39], [40]. Their results demonstrate that LSTM with a hypernetwork works better than the standard LSTM [16]. Furthermore, Melis et al. [26] earlier proposed the “Mogrifier component,” which is a kind of hypernetwork for LSTM in the field of NLP. Mogrifier scales the hidden variables using not only the current inputs but also the output of the hidden variable at the earlier time point. They reported that LSTM with the Mogrifier component outperforms other methods for long input data lengths.
Inspired by the results obtained from those studies, we incorporate a novel hypernetwork into the memory updating component (in Fig. 2), which updates the latent variable
Fig. 3 presents the proposed hypernetwork architecture and the memory updating component of the proposed method. The hypernetwork optimizes the degree of forgetting of past data in the proposed Deep-IRT and improves prediction accuracy with parameter interpretability. Specifically, before the method updates the latent variable
Memory updating component of the proposed Deep-IRT with hypernetwork. The proposed hypernetwork is located at the beginning of the memory updating component. It estimates the optimal forgetting parameters by balancing both the current input data and the past latent variable before the model updates the latent variable.
A. Hypernetwork
In the memory updating components of DKVMN and Deep-IRT [3], [4], the forgetting parameters are optimized only from current input data. Therefore, their value memory
The proposed hypernetwork structure is located at the beginning of the memory updating component (see Fig. 3). The inputs of the hypernetwork are the embedding vector
\begin{equation*}
\tilde{\bm {M}}^{v}_{t} = {\begin{cases}\bm M^{v}_{t} & (\lambda =0) \\
\sigma (\bm W [\bm M^{v}_{t},\bm M^{v}_{t-1},\ldots,\bm M^{v}_{t-\lambda }]+\bm \tau) & (\text{otherwise}). \end{cases}} \tag{29}
\end{equation*}
\begin{align*}
\tilde{\bm v}_{t}^{r^{\prime }} =& \delta _{1}*\sigma (\bm W^{v} \tilde{\bm {M}}^{vr^{\prime }-1}_{t})\odot \bm v_{t}^{r^{\prime }-1}\tag{30}
\\
\tilde{\bm {M}}^{vr^{\prime }}_{t} =& \delta _{2}*\sigma (\bm W^{M} \tilde{\bm v}_{t}^{r^{\prime }}) \odot \tilde{\bm {M}}^{vr^{\prime }-1}_{t} \tag{31}
\end{align*}
B. Memory Updating Component
Next, we estimate the forgetting parameters
\begin{align*}
\bm e_{t}^{(l)} =& \sigma (\bm W^{e1}\tilde{\bm v}^{r}_{t}+\bm W^{e2}\tilde{\bm {M}}^{vr}_{t,l}+\bm \tau ^{e})\tag{32}
\\
\bm z_{t}^{(l)} =& \sigma (\bm W^{z1}\tilde{\bm v}^{r}_{t}+\bm W^{z2}\tilde{\bm {M}}^{vr}_{t,l}+\bm \tau ^{z})\tag{33}
\\
\bm a_{t}^{(l)} =& \tanh (\bm W^{a1}\bm z_{t}^{(l)}+\bm W^{a2}\tilde{\bm {M}}^{vr}_{t,l}+\bm \tau ^{a}). \tag{34}
\end{align*}
\begin{align*}
\bm M_{t+1,l}^{v} = \tilde{\bm {M}}^{vr}_{t,l}\otimes (1- w_{jl}\bm e_{t}^{(l)})^\top + w_{jl}\bm a_{t}^{(l)\top }. \tag{35}
\end{align*}
By optimizing
Predictive Accuracy
A. Datasets
We conduct experiments to compare the performances of the proposed Deep-IRT in Section III (designated as “Proposed-DI”) and the proposed Deep-IRT with a hypernetwork in Section IV (designated as “Proposed-HN”) against existing solutions. This section presents a comparison of the prediction accuracies for student performance of the proposed methods with those of earlier methods (DKVMN [4], Yeung's Deep-IRT [3] (designated as “Yeung-DI”), AKT [5]) using six benchmark datasets as ASSISTments2009,2 ASSISTments2015,3 ASSISTments2017,4 Statics2011,5 Junyi,6 and Eedi.7 The ASSISTments datasets collected from online tutoring systems have been used as the standard benchmark for KT methods. The Statics2011 dataset was collected from college-level engineering courses on statics. The Junyi dataset was collected by Junyi Academy, a Chinese e-learning website [41]. We use only the students' exercise records in the math curriculum. In addition, we select items that the students attempted for the first time without hints. We also changed the question types into unique skill number tags. The Eedi dataset includes data from the school years of 2018–2020, with student responses to mathematics questions from Eedi, a leading educational platform by which millions of students interact daily around the globe [42]. For Eedi, each item has a list of hierarchical knowledge components. We convert these lists into unique skill number tags.
ASSISTments2009, ASSISTments2017, and Eedi have item and skill tags, although most methods explained in the relevant literature adopt only the skill tag as an input. However, methods with skill inputs rely on the assumption that items with the same skill are equivalent [5]. That assumption does not hold when an item's difficulties in the same skill differ greatly. Therefore, as inputs to AKT and the proposed method, we employ not only skills but also items [5], [23], [27]. Also, for ASSISTments2015, Statics2011, and Junyi with only skill tags, we employ the skill as input data. Table I presents the number of students (No. students), the number of skills (No. skills), the number of items (No. items), the rate of correct responses (Rate correct), and the average length of the items that students addressed (Learning length).
B. Hyperparameter Selection and Evaluation in Deep-IRT
We used standard fivefold cross-validation to evaluate the respective prediction accuracies of the methods. According to Ghosh et al. [5], for each fold, 20% learners are used as the test set, 20% are used as the validation set, and 60% are used as the training set.
For all methods, we chose batch sizes from
In addition, for the earlier methods, we used the hyperparameters reported from the earlier studies [3], [4], [5]. Additionally, we set 200 items as the upper limit of the input length according to the earlier studies [3], [4], [5]. When the input length of items is greater than 200, we use the first 200 response data for all methods.
To ascertain the number of layers
AUC and the number of layers for ASSISTments2009. The vertical axis shows AUC on the left side. The horizontal axis shows the number of layers.
If the predicted correct answer probability for the next item is 0.5 or more, then the student's response to the next item is predicted as correct. Otherwise, the student's response is predicted as incorrect. For this study, we leverage three metrics for prediction accuracy: 1) accuracy (Acc) score, 2) AUC score, and 3) loss score [43], [44]. The first, Acc, represents the concordance rate between the student predictive responses and the actual responses. The second, AUC, provides a robust metric for binary prediction evaluation. When an AUC score is 0.5, the prediction performance is equal to that of random guessing. Loss represents the cross-entropy in (28).
We used a Tesla T4 GPU to train all methods.
C. Hyperparameter Selection in Hypernetwork
1) Optimal Tuning Parameter \delta _{1} and \delta _{2} Estimation
For our experiments, we optimize
2) Optimal Number of Rounds r Estimation
To ascertain the number of rounds
We find the number of rounds
3) Optimal Degree of Past Latent Variables to be Assessed
The input of the hypernetwork
D. Results
1) Skill Inputs
The respective values of Acc, AUC, and Loss for all benchmark datasets with only skill inputs are presented in Table III. In addition, this report describes the standard deviations across five test folds. Proposed-DI and Proposed-HN, respectively, represent variants of the proposed method with and without the hypernetwork.
Results show that the averages of AUC, Acc, and Loss obtained using Proposed-DI are better than those using Yeung-DI, which is the earlier Deep-IRT method, although the proposed method separates student and item networks. This result implies that redundant deep student and item networks function effectively for performance prediction. These results are explainable from reports of state-of-the-art methods [20], [21], [22].
Also, Proposed-HN, which optimizes the forgetting parameters in the hypernetwork, provides the best average scores for all metrics. Proposed-HN improves the prediction accuracy of Proposed-DI. In fact, Proposed-HN outperforms AKT, which was reported as having the highest accuracies among earlier methods. For each dataset, results indicate that Proposed-HN provides the best AUC scores for ASSISTments2009, ASSISTments2017, Statics2011, and Junyi. Especially, for ASSISTments2017 with long learning lengths, the performance of the Proposed-HN markedly outperforms that of AKT. By contrast, Proposed-HN tends to have lower prediction accuracies for ASSISTments2015 with a shorter learning length than AKT has. Results suggest that the proposed hypernetwork functions effectively, especially for datasets with long learning lengths.
To investigate the reason for that phenomenon, we analyze the forgetting parameters
Findings indicate that AKT provides the best performance for ASSISTments2015. However, the AKT performance results are worse than those of Proposed-HN for ASSISTments2017. Fig. 5 shows the average of attention weights of all students for the 200 items in ASSISTments2017. The vertical axis shows the average of attention weights. The horizontal axis shows the number of items the student addressed. Fig. 5 shows that the attention weight
2) Item and Skill Inputs
Furthermore, we compared the performances of the proposed methods with those of AKT for ASSISTments2009, ASSISTments2017, and Eedi with item and skill inputs according to the work in [5]. The respective values of Acc, AUC, and Loss are presented in Table V. Results indicate that the Proposed-HN provides the best performance for all metrics: averages of AUC, Acc, and Loss. For each dataset, the Proposed-HN provides the best scores for ASSISTments2009 and for ASSISTments2017. As described earlier, the Proposed-HN greatly outperforms AKT for ASSISTments2017 with a long learning length because the proposed hypernetwork functions effectively. However, for Eedi, AKT provides the best scores for all the metrics. In fact, AKT with item and skill inputs provides higher performance than those achieved using only skill inputs, as shown in [5]. In contrast, the proposed methods with item and skill inputs do not necessarily outperform those with only skill inputs. The reason might be that input item information cannot be used effectively because the latent value memory
Moreover, we experimented with TIRT [11], [12]. It is a hidden Markov IRT with a parameter to forget past response data, as described earlier in Section II-A. IRT-based methods rely on an assumption of local independence among the student item responses. They should not be applied to learning processes that allow a student to respond to the same item repeatedly. Therefore, we employ not skills but items as inputs using ASSISTments2009 and ASSISTments2017. In addition, we decompose these datasets into their respective skill groups and estimate the parameters from skill data independently because TIRT assumes a single-dimension skill of the ability. In other words, TIRT predicts performance using only an ability corresponding to one skill for an item. To estimate the student ability and item parameters of TIRT, we use the expected a posterior estimators using the Markov chain Monte Carlo method [45]. The results indicate that AUC is 80.38, Acc is 76.39, and Loss is 0.49 for ASSISTments2009. For ASSISTments2017, results show that AUC is 75.52, Acc is 84.71, and Loss is 0.46. Surprisingly, TIRT outperforms AKT with skill input for ASSISTments2017. That finding suggests that TIRT might estimate the student ability transition accurately. For the Eedi dataset, TIRT cannot complete the calculations within 24 h because of its data size.
E. Computational Costs
This section presents an investigation of the computational costs associated with each method. Concretely, we calculated the number of trainable parameters and training time for each method. We measured the training time for each partition in fivefold cross-validation. Table VI shows the number of trainable parameters and the average training times. According to Table VI, AKT has the largest number of parameters. It requires the longest computation training time. Proposed-DI and Proposed-HN can be trained more quickly than AKT can. Although Proposed-HN has more parameters than Proposed-DI does, the training times are comparable. In addition, DKVMN and Yeung-DI, which have relatively small numbers of parameters, were trained more quickly than the proposed methods and AKT. The computational times of DKVMN are almost identical to those of Yeung-DI because they have similar network structures.
Parameter Interpretability
A. Estimation Accuracy of Ability Parameters
In the preceding section, we showed that the proposed method has higher prediction accuracy than other methods. As described in this section, to evaluate the interpretability of the ability parameters of the proposed method, we use simulation data to compare the parameter estimates with those of the earlier Deep-IRT [3]. These datasets are generated from TIRT [11], [12]. The prior of
We evaluate Pearson's correlation coefficients, Spearman's rank correlation coefficients, and Kendall rank correlation coefficients between the true ability parameters of the true model (TIRT) and the estimated ability parameters of the Deep-IRTs (Yeung-DI, Proposed-DI, and Proposed-HN) [46], [47]. Spearman's rank correlation is the nonparametric version of Pearson's correlation. The Kendall rank correlation coefficient is known to provide robust estimates for aberrant values [48]. Generally, the estimation accuracy of the ability parameters is evaluated using the root-mean-square error (RMSE). However, a student's ability of TIRT does not assume a standard normal distribution because the student ability distribution differs at each time. We are unable to evaluate RMSE in this experiment because TIRT, the earlier Deep-IRT method [3], and the proposed methods are unable to not standardize their student abilities.
We calculate a correlation coefficient using student's abilities
To confirm the significance of the differences between the proposed methods from Yeung-DI, we applied the Tukey–Kramer multiple comparison test [49]. The
Results show that, for all conditions, Proposed-DI and Proposed-HN provide a stronger correlation with the true ability parameters than Yeung-DI does. The results of Spearman's rank correlation coefficients of the proposed method are greater than those of Pearson's correlation coefficients because the student's ability distribution changes constantly over time in TIRT. Especially, the results obtained for Kendall rank correlation coefficients suggest that Proposed-DI and Proposed-HN estimate the abilities robustly, even for aberrant values. The results demonstrate that the two proposed independent networks function effectively to provide appropriate interpretability of the estimated parameters. Moreover, the students' ability parameters are estimated accurately with sufficient information from past learning history data because the hypernetwork optimized the forgetting parameters using both current input data and past data. Furthermore, the proposed methods tend to produce stronger correlations as the number of items increases. These findings suggest that the proposed methods represent the true student's ability transition accurately in long learning processes.
B. Student Ability Transitions
This section shows student ability transitions using the proposed method.
First, we visualized the student ability parameters for underlying skills in the student network of Proposed-HN. Fig. 6 presents an example of the student ability transition
Examples of student ability
Next, we evaluated the interpretability of the ability parameters of the proposed method by visualizing the ability transition.
Visualizing the ability transition for each skill is helpful for both students and teachers because it can reveal student strengths and weaknesses and can improve the learning method to fill in the learning gaps. Yeung [3] demonstrated a student ability transition for each skill using Yeung-DI. However, their results included some counterintuitive ability estimates. For example, even when the student answered incorrectly, the corresponding student ability estimate increased. Moreover, Yeung-DI cannot identify a relation among multidimensional skills. In some cases, a student's ability for low-level skills decreases even when the student responds correctly to items for high-level skills.
Fig. 7 depicts an example of student ability transitions of each skill estimated using Yeung-DI and Proposed-HN for the ASSISTments2009 according to earlier studies [3], [27]. The vertical axis shows the student's ability value on the right side. The horizontal axis shows the item number. The student response is shown by filled circles “•” when the student answers the item correctly; it is shown by hollow circles “
Example of a student ability transition from the ASSISTments2009 dataset. The skill inputs are classified, respectively, as ordering factions (orange), equation solving more than two steps (gray), equation solving two or fewer steps (green), finding percentages (yellow), and finding percentages (orange). The filled and the hollow circles, respectively, represent correct and incorrect responses.
For Yeung-DI, as described in earlier reports [3], some of the ability changes might be inconsistent with response data. For instance, the ability of skill “equation solving more than two steps” (gray), which is a higher-level skill, decreases even though the student responds correctly to items 11–17. In another instance, the student responds correctly to items for high-level skills even when a student's ability for low-level skills “equation solving two or few steps” (green) decreases. These unstable behaviors of Yeung-DI might engender severe difficulties, which will consequently confuse students and teachers, as a student model.
In contrast, Fig. 7 shows that the Proposed-HN can provide accurate estimates to reflect the student responses. Additionally, it can estimate relations among the skills. Therefore, when a student responds to an item, not only the corresponding skill ability but also those for other skills change. Especially, because the skills of “equation solving more than two steps” (gray) and “equation solving two or few steps” (green) are similar, the ability changes of each skill also indicate a strong correlation. Consequently, the results demonstrate that the proposed method improves the interpretability of Yeung-DI.
It is noteworthy that the student's responses are not immediately reflected in the estimated ability change when the student provides a different response from the previous several continuous same responses. For example, the ability for “finding percents” (yellow) increases in items 18–19 despite incorrect responses because the Proposed-HN estimates the student's ability with the past responses. Then, the estimated ability values change slightly later when the student provides a different response from the previous several continuous same responses.
Conclusion
This study examined a proposed novel Deep-IRT that models a student's response to an item by two independent redundant networks: 1) a student network and 2) an item network. Because of two independent redundant neural networks, the parameters of the proposed method can be interpreted to a considerable degree while maintaining high prediction accuracy. Furthermore, we improved the prediction accuracy of the proposed method by combining it with a novel hypernetwork. In the earlier memory updating component, the forgetting parameters, which control the degree of forgetting the past latent value memory, are optimized only from the current input data. That restriction might degrade the prediction accuracy of the Deep-IRT because the value memory only insufficiently reflects the past learning information. The proposed hypernetwork can estimate the optimal forgetting parameters by balancing both the current input data and the past latent variables.
Experiments conducted with the benchmark datasets demonstrated that the proposed method improves both the ability parameter interpretability and the prediction accuracies of the earlier KT methods. Especially, results showed that the proposed method with the hypernetwork is effective for tasks with a long-term learning process. Experiments for the simulation dataset demonstrated that the proposed method provides stronger correlations with true parameters of TIRT than the earlier Deep-IRT method. Furthermore, the proposed method estimates the abilities robustly, even with aberrant values.
This study employed slightly redundant deep networks compared to earlier methods. In future work, we intend to use the proposed method to investigate the performances of more-redundant and deeper networks. In addition, we will try to optimize a hypernetwork to maximize the prediction accuracy for large datasets. Most recently, results of some studies have shown that each item's characteristics differ according to their texts, although they require the same skill. To resolve this difficulty, they proposed KT methods to estimate the relation between the item's text content and the student's performance using the NLP technique or graph neural network [50], [51], [52], [53], [54], [55], [56]. In future work, we expect to incorporate the item's text content into the proposed method to improve the student performance prediction accuracy. Furthermore, deep-learning approaches for KT have been used for computerized adaptive testing (CAT) [57], [58]. The main purpose of CAT is measurement of student ability in personalized tests for online education. Therefore, we infer that the proposed method might be effective for CAT because it can estimate student's capabilities correctly.