Introduction
Recommender systems [1] are essential for the success of many online applications. Considering online shopping websites as an example, numerous goods are provided by these shopping sites, and users browse all the information about all the goods in a short time. In this context, recommender systems, as one kind of effective information filtering tool, not only can help users to obtain more valuable advice by filtering the redundant information but also gradually increase the sales volume of the websites. As a result, recommender systems have already been integrated in some large-scale websites (e.g., Amazon), which continually service thousands of people.
To date, different kinds of recommendation tasks have been extensively investigated in academia and industry, including rating prediction tasks [11], [13], [25], Top-N tasks [30], [32], [35], click-through rate predic- tion [12], [27], [29], etc. These tasks can help people to obtain useful information from a certain amount of varied data, which conform to the actual use scenario in industrial applications. In the past few decades, teams of researchers have spent a considerable amount of effort on recommender design and have achieved great results. Collaborative filtering (CF) [1] is one of the great inventions in recommendation research that has been successfully used for industry applications. In contrast to traditional CF recommenders that depend on calculating the similarity between users/items with similar preferences, matrix factorization (MF) is a popular CF recommender for rating prediction tasks [11]. MF decomposes the original rating matrix R into two low-rank matrices, which represent the latent feature space of users and items. Due to their effectiveness, variants of the MF method have been proposed [2], [9]. Recently, with the success of deep neural networks, the combination with deep learning methods is a new breakthrough for recommenders.
In this context, the CF recommender combined with deep learning methods has attracted much attention of academia and industry. Deep learning methods have been successful applied and have achieved satisfactory results in many fields, such as image processes, speech recognition and natural language processes [3], [18]. Research results have demonstrated that neural networks have the powerful ability to learn the latent features from heterogeneous data and to gain reasonable results [4], [20]. Among them, AutoEncoders (AE) and Multi-Layer Perceptron (MLP) have been widely used for recommendation recently. However, they have their own advantages and disadvantages when designing recommendation model. For AE, it is a common and effective method to reconstruct its input data in the output layer. The core idea of AE for recommender is to predict users’ preference via compressing an input vector, and then make recommendations. However, most of existing studies using AE mainly focus on the feature representation with users and items separately, without considering the interaction between them [11], [25], [26]. For MLP, it is a feed-forward neural network with multiple hidden layers, which goods at learning the hierarchical feature representations effectively (e.g., the interaction between users and items in recommendation) but lacks of extracting the feature from user and item separately [9], [32]. It means the feature representation of user and item can be affected by each other. Therefore, it cannot achieve the goals of latent features representation and fusion simultaneously via adopting AE or MLP alone.
To address these issues, this paper proposes a deep collaborative conjunctive recommender (DCCR). We focus on the rating prediction task of a recommender and format it as a regression problem. By taking advantage of deep learning and traditional CF methods, we try to extract latent features from users’ explicit ratings to items without any additional information. We first present some related studies of deep learning methods and traditional methods and analyze the process of rating prediction. We then describe the feature representation with users and items and explain the model that we propose. Some techniques that we apply and the working mechanism are detailed for better features extraction with consideration of the interaction between the users and items. Particular experimental settings and processes are defined to obtain reasonable results. We test the results with the root mean squared error and the mean absolute error. To obtain the optimal results, some experiments are conducted with varying parameters settings. The experimental results are illustrated to understand the impacts of many factors. We also compare the accuracy of our proposed method with other related and recent methods. The main contributions of this paper include:
We present a novel recommender model that extracts deep inner features of both users and items that solely depend on the explicit ratings and extract the interaction features. We describe the details of the structure, input vector, loss function and training techniques, which are indispensable for the experiments.
We investigate the impacts of the parameters of the proposed model and analyze the relations of these parameters on the prediction accuracy. We also provide possible measures for improving the results from different perspectives. An improved activation function for our neural networks are proposed, which can be specified with input vectors.
By conducting considerable experiments on two datasets, the results show that the proposed model can achieve better accuracy for this particular rating prediction task. We also discuss the expandability of our model by analyzing the depth of neural networks. Several methods are proposed to adjust the gradient problem of the deep neural networks.
Related Studies
Collaborative filtering (CF) has been widely used to provide users with new products and services in many industrial applications. CF provides users with products from similar users or chooses similar products from users’ favorite products. The matrix factorization model is the most important method of CF and has been explored by many researchers. Among different kinds of MF models, the latent factor model (LFM) is the most popular model for rating prediction tasks. The LFM factorizes the rating matrix R into two low-latent factor matrices. However, the manual process of feature extraction consumes manpower and financial resources. Recently, deep learning methods have shown that the neural networks have the powerful ability to automatically learn the features from heterogeneous data and gain reasonable results for most tasks [4], [20]. Therefore, to achieve the goal of improving the prediction accuracy by learning the deep inner user/item features, CF combined with the neural networks methods have been proposed by many papers.
As one of the most effective deep learning methods in recommendation, autoencoders have been discussed in several papers [12]–[15]. Autoencoder is an unsupervised learning method that can automatically compress the input features to a low dimension, which has shown absolute advantages in feature extraction compared with traditional methods [22]. Different kinds of autoencoders have been proposed for different scenarios, such as denoising autoencoders (DAE) [4], [5], marginalized autoencoders [6], [12] and contractive autoencoders [26]. Many researchers have successfully applied these models in recommender systems. Reference [12] combines collaborative filtering with marginalized denoising autoencoders for rating prediction and click prediction. Reference [25] employs stacked denoising autoencoders (SDAE) to extract features from side information to predict ratings. Some studies combined with traditional method are proposed. In [26], the authors present two new hybrid models by integrating contractive autoencoders (CAE) into the matrix factorization model: SVD, SVD++ [11], which are named AutoSVD and AutoSVD++. The authors utilize CAE to represent item side information with nonlinear features. In [30], the authors build a hybrid collaborative filtering model that combines the SDAE and MF to learn both a user item rating matrix and side information of users and items. Although an autoencoder is an effective method for compressing an input vector for predicting users’ preference and making recommendations, these studies usually focus on the feature representation with users and items separately without considering the interaction between users and items.
To address this issue, multilayered perceptron—another neural network model—has been applied to many industry recommender systems. Multilayered perceptron combines the features of users and items, which have been extracted from neural networks to achieve better recommendation. But most of these methods are focus on the content processing, such as reviews. Normally, the reviews of users and items are employed as input data and a joint deep model are build to merge the features [23]. Some works apply co-attention mechanisms [24] to learn a distributed representation from user and item reviews. Side information, such as categorical information about users and items, is applied in many papers to improve the accuracy of prediction and has been applied for multiple tasks, especially top-n prediction. Reference [31] combines the linearity of the Factorization Machines (FM), which represents the feature interactions and nonlinearity of networks that extract features from high-order interactions such as categorical variables. Some researchers have tried to employ side information in traditional methods. However, most recommendation research addresses deep learning. Reference [27] designs a novel deep interest network that refers to embedding and multi layered perceptron (MLP) to learn the representation of user interests from historical behaviors. Reference [29] merges the features from a wide network branch and deep network branch into one model to predict the click rate. Reference [32] replaces the inner product of MF with neural networks and separately fuses the linear features and nonlinear features from generalized matrix factorization and MLP. Reference [9] slightly changes the input vector and loss function and employs the cosine similarity to verify the users’ preference. In addition to the text and categorical data as side information, MLP with its derivatives can also extract features from media data, such as text content for recommendation [5]. Side information can be easily attained for a commercial company, but some information is highly sensitive. As we can see, these studies have more stringent requirements for the input data and are not appropriate for every dataset, which indicates that a substantial amount of effort should be spent on the adjustment of inputs. Besides, these related works usually focus on the interactions but not extract the feature from user and item separately which means the feature representation of user and item can be affected by each other.
Thus, we propose a new model for recommender systems that can separately represent the users and items features and merge them to make predictions more accurate. We focus on the feature representation, algorithm and model design to make better recommendations instead of using more information from specified datasets. With some techniques, this new model can precede all related studies of different dataset.
Architecture of DCCR Model
In this session, we describe the proposed DCCR model. To take advantage of the deep learning model in terms of deeper inner feature extraction and fusion, the DCCR is a hybrid architecture that consists of two different kinds of neural network models (i.e., DAE and MLP). DAE extracts user and item deeper latent features for the raw ratings data in a separate way, while the main function of an MLP is to merge the user and item feature from the results of the DAE at the first layer and extract the higher features (i.e., relationships between user and item) based on the combined user and item features. The architecture of the proposed model for the rating prediction task is shown in Fig. 1.
Rating prediction of the DCCR model. The input vector is the feature representation of users and items. The output vector is the predicted ratings.
This model contains two parallel neural networks, which are merged in the fusion part. The first branch of the model attempts to represent the latent features from the input vectors of the user, while the second branch of the model represents the item latent features. The main advantage of this architecture is that it not only separately extracts latent features from users and items, similar to many other studies, but also captures the interactions relationship between users and items for improving the prediction performance. The proposed architecture only requires raw rating data to rate prediction tasks.
A. Feature Representation
Raw data collected from industrial applications are noisy with a substantial amount of useless information and cannot be directly applied for recommender tasks. Thus, feature representation is an important step in our study. Based on the theory of collaborative filtering, we define our item and user feature representation as follows:
Given the user id set \begin{equation*} r_{u,i} =\frac {\bar {r}_{u,i} -\bar {r}_{\min }}{\bar {r}_{\max } -\bar {r}_{\min }} \cdot \text {g} N_{s} +1,where~\bar {r}_{u,i} ~ is ~known\tag{1}\end{equation*}
Because user \begin{equation*} r_{u,i} =\begin{cases} 0, & if\text {r}_{u,i}~ is~ unknown \\ r_{u,i}, & otherwise \\ \end{cases}\tag{2}\end{equation*}
We use rating data to express our feature of items and users. For item \begin{equation*} I_{i} =[r_{1,i},r_{2,i} \ldots,r_{M,i}]\tag{3}\end{equation*}
For user \begin{equation*} U_{u} =[r_{u,1},r_{u,2} \ldots,r_{u,N}]\tag{4}\end{equation*}
M usually does not equal N; thus, the feature vector
B. Feature Extracting With Autoencoders
An autoencoder is selected to extract the latent feature from raw rating data. Each branch of the first part of the DCCR is a complete denoising autoencoder with all layers and differs from existing methods that only use the output of hidden layers.
Typically, an autoencoder is an efficient model for compressing the high-dimension input to a low-dimension vector. An autoencoder is an unsupervised neural network, where the output of the network needs to reconstruct the input. Autoencoders usually have three layers: the input layer, the hidden layer and the output layer. To force the neural network to learn features from the input, the number of output layer units is equivalent to the number of input layer units, and the number of hidden layer units is less than the number of input layer units. We consider that the output vectors of an autoencoder can be another way to express the implicit features. Unlike the input vector, the output vector is a dense vector without missing any values, which can be more suitable for a neural network, such as input data.
However, neural networks can easily cause overfitting due to the number of parameters. To avoid this issue, a regularized autoencoder (RAE) is proposed. The RAE adds the L2 regularized term of the loss function, which can change the training process of networks. Considering the example of \begin{align*} L\left ({{U_{u},h\left ({{U_{u};\theta } }\right)} }\right)\!=\!\sum \limits _{u=1}^{M} {\left ({{U_{u} -h\left ({{U_{u};\theta } }\right)} }\right)\!+\!\frac {\lambda }{2}\left ({{\sum \limits _{l=1}^{L} {\left \|{ {W_{l}} }\right \|_{F}^{2}}} }\right)} \\ {}\tag{5}\end{align*}
Rating prediction of autoencoders. AutoRec-U and AutoRec-I separately use the user features and item features as input vectors. These two models have different dimensions of input and parameter size and accuracy of predication.
Classic autoencoders often have a dense input; an exception is this scenario. A denoise autoencoder uses some specified noise to corrupt the input data, modify the loss function, and require the network to reconstruct the initial input [5], [22]. This process is termed a denoising process, which can render the neural network more robust than the original autoencoder. Let \begin{align*} U'_{u,i}=&N\left ({{U_{u,i}} }\right),for~ i~ in~\left [{ {1,N} }\right] \tag{6}\\[5pt] I'_{u,i}=&N\left ({{I_{u,i}} }\right),for ~u ~in~\left [{ {1,M} }\right]\tag{7}\end{align*}
Generally, three kinds of noises are extensively employed in many papers: Gaussian noise, masking noise, salt-and-pepper noise. The corruption ratio is usually small and does not destroy the latent feature of the input data. In [15], the author tried different noise with different ratios to improve the model for predicting ratings.
A stacked denoising autoencoder is an efficient way to train a deep neural network [15]. Each layer in a stacked denoising autoencoder is the hidden layer of one autoencoder, with the exception of the input layer and the output layer. The way to train this deep neural network is greedy layer-wise training. First, train each layer at one time; second, stack all layers as a network and fine-tune the parameters by a backpropagation algorithm. There are some recommendation studies that use stacked denoising autoencoders [25], [30].
Typically, fusion does not exist for both user features and item features using the autoencoders in these studies [13]. In our model, the fusion part is as important as the feature extraction part.
C. Feature Fusion With MLP
Direct prediction of the unknown ratings by the direct use of the output of autoencoders is common [13], [25]. However, this approach does not account for any interactions and disregards the relation between a user and item. Collaborative filtering models these interactions to affect the prediction results. To address this problem, we use the MLP model by adding hidden layers to this model to merge implicit features from the two autoencoders’ output vectors. In contrast to autoencoders, MLP (Multi Layered Perceptron) can extract higher order features, which can be more useful for different tasks. MLP and its derivatives, such as convolutional neural networks, are substantially applied in image processes with multilayer neural network to capture latent features.
The structure design of an MLP usually follows a tower pattern [16]. The idea is to use fewer hidden units for higher layers. In our experiment, we try to use the MLP to perform rating prediction and combine the user input vector with the item vector in the first layer. However, the inferior prediction is obtained, which explains why no paper is available to predict ratings. Because MLP extracts features and maps them into other dimensions without considering the range of ratings, even the entire output is renormalized. The sparse input vector may be another reason why this approach does not work.
To alleviate this issue, we slightly modify the MLP model. First, we concatenate these two vector as one vector. Second, we feed this vector to the MLP. More precisely, the MLP model with our DCCR is defined as \begin{align*} X'=&f\left ({{U_{u} ',I_{i} '} }\right)=\left [{ {U_{u} ',I_{i} '} }\right] \\ H_{1}=&f_{1} \left ({{X'} }\right)=f_{2} \left ({{W_{1}^{T} \text {g} X'+b_{1}} }\right), \\&\ldots \ldots, \\ H_{L}=&f_{L} \left ({{H_{L}} }\right)=f_{L} \left ({{W_{L}^{T} \text {g} H_{L} +b_{L}} }\right)\tag{8}\end{align*}
Unlike the usual tower pattern, we build the MLP model to have the same number of units in every layer. Generally, a tower pattern neural network can help compress the vector to a low-dimension feature representation. In this scenario, however, compression is not necessary since only the interactions between the users and items need to be described in our proposed model. An MLP usually needs a classifier, such as Softmax, to acquire the predicted value [32]. Because the rating prediction task is a regression task, we regard the output of the last layer as the predicted ratings.
After completion of the feature fusion process, we use a gradient descent algorithm to minimize the loss, which is defined as \begin{align*}&\hspace{-0.6pc} Loss=\alpha \sum \limits _{\left ({{u,i} }\right)\in \kappa \left ({r }\right)\cap o\left ({{\tilde {r}} }\right)}^ {\left ({{r_{u,i} -h\left ({{U_{u},I_{i};\theta } }\right)} }\right)} \\& \qquad \qquad \qquad {{{+\,\beta \sum \limits _{(u,i)\notin \kappa \left ({r }\right)\cap o\left ({{\tilde {r}} }\right)}^ {\left ({{r_{u,i} -h\left ({{U_{u},I_{i};\theta } }\right)} }\right)} +\frac {\lambda }{2}\left ({{\sum \limits _{l=1}^{L} {\left \|{ {W_{l}} }\right \|_{F}^{2}}} }\right)} }}\tag{9}\end{align*}
D. R-Relu Activation Function
As a vital part of a neutral network model, the activation function of a neural network is the nonlinear mapping of the features. The commonly employed functions include the sigmoid, tanh and rectified linear units (Relu) functions. Among them, Relu [34] has been proved to be nonsaturated and more suitable for sparse data. Relu alleviates the problem of gradient vanishing, and variation functions of Relu have been proposed [27].
For the classification task, a classifier is required as the last layer of a neural network. In our ratings prediction task, however, the output of the last layer without the classifier can be regarded as predicted values. In this case, the activation function of the output layer can be properly chosen. We need to guarantee that the output values are limited in a certain interval, which indicates that the minimum and maximum of the input rating data should be reflected in the activation function. A sigmoid function restricts each value to (0, 1), and the tanh function restricts each value to (−1, 1). Relu is a better choice but does not have a maximum. In this paper, we design the R-Relu activation function, which can limit the output value to [1], [5] naturally by modifying the upper and lower boundary conditions of original Relu function.\begin{align*} f\left ({x }\right)\!=\! \begin{cases} minimum, & if x < minimum \\ x, & if x >minimum,~x< maximum \\ maximum, & if x >maximum \\ \end{cases},\! \\ {}\tag{10}\end{align*}
Due to renormalization, the minimum and maximum in our experiment are 1 and 5, respectively. R-Relu is a variant of Relu. We focus the range of values while absorbing the benefits of both sigmoid and Relu. This function can be easily understood and implemented because no hyperparameters and needless variables need to be adjusted. Fig. 3 plots the control function of R-Relu.
Schematic of R-Relu. It is a piecewise function similar to the original Relu with minimal change. In our dataset, the minimum is 1 and the maximum is 5
R-Relu is employed in two locations in our model: the first location is the last layer of the DAE part, and the second location is the last layer of the MLP part. The activation function of the remaining layer, we apply a sigmoid function, which has been proven to be better than Relu in rating prediction tasks [13].
Experiments and Results
In this section, we present our experiments in detail. Experiments on several conditions show the key factors in our proposed model. For all the experiments, we run 5-fold cross validation and take the average to report the results.
A. Data and Metrics
We employ two datasets to evaluate the performance of our DCCR model. The datasets are public and can be easily obtained from the internet. Details of the dataset are described as follows:
MovieLens 1M. This dataset contains 1,000,209 ratings of 3,900 movies made by 6,040 MovieLens users.
MovieLens 10M. This dataset contains 10,000,054 ratings of 10,681 movies made by 71,567 users of the online movie recommender service MovieLens.
In the experiment, each dataset is divided into two datasets: training dataset (90%) and testing dataset (10%). We implement our experiments on TensorFlow [21], which is a popular and powerful framework for deep learning model implementation.
We employ the widely-used root mean squared error (RMSE) and mean absolute error (MAE) as the evaluation metrics for measuring the prediction accuracy. The RMSE and MAE are defined as \begin{align*} RMSE &=\sqrt {\frac {\sum \limits _{(u,i)\in \kappa \left ({r }\right)} {\left ({{r_{u,i} -\tilde {r}_{u,i}} }\right)^{2}}}{\left |{ T }\right |}}\tag{11} \\ MAE & =\frac {\sum \limits _{(u,i)\in \kappa \left ({r }\right)} {\left |{ {r_{u,i} -\tilde {r}_{u,i}} }\right |_{abs}}}{\left |{ T }\right |}\tag{12}\end{align*}
B. Effect of Different Noises
DAE is the base model of our DCCR and plays an important role in the results of precision. Different noises can have a large impact on our model. The idea of denoising is to break the input representation vector with noises, modify the loss function and adjust the parameters of our model by the gradient descent method to render the DAE more robust. We consider three common noises and test them with the model with three layers.
1) Mask Noise
This noise forces parts of the number of input vectors to be zero. Consider the example of \begin{equation*} N_{M} \left ({{U_{u,i}} }\right)=\begin{cases} 0, & if~ i~\text {in}~S_{u} \\ U_{u,i}, & otherwise \\ \end{cases}\tag{13}\end{equation*}
2) Salt and Pepper Noise
This noise forces part of the number of input vectors to be the maximum or minimum of the ratings. Consider the example of \begin{equation*} N_{S} \left ({{U_{u,i}} }\right)= \begin{cases} minimum, & if~ i~\text {in}~P_{u} \\ maximum, & if ~i~\text {in}~Q_{u} \\ U_{u,i}, & otherwise \\ \end{cases}\tag{14}\end{equation*}
3) Gaussian Noise
This noise is a subset of random stochastic variables with a mean of 0 and standard deviation of a number between (0, 0.1).\begin{equation*} N_{G} \left ({x }\right)=x+G\left ({\upsilon }\right),G\left ({\upsilon }\right): \left ({{0,\delta ^{2}} }\right)\tag{15}\end{equation*}
The RMSE results after adopting different kinds of noise in the DCCR are shown in Fig. 4, and the MAE results are in Fig. 5. Note that only one kind of noise is added to the DCCR for each experiment.
In both Fig. 4 and Fig. 5, I-DAE and U-DAE are two branches from our model that are applied to separately extract the features from the user input and item input. I-DAE and U-DAE are two important parts for our prediction that ensure that the neural networks achieve optimal learning. The performance of I-DAE is usually better than that of the U-DAE. As indicated by the experimental results shown in Fig. 4 and 5, the same noises have similar results in both I-DAE and U-DAE. The input with the Gaussian noise achieves the best result, while the mask noise performs better than the input without noise. The salt-and-pepper noise produces an inferior prediction. For I-DAE, the mask noise and Gaussian noise have similar results most of the time. However, the Gaussian noise can obtain better results than the latter noise. With an increase in epoch, overfitting occurs. Even the results with the salt and pepper noise input do not deteriorate and have converged but substantially worse than the other inputs. For U-DAE, Gaussian noise has a distinct advantage from the results. The best accuracy is attained after 80 epochs.
C. Effects of Regularization and Pre-Train
The impact of different regularization. The regularization term of the loss function can avoid overfitting of the model and impact the performance of the final results. The popular regularization terms are L1 and L2. Dropout is another exclusive regularization method from deep learning [7], [10]. Dropout randomly prevents some units in hidden layers from working in the whole training process, which has proved the effectiveness in many deep leaning models. The loss function usually employs one kind of regularization term at a time. To find the most effective regularization method, in this set of experiments, we compare the performance of the DCCR model with different types of regularization and nonregularization. The results are shown in Fig. 6.
Impact of different regularization term. The red line demonstrates that the L2 regularization term is better than the others.
From the results in Fig. 6, we discover that the L2 regularization outperforms the L1 regularization. Without the regularization term, the training process of the model has a serious overfitting problem, which causes the model to have the worst generalization ability. We perform dropout regularization. In shallow neural networks, however, dropout regularization is not appropriate for the model training because of the small number of hidden units. The results with dropout are not better than the L2 regularization but are better than the nonregularization and L1 regularization.
Pretraining technique. The pretraining methods can affect the performance of the final results. The objective of the pre-training technique is to initialize the weights and bias values of neural networks by training the neural networks with different training datasets. Most convolutional neural networks for image process tasks train neural networks using the ImageNet dataset [8]. Similar to the greedy layer-wise training of stacked autoencoders, we pretrain the two branches of the DCCR model separately using the same training dataset. After the pretraining process achieves the optimal results, we train the whole model from the start, which simplifies the adjustment of the weights of the neural network for the ultimate goal. Our experiments in Table 1 show that this approach will help to improve the accuracy of the rating prediction. Without the pretraining, we believe that the two branches would affect each other in the whole training process, which indicates that the branch for users or items will not attain the best representation for the rating prediction. First, we pretrain these two branches to ensure that the model perfectly extracts the latent features. Second, we train the DCCR model with pretraining weights of this network, which improves the model.
D. Effects of Different Activation Function and Hyper-Parameters
Each layer of neural networks has activation function to filter some irrelevant features from the input and only focus on the valuable features. In this paper, we propose a new activation function named R-Relu which has been employed in two locations in our model: the first location is the last layer of the DAE part, and the second location is the last layer of the MLP part. To prove the effectiveness of the R-Relu, we compare the results of some other activation functions including sigmoid, tanh and Relu, which is shown as Fig. 7. The experiments are conducted without any other techniques to show the effects of different activation functions. From the results, we can see the R-Relu can outperform the other activation functions.
The DCCR model contains some hyperparameters, which are vital for the performance of ratings prediction. We manually adjust the hyperparameters in the specified range to show the effects of different parameters.
We focus on the symmetrical DAE in our DCCR model, which is designed with a particular purpose. In this symmetrical DAE, the sizes of the input vector and output vector are fixed. However, the hidden layers are not fixed. In many deep learning studies, the number of units in each layer in a model is specified by the researchers and are repeatedly validated. In this case, we use the DAE with one hidden layer to test the impacts of different numbers of units. The adjustable range of this value is [100, 250, 500, 750, 1000]. For our experiments, to show the different effects of features extraction between users and items, we separately tried both I-DAE and U-DAE. The results are shown in Fig. 8.
The results of Fig. 8 reveal that the number of hidden layer units can significantly affect the results. The hidden layer with 500 units can achieve the best results compared with the other settings. Either fewer than 500 or more than 500 will yield inferior results.
Second, the loss function contains the coefficient of regularization term. We evaluate the impact of these parameters by several contrast experiments. The adjustable range of this value is [0.01, 0.1, 1, 10, 100, 1000]. The results are shown in Fig. 9. The coefficient significantly impacts the performance, and the optimal coefficient is sensitive to the dataset. For example, the best coefficient is 10 for MovieLens 1M, while the best coefficient is 100 for MovieLens 10M.
Impacts of different coefficients of regularization. (A) shows the accuracy results for MovieLens 1M, and (B) shows the results for MovieLens 10M.
A backpropagation algorithm and mini-batch gradient descent are used for our model training. The mini-batch gradient descent is a compromise between the batch gradient descent and stochastic gradient descent. We also apply adaptive moment estimation (Adam) to automatically adjust the learning rate during the training process [33]. We do not discuss other hyperparameters, such as the ratio of corrupt values, due to the limited scope of this paper.
E. Results From Model Comparison
We compare the performance of traditional methods and several effective methods based on neural networks and prove that our model can outperform the other methods. The results are shown in Table 1. The advantages of our model are indicated by the accuracy of the results. Please noted that we did not compared DCCR with knowledge-based and content-based recommender, since we mainly focus on using the raw ratings solely for rating prediction, while the knowledge-based and content-based methods are not suitable for recommendation when only the raw ratings available. Furthermore, both of knowledge and content-based methods are mainly used for to Top-N recommendation, instead of rating prediction.
As shown in Table 1, the first two methods are proposed in [11]. We set the dimension of the latent feature model to 20, the learning rate to 0.01, and the coefficient of the regularization term to 0.02. The remaining methods are based on neural networks. The V-CFN and V-CFN++ use stacked denoising autoencoders for rating prediction in [25]. AutoSVD and AutoSVD++ combine the SVD and SVD++, respectively, with contractive autoencoders, which render the model more explicable [26]. Conversely, the next two methods only use the rating data to predict ratings without any other information [13], which is more consistent with our research. The AutoRec-I model can achieve better results than the AutoRec-U model when the number of users is larger than the number of items. This finding demonstrates that more features of input can produce more accurate results. For each method, we leave the settings as the papers described. The remaining four methods are discussed in this paper.
According to the experimental results, our method based on the neural network can attain the best accuracy, which proves the correctness of feature extraction and the merge part in our model. From the feature extraction part, two parallel branches of networks are designed and work independently. In the recommendation methods that combine with neural networks, autoencoders can predict the ratings with a user vector or item vector, which produces the two kinds of models for users and items, respectively. Based on the results, the model with the item vector as an input achieves better results than the model with the user vector as an input. Typically, the output values of autoencoders are the predicted values. In our study, dense output vectors are treated as the extracting features for users and items. The idea of fusion for users’ features and items’ features is discussed in many studies, which can represent the interactions between users and items. The fusion part is required in the top-n recommendation and click-through rate prediction. In this study, we also show the effectiveness of the merge part in the rating prediction task. We may be the first researchers to demonstrate the effectiveness based on neural networks.
The previously mentioned steps contribute to the final results. The accuracy improvement with pretraining is obvious. We pretrain the two branches of the DCCR until the model converges and then use the weights from the pretraining as the initial values to adjust all parameters of the DCCR. The results of the last two methods in Table 1 achieve slightly different results because of the pretraining technique.
F. Effects of the Depth of Neural Network
Some researchers have proven the deep neural network can truly help improve the accuracy of many artificial intelligence tasks, such as image processing. The most successful neural network is ResNet [18], which equips up to 100 layers to achieve the best results in ImageNet contests. In our model, we design the basic structure with three hidden layers for the feature extraction part, and one layers for the fusion part. Five-layer autoencoder can be designed as Fig.10 (left) in DCCR. However, most deep neural networks cannot achieve better results than one single layer network [28]. The vanishing gradient is a serious problem in the training process of multilayer neural networks. The vanishing gradients are usually ascribed to activation functions, such as the sigmoid function [16], uninitialized parameters and inappropriate learning rates. This situation causes network degradation, which indicates that deeper neural networks learn even less than shallow networks. Without exception, we stack the layers to predict the ratings and acquire inferior accuracy, which is shown in Fig. 10 (right). To alleviate this problem, several approaches, such as stacked autoencoders [4], [22], batch normalization [17] and residual learning [18], exist. Changing the activation function enables deep neural network training to proceed more smoothly [34]. In our case, changing the activation function worsens the prediction. We defer the residual learning for future research.
Multilayer autoencoders for rating prediction of recommendations (left). Results of multilayer autoencoders (right).
Stacked autoencoders can help to train deep neural networks in the early stages [4] and employ greedy layer-wise training in two steps. First, treat each layer as one hidden layer of an autoencoder, and separately adjust the weights. Second, use a backpropagation algorithm to simultaneously optimize the parameters of all layers. The first step for each layer can be very time-consuming, and the process is not end-to-end training. This method has been replaced by other methods because of the complexity and inefficiency of the method.
Batch normalization has proven to be an effective measure to accelerate the training of deep networks and improve the accuracy without many deep learning techniques, such as lower learning rates and careful parameter initialization [17]. Batch normalization takes a normalization step that fixes the mean and variances of layer inputs for each mini-batch and has a beneficial impact on the gradient propagation in the neural networks, which can help networks learn better from the data. Fig. 10 (right) shows the results of 5 layers of DAE with BN, which is substantially better than the model without BN. We also test the results with 7 layers, which is shown in Fig. 11.
Our experiments show that deep networks can assist the rating prediction when the vanished gradients have been alleviated. With a gradual increase in the networks’ depth, the training process will be slower than the shallow networks, require more time and have higher requirements for hardware. This approach is a trade-off between accuracy and efficiency. We defer this issue for future research.
Conclusions
Collaborative filtering has shown to be effective in commercial recommender systems. By combining with neural networks, CF can represent the latent features of users and items without a manual setting. However, most of related studies use a single model with a common activation function to perform a rating prediction task without considering the traits of features and ratings. In this paper, we propose a hybrid neural network model for rating prediction that is named the deep collaborative conjunctive recommender (DCCR). This model integrates the spirits of several neural networks to separately capture the latent features from users and items and describes the interactions between these features. Solely using the explicit ratings from the data, we design this end-to-end model to improve the accuracy of rating prediction.
Numerous factors affect the prediction performance. Thus, to achieve the optimal model, we evaluate the DCCR with varying factor settings by considerable contrast experiments. The results show that our DCCR model outperforms other state-of-the-art methods using two real-world datasets. We also prove that the DCCR with additional layers has a positive effect on accuracy improvement.