Journals & Magazines >IEEE Access >Volume: 7

DCCR: Deep Collaborative Conjunctive Recommender for Rating Prediction

A deep collaborative conjunctive recommender, which is a hybrid approach that combines an autoencoder and a multilayered perceptron, where the autoencoder is used to extr...

Abstract:

Recently, collaborative filtering combined with various kinds of deep learning models is appealing to recommender systems, which have shown a strong positive effect in an...Show More

Metadata

Abstract:

Recently, collaborative filtering combined with various kinds of deep learning models is appealing to recommender systems, which have shown a strong positive effect in an accuracy improvement. However, many studies related to deep learning model rely heavily on abundant information to improve prediction accuracy, which has stringent data requirements in addition to raw rating data. Furthermore, most of them ignore the interaction effect between users and items when building the recommendation model. To address these issues, we propose DCCR, a deep collaborative conjunctive recommender, for rating prediction tasks that are solely based on the raw ratings. A DCCR is a hybrid architecture that consists of two different kinds of neural network models (i.e., an autoencoder and a multilayered perceptron). The main function of the autoencoder is to extract the latent features from the perspectives of users and items in parallel, while the multilayered perceptron is used to represent the interaction between users and items based on fusing the user and item latent features. To further improve the performance of DCCR, an advanced activation function is proposed, which can be specified with input vectors. The extensive experiments conducted with two well-known real-world datasets and performances of the DCCR with varying settings are analyzed. The results demonstrate that our DCCR model outperforms other state-of-art methods. We also discuss the performance of the DCCR with additional layers to show the extensibility of our model.

A deep collaborative conjunctive recommender, which is a hybrid approach that combines an autoencoder and a multilayered perceptron, where the autoencoder is used to extr...

Published in: IEEE Access ( Volume: 7)

Page(s): 60186 - 60198

Date of Publication: 08 May 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2915531

Funding Agency:

Contents

SECTION I.

Introduction

Recommender systems [1] are essential for the success of many online applications. Considering online shopping websites as an example, numerous goods are provided by these shopping sites, and users browse all the information about all the goods in a short time. In this context, recommender systems, as one kind of effective information filtering tool, not only can help users to obtain more valuable advice by filtering the redundant information but also gradually increase the sales volume of the websites. As a result, recommender systems have already been integrated in some large-scale websites (e.g., Amazon), which continually service thousands of people.

To date, different kinds of recommendation tasks have been extensively investigated in academia and industry, including rating prediction tasks [11], [13], [25], Top-N tasks [30], [32], [35], click-through rate predic- tion [12], [27], [29], etc. These tasks can help people to obtain useful information from a certain amount of varied data, which conform to the actual use scenario in industrial applications. In the past few decades, teams of researchers have spent a considerable amount of effort on recommender design and have achieved great results. Collaborative filtering (CF) [1] is one of the great inventions in recommendation research that has been successfully used for industry applications. In contrast to traditional CF recommenders that depend on calculating the similarity between users/items with similar preferences, matrix factorization (MF) is a popular CF recommender for rating prediction tasks [11]. MF decomposes the original rating matrix R into two low-rank matrices, which represent the latent feature space of users and items. Due to their effectiveness, variants of the MF method have been proposed [2], [9]. Recently, with the success of deep neural networks, the combination with deep learning methods is a new breakthrough for recommenders.

In this context, the CF recommender combined with deep learning methods has attracted much attention of academia and industry. Deep learning methods have been successful applied and have achieved satisfactory results in many fields, such as image processes, speech recognition and natural language processes [3], [18]. Research results have demonstrated that neural networks have the powerful ability to learn the latent features from heterogeneous data and to gain reasonable results [4], [20]. Among them, AutoEncoders (AE) and Multi-Layer Perceptron (MLP) have been widely used for recommendation recently. However, they have their own advantages and disadvantages when designing recommendation model. For AE, it is a common and effective method to reconstruct its input data in the output layer. The core idea of AE for recommender is to predict users’ preference via compressing an input vector, and then make recommendations. However, most of existing studies using AE mainly focus on the feature representation with users and items separately, without considering the interaction between them [11], [25], [26]. For MLP, it is a feed-forward neural network with multiple hidden layers, which goods at learning the hierarchical feature representations effectively (e.g., the interaction between users and items in recommendation) but lacks of extracting the feature from user and item separately [9], [32]. It means the feature representation of user and item can be affected by each other. Therefore, it cannot achieve the goals of latent features representation and fusion simultaneously via adopting AE or MLP alone.

To address these issues, this paper proposes a deep collaborative conjunctive recommender (DCCR). We focus on the rating prediction task of a recommender and format it as a regression problem. By taking advantage of deep learning and traditional CF methods, we try to extract latent features from users’ explicit ratings to items without any additional information. We first present some related studies of deep learning methods and traditional methods and analyze the process of rating prediction. We then describe the feature representation with users and items and explain the model that we propose. Some techniques that we apply and the working mechanism are detailed for better features extraction with consideration of the interaction between the users and items. Particular experimental settings and processes are defined to obtain reasonable results. We test the results with the root mean squared error and the mean absolute error. To obtain the optimal results, some experiments are conducted with varying parameters settings. The experimental results are illustrated to understand the impacts of many factors. We also compare the accuracy of our proposed method with other related and recent methods. The main contributions of this paper include:

We present a novel recommender model that extracts deep inner features of both users and items that solely depend on the explicit ratings and extract the interaction features. We describe the details of the structure, input vector, loss function and training techniques, which are indispensable for the experiments.
We investigate the impacts of the parameters of the proposed model and analyze the relations of these parameters on the prediction accuracy. We also provide possible measures for improving the results from different perspectives. An improved activation function for our neural networks are proposed, which can be specified with input vectors.
By conducting considerable experiments on two datasets, the results show that the proposed model can achieve better accuracy for this particular rating prediction task. We also discuss the expandability of our model by analyzing the depth of neural networks. Several methods are proposed to adjust the gradient problem of the deep neural networks.

The remainder of this paper is organized as follows: Section II presents some related studies and analyzes the existing disadvantages of these studies. Section III presents details of the proposed model and some techniques that we use to improve the model. Section IV illustrates the experimental results and discussion. Section V concludes this paper.

SECTION II.

Related Studies

Collaborative filtering (CF) has been widely used to provide users with new products and services in many industrial applications. CF provides users with products from similar users or chooses similar products from users’ favorite products. The matrix factorization model is the most important method of CF and has been explored by many researchers. Among different kinds of MF models, the latent factor model (LFM) is the most popular model for rating prediction tasks. The LFM factorizes the rating matrix R into two low-latent factor matrices. However, the manual process of feature extraction consumes manpower and financial resources. Recently, deep learning methods have shown that the neural networks have the powerful ability to automatically learn the features from heterogeneous data and gain reasonable results for most tasks [4], [20]. Therefore, to achieve the goal of improving the prediction accuracy by learning the deep inner user/item features, CF combined with the neural networks methods have been proposed by many papers.

As one of the most effective deep learning methods in recommendation, autoencoders have been discussed in several papers [12]–[15]. Autoencoder is an unsupervised learning method that can automatically compress the input features to a low dimension, which has shown absolute advantages in feature extraction compared with traditional methods [22]. Different kinds of autoencoders have been proposed for different scenarios, such as denoising autoencoders (DAE) [4], [5], marginalized autoencoders [6], [12] and contractive autoencoders [26]. Many researchers have successfully applied these models in recommender systems. Reference [12] combines collaborative filtering with marginalized denoising autoencoders for rating prediction and click prediction. Reference [25] employs stacked denoising autoencoders (SDAE) to extract features from side information to predict ratings. Some studies combined with traditional method are proposed. In [26], the authors present two new hybrid models by integrating contractive autoencoders (CAE) into the matrix factorization model: SVD, SVD++ [11], which are named AutoSVD and AutoSVD++. The authors utilize CAE to represent item side information with nonlinear features. In [30], the authors build a hybrid collaborative filtering model that combines the SDAE and MF to learn both a user item rating matrix and side information of users and items. Although an autoencoder is an effective method for compressing an input vector for predicting users’ preference and making recommendations, these studies usually focus on the feature representation with users and items separately without considering the interaction between users and items.

To address this issue, multilayered perceptron—another neural network model—has been applied to many industry recommender systems. Multilayered perceptron combines the features of users and items, which have been extracted from neural networks to achieve better recommendation. But most of these methods are focus on the content processing, such as reviews. Normally, the reviews of users and items are employed as input data and a joint deep model are build to merge the features [23]. Some works apply co-attention mechanisms [24] to learn a distributed representation from user and item reviews. Side information, such as categorical information about users and items, is applied in many papers to improve the accuracy of prediction and has been applied for multiple tasks, especially top-n prediction. Reference [31] combines the linearity of the Factorization Machines (FM), which represents the feature interactions and nonlinearity of networks that extract features from high-order interactions such as categorical variables. Some researchers have tried to employ side information in traditional methods. However, most recommendation research addresses deep learning. Reference [27] designs a novel deep interest network that refers to embedding and multi layered perceptron (MLP) to learn the representation of user interests from historical behaviors. Reference [29] merges the features from a wide network branch and deep network branch into one model to predict the click rate. Reference [32] replaces the inner product of MF with neural networks and separately fuses the linear features and nonlinear features from generalized matrix factorization and MLP. Reference [9] slightly changes the input vector and loss function and employs the cosine similarity to verify the users’ preference. In addition to the text and categorical data as side information, MLP with its derivatives can also extract features from media data, such as text content for recommendation [5]. Side information can be easily attained for a commercial company, but some information is highly sensitive. As we can see, these studies have more stringent requirements for the input data and are not appropriate for every dataset, which indicates that a substantial amount of effort should be spent on the adjustment of inputs. Besides, these related works usually focus on the interactions but not extract the feature from user and item separately which means the feature representation of user and item can be affected by each other.

Thus, we propose a new model for recommender systems that can separately represent the users and items features and merge them to make predictions more accurate. We focus on the feature representation, algorithm and model design to make better recommendations instead of using more information from specified datasets. With some techniques, this new model can precede all related studies of different dataset.

SECTION III.

Architecture of DCCR Model

In this session, we describe the proposed DCCR model. To take advantage of the deep learning model in terms of deeper inner feature extraction and fusion, the DCCR is a hybrid architecture that consists of two different kinds of neural network models (i.e., DAE and MLP). DAE extracts user and item deeper latent features for the raw ratings data in a separate way, while the main function of an MLP is to merge the user and item feature from the results of the DAE at the first layer and extract the higher features (i.e., relationships between user and item) based on the combined user and item features. The architecture of the proposed model for the rating prediction task is shown in Fig. 1.

FIGURE 1.

Rating prediction of the DCCR model. The input vector is the feature representation of users and items. The output vector is the predicted ratings.

Show All

This model contains two parallel neural networks, which are merged in the fusion part. The first branch of the model attempts to represent the latent features from the input vectors of the user, while the second branch of the model represents the item latent features. The main advantage of this architecture is that it not only separately extracts latent features from users and items, similar to many other studies, but also captures the interactions relationship between users and items for improving the prediction performance. The proposed architecture only requires raw rating data to rate prediction tasks.

A. Feature Representation

Raw data collected from industrial applications are noisy with a substantial amount of useless information and cannot be directly applied for recommender tasks. Thus, feature representation is an important step in our study. Based on the theory of collaborative filtering, we define our item and user feature representation as follows:

Given the user id set $u = [1, 2,\ldots,\text {M}]$ and the item id set $i = [1, 2,\ldots,\text {N}]$ , define $R$ as a $\vert \text{M}\vert \times \vert \text{N}\vert$ matrix. Let $r_{u,i}$ denote the rating of user $u$ on item $i$ . First, the range of values usually differ due to the diversity of datasets. Thus, renormalization of the known values into the same range is necessary for more accurate experiments. The renormalization function is defined as follows:

$\begin{equation*} r_{u,i} =\frac {\bar {r}_{u,i} -\bar {r}_{\min }}{\bar {r}_{\max } -\bar {r}_{\min }} \cdot \text {g} N_{s} +1,where~\bar {r}_{u,i} ~ is ~known\tag{1}\end{equation*}$ View Source

$\bar {r}_{u,i}$ is the raw value from rating data. $\bar {r}_{\min }$ and $\bar {r}_{\max }$ are the minimum raw value and maximum raw value, respectively, among the raw values. $N_{s}$ is the size of the designated range. In our experiments, the range is set to [1], [5], which indicates that $N_{s}$ is 4, and each rating can be a decimal but cannot be a negative number.

Because user $u$ cannot rate all items but only few items are involved, the matrix $R$ is sparse and most values do not exist. In this case, we define the matric as follows:

$\begin{equation*} r_{u,i} =\begin{cases} 0, & if\text {r}_{u,i}~ is~ unknown \\ r_{u,i}, & otherwise \\ \end{cases}\tag{2}\end{equation*}$ View Source

We use rating data to express our feature of items and users. For item $i$ , let $I_{i}$ denote the item feature representation vector:

$\begin{equation*} I_{i} =[r_{1,i},r_{2,i} \ldots,r_{M,i}]\tag{3}\end{equation*}$ View Source

For user $u$ , let $U_{u}$ denote the user feature:

$\begin{equation*} U_{u} =[r_{u,1},r_{u,2} \ldots,r_{u,N}]\tag{4}\end{equation*}$ View Source

M usually does not equal N; thus, the feature vector $I_{i}$ and $U_{u}$ have different dimensions. We do not use the extra information but only use the rating data since we try to use the least amount of information to make better recommendations in the rating prediction task.

B. Feature Extracting With Autoencoders

An autoencoder is selected to extract the latent feature from raw rating data. Each branch of the first part of the DCCR is a complete denoising autoencoder with all layers and differs from existing methods that only use the output of hidden layers.

Typically, an autoencoder is an efficient model for compressing the high-dimension input to a low-dimension vector. An autoencoder is an unsupervised neural network, where the output of the network needs to reconstruct the input. Autoencoders usually have three layers: the input layer, the hidden layer and the output layer. To force the neural network to learn features from the input, the number of output layer units is equivalent to the number of input layer units, and the number of hidden layer units is less than the number of input layer units. We consider that the output vectors of an autoencoder can be another way to express the implicit features. Unlike the input vector, the output vector is a dense vector without missing any values, which can be more suitable for a neural network, such as input data.

However, neural networks can easily cause overfitting due to the number of parameters. To avoid this issue, a regularized autoencoder (RAE) is proposed. The RAE adds the L2 regularized term of the loss function, which can change the training process of networks. Considering the example of $U_{u}$ , which indicates the input vector of user u, the loss function is defined as follows:

$\begin{align*} L\left ({{U_{u},h\left ({{U_{u};\theta } }\right)} }\right)\!=\!\sum \limits _{u=1}^{M} {\left ({{U_{u} -h\left ({{U_{u};\theta } }\right)} }\right)\!+\!\frac {\lambda }{2}\left ({{\sum \limits _{l=1}^{L} {\left \|{ {W_{l}} }\right \|_{F}^{2}}} }\right)} \\ {}\tag{5}\end{align*}$ View Source

$\theta$ defines the parameter set of the model. $\lambda$ denotes the coefficient of regularization term, which can be manually adjusted. $W_{l}$ defines the weights of the $l$ -th layer of networks. $\left \|{ g }\right \|_{F}^{2}$ defines the L2 regularization terms. The L defines the total number of layers. In [13], two different models with different input vectors are employed, as shown in Fig. 2. U-AutoRec with the input vector $U_{u}$ has lower accuracy than I-AutoRec with the input vector $I_{i}$ . The number of hidden layer units is 500 for these two models, and the activation functions of the hidden layer and output layer are the sigmoid function and the identity function, respectively.

FIGURE 2.

Rating prediction of autoencoders. AutoRec-U and AutoRec-I separately use the user features and item features as input vectors. These two models have different dimensions of input and parameter size and accuracy of predication.

Show All

Classic autoencoders often have a dense input; an exception is this scenario. A denoise autoencoder uses some specified noise to corrupt the input data, modify the loss function, and require the network to reconstruct the initial input [5], [22]. This process is termed a denoising process, which can render the neural network more robust than the original autoencoder. Let $N\left ({x }\right)$ be a corrupt function that affects the input vector. Several alternative functions exist with different levels of effectiveness. Let $U'_{u}$ and $I'_{i}$ denote the corrupt user input vector and item input vector, respectively. $U_{u,i}$ denotes the $i$ -th dimension of vector $U_{u}$ , similar to $U'_{u,i}$ , $I_{u,i}$ and $I'_{u,i}$ . Thus,

$\begin{align*} U'_{u,i}=&N\left ({{U_{u,i}} }\right),for~ i~ in~\left [{ {1,N} }\right] \tag{6}\\[5pt] I'_{u,i}=&N\left ({{I_{u,i}} }\right),for ~u ~in~\left [{ {1,M} }\right]\tag{7}\end{align*}$ View Source

Generally, three kinds of noises are extensively employed in many papers: Gaussian noise, masking noise, salt-and-pepper noise. The corruption ratio is usually small and does not destroy the latent feature of the input data. In [15], the author tried different noise with different ratios to improve the model for predicting ratings.

A stacked denoising autoencoder is an efficient way to train a deep neural network [15]. Each layer in a stacked denoising autoencoder is the hidden layer of one autoencoder, with the exception of the input layer and the output layer. The way to train this deep neural network is greedy layer-wise training. First, train each layer at one time; second, stack all layers as a network and fine-tune the parameters by a backpropagation algorithm. There are some recommendation studies that use stacked denoising autoencoders [25], [30].

Typically, fusion does not exist for both user features and item features using the autoencoders in these studies [13]. In our model, the fusion part is as important as the feature extraction part.

C. Feature Fusion With MLP

Direct prediction of the unknown ratings by the direct use of the output of autoencoders is common [13], [25]. However, this approach does not account for any interactions and disregards the relation between a user and item. Collaborative filtering models these interactions to affect the prediction results. To address this problem, we use the MLP model by adding hidden layers to this model to merge implicit features from the two autoencoders’ output vectors. In contrast to autoencoders, MLP (Multi Layered Perceptron) can extract higher order features, which can be more useful for different tasks. MLP and its derivatives, such as convolutional neural networks, are substantially applied in image processes with multilayer neural network to capture latent features.

The structure design of an MLP usually follows a tower pattern [16]. The idea is to use fewer hidden units for higher layers. In our experiment, we try to use the MLP to perform rating prediction and combine the user input vector with the item vector in the first layer. However, the inferior prediction is obtained, which explains why no paper is available to predict ratings. Because MLP extracts features and maps them into other dimensions without considering the range of ratings, even the entire output is renormalized. The sparse input vector may be another reason why this approach does not work.

To alleviate this issue, we slightly modify the MLP model. First, we concatenate these two vector as one vector. Second, we feed this vector to the MLP. More precisely, the MLP model with our DCCR is defined as

$\begin{align*} X'=&f\left ({{U_{u} ',I_{i} '} }\right)=\left [{ {U_{u} ',I_{i} '} }\right] \\ H_{1}=&f_{1} \left ({{X'} }\right)=f_{2} \left ({{W_{1}^{T} \text {g} X'+b_{1}} }\right), \\&\ldots \ldots, \\ H_{L}=&f_{L} \left ({{H_{L}} }\right)=f_{L} \left ({{W_{L}^{T} \text {g} H_{L} +b_{L}} }\right)\tag{8}\end{align*}$ View Source

where

$X'$

denote the input vector of MLP.

$U_{u} '$

and

$I_{i} '$

denote the output of U-DAE and I-DAE.

$W_{L}$

and

$b_{L}$

denote the weight matrix and the bias vector, respectively.

$f_{L}(x)$

denotes the activation function for the

$L$

-th layer of the perceptron.

$H_{L}$

defines the output vector of the

$L$

-th layer.

Unlike the usual tower pattern, we build the MLP model to have the same number of units in every layer. Generally, a tower pattern neural network can help compress the vector to a low-dimension feature representation. In this scenario, however, compression is not necessary since only the interactions between the users and items need to be described in our proposed model. An MLP usually needs a classifier, such as Softmax, to acquire the predicted value [32]. Because the rating prediction task is a regression task, we regard the output of the last layer as the predicted ratings.

After completion of the feature fusion process, we use a gradient descent algorithm to minimize the loss, which is defined as

$\begin{align*}&\hspace{-0.6pc} Loss=\alpha \sum \limits _{\left ({{u,i} }\right)\in \kappa \left ({r }\right)\cap o\left ({{\tilde {r}} }\right)}^ {\left ({{r_{u,i} -h\left ({{U_{u},I_{i};\theta } }\right)} }\right)} \\& \qquad \qquad \qquad {{{+\,\beta \sum \limits _{(u,i)\notin \kappa \left ({r }\right)\cap o\left ({{\tilde {r}} }\right)}^ {\left ({{r_{u,i} -h\left ({{U_{u},I_{i};\theta } }\right)} }\right)} +\frac {\lambda }{2}\left ({{\sum \limits _{l=1}^{L} {\left \|{ {W_{l}} }\right \|_{F}^{2}}} }\right)} }}\tag{9}\end{align*}$ View Source

$\kappa \left ({r }\right)$ defines the index pair set of known ratings with the id of corresponding users and items. For example, $\kappa \left ({r }\right)$ has the index pair ( $u$ , $i$ ), which indicates that the rating of user $u$ to item $i$ is known. $o\left ({{\tilde {r}} }\right)$ defines the index pair set of the corrupted rating with that of the corresponding users and items. $h\left ({{U_{u},I_{i};\theta } }\right)$ denotes the output of this model where $U_{u}$ and $I_{i}$ denote the input vectors of the model. $\alpha$ and $\beta$ denote the ratio of the corrupt data and the original data, respectively, which are set to 1.0 and 0.8, respectively.

D. R-Relu Activation Function

As a vital part of a neutral network model, the activation function of a neural network is the nonlinear mapping of the features. The commonly employed functions include the sigmoid, tanh and rectified linear units (Relu) functions. Among them, Relu [34] has been proved to be nonsaturated and more suitable for sparse data. Relu alleviates the problem of gradient vanishing, and variation functions of Relu have been proposed [27].

For the classification task, a classifier is required as the last layer of a neural network. In our ratings prediction task, however, the output of the last layer without the classifier can be regarded as predicted values. In this case, the activation function of the output layer can be properly chosen. We need to guarantee that the output values are limited in a certain interval, which indicates that the minimum and maximum of the input rating data should be reflected in the activation function. A sigmoid function restricts each value to (0, 1), and the tanh function restricts each value to (−1, 1). Relu is a better choice but does not have a maximum. In this paper, we design the R-Relu activation function, which can limit the output value to [1], [5] naturally by modifying the upper and lower boundary conditions of original Relu function.

$\begin{align*} f\left ({x }\right)\!=\! \begin{cases} minimum, & if x < minimum \\ x, & if x >minimum,~x< maximum \\ maximum, & if x >maximum \\ \end{cases},\! \\ {}\tag{10}\end{align*}$ View Source

Due to renormalization, the minimum and maximum in our experiment are 1 and 5, respectively. R-Relu is a variant of Relu. We focus the range of values while absorbing the benefits of both sigmoid and Relu. This function can be easily understood and implemented because no hyperparameters and needless variables need to be adjusted. Fig. 3 plots the control function of R-Relu.

FIGURE 3.

Schematic of R-Relu. It is a piecewise function similar to the original Relu with minimal change. In our dataset, the minimum is 1 and the maximum is 5

Show All

R-Relu is employed in two locations in our model: the first location is the last layer of the DAE part, and the second location is the last layer of the MLP part. The activation function of the remaining layer, we apply a sigmoid function, which has been proven to be better than Relu in rating prediction tasks [13].

SECTION IV.

Experiments and Results

In this section, we present our experiments in detail. Experiments on several conditions show the key factors in our proposed model. For all the experiments, we run 5-fold cross validation and take the average to report the results.

A. Data and Metrics

We employ two datasets to evaluate the performance of our DCCR model. The datasets are public and can be easily obtained from the internet. Details of the dataset are described as follows:

MovieLens 1M. This dataset contains 1,000,209 ratings of 3,900 movies made by 6,040 MovieLens users.
MovieLens 10M. This dataset contains 10,000,054 ratings of 10,681 movies made by 71,567 users of the online movie recommender service MovieLens.

In the experiment, each dataset is divided into two datasets: training dataset (90%) and testing dataset (10%). We implement our experiments on TensorFlow [21], which is a popular and powerful framework for deep learning model implementation.

We employ the widely-used root mean squared error (RMSE) and mean absolute error (MAE) as the evaluation metrics for measuring the prediction accuracy. The RMSE and MAE are defined as

$\begin{align*} RMSE &=\sqrt {\frac {\sum \limits _{(u,i)\in \kappa \left ({r }\right)} {\left ({{r_{u,i} -\tilde {r}_{u,i}} }\right)^{2}}}{\left |{ T }\right |}}\tag{11} \\ MAE & =\frac {\sum \limits _{(u,i)\in \kappa \left ({r }\right)} {\left |{ {r_{u,i} -\tilde {r}_{u,i}} }\right |_{abs}}}{\left |{ T }\right |}\tag{12}\end{align*}$ View Source

where %

$\tilde {r}_{u,i}$

is the predicted value of the tested model,

$\left |{ \text {g} }\right |_{abs}$

defines the absolute value and

$\left |{ T }\right |$

is the size of the ratings in the testing dataset.

B. Effect of Different Noises

DAE is the base model of our DCCR and plays an important role in the results of precision. Different noises can have a large impact on our model. The idea of denoising is to break the input representation vector with noises, modify the loss function and adjust the parameters of our model by the gradient descent method to render the DAE more robust. We consider three common noises and test them with the model with three layers.

1) Mask Noise

This noise forces parts of the number of input vectors to be zero. Consider the example of $U_{u}$ . First, we generate the random id set $S_{u}$ of the values in the scale of [1, $N$ ], whose size is 10% of the $\left |{ {U_{u}} }\right |$ ( $\left |{ X }\right |$ is the size of $X$ ). $U_{u,i}$ denotes the $i$ -th dimension of vector $U_{u}$ . Second, we define the $N\left ({x }\right)$ as follows:

$\begin{equation*} N_{M} \left ({{U_{u,i}} }\right)=\begin{cases} 0, & if~ i~\text {in}~S_{u} \\ U_{u,i}, & otherwise \\ \end{cases}\tag{13}\end{equation*}$ View Source

2) Salt and Pepper Noise

This noise forces part of the number of input vectors to be the maximum or minimum of the ratings. Consider the example of $U_{u}$ , first, we generate the random set of id $P_{u}$ , whose size is 5% of the $\left |{ {U_{u}} }\right |$ , and another random set of id $Q_{i}$ , whose size is 5% of the remaining 95% $\left |{ {U_{u}} }\right |$ .

$\begin{equation*} N_{S} \left ({{U_{u,i}} }\right)= \begin{cases} minimum, & if~ i~\text {in}~P_{u} \\ maximum, & if ~i~\text {in}~Q_{u} \\ U_{u,i}, & otherwise \\ \end{cases}\tag{14}\end{equation*}$ View Source

3) Gaussian Noise

This noise is a subset of random stochastic variables with a mean of 0 and standard deviation of a number between (0, 0.1).

$\begin{equation*} N_{G} \left ({x }\right)=x+G\left ({\upsilon }\right),G\left ({\upsilon }\right): \left ({{0,\delta ^{2}} }\right)\tag{15}\end{equation*}$ View Source

The RMSE results after adopting different kinds of noise in the DCCR are shown in Fig. 4, and the MAE results are in Fig. 5. Note that only one kind of noise is added to the DCCR for each experiment.

FIGURE 4.

RMSE of I-DAE (left) and U-DAE (right) with different noise.

Show All

FIGURE 5.

MAE of I-DAE (left) and U-DAE (right) with different noise.

Show All

In both Fig. 4 and Fig. 5, I-DAE and U-DAE are two branches from our model that are applied to separately extract the features from the user input and item input. I-DAE and U-DAE are two important parts for our prediction that ensure that the neural networks achieve optimal learning. The performance of I-DAE is usually better than that of the U-DAE. As indicated by the experimental results shown in Fig. 4 and 5, the same noises have similar results in both I-DAE and U-DAE. The input with the Gaussian noise achieves the best result, while the mask noise performs better than the input without noise. The salt-and-pepper noise produces an inferior prediction. For I-DAE, the mask noise and Gaussian noise have similar results most of the time. However, the Gaussian noise can obtain better results than the latter noise. With an increase in epoch, overfitting occurs. Even the results with the salt and pepper noise input do not deteriorate and have converged but substantially worse than the other inputs. For U-DAE, Gaussian noise has a distinct advantage from the results. The best accuracy is attained after 80 epochs.

C. Effects of Regularization and Pre-Train

The impact of different regularization. The regularization term of the loss function can avoid overfitting of the model and impact the performance of the final results. The popular regularization terms are L1 and L2. Dropout is another exclusive regularization method from deep learning [7], [10]. Dropout randomly prevents some units in hidden layers from working in the whole training process, which has proved the effectiveness in many deep leaning models. The loss function usually employs one kind of regularization term at a time. To find the most effective regularization method, in this set of experiments, we compare the performance of the DCCR model with different types of regularization and nonregularization. The results are shown in Fig. 6.

FIGURE 6.

Impact of different regularization term. The red line demonstrates that the L2 regularization term is better than the others.

Show All

From the results in Fig. 6, we discover that the L2 regularization outperforms the L1 regularization. Without the regularization term, the training process of the model has a serious overfitting problem, which causes the model to have the worst generalization ability. We perform dropout regularization. In shallow neural networks, however, dropout regularization is not appropriate for the model training because of the small number of hidden units. The results with dropout are not better than the L2 regularization but are better than the nonregularization and L1 regularization.

Pretraining technique. The pretraining methods can affect the performance of the final results. The objective of the pre-training technique is to initialize the weights and bias values of neural networks by training the neural networks with different training datasets. Most convolutional neural networks for image process tasks train neural networks using the ImageNet dataset [8]. Similar to the greedy layer-wise training of stacked autoencoders, we pretrain the two branches of the DCCR model separately using the same training dataset. After the pretraining process achieves the optimal results, we train the whole model from the start, which simplifies the adjustment of the weights of the neural network for the ultimate goal. Our experiments in Table 1 show that this approach will help to improve the accuracy of the rating prediction. Without the pretraining, we believe that the two branches would affect each other in the whole training process, which indicates that the branch for users or items will not attain the best representation for the rating prediction. First, we pretrain these two branches to ensure that the model perfectly extracts the latent features. Second, we train the DCCR model with pretraining weights of this network, which improves the model.

TABLE 1 The Performance of Different Model. The Last Two Methods are Our Proposed method. The Pretraining Technique Reduce the Value of RMSE and Improve the Accuracy of Rating Prediction

D. Effects of Different Activation Function and Hyper-Parameters

Each layer of neural networks has activation function to filter some irrelevant features from the input and only focus on the valuable features. In this paper, we propose a new activation function named R-Relu which has been employed in two locations in our model: the first location is the last layer of the DAE part, and the second location is the last layer of the MLP part. To prove the effectiveness of the R-Relu, we compare the results of some other activation functions including sigmoid, tanh and Relu, which is shown as Fig. 7. The experiments are conducted without any other techniques to show the effects of different activation functions. From the results, we can see the R-Relu can outperform the other activation functions.

FIGURE 7.

The performance of different activation function.

Show All

The DCCR model contains some hyperparameters, which are vital for the performance of ratings prediction. We manually adjust the hyperparameters in the specified range to show the effects of different parameters.

We focus on the symmetrical DAE in our DCCR model, which is designed with a particular purpose. In this symmetrical DAE, the sizes of the input vector and output vector are fixed. However, the hidden layers are not fixed. In many deep learning studies, the number of units in each layer in a model is specified by the researchers and are repeatedly validated. In this case, we use the DAE with one hidden layer to test the impacts of different numbers of units. The adjustable range of this value is [100, 250, 500, 750, 1000]. For our experiments, to show the different effects of features extraction between users and items, we separately tried both I-DAE and U-DAE. The results are shown in Fig. 8.

FIGURE 8.

Results of different numbers of units. (A) is for U-DAE, (B) is for I-DAE.

Show All

The results of Fig. 8 reveal that the number of hidden layer units can significantly affect the results. The hidden layer with 500 units can achieve the best results compared with the other settings. Either fewer than 500 or more than 500 will yield inferior results.

Second, the loss function contains the coefficient of regularization term. We evaluate the impact of these parameters by several contrast experiments. The adjustable range of this value is [0.01, 0.1, 1, 10, 100, 1000]. The results are shown in Fig. 9. The coefficient significantly impacts the performance, and the optimal coefficient is sensitive to the dataset. For example, the best coefficient is 10 for MovieLens 1M, while the best coefficient is 100 for MovieLens 10M.

FIGURE 9.

Impacts of different coefficients of regularization. (A) shows the accuracy results for MovieLens 1M, and (B) shows the results for MovieLens 10M.

Show All

A backpropagation algorithm and mini-batch gradient descent are used for our model training. The mini-batch gradient descent is a compromise between the batch gradient descent and stochastic gradient descent. We also apply adaptive moment estimation (Adam) to automatically adjust the learning rate during the training process [33]. We do not discuss other hyperparameters, such as the ratio of corrupt values, due to the limited scope of this paper.

E. Results From Model Comparison

We compare the performance of traditional methods and several effective methods based on neural networks and prove that our model can outperform the other methods. The results are shown in Table 1. The advantages of our model are indicated by the accuracy of the results. Please noted that we did not compared DCCR with knowledge-based and content-based recommender, since we mainly focus on using the raw ratings solely for rating prediction, while the knowledge-based and content-based methods are not suitable for recommendation when only the raw ratings available. Furthermore, both of knowledge and content-based methods are mainly used for to Top-N recommendation, instead of rating prediction.

As shown in Table 1, the first two methods are proposed in [11]. We set the dimension of the latent feature model to 20, the learning rate to 0.01, and the coefficient of the regularization term to 0.02. The remaining methods are based on neural networks. The V-CFN and V-CFN++ use stacked denoising autoencoders for rating prediction in [25]. AutoSVD and AutoSVD++ combine the SVD and SVD++, respectively, with contractive autoencoders, which render the model more explicable [26]. Conversely, the next two methods only use the rating data to predict ratings without any other information [13], which is more consistent with our research. The AutoRec-I model can achieve better results than the AutoRec-U model when the number of users is larger than the number of items. This finding demonstrates that more features of input can produce more accurate results. For each method, we leave the settings as the papers described. The remaining four methods are discussed in this paper.

According to the experimental results, our method based on the neural network can attain the best accuracy, which proves the correctness of feature extraction and the merge part in our model. From the feature extraction part, two parallel branches of networks are designed and work independently. In the recommendation methods that combine with neural networks, autoencoders can predict the ratings with a user vector or item vector, which produces the two kinds of models for users and items, respectively. Based on the results, the model with the item vector as an input achieves better results than the model with the user vector as an input. Typically, the output values of autoencoders are the predicted values. In our study, dense output vectors are treated as the extracting features for users and items. The idea of fusion for users’ features and items’ features is discussed in many studies, which can represent the interactions between users and items. The fusion part is required in the top-n recommendation and click-through rate prediction. In this study, we also show the effectiveness of the merge part in the rating prediction task. We may be the first researchers to demonstrate the effectiveness based on neural networks.

The previously mentioned steps contribute to the final results. The accuracy improvement with pretraining is obvious. We pretrain the two branches of the DCCR until the model converges and then use the weights from the pretraining as the initial values to adjust all parameters of the DCCR. The results of the last two methods in Table 1 achieve slightly different results because of the pretraining technique.

F. Effects of the Depth of Neural Network

Some researchers have proven the deep neural network can truly help improve the accuracy of many artificial intelligence tasks, such as image processing. The most successful neural network is ResNet [18], which equips up to 100 layers to achieve the best results in ImageNet contests. In our model, we design the basic structure with three hidden layers for the feature extraction part, and one layers for the fusion part. Five-layer autoencoder can be designed as Fig.10 (left) in DCCR. However, most deep neural networks cannot achieve better results than one single layer network [28]. The vanishing gradient is a serious problem in the training process of multilayer neural networks. The vanishing gradients are usually ascribed to activation functions, such as the sigmoid function [16], uninitialized parameters and inappropriate learning rates. This situation causes network degradation, which indicates that deeper neural networks learn even less than shallow networks. Without exception, we stack the layers to predict the ratings and acquire inferior accuracy, which is shown in Fig. 10 (right). To alleviate this problem, several approaches, such as stacked autoencoders [4], [22], batch normalization [17] and residual learning [18], exist. Changing the activation function enables deep neural network training to proceed more smoothly [34]. In our case, changing the activation function worsens the prediction. We defer the residual learning for future research.

FIGURE 10.

Multilayer autoencoders for rating prediction of recommendations (left). Results of multilayer autoencoders (right).

Show All

Stacked autoencoders can help to train deep neural networks in the early stages [4] and employ greedy layer-wise training in two steps. First, treat each layer as one hidden layer of an autoencoder, and separately adjust the weights. Second, use a backpropagation algorithm to simultaneously optimize the parameters of all layers. The first step for each layer can be very time-consuming, and the process is not end-to-end training. This method has been replaced by other methods because of the complexity and inefficiency of the method.

Batch normalization has proven to be an effective measure to accelerate the training of deep networks and improve the accuracy without many deep learning techniques, such as lower learning rates and careful parameter initialization [17]. Batch normalization takes a normalization step that fixes the mean and variances of layer inputs for each mini-batch and has a beneficial impact on the gradient propagation in the neural networks, which can help networks learn better from the data. Fig. 10 (right) shows the results of 5 layers of DAE with BN, which is substantially better than the model without BN. We also test the results with 7 layers, which is shown in Fig. 11.

FIGURE 11.

The comparison of RMSE of DAEs with different layers.

Show All

Our experiments show that deep networks can assist the rating prediction when the vanished gradients have been alleviated. With a gradual increase in the networks’ depth, the training process will be slower than the shallow networks, require more time and have higher requirements for hardware. This approach is a trade-off between accuracy and efficiency. We defer this issue for future research.

SECTION V.

Conclusions

Collaborative filtering has shown to be effective in commercial recommender systems. By combining with neural networks, CF can represent the latent features of users and items without a manual setting. However, most of related studies use a single model with a common activation function to perform a rating prediction task without considering the traits of features and ratings. In this paper, we propose a hybrid neural network model for rating prediction that is named the deep collaborative conjunctive recommender (DCCR). This model integrates the spirits of several neural networks to separately capture the latent features from users and items and describes the interactions between these features. Solely using the explicit ratings from the data, we design this end-to-end model to improve the accuracy of rating prediction.

Numerous factors affect the prediction performance. Thus, to achieve the optimal model, we evaluate the DCCR with varying factor settings by considerable contrast experiments. The results show that our DCCR model outperforms other state-of-the-art methods using two real-world datasets. We also prove that the DCCR with additional layers has a positive effect on accuracy improvement.

References is not available for this document.

DCCR: Deep Collaborative Conjunctive Recommender for Rating Prediction

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Studies