1 Introduction
Gcns have become a promising technique in various applications [1], such as recommender system [2], [3], [4], user profiling [5], [6] and text mining [7]. The main idea of graph convolution is to relate the representations of nodes based on the graph structure s.t. connected nodes should have similar representations, which can be seen as enforcing the smoothness constraint in the representation space. For example, the standard GCN [8] performs layer-wise representation relating as \begin{equation*} {\mathbf H}^{(l+1)} = \sigma (\widetilde{{\mathbf A}} {\mathbf H}^{(l)} {\mathbf W}^{(l)}), \tag{1} \end{equation*}
H(l+1)=σ(A˜H(l)W(l)),((1))
where {\mathbf H}^{(l)} is the node representation matrix of the lth layer, \widetilde{{\mathbf A}} is the normalized graph adjacency matrix, and {\mathbf W}^{(l)} is the weight matrix of the lth layer (i.e., trainable model parameters of GCN). The {\mathbf H}^{(0)} matrix stores the input features of nodes, e.g., the frequency of words of a document node [8]. We term {\mathbf H}^{(0)} {\mathbf W}^{(0)} as the initial node representation, which performs linear transformation on the input features of each node and obtains a representation for the follow-up graph convolution operation.