Introduction
Tensors, also referred to as multiway arrays, are high-order generalizations of vectors and matrices and have been adopted in diverse branches of data analysis, to represent a wide range of real-world data. Examples of tensor data are grayscale and color video sequences [1]–[4], gene expression [5], genome-scale signals [6], magnetic resonance imaging [7], to cite just a few.
Data modeling and classification of these data are important problems in several applications, such as human action and gesture recognition [8], tumor classifications [9], spatio-temporal analysis in climatology, geology and sociology [10], neuroimaging data analysis [11], big data representation [12], completion of big data [13], and so on. To address these problems most previous works represent a tensor by a vector in high-dimensional space and apply ordinary learning methods for vectorial data. Representative techniques in this context include feature extraction and selection [14], [15], linear discriminant analysis (LDA) [16], and support vector machine (SVM) [17]. Unfortunately this approach needs to arrange the tensor data into long vectors causing two main problems, i) loss of structural information of tensors, ii) vectors with very high dimensionality. To face these problems specific tensor learning algorithms that retain the original structure of tensor data have been recently developed [18]–[26]. However, in all these methods the problem of high dimensionality of data, also known as the curse of dimensionality, remains. To deal with such an issue the classical dimensionality reduction method, known as principal component analysis (PCA) [27], [28] was generalized to the second-order case (2-DPCA) [29], low-rank matrices (GLRAM) [30], and high-order cases (MPCA) [31], [32]. Likewise, the linear discriminant analysis (LDA) [33] technique was extended to a 2D case (2-DLDA) [34], [35] and multilinear discriminant analysis (MDA) [36]. Nevertheless to overcome the limitations of methods based on classical approaches, the decomposition of tensors into low-rank components, using two popular models, namely the Tucker decomposition (TD) [37] and the CANDECOMP/PARAFAC (CP) decomposition [38], has been one of the main concerns in tensor analysis to reduce the dimensionality of data [39], [40]. In all the above methods, data are considered as points in a multidimensional space, thus using the global structure information of the dataset alone. Instead many studies have shown that some classes of real world high-dimensional data exist in which they lie on a low-dimensional manifold (a parametrized surface), thus showing a local geometric structure.
Manifold learning (ML) is a nonlinear dimensionality reduction (NLDR) technique that, assuming the existence of an intrinsic structure, the manifold, has proven to be very effective in modeling data with reduced dimensionality [41], [42]. ML, also classified as embedding method, is based on the assumption that high-dimensional data are embedded in a nonlinear manifold of lower dimension [43]–[47], [48]–[50]. In this context several algorithms have been proposed, such as locally linear embedding (LLE) [51], local tangent space alignment (LTSA) [52], locally multidimensional scaling (LMDS) [53], and ISOMAP [54].
To deal with the high-order tensor data, some of these methods were extended by using multiway data analysis [55] and in particular higher order tensor decomposition [56], [57]. For example, Lai et al. [58] proposed a robust tensor learning method called sparse tensor alignment (STA) for unsupervised tensor feature extraction. Ju et al. [59] introduced a new tensor dimension reduction model based on the Bayesian theory. The proposed method assumes that each observation can be represented as a linear combination of some tensor bases, thus CP decomposition and variational EM algorithm are used to solve this model. He et al. [60] proposed tensor subspace analysis (TSA) for second-order learning. In the method suggested by Jiang et al. [61], given image tensor data, a
The central objective in ML algorithms is to determine an effective parametrization of data. This is a key issue in order to accurately capture the local geometry of the low-dimensional manifold and the following aspects are relevant to this end: i) nonlinearity, ii) explicit modeling, iii) intrinsic dimension (ID) estimation.
With regard to nonlinearity data generally have a nonlinear geometric structure, thus using the tangent space at each data point to locally describe its neighbour, as assumed in some of the previous techniques, is a strong limitation.
With reference to the second aspect, a main drawback of most ML methods is that no explicit mapping representing the local manifold parametrization can be obtained after the training process, as they learn high-dimensional data implicitly.
Regarding ID estimation, the intrinsic dimension (ID) may be interpreted as the minimum number of parameters required to describe the data [65], thus to derive a low-dimensional model, the dataset ID has to be discovered first.
At present, none of the methods suggested so far are able to take into account all the key requirements of nonlinearity, explicit modeling, and ID estimation for low-dimensional tensor modeling.
The aim of this article is to develop a manifold learning-based approach, named principal tensor embedding (PTE), for unsupervised tensor learning, that is able to address the first two of the aforementioned key points, while adopting the most relevant state-of-the-art methods for the ID estimation. This result represents an advancement with respect to the state-of-the-art, as the standard tensor learning techniques are not able to combine all the relevant aspects previously mentioned. The method has been derived by considering a tensor as an element of the finite-dimensional linear space of tensors, in which the inner product defines a metric for the space. Once a basis is computed using the Gram-Schmidt procedure, the coefficient vector in this basis, establishes an isomorphism between a vector space of rank-one and the space of tensors. In this way the problem of dimensionality reduction in tensor space reduces to the dimensionality reduction in vector space. To this end an effective manifold algorithm recently proposed, can be used for the parametrization of data, to accurately capture the local geometry of the low-dimensional manifold that represents the data. In such a way a nonlinear model of data with reduced dimensionality is obtained.
The model establishes an explicit one-to-one correspondence between a tensor, a point on the manifold embedded in the high dimensional space, and a vector, a point in the low-dimensional Euclidean space. Additionally a relationship for the geodesic distance of all pairs of points on the manifold, as a nonlinear function of the Euclidean distance between points in the low-dimensional space, is given.
The rest of the paper is organized as follows. Section II reviews related work on tensor learning. Section III summarizes our method highlighting the most relevant aspects. Section IV introduces some general concepts on finite dimensional linear space of tensors. Section V gives a representation of a tensor in terms of a basis derived by the tensor version of the Gram-Schmidt procedure. In Section VI a nonlinear dimensionality reduction approach, named principal tensor embedding (PTE), is developed, and the estimation of nonlinearity in such a model is treated in Section VII using a nonparametric kernel regression (NPKR) technique. Experimental results are presented in Section VIII, in which the proposed tensor learning approach is used and compared with some other techniques for data reconstruction and classification problems.
Related Work
Tensor learning techniques have been more widely applied to 2-order and 3-order tensors. In the following we summarize the most relevant approaches in these two main fields.
A. Tensor Learning of 2-Order Tensors
Principal component analysis (PCA) is one of the most common techniques for unsupervised subspace learning, however when applied to tensor objects it requires their reshaping into vectors with high-dimensionality (vectorization). As a result this implies high processing costs in terms of computational and memory demand.
Multilinear principal component analysis (MPCA) [31] is the multilinear extension of the classical PCA that have been used both for two-order and three-order tensors. Recently a method called graph-Laplacian PCA (gLPCA) that combines a manifold learning method, i.e. the graph-Laplacian, with PCA has been proposed [66].
Tucker decomposition (TD) is a technique that reduces a given tensor to a low-rank tensor [37]. It solves an optimization problem, by minimizing the Frobenius distance between the given tensor and a tensor of lower dimensionality.
To account for the geometrical (manifold) structure of image tensor data, a technique called graph-Laplacian Tucker tensor decomposition (GLTD), that combines graph-Laplacian with a regularized version of TD, has recently been proposed in [61].
B. Tensor Learning of 3-Order Tensors
Tensor neighborhood preserving embedded (TNPE) and tensor locality preserving projection (TLPP) [67] are tensor embedding techniques. They extend neighborhood preserving embedding (NPE) and locality preserving projection (LPP), which can only work with vectorized representation, to be used with more general tensor representations. More specifically, given a set of data points
Orthogonal tensor neighborhood preserving embedded (OTNPE) [68] is a generalized tensor subspace model similar to TNPE. However, while TNPE cannot ensure the obtained transformation matrices have orthogonal column vectors, OTNPE aims to derive orthonormal basis tensor for TNPE.
Sparse tensor alignment (STA) [58] is a sparse representation incorporated into tensor alignment (TA) framework, a technique that unifies the tensor learning methods. Since a tensor
Unfortunately none of the aforementioned techniques is able to satisfy the key requirements in order to accurately capture the local geometry of tensors embedded in a manifold, i.e. nonlinearity, explicit modeling and ID estimation.
Our Method
In this article we address the problem of unsupervised learning of tensors. In this case one has a set of
Mathematically this problem is equivalent to determine a \begin{equation*} \mathcal {X} = \gamma (\beta ''), \; \beta '' \in U \subset \mathbb {R}^{d}, \; \mathcal {X} = \mathbb {R}^{I_{1} \times I_{2} \times \ldots \times I_{M}}\tag{1}\end{equation*}
The main steps of our approach are:
Given the data set
the Gram-Schmidt procedure is applied to\Omega = \{ \mathcal {X}^{(1) }, \mathcal {X}^{(2) }, \ldots, \mathcal {X}^{(N)} \} observationsL so that an orthonormal basis\{ \hat {\mathcal {X}}^{(1) }, \hat {\mathcal {X}}^{(2) }, \ldots, \hat {\mathcal {X}}^{(L)} \} is obtained.\{ \mathcal {U}^{(1) }, \mathcal {U}^{(2) }, \ldots, \mathcal {U}^{(L)} \} The generic tensor
of the set\mathcal {X} can be represented as the summation\Omega where\begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \alpha _{i} \mathcal {U}^{(i)}, \; L = \prod _{j=1}^{M} I_{j}\tag{2}\end{equation*} View Source\begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \alpha _{i} \mathcal {U}^{(i)}, \; L = \prod _{j=1}^{M} I_{j}\tag{2}\end{equation*}
is the coefficient vector. Thus to the dataset\alpha = (\alpha _{1}, \alpha _{2}, \ldots, \alpha _{L})^{T} corresponds a set of\Omega = \{ \mathcal {X}^{(1) }, \mathcal {X}^{(2) }, \ldots, \mathcal {X}^{(N)} \} vectorsN in\{ \alpha ^{(1) }, \alpha ^{(2) }, \ldots, \alpha ^{(N)} \} .\mathbb {R}^{L} We assume these vectors are points sampled from a manifold
of dimension\mathcal {M} embedded in thed -dimensional observation space, so that a local parametrizationL exists, with\begin{equation*} \alpha = F(\theta), \; \theta \in \mathbb {R}^{d}, \; d \in \mathbb {R}^{L},\tag{3}\end{equation*} View Source\begin{equation*} \alpha = F(\theta), \; \theta \in \mathbb {R}^{d}, \; d \in \mathbb {R}^{L},\tag{3}\end{equation*}
, whered < L is the vector of latent variables and\theta is the so-called intrinsic dimension.d Since
is hidden, i.e. not known, it will be shown that a local parametrization\theta of the manifold, through a partition\gamma of the vector\alpha = \varphi (\alpha '') = \left ({\alpha '', G(\alpha '') }\right)^{T} can be derived.\alpha This is an explicit modeling of the manifold depending on known variables
with\alpha '' \in \mathbb {R}^{d} = ID. The model is completely defined onced and\alpha have been estimated.G(\cdot) By estimating the variance vector
of\sigma _{\alpha } = (\sigma _{\alpha _{1}}, \ldots, \sigma _{\alpha _{L}})^{T} , their elements are put in decreasing order and the corresponding terms\alpha are ordered accordingly. In this way a new representation of the tensor\{ \hat {\mathcal {U}}^{(1) }, \hat {\mathcal {U}}^{(2) }, \ldots, \hat {\mathcal {U}}^{(L)} \} is obtained\mathcal {X} in terms of the new vector\begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \beta _{i} \hat {\mathcal {U}}^{(i)} = \sum _{i=1}^{d} \beta ''_{i} \hat {\mathcal {U}}^{(i)} + \sum _{i>d} G_{i}(\beta '') \hat {\mathcal {U}}^{(i)}\tag{4}\end{equation*} View Source\begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \beta _{i} \hat {\mathcal {U}}^{(i)} = \sum _{i=1}^{d} \beta ''_{i} \hat {\mathcal {U}}^{(i)} + \sum _{i>d} G_{i}(\beta '') \hat {\mathcal {U}}^{(i)}\tag{4}\end{equation*}
such that\beta = (\beta _{1}, \beta _{2}, \ldots, \beta _{L})^{T} . It is worth to notice that as the values\sigma _{\beta _{1}} \geq \sigma _{\beta _{2}} \geq \ldots \geq \sigma _{\beta _{L}} are in decreasing order, then the higher the index\sigma _{\beta _{i}} , the lower the importance of the corresponding component. An overview diagram that explains the main steps of the approach is reported in Fig. 1.i On the basis of this result, a low-dimensional representation of tensor
can be simply obtained by truncating the summation in (4) to the first\mathcal {X} termsr thus obtaining a truncation error\begin{equation*} \mathcal {X} \cong \sum _{i=1}^{r} \beta _{i} \, \hat {\mathcal {U}}^{(i)},\tag{5}\end{equation*} View Source\begin{equation*} \mathcal {X} \cong \sum _{i=1}^{r} \beta _{i} \, \hat {\mathcal {U}}^{(i)},\tag{5}\end{equation*}
that decreases as\begin{equation*} \mathcal {E}_{r} = \sum _{i>r} \sigma _{\beta _{i}}\tag{6}\end{equation*} View Source\begin{equation*} \mathcal {E}_{r} = \sum _{i>r} \sigma _{\beta _{i}}\tag{6}\end{equation*}
increases. Following this property the proposed technique has been called principal tensor embedding (PTE).r Assuming the ID has been determined with one of the methods known in literature, to estimate the function
an effective method for nonparametric input-output nonlinear function regression in tensor space, called nonparametric regression kernel (NPKR), will be used.G(\cdot)
It is worth to notice that the proposed approach satisfy nonlinearity, explicit modeling and ID estimation, i.e. all the key aspects of ML, thus representing a real advancement to previous techniques.
The Finite Dimensional Linear Space of Tensors
Let us refer to tensors, regarded as multidimensional arrays and denoted by Euler script calligraphic letters, e.g. \begin{equation*} x_{i_{1}, i_{2}, \ldots, i_{M}}, \;\; i_{l} = 1,2, \ldots, I_{l}, \;\; l = 1,2, \ldots, M.\tag{7}\end{equation*}
The inner product of two tensors of the same size \begin{equation*} \left \langle{ \mathcal {X}, \mathcal {Y} }\right \rangle = \sum _{i_{1} = 1}^{I_{1}} \ldots \sum _{i_{M} = 1}^{I_{M}} x_{i_{1} i_{2} \ldots i_{M}} y_{i_{1} i_{2} \ldots i_{M}}.\tag{8}\end{equation*}
\begin{equation*} \lVert \mathcal {X} \rVert = \sqrt {\left \langle{ \mathcal {X}, \mathcal {X} }\right \rangle }.\tag{9}\end{equation*}
The outer product of tensor \begin{equation*} \mathcal {Z} = \mathcal {X} \circ \mathcal {Y}\tag{10}\end{equation*}
\begin{equation*} z_{i_{1} i_{2} \ldots i_{M} j_{1} j_{2} \ldots j_{L}} = x_{i_{1} i_{2} \ldots i_{M}} \cdot y_{j_{1} j_{2} \ldots j_{L}}\tag{11}\end{equation*}
\begin{equation*} z_{ij} = x_{i} \, y_{j}.\tag{12}\end{equation*}
With reference to the canonical basis \begin{equation*} \mathcal {X} = \sum _{i_{1} = 1}^{I_{1}} \ldots \sum _{i_{M} = 1}^{I_{M}} x_{i_{1} i_{2} \ldots i_{M}} \; e_{i_{1}} e_{i_{2}} \ldots e_{i_{M}}\tag{13}\end{equation*}
\begin{align*} e_{1}~e_{1}=&\left ({\begin{array}{ccc} 1 &\quad 0 & \quad 0 \\ 0 &\quad 0 &\quad 0 \\ 0 &\quad 0 &\quad 0 \end{array} }\right), \; \\ e_{1}~e_{2}=&\left ({\begin{array}{ccc} 0 &\quad 1 &\quad 0 \\ 0 &\quad 0 &\quad 0 \\ 0 &\quad 0 &\quad 0 \end{array} }\right), \; \\ e_{1}~e_{3}=&\left ({\begin{array}{ccc} 0 &\quad 0 &\quad 1 \\ 0 &\quad 0 &\quad 0 \\ 0 &\quad 0 &\quad 0 \end{array} }\right), \; \ldots\tag{14}\end{align*}
For easy of reference Table 1 reports some notations, that will be frequently used in the following.
A Basis from Data
One of the main problems in representing elements of a linear space, is to find a basis. For vectors \begin{equation*} R_{vv} = E \{ v v^{T} \}\tag{15}\end{equation*}
\begin{equation*} R_{vv} \cong \frac {1}{N} V V^{T}\tag{16}\end{equation*}
\begin{equation*} R_{vv} = U \Sigma U^{T}\tag{17}\end{equation*}
\begin{equation*} R_{\chi \chi } = E \{ \mathcal {X} \circ \mathcal {X} \}\tag{18}\end{equation*}
A more effective approach for this purpose can be derived using the well known Gram-Schmidt procedure. Extracting from a given dataset \begin{align*} 1.\quad \mathcal {Y}^{(1) }=&\hat {\mathcal {X}}^{(1) }, \;\; \mathcal {U}^{(1) } = \frac {\mathcal {Y}^{(1) }}{\lVert \mathcal {Y}^{(1) } \rVert } \\ 2.\quad \mathcal {Y}^{(2) }=&\hat {\mathcal {X}}^{(2) } - \left \langle{ \hat {\mathcal {X}}^{(2) }, \mathcal {U}^{(1) } }\right \rangle \mathcal {U}^{(1) }, \;\; \mathcal {U}^{(2) } = \frac {\mathcal {Y}^{(2) }}{\lVert \mathcal {Y}^{(2) } \rVert } \\ k.\quad \mathcal {Y}^{(k)}=&\hat {\mathcal {X}}^{(k)} - \sum _{i=1}^{k-1} \left \langle{ \hat {\mathcal {X}}^{(k)}, \mathcal {U}^{(i)} }\right \rangle \mathcal {U}^{(i)}, \;\; \mathcal {U}^{(k)} = \frac {\mathcal {Y}^{(k)}}{\lVert \mathcal {Y}^{(k)} \rVert }. \\\tag{19}\end{align*}
\begin{align*} \left \langle{ \mathcal {U}^{(i)}, \mathcal {U}^{(j)} }\right \rangle = \begin{cases} 1, \, i = j \\ 0, \, i \neq j \\ \end{cases}.\tag{20}\end{align*}
\begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \alpha _{i} \mathcal {U}^{(i)}, \;\; L = I_{1} \cdot I_{2} \cdot \ldots \cdot I_{M}\tag{21}\end{equation*}
\begin{equation*} \xi: \; \alpha \in \mathbb {R}^{L} \Longleftrightarrow \mathcal {X} = \sum _{i=1}^{L} \boldsymbol {\alpha _{i}} \mathcal {U}^{(i)}\tag{22}\end{equation*}
\begin{equation*} \xi: \; \alpha \in \mathbb {R}^{L} \Longleftrightarrow \boldsymbol {\alpha } \in \boldsymbol {H}\tag{23}\end{equation*}
\begin{equation*} \alpha ^{(k)} = (\alpha ^{(k)}_{1}, \ldots, \alpha ^{(k)}_{L})^{T}, \; k = 1, \ldots, N\tag{24}\end{equation*}
\begin{equation*} \alpha ^{(k)}_{i} = \left \langle{ \mathcal {U}^{(i)}, \mathcal {X}^{(k)} }\right \rangle, \; i = 1, \ldots, L, \; k = 1, \ldots, N \;.\tag{25}\end{equation*}
\begin{equation*} A = [\alpha ^{(1) }, \ldots, \alpha ^{(N)}] \Longleftrightarrow \Omega = \{ \mathcal {X}^{(1) }, \ldots, \mathcal {X}^{(N)} \}\tag{26}\end{equation*}
\begin{equation*} \{ A | \Omega \} = \{ \alpha ^{(k)} | \, \mathcal {X}^{(k)}, \; k = 1, \ldots, N\}.\tag{27}\end{equation*}
Local Parametrization of Data
On the basis of previous results for a generic tensor \begin{equation*} \mathcal {X} = \sum _{i=1}^{L} \alpha _{i} \, \mathcal {U}^{(i)}\tag{28}\end{equation*}
To reduce the dimensionality of the above representation, a Manifold Learning (ML) approach will be used. In this context the problem can be formalized as follows.
To the dataset \begin{equation*} \boldsymbol {\alpha } = F(\boldsymbol {\theta }), \; \boldsymbol {\theta } \in \mathbb {R}^{d}, \; \boldsymbol {\alpha } \in \mathbb {R}^{L}\tag{29}\end{equation*}
From differential geometry [69], [70] it can be shown that assuming the values of \begin{align*}&\hspace {-.5pc} \boldsymbol {\alpha } = \varphi (\boldsymbol {\alpha ''}) = (\boldsymbol {\alpha ''}, G(\boldsymbol {\alpha ''}))^{T} = (\boldsymbol {\alpha ''}, \boldsymbol {\alpha '})^{T}, \\&\boldsymbol {\alpha ''} \in \mathbb {R}^{d}, \; \boldsymbol {\alpha '} \in \mathbb {R}^{m}, \; m = L-d\tag{30}\end{align*}
Comparing (29) and (30) it clearly results \begin{align*} A = \left [{ \begin{array}{c} A'' \\ A' \\ \end{array} }\right], \; A'' \in \mathbb {R}^{d \times N}, \; A' \in \mathbb {R}^{m \times N}\tag{31}\end{align*}
A. Principal Tensor Embedding
Being \begin{equation*} \mu _{\alpha } \triangleq E \{ \boldsymbol {\alpha } \} = \frac {1}{N} \sum _{k=1}^{N} \alpha ^{(k)} \in \mathbb {R}^{L \times 1}\tag{32}\end{equation*}
\begin{align*} \sigma _{\alpha }\triangleq&\textit {diag} \, E \{ (\boldsymbol {\alpha } - \mu _{\alpha }) (\boldsymbol {\alpha } - \mu _{\alpha })^{T} \} \\=&\textit {diag} \, \left [{ \frac {1}{N} \sum _{i=1}^{N} (\alpha ^{(k)} - \mu _{\alpha }) (\alpha ^{(k)} - \mu _{\alpha })^{T} }\right] \in \mathbb {R}^{L \times 1}\tag{33}\end{align*}
\begin{equation*} \{ \sigma _{\alpha } | \, \mathcal {U} \} = \{ \sigma _{\alpha _{i}} | \, \mathcal {U}^{(i)}, \; i = 1, \ldots, L\}\tag{34}\end{equation*}
\begin{equation*} \{ \sigma _{\beta } | \, \hat {\mathcal {U}} \}\tag{35}\end{equation*}
\begin{equation*} \mathcal {X}^{(k)} = \sum _{i=1}^{L} \beta ^{(k)}_{i} \, \hat {\mathcal {U}}^{(i)}, \; k = 1, \ldots, N\tag{36}\end{equation*}
\begin{equation*} \beta ^{(k)}_{i} = \left \langle{ \hat {\mathcal {U}}^{(i)}, \mathcal {X}^{(k)} }\right \rangle, \quad i = 1, \ldots,L, \; k = 1, \ldots, N\tag{37}\end{equation*}
\begin{equation*} \{ \beta ^{(k)} | \, \hat {\mathcal {U}} \}, \quad k = 1, \ldots, N.\tag{38}\end{equation*}
\begin{equation*} \{ B | \Omega \} = \{ \beta ^{(k)} | \, \mathcal {X}^{(k)}, \; k = 1, \ldots, N \}\tag{39}\end{equation*}
\begin{equation*} \boldsymbol {\beta } = (\boldsymbol {\beta ''}, \boldsymbol {\beta '})^{T}\tag{40}\end{equation*}
\begin{equation*} \mathcal {X} = \sum _{i=1}^{r} \alpha _{i} \, \mathcal {U}^{(i)} + \sum _{i>r} \alpha _{i} \, \mathcal {U}^{(i)} = \mathcal {X}_{r} + \mathcal {N}\tag{41}\end{equation*}
\begin{align*} \mathcal {E}_{r}=&E \{ \langle \mathcal {N}, \mathcal {N} \rangle \} = E \{ \langle \mathcal {X} - \mathcal {X}_{r}, \mathcal {X} - \mathcal {X}_{r} \rangle \} \\=&E \left\{{ \sum _{i>r} \boldsymbol {\alpha ^{2}_{i}} }\right\} = \sum _{i>r} \sigma _{\alpha _{i}}.\tag{42}\end{align*}
With reference to the new partition (40), the local parametrization of corresponding data \begin{equation*} \boldsymbol {\beta } = \varphi (\boldsymbol {\beta ''}) = (\boldsymbol {\beta ''}, G(\boldsymbol {\beta ''})^{T} = (\boldsymbol {\beta ''}, \boldsymbol {\beta '})^{T}, \; \boldsymbol {\beta ''} \in \mathbb {R}^{d}\tag{43}\end{equation*}
\begin{align*} \mathcal {X}^{(k)}=&\sum _{i=1}^{d} \beta ''^{(k)}_{i} \hat {\mathcal {U}}^{(i)} + \sum _{i>d} G_{i}(\beta ''^{(k)}) \hat {\mathcal {U}}^{(i)}, \; k = 1, \ldots, N \\ \beta ''^{(k)}_{i}=&\langle \hat {\mathcal {U}}^{(i)}, \mathcal {X}^{(k)} \rangle, \; i = 1, \ldots, d, \; k = 1, \ldots, N.\tag{44}\end{align*}
\begin{align*} B = \left [{ \begin{array}{c} B'' \\ B' \\ \end{array} }\right], \; B'' \in \mathbb {R}^{d \times N}, \; B' \in \mathbb {R}^{m \times N}\tag{45}\end{align*}
\begin{equation*} B'' = [\beta ''^{(1) }, \ldots, \beta ''^{(N)}] \Longleftrightarrow \Omega = \{ \mathcal {X}^{(1) }, \ldots, \mathcal {X}^{(N)} \}\tag{46}\end{equation*}
\begin{equation*} \{ B'' | \, \Omega \} = \{\beta ''^{(k)} | \, \mathcal {X}^{(k)}, \; k = 1, \ldots, N \}.\tag{47}\end{equation*}
\begin{align*} \mathcal {X} = \gamma (\boldsymbol {\beta ''}) = \sum _{i=1}^{d} \boldsymbol {\beta ''_{i}} \hat {\mathcal {U}}^{(i)} + \sum _{i>d} \boldsymbol {\beta '_{i}}(\boldsymbol {\beta ''}) \hat {\mathcal {U}}^{(i)}, \; \boldsymbol {\beta '} = G(\boldsymbol {\beta ''}) \\\tag{48}\end{align*}
On the basis of previous results, a low-dimensional representation of the tensor \begin{equation*} \mathcal {X} \cong \sum _{i=1}^{r} \beta _{i} \, \hat {\mathcal {U}}^{(i)} \quad,\tag{49}\end{equation*}
\begin{equation*} \mathcal {E}_{r} \cong \sum _{i>r} \sigma _{\beta _{i}}\tag{50}\end{equation*}
As the terms in (49) correspond to the most important components in the representation of the tensor
B. The Metric of the Manifold \mathcal{M}
A relationship between the metric on the manifold \begin{align*} d(\mathcal {X}^{(i)}, \mathcal {X}^{(j)}) \!= \!\lVert \beta ''^{(i)} - \beta ''^{(j)} \rVert \!=\! \lVert \gamma ^{-1} (\mathcal {X}^{(i)}) - \gamma ^{-1} (\mathcal {X}^{(j)}) \rVert \\\tag{51}\end{align*}
\begin{equation*} \lVert \mathcal {X}^{(i)} - \mathcal {X}^{(j)} \rVert = \lVert \gamma (\beta ''^{(i)}) - \gamma (\beta ''^{(j)}) \rVert.\tag{52}\end{equation*}
\begin{align*} \mathcal {X} = \gamma (\beta '') = \gamma (\beta ''_{0}) + J(\gamma) (\beta '' - \beta ''_{0})^{T} + o(\lVert \beta '' - \beta _{0}'' \rVert) \\\tag{53}\end{align*}
\begin{equation*} \lVert \mathcal {X}^{(i)} - \mathcal {X}^{(j)} \rVert \!=\! \lVert J(\gamma) (\beta '' - \beta _{0})^{T} \!+\! o(\lVert \beta '' - \beta ''_{0}) \rVert) \rVert.\tag{54}\end{equation*}
Nonparametric Kernel Regression (NPKR)
Assuming the ID of dataset
In this context an effective method for non parametric input-output nonlinear function regression in tensor space has been recently proposed [76]. The method can be summarized as follows. Given any continuous and bounded function \begin{equation*} f_{m}(x) = k_{m} * f(x) = \int _{I} f(t) k_{m}(x-t) dt\tag{55}\end{equation*}
the polynomial kernel defined as
where\begin{align*} k_{m}(x) = \begin{cases} \dfrac {(1 - \lVert x \rVert ^{2})^{m}}{C_{m}}, & \lVert x \rVert \leq 1 \\ 0, & \lVert x \rVert > 1 \end{cases}\tag{56}\end{align*} View Source\begin{align*} k_{m}(x) = \begin{cases} \dfrac {(1 - \lVert x \rVert ^{2})^{m}}{C_{m}}, & \lVert x \rVert \leq 1 \\ 0, & \lVert x \rVert > 1 \end{cases}\tag{56}\end{align*}
is the norm of\lVert x \rVert = (x^{T} x)^{1/2} andX is a normalized factor given byC_{m} \begin{equation*} C_{m} = \int _{I} (1 - \lVert t \rVert ^{2})^{m} dt\tag{57}\end{equation*} View Source\begin{equation*} C_{m} = \int _{I} (1 - \lVert t \rVert ^{2})^{m} dt\tag{57}\end{equation*}
the Gaussian kernel defined as
\begin{equation*} k_{m}(x) = \frac {m^{n}}{(2 \pi)^{n/2}} \exp \left ({- \frac {1}{2} m^{2} \lVert x \rVert ^{2} }\right).\tag{58}\end{equation*} View Source\begin{equation*} k_{m}(x) = \frac {m^{n}}{(2 \pi)^{n/2}} \exp \left ({- \frac {1}{2} m^{2} \lVert x \rVert ^{2} }\right).\tag{58}\end{equation*}
As a consequence of property (55) we have
which can be considered as a universal approximating relationship. The main issue of (59) is that it requires calculating the integral on the right hand side of (55). To overcome this problem, suppose we want to compute the mean value of the\begin{equation*} f(x) \cong f_{m}(x), \; \text {for} \;\; m \gg 1\tag{59}\end{equation*} View Source\begin{equation*} f(x) \cong f_{m}(x), \; \text {for} \;\; m \gg 1\tag{59}\end{equation*}
-dimensional functionn , in the intervalg(t), t=(t^{1},\ldots,t^{n}) I where\begin{equation*} E(g(t)) = \int _{I} g(t) p(t) dt\tag{60}\end{equation*} View Source\begin{equation*} E(g(t)) = \int _{I} g(t) p(t) dt\tag{60}\end{equation*}
is a realization of the random variable (r.v.)t , with probability density function (pdf)\boldsymbol {t = (t^{1}, \ldots, t^{n})} , andp(t) denotes the expected value. To numerically solve the integral in (60) a Monte Carlo integration technique [77], [78] can be derived as follows. Let us select at randomE(\cdot) pointsN sampled from pdf(t^{1}, \ldots, t^{N}) , then the Monte Carlo approximation of (60) isp(t) By applying this approach to the convolution\begin{equation*} E(g(t)) \cong \frac {1}{N} \sum _{i=1}^{N} g(t_{i}).\tag{61}\end{equation*} View Source\begin{equation*} E(g(t)) \cong \frac {1}{N} \sum _{i=1}^{N} g(t_{i}).\tag{61}\end{equation*}
, whereg_{m}(x) = k_{m} * g(x) , we haveg(x) = f(x) p(x) and in particular for\begin{align*} g_{m}(x)=&k_{m} * g(x) = E(f(t) k_{m}(x-t)) \\\cong&\frac {1}{N} \sum _{i=1}^{N} f(t_{i}) k_{m}(x - t_{i})\tag{62}\end{align*} View Source\begin{align*} g_{m}(x)=&k_{m} * g(x) = E(f(t) k_{m}(x-t)) \\\cong&\frac {1}{N} \sum _{i=1}^{N} f(t_{i}) k_{m}(x - t_{i})\tag{62}\end{align*}
f(x) = 1(x) = \{ 1 | x \in I\} Combining (62) and (63) we finally get\begin{align*} 1_{m}(x) = k_{m} * p(x) = E(k_{m}(x-t)) \cong \frac {1}{N} \sum _{i=1}^{N} k_{m} (x-t_{i}). \\\tag{63}\end{align*} View Source\begin{align*} 1_{m}(x) = k_{m} * p(x) = E(k_{m}(x-t)) \cong \frac {1}{N} \sum _{i=1}^{N} k_{m} (x-t_{i}). \\\tag{63}\end{align*}
The function approximation (64) is a non-parametric model of function\begin{equation*} f(x) \cong f_{m}(x) = \frac {\sum _{i=1}^{N} f(t_{i}) k_{m}(x - t_{i})}{\sum _{i=1}^{N} k_{m}(x - t_{i})}.\tag{64}\end{equation*} View Source\begin{equation*} f(x) \cong f_{m}(x) = \frac {\sum _{i=1}^{N} f(t_{i}) k_{m}(x - t_{i})}{\sum _{i=1}^{N} k_{m}(x - t_{i})}.\tag{64}\end{equation*}
as it only depends on the observationsf(x) and not on parameters to be estimated. The method described above, named nonparametric kernel regression (NPKR), can be used for the approximation of the functionf(t_{i}) in (48), thus giving the following relationshipG(\cdot) where\begin{equation*} G(\beta '') \cong G_{m}(\beta '') = \frac {\sum _{k=1}^{N} \beta ^{'(k)} \, k_{m} (\beta '' - \beta ''^{(k)})}{\sum _{k=1}^{N} k_{m} (\beta '' - \beta ''^{(k)})}\tag{65}\end{equation*} View Source\begin{equation*} G(\beta '') \cong G_{m}(\beta '') = \frac {\sum _{k=1}^{N} \beta ^{'(k)} \, k_{m} (\beta '' - \beta ''^{(k)})}{\sum _{k=1}^{N} k_{m} (\beta '' - \beta ''^{(k)})}\tag{65}\end{equation*}
is a given kernel function,k_{m}(\cdot) and\beta ''^{(k)} are the input and output training points respectively and\beta '^{(k)} is a testing point, chosen from data matrix\beta '' . It is worth to notice that the previous relationship has the same form as the well-known Nadaraya-Watson nonparametric kernel estimator [79]–[81] that was proposed for the estimation of the regression function of dataB'' = [\beta ''^{(1) }, \ldots, \beta ''^{(N)}] sampled from a population having a density(x_{1}, y_{1}), \ldots, (x_{n}, y_{n}) , thus giving a link between the two theories.f(x,y)
Experiments
The experiments for the validation of the proposed tensor learning approach address two different problems, namely data reconstruction and classification.
Data reconstruction aims at reconstructing the original dataset
A pseudo-code of the algorithm used for the estimation of the model from the dataset
PTE
dataset
Extract
Compute a basis by Gram-Schmidt procedure
Compute the data matrix
Determine the reordered basis
the model of dataset
As far as the extraction of
In order to visually assess the quality of Algorithm 1, a simple experiment on a synthetic dataset was preliminary performed. To this end data \begin{equation*} \mathcal {X} = \alpha ''_{1} \phi ^{(1) } + \alpha ''_{2} \phi ^{(2) } + G(\alpha ''_{1}, \alpha ''_{2}) \phi ^{(3) }\tag{66}\end{equation*}
\begin{equation*} \alpha ' = G(\alpha ''_{1}, \alpha ''_{2}) = (\alpha ''_{1})^{3} - 3 \alpha ''_{1} (\alpha ''_{2})^{2}\tag{67}\end{equation*}
\begin{align*} \phi ^{(1) }=&\left({\begin{array}{lll} 1 &\quad 0 &\quad 0 \end{array} }\right)^{T}, \\ \phi ^{(2) }=&\left({\begin{array}{lll} 0 &\quad 1 &\quad 0 \end{array} }\right)^{T}, \, \phi ^{(3) } = \left({\begin{array}{lll} 0 &\quad 0 &\quad 1 \end{array} }\right)^{T}\tag{68}\end{align*}
In all the experiments a Gaussian kernel was used for the regression of function
Classification aims at classifying the data
A. Data Reconstruction
The capability of the proposed method to model data with a reduced dimensionality, has been validated by three experiments conducted on different datasets, namely CIFAR-10, RGB-D Object Dataset, AT&T Faces Dataset.
1) Experiment on CIFAR-10 (3D-Tensor)
CIFAR-10 dataset [82], [83] consists of 60000,
In this experiment, we use the 50000 RGB images of the training set for image reconstruction by the model (48)–(65). For this purpose, the data has been organized in a 3D-tensor dataset
Vectors
To estimate the intrinsic dimension
Fig. 5 shows a set of original images and the corresponding images reconstructed with the model (48)–(65) and
Comparison of the original images extracted from the CIFAR-10 dataset and the corresponding reconstructed images with the proposed approach.
In order to study how robust is the NPKR method with respect to hyperparameters, the sensitivity of RMSE and PSNR (peak signal-to-noise ratio) to the dimension \begin{equation*} PSNR = 20 \log _{10} \left ({\frac {max(f)}{\sqrt {MSE}} }\right)\tag{69}\end{equation*}
Sensitivity of RMSE to the dimension
Sensitivity of PSNR to the dimension
2) Experiment on RGB-D Object Dataset (4D-Tensor)
The RGB-D Object Dataset [89] contains 300 objects divided in 51 categories. For each object the dataset provides a number of images ranging from a minimum of 506 to a maximum of 852 for a total of 207920 frames. In this experiment we used the subset Cropped RGB and depth images with object segmentation masks [90] that contains the cropped RGB-D frames and tightly include the object as it is spun around on a turntable. We used all the 207920 images, resized into
Vectors
Table 4 reports the results obtained with the same ID estimators used in Experiment VIII-A1. Choosing an intrinsic dimension
Figs. 9–13 show the 5 frames that composed the original videos and the corresponding reconstructed frames obtained with the proposed approach, demonstrating the validity of this approach.
Comparison of the 5 frames (extracted from the RGB-D Object Dataset) that composed the original video V1 and the corresponding reconstructed frames obtained with the proposed approach.
Comparison of the 5 frames (extracted from the RGB-D Object Dataset) that composed the original video V2 and the corresponding reconstructed frames obtained with the proposed approach.
Comparison of the 5 frames (extracted from the RGB-D Object Dataset) that composed the original video V3 and the corresponding reconstructed frames obtained with the proposed approach.
Comparison of the 5 frames (extracted from the RGB-D Object Dataset) that composed the original video V4 and the corresponding reconstructed frames obtained with the proposed approach.
Comparison of the 5 frames (extracted from the RGB-D Object Dataset) that composed the original video V5 and the corresponding reconstructed frames obtained with the proposed approach.
Table 5 reports the RMSE for each processed 4D-tensors, obtained using NPKR method with Gaussian kernel and
In this case, as well, to assess the robustness of the PTE method with respect to hyperparameters, the sensitivity of RMSE and PSNR to the dimension
Sensitivity of RMSE to the dimension
Sensitivity of PSNR to the dimension
3) Experiment on AT&T Faces Dataset (2D-Tensor)
The AT&T Faces Dataset [91], [92] contains 10 different images for each of 40 distinct peoples. The size of each image is
In this experiment, we applied the proposed approach on the original and occluded images achieved from AT&T Faces Dataset, with the same procedure described in [61]. Here, 20 distinct persons are selected and each face image is resized into
Vectors
On the basis of the estimated ID values reported in Table 6, we select an optimal intrinsic dimension
Also in this case to assess the robustness of the PTE method with respect to hyperparameters, the sensitivity of RMSE and PSNR to the dimension
Sensitivity of RMSE to the dimension
Sensitivity of PSNR to the dimension
Fig. 19 and Fig. 20 show a set of original images and the corresponding reconstructed images with the model (48)–(65), demonstrating the validity of this approach. Furthermore, Fig. 21 and Fig. 22 report a set of original partially occluded images and the corresponding reconstructed images showing the good noise tolerance of the method. Table 7 reports the reconstruction residuals for different reconstruction methods. The methods used to evaluate the PTE method are: PCA-based methods (PCA [27], gLPCA [66]) and tensor decomposition methods (TD [37], GLTD [61]). In this table the average residual for reconstructed tensor \begin{equation*} \text {Res}(\mathcal {X}) = \sqrt { \frac {1}{N} \sum _{k=1}^{N} \lVert \mathcal {X}_{0}^{(k)} - \mathcal {X}^{(k)} \rVert ^{2} },\tag{70}\end{equation*}
Comparison of the original images extracted from the AT&T dataset and the corresponding reconstructed images with the proposed approach, on subject 8.
Comparison of the original images extracted from the AT&T dataset and the corresponding reconstructed images with the proposed approach, on subject 12.
Comparison of the partially occluded original images extracted from the AT&T dataset and the corresponding reconstructed images with the proposed approach, on subject 8.
Comparison of the partially occluded original images extracted from the AT&T dataset and the corresponding reconstructed images with the proposed approach, on subject 12.
B. Classification
1) Experiment on Classification of 2-Order Tensors
This experiment was addressed to the classification of data collected from the following datasets.
AT&T Faces dataset: as described in the experiment of Section VIII-A3.
MNIST dataset: is consisted of 8-bit gray-scale images of digits from ”0” to “9”. There are about 6000 examples for each class [93]. Each image is centered on a
grid. In our experiments, we randomly choose 50 images for each class.28 \times 28 COIL-20 dataset: contains 20 objects [94]. Each object has 72 images. The size of each image is
pixels, with 256 gray levels per pixel. We use the first 32 images for each object in our experiments.32 \times 32
We perform semisupervised learning on different datasets, by training the classifier on the labeled data (20% of dataset) and use the rest as unlabeled data (80% of dataset). In particular 20% of data points for each class were randomly selected as labeled data, and the rest was used as unlabeled data. The classifier was trained on the labeled data and the class labels were predicted on the unlabeled data.
We performed classification using k-nearest neighbor (kNN) algorithm [28], and compared the results obtained with features extracted by our model (matrix
Table 8 reports the results for the classification experiment as achieved by the methods used in [61] and PTE method. Also in this case the intrinsic dimension
2) Experiment on Classification of 3-Order Tensors
In order to compare the proposed method with some other tensor-based learning methods, a last experiment was performed on the following datasets.
Weizmann Action Database: an high-order dataset commonly used for human action recognition. The database includes 90 videos coming from 10 categories of actions - a) bending (bend), b) jacking (jack), c) jumping (jump), d) jumping in places (pjump), e) running (run), f) galloping sideways (side), g) skipping (skip), h) walking (walk), i) single-hand waving (wave1), g) both-hands waving (wave2) - which were performed by nine subjects [95], [96]. A tensor samples of size
, represented in a spatio-temporal domain, is formed by 10 successive frames of each action, each of which was normalized to the size32 \times 24 \times 10 pixels.32 \times 24 Cambridge Hand Gesture Database: consists of 900 image sequences of nine gesture classes, which are defined by three primitive hand shapes and three primitive motions [97], [98]. Each class contains 100 image sequences (5 different illuminations
arbitrary motions\times10 subjects). The procedure to format data is the same as in the Weizmann action database.\times {2}
In these experiments, we randomly selected six action tensors of each category for training and the remaining tensors were used for testing. The experiments were independently performed 10 times using the kNN algorithm with Euclidean distance for classification, following the procedure in [58].
The methods used to evaluate the proposed PTE are: the baseline method (nearest classifier on the original data), multilinear PCA (MPCA) [31], tensor locality preserving projection (TLPP) and tensor neighborhood preserving embedded (TNPE) [67], orthogonal tensor neighborhood preserving embedded (OTNPE) [68], sparse tensor alignment (STA) [58].
Table 9 reports the results for the classification experiment as achieved by the aforementioned methods and PTE method. Also in this case the intrinsic dimension
Conclusion
In this article a nonlinear, explicit model of tensor data that depends on a reduced set of latent variables is derived. The main steps required for the estimation of the model from data are:
compute a basis by a Gram-Schmidt procedure;
reorder the basis in such a way the variances of coefficients are in decreasing order;
estimate the intrinsic dimension
of data;d define a data parametrization of dimension
;d approximate the nonlinearity in the parametrization by a regression model.
AppendixLocal Parametrization of Data Embedded in a Manifold
Local Parametrization of Data Embedded in a Manifold
Assume all the values of \begin{equation*} \boldsymbol {\alpha } = F(\boldsymbol {\theta }), \;\; \boldsymbol {\theta } \in U \subset \mathbb {R}^{d}, \, \boldsymbol {\alpha } \in \mathbb {R}^{L}, \, d < L.\tag{71}\end{equation*}
\begin{align*} f(\boldsymbol {x}) = f(\boldsymbol {\theta }, \boldsymbol {t}) = F(\boldsymbol {\theta }) + \left ({\begin{array}{c} 0 \\ \boldsymbol {t} \\ \end{array} }\right), \; f: U \times \mathbb {R}^{m} \rightarrow \mathbb {R}^{L} \\\tag{72}\end{align*}
\begin{align*} f(\boldsymbol {x}) = \left ({\begin{array}{c} F''(\boldsymbol {\theta }) \\ F'(\boldsymbol {\theta }) + \boldsymbol {t} \end{array} }\right), \; F'' \in \mathbb {R}^{d}\tag{73}\end{align*}
\begin{align*} J(f) = \left ({J_{ \boldsymbol {\theta }}(f), J_{ \boldsymbol {t}}(f) }\right) = \left ({\begin{array}{cc} J(F'') &\quad 0 \\ J(F') &\quad I_{mm} \\ \end{array} }\right)\tag{74}\end{align*}
\begin{equation*} \boldsymbol {\theta } = F^{''-1} (\boldsymbol {\alpha ''}).\tag{75}\end{equation*}
\begin{align*} \boldsymbol {\alpha } = \left ({\begin{array}{c} \boldsymbol {\alpha ''} \\ F'\left ({F^{''-1} (\boldsymbol {\alpha ''}) }\right) \end{array} }\right) = \left ({\begin{array}{c} \boldsymbol {\alpha ''} \\ G(\boldsymbol {\alpha ''}) \end{array} }\right)\tag{76}\end{align*}