Introduction
The 3D human pose estimation (HPE) is a crucial research area in computer vision. It refers to estimating the 3D positions of key body joints, such as elbows, knees, wrists, and ankles, from an input image or video sequence. This task has numerous applications in various fields, including augmented reality, virtual reality, human-computer interaction, sports analysis, as well as health-care [1], [2]. The challenge in 3D HPE lies in frequent occlusions and the inherent depth ambiguity present in 2D images [3].
The most advanced techniques in 3D HPE can be broadly segmented into two categories: one-stage and two-stage approaches. One-stage approach [4]–[7] is a direct approach, estimating 3D joints coordinates direcly from 2D images by analyzing features from vast numbers of pixels within each image. Although one-stage approaches benefit from information provided by images, they are prone to disturbances from environmental noise inherent in the images and face limitations due to the scarcity of 3D labeled data for training. In contrast, two-stage approaches formulate 3D HPE as a process involving 2D keypoints detection from image and followed by 2D-to-3D projection [3], [8]–[13]. Two-stage approaches leverage 2D skeleton data extensively for 3D HPE, exhibiting robustness to environmental noises and benefiting from the availability of large 3D motion capture (MoCap) datasets for supervision. Graph convolutional networks (GCNs), renowned for their proficiency in creating representations of skeletal data, have emerged as a prevalent architecture for modeling joints information in process of lifting 2D poses to 3D HPE [14].
Owing to its formidable capability in capturing spatial and temporal feature characteristics of skeletal data, the ST-GCN (spatial-temporal graph convolution network) has emerged as a leading benchmark for applications in 3D HPE. Although previous studies have demonstrated that effectively capturing spatial-temporal graph features is critical for reducing occlusion and depth ambiguity in 3D HPE [15], how to comprehensively capture coherent spatial-temporal information about human joints remains an open issue. The traditional way to aggregate the spatial-temporal features of the joint involves an alternately stacking of spatial graph convolution (S-GC) layer and temporal graph convolution (T-GC) layer. Many studies have revealed that there are inherent limitations in those methods. First, both S-GC and T-GC employ a shared adjacency matrix across all GCN layers, where a binary matrix is used to encode the connection relationships of neighbors, the connection strength between joints is not considered [16]. Second, S-GC and TGC only update joint features at 1-order neighbors from single scale, which cannot well model the long-range relationships among joints since it has fixed receptive field to 1. The size of receptive field for convolution is critical to model the relationship between skeleton joints. It will limit to model global information in long-range nodes with small receptive field. Otherwise, lots of irrelevant joint information will be included when calculating the joint relationship with large receptive field. This not only weakens the feature representation capabilities, but also makes the network size too large. Third, the traditional way to factorize the S-GC and T-GC hinders the cross-spacetime information flow to learn spatial-temporal joint dependencies [17]. It also cannot effectively deal with the imbalance between the spatial and temporal information since it equally treats them.
To address these issues, researchers noticed that aggregating the nodes features from multiple spatial scales enables achieving long-range interdependencies across human body joints. Zou et al. [18] designed a higher-order GCN for 3D HPE to take k-hop neighborhood information when updating the node features. In [17], a higher-order adjacency matrix is devised to capture relationships among skeletal nodes and non-neighboring nodes. However, a core challenge faced by existing higher-order GCNs models lies in effectively fusing the feature representation of these multi-hop neighboring nodes.
In this work, we propose a mix-hop spatial-temporal attention graph convolutional layer to effectively aggregate neighboring feature representations with learnable weight. Specifically, a mix-hop spatial attention graph convolution (MSA-GC) layer and a mix-hop dilated temporal attention graph convolution (MTA-GC) layer are proposeded to encode the connection relationship and strength of neighbors with large receptive fields both in spatial dimension and temporal dimension. The attention weighted neighborhood matrix is calculated at each layer, and the resulting feature representation is propagated to the subsequent layer. It is useful for exploring the difference correlations of nodes in graph. Moreover, the potential of exploiting the cross-domain joint correlations is important for skeleton-based visual tasks including pose estimation, action recognition and motion estimation. However, most existing methods in [10], [16], [18]–[22] deploy interleaving spatial graph convolution network (S-GCN)-only and temporal graph convolution network (T-GCN)-only modules, which hinders the direct information flow across spacetime to capture complex spatial-temporal joint dependencies. Therefore, we devise a cross-domain spatial-temporal residual connection (CSTR) module. The proposed CSTR module fuses multi-scale spatial-temporal convolution features through residual connection, explicitly modeling cross-domain joint interdependencies across spatial and temporal dimensions. We also notice that traditional GCNs follow single-pass feed-forward framework, which makes low-level layers unable to access semantic skeleton connectivity features residing in high-level layers [16]. Inspired by the ability that the feedback mechanism enables the networks to utilize high-level information to refine and correct preceding layers, we introduce a forward dense connection module to facilitate the propagation of multi-scale spatial-temporal feature representation across different layers of the proposed FMR-GNet (forward multi-scale residual graph convolutional network), which is useful for overcoming the drawback of simply superimposing ST-GCN layer.
The main contributions of this work are summarized as follows:
An FMR-GNet model is designed for 3D HPE, in which a forward dense connection block (FDCB) is devised to facilitate propagating multi-scale spatial-temporal feature representaions across different network layers.
An MSA-GC layer and an MTA-GC layer are designed to encode the connection relationship and strength of neighbors with large receptive fields both in spatial-dimension and temporal dimension.
A CSTR block is designed to fuse multi-scale spatial-temporal convolution features through residual connection, enabling effective modeling cross-spacetime joint dependencies.
Relate Works
1. 3D HPE
Currently, one-stage direct regression methods and two-stage pipeline methods are two mianstream frame-works for 3D HPE. The former one directly estimates 3D coordinate of joints from input RGB images. Pavlakos et al. [4] estimated the 3D joint coordinates by calculating voxel likelihoods of each joint. Reference [6] proposed an image-to-pixel network (I2L-MeshNet) to estimate the 3D HPE by predicting the per-pixel likelihood on 1D heat-maps. Ma et al. [23] exploited pictorial structure and GNN to reduce ambiguity in 3D HPE. While benefiting from image information, these one-stage methods often suffer from sensitivity to environmental noise present in images and are constrained by scarcity of 3D labeled data.
The two-stage approaches adopt two setps decomposition strategy for 3D HPE. Firstly, perform 2D key-points detection, and then project these keypoints into 3D space. Recent works have demonstrated promising results following this family. Our approach belongs in this category. A multilayer end-to-end network is devised in [24] for 3D HPE. An ST-GCN network is specially designed in [10] to fully exploit spatial-temporal interdependencies of 2D keypoints for improving feature representation. Pavllo et al. [25] designed a semi-supervised GCN for improving performance of 3D HPE, where both the 2D and 3D labels were utilized to resolve depth ambiguities. Boasting a robust capability in constructing skeleton data, the GCNs have emerged as a prevalent architecture for modeling joints information in the 2D-to-3D lifting task.
2. Spatial-Temporal Graph Convolution Networks
Since the GCN demonstrates an efficient representation capability for skeleton data, the vanilla GCN introduced in [26] has become a prevalent framework for 3D HPE. Owing to their powerful capability to model spatial and temporal feature defined on graph vertexes, the ST-GCN is popular in tackling skeleton graph structured data such as 3D HPE, action recognition and motion estimation. The ST-GCN contains an S-GC layer and T-GC layer to handle the input sequences of skeleton joints, aiming to model the joint relationships both on the intra-frame and the inter-frame. In [19], an effective graph attention spatial-temporal convolution network was proposed for 3D HPE, which comprises stacked temporal convolutional and graph attention modules. Cai et al. [10] incorporated spatial dependencies and temporal consistencies to handle the occlusions issues in 3D HPE. In [21], an ST-GCN based action recognition method is proposed for automatically learning different convolution kernels for skeleton modeling. To capture better spatio-temporal feature, a spatio-temporal transformer module is designed in [27] to capture temporal dynamics of individual joints and learn inter-joint spatial dependencies. Li et al. [20] exploited the inherent topology of human skeleton to formulate pose-oriented attention network for 3D HPE. In [28], a masked pose model devised to effectively capture spatial-temporal skeleton features for 3D HPE. Shi et al. [29] claimed that graph topology remained fixed across all GCN layers, they devised a two-stream adaptive GCN. Motivated by graph-based neighbor assignment techniques, Wang et al. [30] proposed a sparsity locality preserving projection method to learn of human pose distance metric and an unsupervised visible hybrid model is proposed in [31] to facilite accurate and efficient pose tracking.
Revisiting Spatial-Temporal GCN
1. Spatial-Temporal Graph Convolution
With the skeleton joints in
2. General Formulation for Spatial-Graph Convolution Layer
The S-GC is only performed in single frame at one time. The output of S-GC layer for the i-th vertex \begin{gather*}\boldsymbol{H}_{v_{i}}^{(l+1)}= \sigma\left(\sum\limits_{v_{j}\in N(v_{i})} \boldsymbol{WH}_{v_{j}}^{(l)} \boldsymbol{A}_{ij}\right),\\ i\in\{1,2, \ldots, N\},\ l\in\{1,2, \ldots, L\}\tag{1}\end{gather*}
\begin{equation*}N(v_{i})=\{v_{j}\ \vert\ d(v_{i}, v_{j})\leq d^{\prime}\}\tag{2}\end{equation*}
Equation (1) indicates that the general formulation for joint feature extraction in S-GC consists of three steps as shown in Figure 1. First, features of interest joint from its contextual joints are collected; Then, the summation function is operated to aggregate the collected features; Finally, the features of interest joint are updated by transforming the aggregated features.
When implementing (1) for all joints in intra-frame, equation (1) is rewritten in a matrix form as
\begin{equation*}\boldsymbol{H}_{rmS}^{(l+1)}=\sigma\left(\boldsymbol{W}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S}}\right)\tag{3}\end{equation*}
3. General Formulation for Temporal-Graph Convolution Layer
After implementing S-GC in spatial direction as described in (3), the widely used T-GC is simply performed \begin{gather*}\boldsymbol{H}_{v_{qi}}^{(l+1)}= \sigma\left(\sum\limits_{v_{qi}\in N^{T}(v_{ti})} \boldsymbol{WH}_{v_{qj}}^{(l)} \boldsymbol{A}_{ti,(t+1)i}\right),\\ i\in\{1,2, \ldots, N\},\ l\in\{1,2, \ldots, L\}\tag{4}\end{gather*}
\begin{equation*}N^{T}(v_{ti})= \left\{v_{qi}\Vert q-t\vert \leq\left\lfloor\frac{\varGamma}{2}\right\rfloor\right\}\tag{5}\end{equation*}
\begin{equation*}\boldsymbol{H}_{T}^{(l+1)}=\sigma\left(\boldsymbol{WH}^{(l)}\hat{\boldsymbol{A}}_{T}\right)\tag{6}\end{equation*}
The Proposed FMR-GNet for 3D HPE
1. Overview of the Network Architecture
As demonstrated in (3) and (6), both S-GC and T-GC share same adjacency matrix for all GCN layers. The uniform treatment is insufficient to explore the difference correlations of nodes in graph and the connection strength between nodes is not considered. In addition, S-GC and T-GC only aggregate joints feature at 1-order neighbors from single scale, and cannot well model the long-range relationships among joints since the receptive field is fixed to 1. Furthermore, the traditional way to factorize the S-GC and T-GC is ineffectively to model complex spatial-temporal joint dependencies. It also cannot effectively deal with the imbalance between the spatial and temporal information since it equally treats them. To address the limitations of existing methods, an FMR-GNet model is proposed to effectively improve the neighbor feature modeling capability of GCN across both spatial and temporal dimensions. The overall framework of our approach is illustrated in Figure 1. The designed FMR-GNet architecture comprises three key components: 1) an MSA-GC layer and an MTA-GC layer; 2) a CSTR block with CSTR-GConv as basic unit; 3) an FDCB to propagate multi-scale spatial-temporal feature representations across different layers of FMR-GNet.
2. Forward Mix-Hop Spatial-Temporal Residual Graph Network (FMR-GNet)
1) The mix-hop spatial attention graph convolution (MSA-GC)
As shown in Figure 2(a), the conventional GCN operation involves three key steps, it first collects neighbors features, and then aggregates node features from neighbors and finally updates the node features. However, as shown in Figure 2(a) most existing GCN approaches only aggregate features from one-hop neighborhood of each node, hence failing to effectively gather long-range dependencies among joints since it has fixed receptive field to 1. Additionally, the S-GC layers share same adjacency matrix for all GCN layers in feature aggregation, where a binary matrix is used to encode the connection relationships of neighbors, the connection strength between joints is not considered. To improve the neighbor feature representation ability of GCN, an MSA-GC is introduced to revise the S-GC layer in (3) for aggregating the interest node features from multi-hop neighboring by introducing edge weight into S-GC as illustrated in Figure 2(b).
According to the hierarchy of human skeleton in motion, the neighbor joints can be categorized into three types based on their geometric distance from the node of interest in intra-frame as depicted in Figure 2 (b), which explicitly encodes 1-hop, 2-hop and 3-hop neighbor connections for interest joint. Then, the spatial domain is partitioned into three sub-graphs with the three neigh sub-spatial graph convolution operation is formulated as follows:
\begin{gather*}\boldsymbol{H}_{\mathbf{S},k}^{(l)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathbf{S},k}\right),\\ l\in\{1,2, \ldots, L\},\ k\in\{1,2, \ldots, K\}\tag{7}\end{gather*}
(a) A vanilla S-GC operate on 1-hop neighbors, versus (b) The mix-hop neighbors for MSA-GC. Each sub-spatial graph is constructed based on k-hop neighbor of the interest joint, with the circled numbers denoting the hop distance from the node 0.
As described in [16], the adjacency matrix \begin{equation*}\bar{e}_{ij}=\frac{e_{ij}}{\sum\nolimits_{k=1}^{N}e_{ij}},\ \ k\in N(v_{i})\tag{8}\end{equation*}
\begin{equation*}f_{\text{edge}}\left(\boldsymbol{H}_{v_{i}}^{l}, \boldsymbol{H}_{v_{j}}^{l}\right)=\exp\left(\sigma\left(\left[\boldsymbol{H}_{v_{i}}^{l}\boldsymbol{W}^{l}\Vert \boldsymbol{H}_{v_{j}}^{l}\boldsymbol{W}^{l}\right]\right)\right)\tag{9}\end{equation*}
With the attention function being implemented on each edge, the attention-weighted edge features can be represented as follows:
\begin{equation*}\alpha_{ij}=f_{\text{edge}}\left(\boldsymbol{H}_{v_{i}}^{l}, \boldsymbol{H}_{v_{j}}^{l}\right)\bar{e}_{ij}\tag{10}\end{equation*}
\begin{equation*}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}=\alpha_{ij}\cdot \sum\limits_{v_{j}\in N(v_{i})} (\hat{\boldsymbol{A}}_{ij})_{\mathrm{S},k}\tag{11}\end{equation*}
By aggregating information from node features and edge features, Equation (7) for sub-spatial attention graph convolution (SA-GC) can be rewritten as follows:
\begin{equation*}\boldsymbol{H}_{\mathrm{S},k}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\tag{12}\end{equation*}
Then the MSA-GC layer aggregates features from different neighbor joints represented as
\begin{equation*}\boldsymbol{H}_{\mathrm{S}}^{(l+1)}=\Vert_{k=1}^{K}\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\tag{13}\end{equation*}
2) The mix-hop dilated temporal attention graph convolution (MTA-GC)
As described in Section III, most current existing methods such as methods in [15], [19], and [25] simply perform T-GC on the same joint in inter-frame, and are unable to capture neighbor joints information except its information across time-step. However, human body motion is inherently characterized by a sequential progression of limb movements occurring time-dimension, the spatial information at the temporal direction is critical for pose estimation. An MTA-GC layer is designed with extending the single connection vertices of same joints to multiple neighboring joints in temporal dimension to save the spatial feature in time-steps. The distinction between the proposed MTA-GC layer and the Vanilla T-GC layer is clearly exhibited in Figure 4.
(a) A vanilla S-GCN layer shares the same adjacency matrix
(a) A vanilla T-GC operates only on the same joint across frame, versus (b) The proposed MTA-GC utilizes a sliding window
In our MTA-GC layer, with the multiple neighboring joints connection in temporal dimension, the original 1D convolution defined in (6) is replaced by 2D convolutions since the input is a 3-D vector
The sub-temporal graph convolution can be represented as follows:
\begin{gather*}\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k^{\prime}}^{(l)} \boldsymbol{H}_{\mathrm{S}}^{(l+1)}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}\right),\\ l\in\{1,2, \ldots, L\},\ k^{\prime}\in\{1,2, \ldots, K^{\prime}\}\tag{14}\end{gather*}
By introducing edge features into features aggregation in temporal-dimension, the MTA-GC can be rewritten as
\begin{equation*}\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k^{\prime}}^{(l)}\boldsymbol{H}_{\mathrm{S}}^{(l+1)}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{15}\end{equation*}
To flexibly control the receptive field in temporal dimension, dilated window is introduced into MTA-GC, aiming to capture powerful temporal contexts with various receptive fields. This is critical for reducing the redundant information gathered from large spatial-temporal receptive field. Let \begin{equation*}\left[\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}\right]_{\tau}=\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} \left[\boldsymbol{H}_{\tau,\mathrm{S}}^{(l+1)}\right]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{16}\end{equation*}
\begin{equation*}\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}=\begin{bmatrix}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}} & \ldots & \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\\ \vdots & \ddots & \vdots\\ \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}} & \ldots & \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\end{bmatrix}\in \mathbb{R}^{\tau N\times\tau N}\tag{17}\end{equation*}
With a 2D convolution performed on a dilated window in \begin{equation*}\boldsymbol{H}^{(l+1)}=\Vert_{k^{\prime}=1}^{K^{\prime}}\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} [\boldsymbol{H}_{\tau,d}^{(l)}]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{18}\end{equation*}
3) Cross-domain spatial-temporal residual (CSTR) connection
Though the spatial and temporal information has been widely used in 3D HPE to reduce occlusion and depth ambiguity, comprehensively modeling coherent spatial-temporal dependencies among skeleton joints remains an open problem. The traditional way is to factorize them by implementing an S-GC only and a T-GC only. However, as features are transmitted across space-time through interleaving implementation S-GC and T-GC, they are weakened as redundant information gathered from increasingly larger spatial-temporal receptive field [22] and hinder the cross-spacetime information flow to learn spatial-temporal joint dependencies. Furthermore, it equally treats S-GC and T-GC, and cannot effectively deal with the imbalance between the spatial and temporal features. To effectively solve above problem, we design a CSTR block to effectively model coherent spatial-temporal graph information. The CSTR-GConv is constructed as a basic unit for CSTR. The proposed CSTR-GConv consists of two pathways to simultaneously capture and fuse spatial-temporal features rather than decompose it into performing MSA-GCN and MTA-GCN operation independently. As shown in Figure 5, the first pathway is the spatial convolution branch and the second pathway is the spatial-temporal convolution branch.
The first branch is used to transmit the spatial convolution features with MSA-GCN, it can be regarded as static feature in each skeleton graph. In the second branch, the spatial-temporal features are extracted by interleaving the MSA-GCN and MTA-GC. Let \begin{align*}& \boldsymbol{H}^{(l+1)}=\mathrm{T}(\mathrm{S})+\mathrm{S}\left(\boldsymbol{H}^{(l)}\right)\\ &\mathrm{S}\left(\boldsymbol{H}^{(l)}\right)= \boldsymbol{H}_{\mathrm{S},k}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\\ & \mathrm{T}(\mathrm{S})=\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} \left[\boldsymbol{H}_{\mathrm{S},k}^{(l+1)}\right]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{19}\end{align*}
4) Forward dense connection block (FDCB) for spatial-temporal features
The traditional GCN-based HPE usually simply super-imposes GCN layer in a feedforward manner, which is unable to pass semantic connectivity information among different layers of the network [33]. Hence, we construct FDCB and insert it into each CSTR block for effectively transmitting features from the previous layers to the next layer. As illustrated in Figure 6, FDCB connects each CSTR-GConv layer to all subsequent layers. It has two inputs for each layer, high-level feature maps from previous layers and the local features from the current layer. In each CSTR block, we have \begin{equation*}\boldsymbol{H}^{(l)}= \text{Cat} ([\boldsymbol{H}^{(1)}, \boldsymbol{H}^{(2)}, \ldots, \boldsymbol{H}^{(l-1)}]),\ l=2,3, \ldots, L\tag{20}\end{equation*}
Forward dense connection module for propagating multi-scale spatial-temporal feature representations across layers.
3. Network Instantiation
As illustrated in Figure 1, the input to FMR-GNet is the 2D human joints coordinates spanning
Experiments
1. Experimental Setup
Dataset In this section, we evaluate the proposed FMR-GNet model on two widely-adopted 3D HPE benchmarks: Human3.6M and MPI-INF-3DHP. Human3.6M [34] comprises recordings of 11 subjects performing 15 distinct action classes. Totally 3.6 million images are included in Human3.6M. Following previous works [3], [12], [13], [18], [19], [25], [29], [35], 5 sequences (S1, S5, S6, S7, S8) are used for training, while sequences S9 and S11 are held out for testing. The MPI-INF-3DHP [36] dataset consists of 1.3 million images that record 8 actors performing 8 actions.
Evaluation protocols For Human3.6M, we report results using two common metrics: 1) mean per joint position error (MPJPE) that compute the mean Euclidean distance (mm) between predicted and ground-truth 3D joint locations, and 2) Procrustes-aligned MPJPE (P-MPJPE) where the prediction is rigidly aligned to the ground truth prior to computing MPJPE. For the MPI-INF -3DHP, we follow previous literature [3], [9], [19], [25], [35], [37] and use 3D percentage of correct keypoints (3D-PCK), area under the curve (AUC), and MPJPE as evaluation metrics.
Implementation of details Our model is implemented in PyTorch and trained on an NVIDIA RTX 2080TI using Adam [38] optimizer for 120 epochs with an initial learning rate of 0.001 decayed by 0.95 at each epoch. The batch size is set to 256. Following convention from prior SOTA methods [3], [12], [13], [18], [19], [25], [29], [35], we utilize the cascaded pyramid network (CPN) [39] for 2D pose estimation on Human3.6M. While the ground-truth 2D joints are used for MPI-INF-3DHP dataset as works in [3], [9], [12], and [37].
2. Ablation Study
To evaluate the individual contributions of each component in FMR-GNet model, we perform ablation experiments on Human3.6M test set using 2D ground truth. The input representation is (17, 2, 27), encoding 17 body joints with 2D coordinates over a temporal window of
Effect of MSA-GC We first validate how the MSA-GC affects the 3D HPE performance by using ST-GCN [21] as a baseline. Then, we add each module one-by-one including mix-hop spatial graph convolution and attention weighed edge features into adjacency matrix. The performance steadily improves with the incremental addition of each module, corroborating their complementary benefits. As shown in Table 1, by adding mix-hop S-GC into baseline method, the MS-GCN module outperforms baseline method with 10.9 mm. The good performance is owing to that with the mix-hop neighbor joints features being introduced into feature aggregation, the MS-GCN module can combine joints features from longer distances, which surpasses the constraint of the baseline method that solely aggregates features from 1-hop neighbors at single scale both in spatial and temporal dimension. By further introducing the attention-weighted edge features into adjacency matrix, the performance of MSA-GCN module is improved compared with the baseline module (39.6 mm vs. 55.4 mm) and MS-GCN (39.6 mm vs. 44.5 mm), respectively, validating the edge feature in GCN can boost the feature representation of skeleton-structured data since it processes rich information like connection strengths and types of nodes. Hence, with attention weighted neighbor matrix
Effect of MTA-GC We use TCN proposed in ST-GCN [21] as baseline network architecture to validate how the mix-hop dilated temporal attention GCN layer affects the 3D HPE performance. We add mix-hop TCN, attention weighed temporal edge features and dilated convolution into the baseline network one-by-one. As shown in Table 2, with fully utilizing the temporal information in MTA-GCN, the model error is clearly lower than that in baseline method. Since the temporal features embed mix-hop neighborhood joints information that can enhance the increase the network's capability to process information from neighbor joints rather than only itself across time step, it is helpful to save the spatial information in the temporal direction. Hence, the performance in model 1 with adding mix-hop TCN is 4.5 mm smaller than the baseline method. With attention weighted neighbor matrix
Effects of the CSTR and FDCB To investigate the effect of the CSTR and FDCB on 3D HPE, we take ST-GCN [21] as baseline network architecture. Compared with the original structure of ST-GCN, the model 1 with MSA-GC and MTA-GC layers achieves 38.2 mm MPJPE, which is 17.2 mm lower than that in baseline method. We further introduce a spatial residual connection into the model 1, the model 2 with CSTR exhibits an obvious improvement compared with baseline method (36.0 mm vs. 55.4 mm). This is because the CSTR can effectively capture the spatial features and spatial-temporal fea-tures at the same time, which is beneficial for accurately modeling coherent spatial-temporal graph features. As shown in Table 3, by further adding FDCB into each CSTR layer, the MPJPE for model 3 is further reduced. This verifies that the devised FDCB can boost the model performance, and reflects that capturing semantic skeleton connectivity information from different layer of model can strengthen the stability of the network.
3. Comparison with State-of-the-Art Methods
1) Results on Human3.6M
Table 4 ([40]–[47] ) and Table 5 ([48], [49]) report the quantitative results comparing the proposed FMR-GNet against state-of-the-art (SOTA) methods on Human3.6M dataset under Protocol #1 and Protocol #2, respectively. Our method achieves an average error of 46.2 mm for MPJPE (Protocol #1) and 35.7 mm for P-MPJPE (Protocol #2), outperforming most existing SOTA methods under the same input temporal window
Effectively capturing long-range temporal correlations is vital for handling challenging actions involving occlusions and rapid movements. To evaluate the proposed method's performance on such cases, we compare it with several SOTA methods that incorporate temporal convolutions for graph-based action representation. Specifically, we analyze the challenging “photo” and “sitting down” actions from the Human3.6M S11 sequence, as shown in Figure 7. Our approach has smaller errors compared to methods proposed in Pavllo et al. [25] and Chen et al. [35]. It obtains good promotion in some joints (e.g., the MPJPE for right elbow, left wrist, and right wrist are 89.1 nm, 98.3 nm and 109.4 mm, respectively). Those results further validate our model can effectively capture long-distance frames correlations by the designed MTA-GC layer, enabling the network to leverage global information for tackling the challenging actions.
2) Results on MPI-INF-3DHP
To evaluate the generalization capability, we conduct a comparative analysis with SOTA approaches on MPI-INF-3DHP dataset. Following the existing methods [25], [35], [36], [50], three evaluation metrics with an input temporal window of 27 frames
4. Computational Complexity Analysis
To further evaluate the proposed model's performance, we report the computational complexity in terms of total parameter count, floating-point operations (FLOPs) and MPJPE on Human3.6M using a temporal window of 27 frames
5. Visualization Results
For better observation, Figure 8 presents a visual comparison of the 3D pose estimates generated by our model against several SOTA methods and the ground-truth annotations. Figure 8 shows some challenging actions on Human3.6M with fast motion and occlusion. The blue circle notifies differences between our model and others. Our approach yields good visually estimation on actions like “Photo”, “Posing”, and “SittingDown” that involve ambiguous body parts and self-occlusion.
Conclusion
This paper presents an FMR-GNet for 3D HPE from monocular videos. FMR-GNet first devises a mix-hop spatial attention GCN layer and mix-hop dilated temporal attention GCN layer to effectively aggregate neighbor feature representations with learnable weights over large spatial-temporal receptive field. Then, the mix-hop neighbor joints features and attention-weighted edge features are introduced for graph representation to explore the correlations of nodes in graph. Secondly, the CSTR block is employed to effectively model the coherent spatial-temporal graph information through residual connection, thereby enabling effective cross-domain modeling of joint interdependencies across space and time. Finally, the FDCB block is designed to insert into each CSTR-GConv layer for effectively transmitting features from the previous layers to the next layer, enabling the model to fully utilize the global information of the network. Experiments on two widely-adopted 3D HPE benchmark show that our FMR-GNet achieves good performance compared to existing SOTA methods. Though our approach mainly focuses on root-relative 3D HPE, an important future research direction is to develop a unified framework that can efficiently integrate FMR-GNet with human detection and 3D root localization modules to enable 3D multi-person pose estimation in complex scenarios.
ACKNOWLEDGEMENTS
This work was supported by the National Natural Science Foundation of China (Grant Nos. 61907028, 62107027, and 11872036), the Young Science and Technology Stars in Shaanxi Province (Grant No. 2021KJXX-91), and the Central Universities (Grant Nos. 2023YBGY158, K2021011004, 2022TD-26, and GK202205020).