Loading [MathJax]/extensions/TeX/boldsymbol.js
FMR-GNet: Forward Mix-Hop Spatial-Temporal Residual Graph Network for 3D Pose Estimation | CIE Journals & Magazine | IEEE Xplore

FMR-GNet: Forward Mix-Hop Spatial-Temporal Residual Graph Network for 3D Pose Estimation


Abstract:

Graph convolutional networks that leverage spatial-temporal information from skeletal data have emerged as a popular approach for 3D human pose estimation. However, compr...Show More

Abstract:

Graph convolutional networks that leverage spatial-temporal information from skeletal data have emerged as a popular approach for 3D human pose estimation. However, comprehensively modeling consistent spatial-temporal dependencies among the body joints remains a challenging task. Current approaches are limited by performing graph convolutions solely on immediate neighbors, deploying separate spatial or temporal modules, and utilizing single-pass feedforward architectures. To solve these limitations, we propose a forward multi-scale residual graph convolutional network (FMR-GNet) for 3D pose estimation from monocular video. First, we introduce a mix-hop spatial-temporal attention graph convolution layer that effectively aggregates neighboring features with learnable weights over large receptive fields. The attention mechanism enables dynamically computing edge weights at each layer. Second, we devise a cross-domain spatial-temporal residual module to fuse multi-scale spatial-temporal convolutional features through residual connections, explicitly modeling interdependencies across spatial and temporal domains. Third, we integrate a forward dense connection block to propagate spatial-temporal representations across network layers, enabling high-level semantic skeleton information to enrich lower-level features. Comprehensive experiments conducted on two challenging 3D human pose estimation benchmarks, namely Human3.6M and MPI-INF-3DHP, demonstrate that the proposed FMR-GNet achieves superior performance, surpassing the most state-of-the-art methods.
Published in: Chinese Journal of Electronics ( Volume: 33, Issue: 6, November 2024)
Page(s): 1346 - 1359
Date of Publication: 11 November 2024

ISSN Information:

Funding Agency:

Figures are not available for this document.

SECTION I.

Introduction

The 3D human pose estimation (HPE) is a crucial research area in computer vision. It refers to estimating the 3D positions of key body joints, such as elbows, knees, wrists, and ankles, from an input image or video sequence. This task has numerous applications in various fields, including augmented reality, virtual reality, human-computer interaction, sports analysis, as well as health-care [1], [2]. The challenge in 3D HPE lies in frequent occlusions and the inherent depth ambiguity present in 2D images [3].

The most advanced techniques in 3D HPE can be broadly segmented into two categories: one-stage and two-stage approaches. One-stage approach [4]–​[7] is a direct approach, estimating 3D joints coordinates direcly from 2D images by analyzing features from vast numbers of pixels within each image. Although one-stage approaches benefit from information provided by images, they are prone to disturbances from environmental noise inherent in the images and face limitations due to the scarcity of 3D labeled data for training. In contrast, two-stage approaches formulate 3D HPE as a process involving 2D keypoints detection from image and followed by 2D-to-3D projection [3], [8]–​[13]. Two-stage approaches leverage 2D skeleton data extensively for 3D HPE, exhibiting robustness to environmental noises and benefiting from the availability of large 3D motion capture (MoCap) datasets for supervision. Graph convolutional networks (GCNs), renowned for their proficiency in creating representations of skeletal data, have emerged as a prevalent architecture for modeling joints information in process of lifting 2D poses to 3D HPE [14].

Owing to its formidable capability in capturing spatial and temporal feature characteristics of skeletal data, the ST-GCN (spatial-temporal graph convolution network) has emerged as a leading benchmark for applications in 3D HPE. Although previous studies have demonstrated that effectively capturing spatial-temporal graph features is critical for reducing occlusion and depth ambiguity in 3D HPE [15], how to comprehensively capture coherent spatial-temporal information about human joints remains an open issue. The traditional way to aggregate the spatial-temporal features of the joint involves an alternately stacking of spatial graph convolution (S-GC) layer and temporal graph convolution (T-GC) layer. Many studies have revealed that there are inherent limitations in those methods. First, both S-GC and T-GC employ a shared adjacency matrix across all GCN layers, where a binary matrix is used to encode the connection relationships of neighbors, the connection strength between joints is not considered [16]. Second, S-GC and TGC only update joint features at 1-order neighbors from single scale, which cannot well model the long-range relationships among joints since it has fixed receptive field to 1. The size of receptive field for convolution is critical to model the relationship between skeleton joints. It will limit to model global information in long-range nodes with small receptive field. Otherwise, lots of irrelevant joint information will be included when calculating the joint relationship with large receptive field. This not only weakens the feature representation capabilities, but also makes the network size too large. Third, the traditional way to factorize the S-GC and T-GC hinders the cross-spacetime information flow to learn spatial-temporal joint dependencies [17]. It also cannot effectively deal with the imbalance between the spatial and temporal information since it equally treats them.

To address these issues, researchers noticed that aggregating the nodes features from multiple spatial scales enables achieving long-range interdependencies across human body joints. Zou et al. [18] designed a higher-order GCN for 3D HPE to take k-hop neighborhood information when updating the node features. In [17], a higher-order adjacency matrix is devised to capture relationships among skeletal nodes and non-neighboring nodes. However, a core challenge faced by existing higher-order GCNs models lies in effectively fusing the feature representation of these multi-hop neighboring nodes.

In this work, we propose a mix-hop spatial-temporal attention graph convolutional layer to effectively aggregate neighboring feature representations with learnable weight. Specifically, a mix-hop spatial attention graph convolution (MSA-GC) layer and a mix-hop dilated temporal attention graph convolution (MTA-GC) layer are proposeded to encode the connection relationship and strength of neighbors with large receptive fields both in spatial dimension and temporal dimension. The attention weighted neighborhood matrix is calculated at each layer, and the resulting feature representation is propagated to the subsequent layer. It is useful for exploring the difference correlations of nodes in graph. Moreover, the potential of exploiting the cross-domain joint correlations is important for skeleton-based visual tasks including pose estimation, action recognition and motion estimation. However, most existing methods in [10], [16], [18]–​[22] deploy interleaving spatial graph convolution network (S-GCN)-only and temporal graph convolution network (T-GCN)-only modules, which hinders the direct information flow across spacetime to capture complex spatial-temporal joint dependencies. Therefore, we devise a cross-domain spatial-temporal residual connection (CSTR) module. The proposed CSTR module fuses multi-scale spatial-temporal convolution features through residual connection, explicitly modeling cross-domain joint interdependencies across spatial and temporal dimensions. We also notice that traditional GCNs follow single-pass feed-forward framework, which makes low-level layers unable to access semantic skeleton connectivity features residing in high-level layers [16]. Inspired by the ability that the feedback mechanism enables the networks to utilize high-level information to refine and correct preceding layers, we introduce a forward dense connection module to facilitate the propagation of multi-scale spatial-temporal feature representation across different layers of the proposed FMR-GNet (forward multi-scale residual graph convolutional network), which is useful for overcoming the drawback of simply superimposing ST-GCN layer.

The main contributions of this work are summarized as follows:

  1. An FMR-GNet model is designed for 3D HPE, in which a forward dense connection block (FDCB) is devised to facilitate propagating multi-scale spatial-temporal feature representaions across different network layers.

  2. An MSA-GC layer and an MTA-GC layer are designed to encode the connection relationship and strength of neighbors with large receptive fields both in spatial-dimension and temporal dimension.

  3. A CSTR block is designed to fuse multi-scale spatial-temporal convolution features through residual connection, enabling effective modeling cross-spacetime joint dependencies.

SECTION II.

Relate Works

1. 3D HPE

Currently, one-stage direct regression methods and two-stage pipeline methods are two mianstream frame-works for 3D HPE. The former one directly estimates 3D coordinate of joints from input RGB images. Pavlakos et al. [4] estimated the 3D joint coordinates by calculating voxel likelihoods of each joint. Reference [6] proposed an image-to-pixel network (I2L-MeshNet) to estimate the 3D HPE by predicting the per-pixel likelihood on 1D heat-maps. Ma et al. [23] exploited pictorial structure and GNN to reduce ambiguity in 3D HPE. While benefiting from image information, these one-stage methods often suffer from sensitivity to environmental noise present in images and are constrained by scarcity of 3D labeled data.

The two-stage approaches adopt two setps decomposition strategy for 3D HPE. Firstly, perform 2D key-points detection, and then project these keypoints into 3D space. Recent works have demonstrated promising results following this family. Our approach belongs in this category. A multilayer end-to-end network is devised in [24] for 3D HPE. An ST-GCN network is specially designed in [10] to fully exploit spatial-temporal interdependencies of 2D keypoints for improving feature representation. Pavllo et al. [25] designed a semi-supervised GCN for improving performance of 3D HPE, where both the 2D and 3D labels were utilized to resolve depth ambiguities. Boasting a robust capability in constructing skeleton data, the GCNs have emerged as a prevalent architecture for modeling joints information in the 2D-to-3D lifting task.

2. Spatial-Temporal Graph Convolution Networks

Since the GCN demonstrates an efficient representation capability for skeleton data, the vanilla GCN introduced in [26] has become a prevalent framework for 3D HPE. Owing to their powerful capability to model spatial and temporal feature defined on graph vertexes, the ST-GCN is popular in tackling skeleton graph structured data such as 3D HPE, action recognition and motion estimation. The ST-GCN contains an S-GC layer and T-GC layer to handle the input sequences of skeleton joints, aiming to model the joint relationships both on the intra-frame and the inter-frame. In [19], an effective graph attention spatial-temporal convolution network was proposed for 3D HPE, which comprises stacked temporal convolutional and graph attention modules. Cai et al. [10] incorporated spatial dependencies and temporal consistencies to handle the occlusions issues in 3D HPE. In [21], an ST-GCN based action recognition method is proposed for automatically learning different convolution kernels for skeleton modeling. To capture better spatio-temporal feature, a spatio-temporal transformer module is designed in [27] to capture temporal dynamics of individual joints and learn inter-joint spatial dependencies. Li et al. [20] exploited the inherent topology of human skeleton to formulate pose-oriented attention network for 3D HPE. In [28], a masked pose model devised to effectively capture spatial-temporal skeleton features for 3D HPE. Shi et al. [29] claimed that graph topology remained fixed across all GCN layers, they devised a two-stream adaptive GCN. Motivated by graph-based neighbor assignment techniques, Wang et al. [30] proposed a sparsity locality preserving projection method to learn of human pose distance metric and an unsupervised visible hybrid model is proposed in [31] to facilite accurate and efficient pose tracking.

SECTION III.

Revisiting Spatial-Temporal GCN

1. Spatial-Temporal Graph Convolution

With the skeleton joints in T frames \boldsymbol{X}\in \mathbb{R}^{N\times D\times T}, the spatial-temporal graph convolution (ST-GC) is used to capture the joint relationships in intra-frame and the inter-frames. The human body joints are treated as nodes of a spatial-temporal graph \boldsymbol{G}=(\boldsymbol{V}, \boldsymbol{E}, \boldsymbol{A}), where \boldsymbol{V}= \{v_{ti}\vert t=1,2, \ldots, T;i=1,2, \ldots, N\} represents vertex set consisted by N skeleton joints per frame in consecutive T frame; Edge set \boldsymbol{E}=\{[e_{ij}, e_{ti}]\vert t=1,2, \ldots, T;i, j= 1, 2,\ldots, N\} contains the spatial edges that connect different joints in intra-frame and the temporal edges that connect each same joint in inter-frames; \boldsymbol{A}=(A_{ij})_{M\times M} with M=NT is the adjacency matrix, where A_{ij}=1 if an edge exists between node i and j, otherwise A_{ij}=0; A_{ii}=1 represents self-connection.

2. General Formulation for Spatial-Graph Convolution Layer

The S-GC is only performed in single frame at one time. The output of S-GC layer for the i-th vertex v_{i} is expressed as follows: \begin{gather*}\boldsymbol{H}_{v_{i}}^{(l+1)}= \sigma\left(\sum\limits_{v_{j}\in N(v_{i})} \boldsymbol{WH}_{v_{j}}^{(l)} \boldsymbol{A}_{ij}\right),\\ i\in\{1,2, \ldots, N\},\ l\in\{1,2, \ldots, L\}\tag{1}\end{gather*}View SourceRight-click on figure for MathML and additional features. where \boldsymbol{H}_{v_{j}}^{(l)} denotes the feature vector for joint v_{j} in the l-th layer, \boldsymbol{H}_{v_{i}}^{(l+1)} represents the updated feature, \boldsymbol{W} denotes the input weight matrix, and N(v_{i}) is the sample neighboring nodes for v_{i} including itself, which is represented as \begin{equation*}N(v_{i})=\{v_{j}\ \vert\ d(v_{i}, v_{j})\leq d^{\prime}\}\tag{2}\end{equation*}View SourceRight-click on figure for MathML and additional features. where d(v_{i}, v_{j}) represents the shortest path distance between joints v_{i} and v_{j}, and d^{\prime} is pre-defined maximum length for sample neighbors on intra-frame and set d^{\prime}=1 in traditional S-GC.

Equation (1) indicates that the general formulation for joint feature extraction in S-GC consists of three steps as shown in Figure 1. First, features of interest joint from its contextual joints are collected; Then, the summation function is operated to aggregate the collected features; Finally, the features of interest joint are updated by transforming the aggregated features.

When implementing (1) for all joints in intra-frame, equation (1) is rewritten in a matrix form as \begin{equation*}\boldsymbol{H}_{rmS}^{(l+1)}=\sigma\left(\boldsymbol{W}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S}}\right)\tag{3}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \boldsymbol{H}^{(l)}\in \mathbb{R}^{D_{l}\times N} and \boldsymbol{H}^{(l+1)}\in \mathbb{R}^{D_{l+1}\times N} denote the input and updated features for the S-GC layer respectively. \boldsymbol{W}^{(l)}\in \mathbb{R}^{D_{(l+1)}\times D_{l}} represents trainable matrix, \sigma(\cdot) denotes ReLU function, \hat{\boldsymbol{A}}_{\mathrm{S}}=\boldsymbol{D}^{-\frac{1}{2}}(\boldsymbol{A}+\boldsymbol{I})\boldsymbol{D}^{-\frac{1}{2}},\ \boldsymbol{I} is an identity matrix encoding self-connections within each frame and \boldsymbol{D} denotes a degree matrix for that each element in \boldsymbol{D} is defined as D_{ii}= \sum\nolimits_{i\neq j}(A_{ij}+1). \boldsymbol{H}^{(0)}= \boldsymbol{X} is initialized by the node features \boldsymbol{X} of \boldsymbol{G}, and \boldsymbol{X}\in \mathbb{R}^{N\times D\times T}. Each row of \boldsymbol{X} is a feature vector corresponding to a node in \boldsymbol{V}.

Figure 1 - The overall architecture of the proposed FMR-GNet.
Figure 1

The overall architecture of the proposed FMR-GNet.

3. General Formulation for Temporal-Graph Convolution Layer

After implementing S-GC in spatial direction as described in (3), the widely used T-GC is simply performed \varGamma\times 1 convolution on same joint in a period of time. The T-GC in time direction can be represented as \begin{gather*}\boldsymbol{H}_{v_{qi}}^{(l+1)}= \sigma\left(\sum\limits_{v_{qi}\in N^{T}(v_{ti})} \boldsymbol{WH}_{v_{qj}}^{(l)} \boldsymbol{A}_{ti,(t+1)i}\right),\\ i\in\{1,2, \ldots, N\},\ l\in\{1,2, \ldots, L\}\tag{4}\end{gather*}View SourceRight-click on figure for MathML and additional features. where the sampling neighbors N^{T}(v_{ti}) on inter-frames for T-GC is represented as \begin{equation*}N^{T}(v_{ti})= \left\{v_{qi}\Vert q-t\vert \leq\left\lfloor\frac{\varGamma}{2}\right\rfloor\right\}\tag{5}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \varGamma is the kernel size for T-GCN. When implementing T-GC for all joints in time direction, Equation (4) is rewritten in a matrix form as \begin{equation*}\boldsymbol{H}_{T}^{(l+1)}=\sigma\left(\boldsymbol{WH}^{(l)}\hat{\boldsymbol{A}}_{T}\right)\tag{6}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \hat{\boldsymbol{A}}_{T} encodes the connection relationship for each same joint in the inter-frames.

SECTION IV.

The Proposed FMR-GNet for 3D HPE

1. Overview of the Network Architecture

As demonstrated in (3) and (6), both S-GC and T-GC share same adjacency matrix for all GCN layers. The uniform treatment is insufficient to explore the difference correlations of nodes in graph and the connection strength between nodes is not considered. In addition, S-GC and T-GC only aggregate joints feature at 1-order neighbors from single scale, and cannot well model the long-range relationships among joints since the receptive field is fixed to 1. Furthermore, the traditional way to factorize the S-GC and T-GC is ineffectively to model complex spatial-temporal joint dependencies. It also cannot effectively deal with the imbalance between the spatial and temporal information since it equally treats them. To address the limitations of existing methods, an FMR-GNet model is proposed to effectively improve the neighbor feature modeling capability of GCN across both spatial and temporal dimensions. The overall framework of our approach is illustrated in Figure 1. The designed FMR-GNet architecture comprises three key components: 1) an MSA-GC layer and an MTA-GC layer; 2) a CSTR block with CSTR-GConv as basic unit; 3) an FDCB to propagate multi-scale spatial-temporal feature representations across different layers of FMR-GNet.

2. Forward Mix-Hop Spatial-Temporal Residual Graph Network (FMR-GNet)

1) The mix-hop spatial attention graph convolution (MSA-GC)

As shown in Figure 2(a), the conventional GCN operation involves three key steps, it first collects neighbors features, and then aggregates node features from neighbors and finally updates the node features. However, as shown in Figure 2(a) most existing GCN approaches only aggregate features from one-hop neighborhood of each node, hence failing to effectively gather long-range dependencies among joints since it has fixed receptive field to 1. Additionally, the S-GC layers share same adjacency matrix for all GCN layers in feature aggregation, where a binary matrix is used to encode the connection relationships of neighbors, the connection strength between joints is not considered. To improve the neighbor feature representation ability of GCN, an MSA-GC is introduced to revise the S-GC layer in (3) for aggregating the interest node features from multi-hop neighboring by introducing edge weight into S-GC as illustrated in Figure 2(b).

According to the hierarchy of human skeleton in motion, the neighbor joints can be categorized into three types based on their geometric distance from the node of interest in intra-frame as depicted in Figure 2 (b), which explicitly encodes 1-hop, 2-hop and 3-hop neighbor connections for interest joint. Then, the spatial domain is partitioned into three sub-graphs with the three neigh sub-spatial graph convolution operation is formulated as follows: \begin{gather*}\boldsymbol{H}_{\mathbf{S},k}^{(l)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathbf{S},k}\right),\\ l\in\{1,2, \ldots, L\},\ k\in\{1,2, \ldots, K\}\tag{7}\end{gather*}View SourceRight-click on figure for MathML and additional features. where \hat{\boldsymbol{A}}_{\mathbf{S},k} represents the adjacency matrix for k\ (k\in \{1, 2, \ldots, K\},\ K=3 in our work) type neighbors in a graph performed on spatial dimension, and \boldsymbol{W}_{k}^{(l)} is the learned matrix. Specially, k=0 indicates self-connection, k=1,2,3 express 1-hop, 2-hop and 3-hop neighbor connections as shown in Figure 2(b), respectively.

Figure 2 - (a) A vanilla S-GC operate on 1-hop neighbors, versus (b) The mix-hop neighbors for MSA-GC. Each sub-spatial graph is constructed based on k-hop neighbor of the interest joint, with the circled numbers denoting the hop distance from the node 0.
Figure 2

(a) A vanilla S-GC operate on 1-hop neighbors, versus (b) The mix-hop neighbors for MSA-GC. Each sub-spatial graph is constructed based on k-hop neighbor of the interest joint, with the circled numbers denoting the hop distance from the node 0.

As described in [16], the adjacency matrix \hat{\boldsymbol{A}}_{\mathbf{S},k} in (7) is a binary matrix, which only encodes the connection relationships between neighborhoods. Since each edge also describes relationship strengths between nodes, it is more effectively to enhance GCN by fusing information from nodes and edges. Hence, the binary matrix \hat{\boldsymbol{A}}_{\mathbf{S},k} is replaced by introducing edge weight into GCN. For node v_{i} in layer l, the node feature is \boldsymbol{H}_{v_{i}}^{(l)}, and the corresponding edge connection between v_{i} and v_{j} is e_{ij}\cdot e_{ij}= 0, which indicates that no edge is connected between v_{i} and v_{j}. In order to avoid that the feature scale is extended when multiplying the edge features with the node features, the edge features is normalized as \begin{equation*}\bar{e}_{ij}=\frac{e_{ij}}{\sum\nolimits_{k=1}^{N}e_{ij}},\ \ k\in N(v_{i})\tag{8}\end{equation*}View SourceRight-click on figure for MathML and additional features. where N(v_{i}) is the neighbor nodes of node v_{i}. With the normalized edge features in (8), an edge attention function f_{\text{edge}}(\cdot) is introduced to measure the significance of the edge feature e_{ij} connected between v_{i} and v_{j}. \begin{equation*}f_{\text{edge}}\left(\boldsymbol{H}_{v_{i}}^{l}, \boldsymbol{H}_{v_{j}}^{l}\right)=\exp\left(\sigma\left(\left[\boldsymbol{H}_{v_{i}}^{l}\boldsymbol{W}^{l}\Vert \boldsymbol{H}_{v_{j}}^{l}\boldsymbol{W}^{l}\right]\right)\right)\tag{9}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \sigma(\cdot) is the ReLU, and \boldsymbol{W}^{l} denotes the weight matrix same as in (7).

With the attention function being implemented on each edge, the attention-weighted edge features can be represented as follows: \begin{equation*}\alpha_{ij}=f_{\text{edge}}\left(\boldsymbol{H}_{v_{i}}^{l}, \boldsymbol{H}_{v_{j}}^{l}\right)\bar{e}_{ij}\tag{10}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \alpha_{ij} is the attention coefficient which depicts the connection strength between v_{i} and v_{j}. By performing (10) for all nodes in each sub-graph, the adjacency matrix in (7) is updated as follows: \begin{equation*}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}=\alpha_{ij}\cdot \sum\limits_{v_{j}\in N(v_{i})} (\hat{\boldsymbol{A}}_{ij})_{\mathrm{S},k}\tag{11}\end{equation*}View SourceRight-click on figure for MathML and additional features.

By aggregating information from node features and edge features, Equation (7) for sub-spatial attention graph convolution (SA-GC) can be rewritten as follows: \begin{equation*}\boldsymbol{H}_{\mathrm{S},k}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\tag{12}\end{equation*}View SourceRight-click on figure for MathML and additional features.

Then the MSA-GC layer aggregates features from different neighbor joints represented as \begin{equation*}\boldsymbol{H}_{\mathrm{S}}^{(l+1)}=\Vert_{k=1}^{K}\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\tag{13}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \Vert is column-wise concatenation. Figure 3 highlights the fundamental distinction between the novel MSA-GC layer and the Vanilla GCN layer.

2) The mix-hop dilated temporal attention graph convolution (MTA-GC)

As described in Section III, most current existing methods such as methods in [15], [19], and [25] simply perform T-GC on the same joint in inter-frame, and are unable to capture neighbor joints information except its information across time-step. However, human body motion is inherently characterized by a sequential progression of limb movements occurring time-dimension, the spatial information at the temporal direction is critical for pose estimation. An MTA-GC layer is designed with extending the single connection vertices of same joints to multiple neighboring joints in temporal dimension to save the spatial feature in time-steps. The distinction between the proposed MTA-GC layer and the Vanilla T-GC layer is clearly exhibited in Figure 4.

Figure 3 - (a) A vanilla S-GCN layer shares the same adjacency matrix $\hat{\boldsymbol{A}}_{\mathrm{S}}$ across all layers, versus (b) The proposed MSA-GC layer dynamically computes a $k$-hop attention weighted neighborhood matrix $\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}$ is computed at each layer before propagating to the next layer.
Figure 3

(a) A vanilla S-GCN layer shares the same adjacency matrix \hat{\boldsymbol{A}}_{\mathrm{S}} across all layers, versus (b) The proposed MSA-GC layer dynamically computes a k-hop attention weighted neighborhood matrix \hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}} is computed at each layer before propagating to the next layer.

Figure 4 - (a) A vanilla T-GC operates only on the same joint across frame, versus (b) The proposed MTA-GC utilizes a sliding window $\tau$ and dilate rate $d$, each sub-temporal graph is built based on $k^{\prime}$-hop neighborhood of the interest node.
Figure 4

(a) A vanilla T-GC operates only on the same joint across frame, versus (b) The proposed MTA-GC utilizes a sliding window \tau and dilate rate d, each sub-temporal graph is built based on k^{\prime}-hop neighborhood of the interest node.

In our MTA-GC layer, with the multiple neighboring joints connection in temporal dimension, the original 1D convolution defined in (6) is replaced by 2D convolutions since the input is a 3-D vector (T, N, C) with multi-neighbors, where T,\ N and C represent the receptive fields, joint number within a single frame and the dimension for joints coordinate (x, y), respectively. As same idea described in Section IV.2.1 for MSA-GC, the neighbor joints are divided into three types in inter-frames. The joint partition rule is referred to [21], where the self-connection and 1-hop and 2-hop neighbor connections are explicitly encoded in time-steps. Then, the graph convolution in (6) for temporal dimension is partitioned into three sub-graphs according to the connection relationship among adjacent joints.

The sub-temporal graph convolution can be represented as follows: \begin{gather*}\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k^{\prime}}^{(l)} \boldsymbol{H}_{\mathrm{S}}^{(l+1)}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}\right),\\ l\in\{1,2, \ldots, L\},\ k^{\prime}\in\{1,2, \ldots, K^{\prime}\}\tag{14}\end{gather*}View SourceRight-click on figure for MathML and additional features. where \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}} represents the adjacency matrix for the k^{\prime} (k^{\prime}\in\{1,2, \ldots, K^{\prime}\},\ K^{\prime}=2 in our work) type neighbors in a graph performed on temporal dimension, and \boldsymbol{W}_{k^{\prime}}^{(l)} represents the learned matrix. Particularly, k^{\prime}=0 indicates self-connection in forward and backward frames, k^{\prime}=1,2 express 1-hop and 2-hop neighbor connections in time-steps, respectively.

By introducing edge features into features aggregation in temporal-dimension, the MTA-GC can be rewritten as \begin{equation*}\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k^{\prime}}^{(l)}\boldsymbol{H}_{\mathrm{S}}^{(l+1)}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{15}\end{equation*}View SourceRight-click on figure for MathML and additional features. where the process to calculate the attention adjacency matrix \hat{\boldsymbol{A}}_{\mathrm{T},k}^{\text{att}} in temporal-dimension is same with \hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}} in (9)–​(11).

To flexibly control the receptive field in temporal dimension, dilated window is introduced into MTA-GC, aiming to capture powerful temporal contexts with various receptive fields. This is critical for reducing the redundant information gathered from large spatial-temporal receptive field. Let \boldsymbol{G}_{(\tau)}=(\boldsymbol{V}_{(\tau)}, \boldsymbol{E}_{(\tau)}) denote a spatial-temporal sub-graph in a temporal sliding window with size \tau, we extrapolate the frame-wise temporal joints connection into \tau frames with the temporal edge connections as shown in Figure 4(b). Then, we achieve \boldsymbol{H}_{\tau}^{(l)}\in \mathbb{R}^{T\times\tau N\times C} using \tau over \boldsymbol{H}_{\mathrm{S}}^{(l+1)} with zero padding to form T frame windows. Thus, the MTA-GC defined in (15) performed in the \tau-th temporal window can be rewritten as follows: \begin{equation*}\left[\boldsymbol{H}_{\mathrm{T},k^{\prime}}^{(l+1)}\right]_{\tau}=\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} \left[\boldsymbol{H}_{\tau,\mathrm{S}}^{(l+1)}\right]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{16}\end{equation*}View SourceRight-click on figure for MathML and additional features. where the adjacency matrix \hat{\boldsymbol{A}}_{\tau,\mathrm{T},k}^{\text{att}}, is formed by tiling \hat{\boldsymbol{A}}_{\mathrm{T},k}^{\text{att}}, into a block matrix as follows: \begin{equation*}\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}=\begin{bmatrix}\hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}} & \ldots & \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\\ \vdots & \ddots & \vdots\\ \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}} & \ldots & \hat{\boldsymbol{A}}_{\mathrm{T},k^{\prime}}^{\text{att}}\end{bmatrix}\in \mathbb{R}^{\tau N\times\tau N}\tag{17}\end{equation*}View SourceRight-click on figure for MathML and additional features.

With a 2D convolution performed on a dilated window in \tau frames, one frame will be picked in every d frames controlled by the dilation rate d. Then, in (14), \boldsymbol{H}_{\tau,\mathrm{S}}^{(l+1)}\in \mathbb{R}^{T\times\tau N\times C} is changed to \boldsymbol{H}_{\tau,d}^{(l)}\in \mathbb{R}^{T\times\tau N\times C}. As described in [32], the dilated window allows broader temporal receptive field without increasing the size of \hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}, the complexity also has not been changed. Then the MTA-GC layer aggregates features as follows: \begin{equation*}\boldsymbol{H}^{(l+1)}=\Vert_{k^{\prime}=1}^{K^{\prime}}\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} [\boldsymbol{H}_{\tau,d}^{(l)}]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{18}\end{equation*}View SourceRight-click on figure for MathML and additional features.

3) Cross-domain spatial-temporal residual (CSTR) connection

Though the spatial and temporal information has been widely used in 3D HPE to reduce occlusion and depth ambiguity, comprehensively modeling coherent spatial-temporal dependencies among skeleton joints remains an open problem. The traditional way is to factorize them by implementing an S-GC only and a T-GC only. However, as features are transmitted across space-time through interleaving implementation S-GC and T-GC, they are weakened as redundant information gathered from increasingly larger spatial-temporal receptive field [22] and hinder the cross-spacetime information flow to learn spatial-temporal joint dependencies. Furthermore, it equally treats S-GC and T-GC, and cannot effectively deal with the imbalance between the spatial and temporal features. To effectively solve above problem, we design a CSTR block to effectively model coherent spatial-temporal graph information. The CSTR-GConv is constructed as a basic unit for CSTR. The proposed CSTR-GConv consists of two pathways to simultaneously capture and fuse spatial-temporal features rather than decompose it into performing MSA-GCN and MTA-GCN operation independently. As shown in Figure 5, the first pathway is the spatial convolution branch and the second pathway is the spatial-temporal convolution branch.

Figure 5 - The cross-domain spatial-temporal residual graph connection module (CSTR-GConv).
Figure 5

The cross-domain spatial-temporal residual graph connection module (CSTR-GConv).

The first branch is used to transmit the spatial convolution features with MSA-GCN, it can be regarded as static feature in each skeleton graph. In the second branch, the spatial-temporal features are extracted by interleaving the MSA-GCN and MTA-GC. Let \boldsymbol{H}^{(l)} denote the input for l-th layer, the output feature can be represented as follows by implementing the CSTR-GConv: \begin{align*}& \boldsymbol{H}^{(l+1)}=\mathrm{T}(\mathrm{S})+\mathrm{S}\left(\boldsymbol{H}^{(l)}\right)\\ &\mathrm{S}\left(\boldsymbol{H}^{(l)}\right)= \boldsymbol{H}_{\mathrm{S},k}^{(l+1)}=\sigma\left(\boldsymbol{W}_{k}^{(l)}\boldsymbol{H}^{(l)}\hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}\right)\\ & \mathrm{T}(\mathrm{S})=\sigma\left(\boldsymbol{W}_{\tau,k^{\prime}}^{(l)} \left[\boldsymbol{H}_{\mathrm{S},k}^{(l+1)}\right]\hat{\boldsymbol{A}}_{\tau,\mathrm{T},k^{\prime}}^{\text{att}}\right)\tag{19}\end{align*}View SourceRight-click on figure for MathML and additional features. where \mathrm{S}(\cdot) and \mathrm{T}(\cdot) denote the MSA-GCN and MTA-GCN operations defined in Section IV.2, respectively. The remaining symbols in (19) are the same with above definitions. As illustrated in (19), the 2D cross-domain residual structure is introduced between the input and output of MSA-GC and MTA-GC layers, which can simultaneously learn the spatial features and spatial-temporal features effectively. \boldsymbol{x}_{i}\in R^{D} is the feature vector for node v_{i},\ \boldsymbol{H}^{(0)}=\boldsymbol{X} is the feature matrix for all nodes in T frame, \boldsymbol{X}\in \mathbb{R}^{N\times D\times T}.

4) Forward dense connection block (FDCB) for spatial-temporal features

The traditional GCN-based HPE usually simply super-imposes GCN layer in a feedforward manner, which is unable to pass semantic connectivity information among different layers of the network [33]. Hence, we construct FDCB and insert it into each CSTR block for effectively transmitting features from the previous layers to the next layer. As illustrated in Figure 6, FDCB connects each CSTR-GConv layer to all subsequent layers. It has two inputs for each layer, high-level feature maps from previous layers and the local features from the current layer. In each CSTR block, we have L CSTR-GConv() layers. Particularly, the output produced by the initial layer of CSTR-GConv servers as the input feature \boldsymbol{H}^{(1)} at the first stage. For each l-th layer, the concatenated features from its all preceding layers \boldsymbol{H}^{(1)}, \boldsymbol{H}^{(2)}, \ldots. \boldsymbol{H}^{(l-1)} are regarded as input, which is denoted as \begin{equation*}\boldsymbol{H}^{(l)}= \text{Cat} ([\boldsymbol{H}^{(1)}, \boldsymbol{H}^{(2)}, \ldots, \boldsymbol{H}^{(l-1)}]),\ l=2,3, \ldots, L\tag{20}\end{equation*}View SourceRight-click on figure for MathML and additional features. where \text{Cat} (\cdot) refers to concatenate features. Through the FDCB, the features extracted from previous layers are reused in subsequent layers. This strategy is helpful for the proposed method to fully utilize the global information of the network by repeatedly using layer-by-layer features. With the increasing dimension of concatenating features in (20), a fully-connected convolution Conv 1 \times 1 is used to reduce feature dimension.

Figure 6 - Forward dense connection module for propagating multi-scale spatial-temporal feature representations across layers.
Figure 6

Forward dense connection module for propagating multi-scale spatial-temporal feature representations across layers.

3. Network Instantiation

As illustrated in Figure 1, the input to FMR-GNet is the 2D human joints coordinates spanning T frames, which are mapped using a pre-processing batch normalization layer and MSA-GCN as in [15]. Then, the proposed network is stacking by r CSTR blocks with FDCB to connect each layer. In each CSTR block, r=2 in our work. Four CSTR-GConv layers are adopted to capture the coherent spatial-temporal convolution features, which can not only improve the neighbor feature representation capability of GCN both in spatial-dimension and in temporal dimension, but also can reduce the redundant information gathered from large spatial-temporal receptive field. All the MSA-GCN and MTA-GCN in CSTR-GConv follow BN and ReLU layers except for the last one. An FDCB is inserted into each CSTR block for effectively transmitting features from the previous layers to the next layer. Finally, 1 \times 1 fully-connected convolution is employed for 3D poses regressing. During training, a \ell_{2} norm loss measures the error between the predicted 3D poses and the ground-truth annotation.

SECTION V.

Experiments

1. Experimental Setup

Dataset In this section, we evaluate the proposed FMR-GNet model on two widely-adopted 3D HPE benchmarks: Human3.6M and MPI-INF-3DHP. Human3.6M [34] comprises recordings of 11 subjects performing 15 distinct action classes. Totally 3.6 million images are included in Human3.6M. Following previous works [3], [12], [13], [18], [19], [25], [29], [35], 5 sequences (S1, S5, S6, S7, S8) are used for training, while sequences S9 and S11 are held out for testing. The MPI-INF-3DHP [36] dataset consists of 1.3 million images that record 8 actors performing 8 actions.

Evaluation protocols For Human3.6M, we report results using two common metrics: 1) mean per joint position error (MPJPE) that compute the mean Euclidean distance (mm) between predicted and ground-truth 3D joint locations, and 2) Procrustes-aligned MPJPE (P-MPJPE) where the prediction is rigidly aligned to the ground truth prior to computing MPJPE. For the MPI-INF -3DHP, we follow previous literature [3], [9], [19], [25], [35], [37] and use 3D percentage of correct keypoints (3D-PCK), area under the curve (AUC), and MPJPE as evaluation metrics.

Implementation of details Our model is implemented in PyTorch and trained on an NVIDIA RTX 2080TI using Adam [38] optimizer for 120 epochs with an initial learning rate of 0.001 decayed by 0.95 at each epoch. The batch size is set to 256. Following convention from prior SOTA methods [3], [12], [13], [18], [19], [25], [29], [35], we utilize the cascaded pyramid network (CPN) [39] for 2D pose estimation on Human3.6M. While the ground-truth 2D joints are used for MPI-INF-3DHP dataset as works in [3], [9], [12], and [37].

2. Ablation Study

To evaluate the individual contributions of each component in FMR-GNet model, we perform ablation experiments on Human3.6M test set using 2D ground truth. The input representation is (17, 2, 27), encoding 17 body joints with 2D coordinates over a temporal window of T=27 frames. The model outputs the predicted 3D joints coordinates as a tensor (17, 3). Within the model, the output channels for the CSTR-GConv layer at each CSTR block were 64, 128, 256 and 512, respectively. Finally, the extracted features are then passed through a 1 \times 1 projection for 3D pose regression.

Effect of MSA-GC We first validate how the MSA-GC affects the 3D HPE performance by using ST-GCN [21] as a baseline. Then, we add each module one-by-one including mix-hop spatial graph convolution and attention weighed edge features into adjacency matrix. The performance steadily improves with the incremental addition of each module, corroborating their complementary benefits. As shown in Table 1, by adding mix-hop S-GC into baseline method, the MS-GCN module outperforms baseline method with 10.9 mm. The good performance is owing to that with the mix-hop neighbor joints features being introduced into feature aggregation, the MS-GCN module can combine joints features from longer distances, which surpasses the constraint of the baseline method that solely aggregates features from 1-hop neighbors at single scale both in spatial and temporal dimension. By further introducing the attention-weighted edge features into adjacency matrix, the performance of MSA-GCN module is improved compared with the baseline module (39.6 mm vs. 55.4 mm) and MS-GCN (39.6 mm vs. 44.5 mm), respectively, validating the edge feature in GCN can boost the feature representation of skeleton-structured data since it processes rich information like connection strengths and types of nodes. Hence, with attention weighted neighbor matrix \hat{\boldsymbol{A}}_{\mathrm{S},k}^{\text{att}}, the edge feature is adapted to embed in each layer before transmit to subsequent layer. Since it has fused node and edge features into the feature aggregation, it exhibits a better performance than the case when only binary adjacency matrix \hat{\boldsymbol{A}}_{\mathrm{S}} is shared for all layer in GCN. This further validates that uniformly treatment of neighboring nodes is insufficient to explore the different correlations of nodes in graph.

Table 1 The effect of MSA-GC in feature aggregation
Table 1- The effect of MSA-GC in feature aggregation

Effect of MTA-GC We use TCN proposed in ST-GCN [21] as baseline network architecture to validate how the mix-hop dilated temporal attention GCN layer affects the 3D HPE performance. We add mix-hop TCN, attention weighed temporal edge features and dilated convolution into the baseline network one-by-one. As shown in Table 2, with fully utilizing the temporal information in MTA-GCN, the model error is clearly lower than that in baseline method. Since the temporal features embed mix-hop neighborhood joints information that can enhance the increase the network's capability to process information from neighbor joints rather than only itself across time step, it is helpful to save the spatial information in the temporal direction. Hence, the performance in model 1 with adding mix-hop TCN is 4.5 mm smaller than the baseline method. With attention weighted neighbor matrix \hat{\boldsymbol{A}}_{\mathrm{T},k}^{\text{att}} introduced into the mix-hop temporal convolution, the model 2 achieves smaller error compared with model 1 (47.3 mm vs. 50.9 mm). By further adding dilated convolution into model 2, the model 3 achieves the error of 46.2 mm, which is 2.1 mm smaller than that in model 2 (46.2 mm vs. 47.3 mm). This further validates that employing dilated convolutions can boost the capability of the networks for capturing long-term dependencies of input.

Table 2 The influence of MTA-GC in feature aggregation
Table 2- The influence of MTA-GC in feature aggregation

Effects of the CSTR and FDCB To investigate the effect of the CSTR and FDCB on 3D HPE, we take ST-GCN [21] as baseline network architecture. Compared with the original structure of ST-GCN, the model 1 with MSA-GC and MTA-GC layers achieves 38.2 mm MPJPE, which is 17.2 mm lower than that in baseline method. We further introduce a spatial residual connection into the model 1, the model 2 with CSTR exhibits an obvious improvement compared with baseline method (36.0 mm vs. 55.4 mm). This is because the CSTR can effectively capture the spatial features and spatial-temporal fea-tures at the same time, which is beneficial for accurately modeling coherent spatial-temporal graph features. As shown in Table 3, by further adding FDCB into each CSTR layer, the MPJPE for model 3 is further reduced. This verifies that the devised FDCB can boost the model performance, and reflects that capturing semantic skeleton connectivity information from different layer of model can strengthen the stability of the network.

Table 3 The effect of CSTR and FDCB for 3D HPE
Table 3- The effect of CSTR and FDCB for 3D HPE

3. Comparison with State-of-the-Art Methods

1) Results on Human3.6M

Table 4 ([40]–​[47] ) and Table 5 ([48], [49]) report the quantitative results comparing the proposed FMR-GNet against state-of-the-art (SOTA) methods on Human3.6M dataset under Protocol #1 and Protocol #2, respectively. Our method achieves an average error of 46.2 mm for MPJPE (Protocol #1) and 35.7 mm for P-MPJPE (Protocol #2), outperforming most existing SOTA methods under the same input temporal window (T=27). The model's competitive performance is attributed to its proficiency in capturing long-range temporal dependencies, which is crucial for handling challenging actions involvng occlusions and rapid movements. Compared to the spatial-temporal convolution graph architectures of [10] and [13] that utilize vanilla GConv and SemGConv, our model yields smaller joint errors with 46.2 mm vs. 48.8 mm in [10] and 46.2 mm vs. 60.8 mm in [13]. Though models in [10] and [13] incorporate spatial-temporal convolutions into the GCN framework, the proposed FMR-GNet more effectively leverages multi-scale information from different layers through the forward dense connections and CSTR modules. It not only introduces the MSA-GC and MTA-GC layer to effectively gather the neighbor features in a weighted way from large spatial-temporal receptive field, but also designs the CSTR and the FDCB blocks to effectively model the coherent spatial-temporal graph information in a residual manner and to reuse the features extracted from the previous layers. This enables superior modeling of long-range temporal correlations across frames. All these strategies enable the network to fully utilize the global information from different layers of FMR-GNet, modeling of long-range temporal correlations across frames. It plays an important role to handle with the challenging scenarios that involve occlusion and fast motion. Compared with the models in [18] and [40], which design a mix-hop strategy for feature aggregation, our method exhibits better results (46.2 mm vs. 55.6 mm in [18] and 46.2 mm vs. 57.3 mm in [40]). Although our method and the methods in [18] and [40] all adopt the mix-hop strategy for feature aggregation, the proposed method introduces weight sharing strategies for graph convolution, the mix-hop neighbor joints features and attention-weighted edge features are designed for graph representation in our method. This enables the proposed FMR-GNet to dynamically embed edge connection strengths in each layer before propagating to subsequent layers. It is useful for sufficiently exploring the different correlations of nodes in graph, thus can boost the feature representation of skeleton-structured data. Moreover, in comparison to recent GCN-based 3D HPE approaches like HGN in [41] and GraphSH in [9], our model demonstrates smaller errors than methods in [9] and [41]. While the MHFormer in [3] exhibits lower error than our method by generating multiple 3D pose hypotheses to handle depth ambiguities and self-occlusions, its promising results motivate future exploration of incorporating multi-hypothesis reasoning within the proposed FMR-GNet to further enhance spatial-temporal modeling of joint dependencies.

Table 4 Quantitative comparison results on Human3.6M under Protocol #1 (unit: mm; the unit of T: frames)
Table 4- Quantitative comparison results on Human3.6M under Protocol #1 (unit: mm; the unit of $T$: frames)
Table 5 Quantitative comparison results on Human3.6M under Protocol #2 (unit: mm; the unit of T frames)
Table 5- Quantitative comparison results on Human3.6M under Protocol #2 (unit: mm; the unit of $T$ frames)

Effectively capturing long-range temporal correlations is vital for handling challenging actions involving occlusions and rapid movements. To evaluate the proposed method's performance on such cases, we compare it with several SOTA methods that incorporate temporal convolutions for graph-based action representation. Specifically, we analyze the challenging “photo” and “sitting down” actions from the Human3.6M S11 sequence, as shown in Figure 7. Our approach has smaller errors compared to methods proposed in Pavllo et al. [25] and Chen et al. [35]. It obtains good promotion in some joints (e.g., the MPJPE for right elbow, left wrist, and right wrist are 89.1 nm, 98.3 nm and 109.4 mm, respectively). Those results further validate our model can effectively capture long-distance frames correlations by the designed MTA-GC layer, enabling the network to leverage global information for tackling the challenging actions.

2) Results on MPI-INF-3DHP

To evaluate the generalization capability, we conduct a comparative analysis with SOTA approaches on MPI-INF-3DHP dataset. Following the existing methods [25], [35], [36], [50], three evaluation metrics with an input temporal window of 27 frames (T=27) are used for comparation. As reported in Table 6, our method achieves 86.8 PCK, 56.2 AUC, and 76.4 MPJPE. Despite not performing any dataset-specific retraining or fine-tuning, our approach still demonstrates competitive results, indicating the effectiveness of our method.

Figure 7 - The average joint error of Photo action on S11.
Figure 7

The average joint error of Photo action on S11.

Table 6 Comparison results on MPI-INF-3DHP (↑ denotes the larger the better and ↓denotes the smaller the better)
Table 6- Comparison results on MPI-INF-3DHP (↑ denotes the larger the better and ↓denotes the smaller the better)

4. Computational Complexity Analysis

To further evaluate the proposed model's performance, we report the computational complexity in terms of total parameter count, floating-point operations (FLOPs) and MPJPE on Human3.6M using a temporal window of 27 frames (T=27). The results, compared against several SOTA methods under the same setting, are summarized in Table 7. Our model has 46.2 mm MPJPE while maintaining a modest parameter count of 8.65M (M denotes ×106) and FLOPs of 26.19M, indicating an effective trade-off between accuracy and computational costs.

Table 7 Computational complexity analysis in terms of number of parameters, FLOPs, and MPJPE on Human3.6M
Table 7- Computational complexity analysis in terms of number of parameters, FLOPs, and MPJPE on Human3.6M

5. Visualization Results

For better observation, Figure 8 presents a visual comparison of the 3D pose estimates generated by our model against several SOTA methods and the ground-truth annotations. Figure 8 shows some challenging actions on Human3.6M with fast motion and occlusion. The blue circle notifies differences between our model and others. Our approach yields good visually estimation on actions like “Photo”, “Posing”, and “SittingDown” that involve ambiguous body parts and self-occlusion.

SECTION VI.

Conclusion

This paper presents an FMR-GNet for 3D HPE from monocular videos. FMR-GNet first devises a mix-hop spatial attention GCN layer and mix-hop dilated temporal attention GCN layer to effectively aggregate neighbor feature representations with learnable weights over large spatial-temporal receptive field. Then, the mix-hop neighbor joints features and attention-weighted edge features are introduced for graph representation to explore the correlations of nodes in graph. Secondly, the CSTR block is employed to effectively model the coherent spatial-temporal graph information through residual connection, thereby enabling effective cross-domain modeling of joint interdependencies across space and time. Finally, the FDCB block is designed to insert into each CSTR-GConv layer for effectively transmitting features from the previous layers to the next layer, enabling the model to fully utilize the global information of the network. Experiments on two widely-adopted 3D HPE benchmark show that our FMR-GNet achieves good performance compared to existing SOTA methods. Though our approach mainly focuses on root-relative 3D HPE, an important future research direction is to develop a unified framework that can efficiently integrate FMR-GNet with human detection and 3D root localization modules to enable 3D multi-person pose estimation in complex scenarios.

Figure 8 - Visualization results on Human3.6M.
Figure 8

Visualization results on Human3.6M.

ACKNOWLEDGEMENTS

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61907028, 62107027, and 11872036), the Young Science and Technology Stars in Shaanxi Province (Grant No. 2021KJXX-91), and the Central Universities (Grant Nos. 2023YBGY158, K2021011004, 2022TD-26, and GK202205020).

References

References is not available for this document.