Loading [MathJax]/extensions/MathMenu.js
SST: Simplified Space-Time Transformer Based on Time-Assisted Spatial MSA for 3D Human Pose Estimation | IEEE Conference Publication | IEEE Xplore

SST: Simplified Space-Time Transformer Based on Time-Assisted Spatial MSA for 3D Human Pose Estimation


Abstract:

Depth ambiguity in 2D human joint estimation is a persistent issue for 2D-3D human pose estimation networks. To cope with this challenge, temporal dimensions are adopted ...Show More

Abstract:

Depth ambiguity in 2D human joint estimation is a persistent issue for 2D-3D human pose estimation networks. To cope with this challenge, temporal dimensions are adopted in existing models, however, none of them are able to fully utilize the information embedded in the input data. In this paper, we present a Time-assisted Spatial (TaS) MSA & Simplified Space-Time Transformer (SST) to better capture the spatial-temporal relationships. First, we design a new Time-assisted Spatial (TaS) MSA to comprehensively model spatial-temporal relationships. Secondly, we combine TaS MSA and Temporal MSA in parallel to enhance modeling capability and to build Simplified Space-Time Transformer (SST) model. Thirdly, we find an optimal pipeline of SST through contrasting the impact of parallel blocks and intermediate feature dimensions on the model's performance. Experimental results show that our model achieves the highest accuracy on Human3.6M dataset, with 0.4mm gain against current methods and 9% improvement in difficult positions.
Date of Conference: 22-25 October 2024
Date Added to IEEE Xplore: 16 January 2025
ISBN Information:

ISSN Information:

Conference Location: Zhuhai, China
References is not available for this document.

I. Introduction

3D human pose estimation is pervasively utilized in human-machine interaction, autonomous driving, auxiliary medical care, etc. Traditional monocular 3D human pose estimation methods normally utilize convolutional and fully connected layers to predict 3D human joints. To better utilize 2D human pose estimation and to improve the accuracy, a typical process of 3D human pose estimation generally has two stages. First, obtaining 2D human joint positions with a 2D pose estimation. Second, mapping them to the corresponding 3D joint positions, e.g. SimpieBaseline3D [1] and VideoPose3D [2]. With the introduction of PoseFormer [3], Transformer [4] becomes a promising foundational architecture with ascendant performance [1], [5], [6], [7]. However, the persistent challenges of location uncertainty and depth ambiguity remain unsolved due to the absence of depth information.

Select All
1.
Julieta Martinez, Rayat Hossain, Javier Romero and James J. Little, "A simple yet effective baseline for 3dhuman pose estimation", Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2640-2649, 2017.
2.
Dario Pavllo, Christoph Feichtenhofer, David Grangier and Michael Auli, "3D human pose estimation in video with temporal convolutions and semi-supervised training", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753-7762, 2019.
3.
Ce Zheng, Sijie Zhu, Matias Mendieta, Taojiannan Yang, Chen Chen and Zhengming Ding, "3D Human Pose Estimation with Spatial and Temporal Transformers", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 11656-11665, 2021.
4.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, et al., "Attention is All you Need", Neural Information Processing Systems, pp. 5998-6008, 2017.
5.
Tianlang Chen, Chen Fang, Xiaohui Shen, Yiheng Zhu, Zhili Chen and Jiebo Luo, "Anatomy-Aware 3D Human Pose Estimation With Bone-Based Pose Decomposition", IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 198-209, 2022.
6.
Kehong Gong, Jianfeng Zhang and Jiashi Feng, "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8575-8584, 2021.
7.
Dario Pavllo, Christoph Feichtenhofer, David Grangier and Michael Auli, "3D Human Pose Estimation in Video with Temporal Convolutions and Semi-Supervised Training", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753-7762, 2019.
8.
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Zhao Wang, Kai Han, Shanshe Wang, et al., "Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggrgation", Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 14761-14771, 2023.
9.
Jinlu Zhang, Zhigang Tu, Jianyu Yang, Yujin Chen and Junsong Yuan, "MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13232-13242, 2022.
10.
Wenkang Shan, Zhenhua Liu, Xinfeng Zhang, Shanshe Wang, Siwei Ma and Wen Gao, "P-STMO: Pre-Trained Spatial Temporal Many-to-One Model for 3D Human Pose Estimation", Computer Vision - ECCV 2022, vol. 13665, pp. 461-478.
11.
Wentao Zhu, Xiaoxuan Ma, Zhaoyang Liu, Libin Liu, Wayne Wu and Yizhou Wang, Motionbert: Unified pretraining for human motion analysis, 2022.
12.
Alejandro Newell, Kaiyu Yang and Jia Deng, "Stacked Hourglass Networks for Human Pose Estimation", Computer Vision - ECCV 2016, vol. 9912, pp. 483-499.
13.
Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis and Kostas Daniilidis, "Coarse-to-fine volumetric prediction for single-image 3D human pose", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025-7034, 2017.
14.
Zitian Wang, Xuecheng Nie, Xiaochao Qu, Yunpeng Chen and Si Liu, "Distribution-Aware Single-Stage Models for Multi-Person 3D Pose Estimation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13096-13105, 2022.
15.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, et al., An image is worth 16x16 words: Transformers for image recognition at scale, 2020.
16.
Catalin Ionescu, Dragos Papava, Vlad Olaru and Cristian Sminchisescu, "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 7, pp. 1325-1339, 2014.
17.
Wenhao Li, Hong Liu, Hao Tang, Pichao Wang and Luc Van Gool, "Mhformer: Multi-hypothesis transformer for 3d human pose estimation", Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13147-13156, 2022.
18.
Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu and Li Zhang, Seaformer: Squeeze-enhanced axial transformer for mobile semantic segmentation, 2023.

Contact IEEE to Subscribe

References

References is not available for this document.