Loading web-font TeX/Main/Regular
Driver Head Pose Estimation with Multimodal Temporal Fusion of Color and Depth Modeling Networks | IEEE Conference Publication | IEEE Xplore

Driver Head Pose Estimation with Multimodal Temporal Fusion of Color and Depth Modeling Networks


Abstract:

For in-vehicle systems, head pose estimation (HPE) is a primitive task for many safety indicators, including driver attention modeling, visual awareness estimation, behav...Show More

Abstract:

For in-vehicle systems, head pose estimation (HPE) is a primitive task for many safety indicators, including driver attention modeling, visual awareness estimation, behavior detection, and gaze detection. The driver’s head pose information is also used to augment human-vehicle interfaces for infotainment and navigation. HPE is challenging, especially in the context of driving, due to the sudden variations in illumination, extreme poses, and occlusions. Due to these challenges, driver HPE based only on 2D color data is unreliable. These challenges can be addressed by 3D-depth data to an extent. We observe that features from 2D and 3D data complement each other. The 2D data provides detailed localized features, but is sensitive to illumination variations, whereas 3D data provides topological geometrical features and is robust to lighting conditions. Motivated by these observations, we propose a robust HPE model which fuses data obtained from color and depth cameras (i.e., 2D and 3D). The depth feature representation is obtained with a model based on PointNet++. The color images are processed with the ResNet-50 model. In addition, we add temporal modeling to our framework to exploit the time-continuous nature of head pose trajectories. We implement our proposed model using the multimodal driving monitoring (MDM) corpus, which is a naturalistic driving database. We present our model results with a detailed ablation study with unimodal and multimodal implementations, showing improvement in head pose estimation. We compare our results with baseline HPE models using regular cameras, including OpenFace 2.0 and HopeNet. Our fusion model achieves the best performance, obtaining an average root mean square error (RMSE) equal to 4.38 degrees.
Date of Conference: 02-05 June 2024
Date Added to IEEE Xplore: 15 July 2024
ISBN Information:

ISSN Information:

Conference Location: Jeju Island, Korea, Republic of

Funding Agency:

References is not available for this document.

I. INTRODUCTION

In the field of advanced driver-assistance systems (ADAS), head pose estimation (HPE) of the driver is a primitive task for determining several safety metrics for in-vehicle systems. For example, HPE is a key technology for determining driver attention modeling [27], [28]. HPE can also be instrumental for other tasks such as predicting the driver gaze [16], [17], [20], [21], [26] or estimating the drowsiness level of a driver [43]. In addition to solutions to facilitate safety systems, HPE also plays a key role in improving driver-vehicle interfaces for navigation and infotainment purposes [1], [31]. As we transition to autonomous vehicles, it is also important to identify the visual awareness of the driver for take-over tasks [37]. These applications highlight the need for robust in-vehicle solutions for HPE.

Select All
1.
A. Aftab, M. von der Beeck and M. Feld, "You have a point there: Object selection inside an automobile using gaze head pose and finger pointing", ACM International Conference on Multimodal Interaction (ICMI 2020), pp. 595-603, October 2020.
2.
T. Baltrušaitis, A. Zadeh, Y. C. Lim and L. Morency, "OpenFace 2.0: Facial behavior analysis toolkit", IEEE Conference on Automatic Face and Gesture Recognition (FG 2018), pp. 59-66, May 2018.
3.
T. Bär, J. Reuter and J. Zöllner, "Driver head pose and gaze estimation based on multi-template ICP 3-D point cloud alignment", International IEEE Conference on Intelligent Transportation Systems (ITSC 2012), pp. 1797-1802, September 2012.
4.
F. J. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia and G. Medioni, "FacePoseNet: Making a case for landmark-free face alignment", IEEE International Conference on Computer Vision Workshops (ICCVW 2017), pp. 1599-1608, October 2017.
5.
H. Chen, S. Liu, W. Chen, H. Li and R. Hill, "Equivariant point network for 3D point cloud analysis", IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), pp. 14509-14518, June 2021.
6.
B. Czupryński and A. Strupczewski, "High accuracy head pose tracking survey", International Conference on Active Media Technology (AMT 2014), vol. 8610, pp. 407-420, August 2014.
7.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, "ImageNet: A large-scale hierarchical image database", IEEE Conference on Computer Vision and Pattern Recognition, pp. 248-255, June 2009.
8.
G. Fanelli, J. Gall and L. Van Gool, "Real time head pose estimation with random regression forests", IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2011), pp. 617-624, June 2011.
9.
S. Foix, G. Alenya and C. Torras, "Lock-in time-of-flight (ToF) cameras: A survey", IEEE Sensors Journal, vol. 11, no. 9, pp. 1917-1926, September 2011.
10.
L. Goncalves and C. Busso, "AuxFormer: Robust approach to audiovisual emotion recognition", IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2022), pp. 7357-7361, May 2022.
11.
L. Goncalves and C. Busso, "Robust audiovisual emotion recognition: Aligning modalities capturing temporal information and handling missing features", IEEE Transactions on Affective Computing, vol. 13, no. 4, pp. 2156-2170, October-December 2022.
12.
L. Goncalves, S.-G. Leem, W.-C. Lin, B. Sisman and C. Busso, "Versatile audiovisual learning for handling single and multi modalities in emotion regression and classification tasks", ArXiv e-prints (arXiv:2305.07216), pp. 1-14, May 2023.
13.
K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition", IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2016), pp. 770-778, June-July 2016.
14.
T. Hu, S. Jha and C. Busso, "Robust driver head pose estimation in naturalistic conditions from point-cloud data", IEEE Intelligent Vehicles Symposium (IV 2020), pp. 1176-1182, October-November 2020.
15.
T. Hu, S. Jha and C. Busso, "Temporal head pose estimation from point cloud in naturalistic driving conditions", IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 7, pp. 8063-8076, July 2022.
16.
S. Jha, N. Al-Dhahir and C. Busso, "Driver visual attention estimation using head pose and eye appearance information", IEEE Open Journal of Intelligent Transportation System, vol. 4, pp. 216-231, March 2023.
17.
S. Jha and C. Busso, "Analyzing the relationship between head pose and gaze to model driver visual attention", IEEE International Conference on Intelligent Transportation Systems (ITSC 2016), pp. 2157-2162, November 2016.
18.
S. Jha and C. Busso, "Challenges in head pose estimation of drivers in naturalistic recordings using existing tools", IEEE International Conference on Intelligent Transportation (ITSC 2017), pp. 1624-1629, October 2017.
19.
S. Jha and C. Busso, "Fi-Cap: Robust framework to benchmark head pose estimation in challenging environments", IEEE International Conference on Multimedia and Expo (ICME 2018), pp. 1-6, July 2018.
20.
S. Jha and C. Busso, "Probabilistic estimation of the gaze region of the driver using dense classification", IEEE International Conference on Intelligent Transportation (ITSC 2018), pp. 697-702, November 2018.
21.
S. Jha and C. Busso, "Estimation of driver’s gaze region from head position and orientation using probabilistic confidence regions", IEEE Transactions on Intelligent Vehicles, vol. 8, no. 1, pp. 59-72, January 2023.
22.
S. Jha, M. Marzban, T. Hu, M. Mahmoud, N. Al-Dhahir and C. Busso, "The multimodal driver monitoring database: A naturalistic corpus to study driver attention", IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 8, pp. 10736-10752, August 2022.
23.
W. Kabsch, "A discussion of the solution for the best rotation to relate two sets of vectors", Acta Crystallographica Section A., vol. A34, no. Part 5, pp. 827-828, September 1978.
24.
D. Kingma and J. Ba, "Adam: A method for stochastic optimization", International Conference on Learning Representations, pp. 1-13, May 2015.
25.
A. Kumar, A. Alavi and R. Chellappa, "KEPLER: Keypoint and pose estimation of unconstrained faces by learning efficient H-CNN regressors", IEEE International Conference on Automatic Face and Gesture Recognition (FG 2017), pp. 258-265, May-June 2017.
26.
S. J. Lee, J. Jo, H. G. Jung, K. R. Park and J. Kim, "Real-time gaze estimator based on driver’s head orientation for forward collision warning system", IEEE Transactions on Intelligent Transportation Systems, vol. 12, no. 1, pp. 254-267, March 2011.
27.
N. Li and C. Busso, "Analysis of facial features of drivers under cognitive and visual distractions", IEEE International Conference on Multimedia and Expo (ICME 2013), pp. 1-6, July 2013.
28.
N. Li, J. Jain and C. Busso, "Modeling of driver behavior in real world scenarios using multiple noninvasive sensors", IEEE Transactions on Multimedia, vol. 15, no. 5, pp. 1213-1225, August 2013.
29.
Y. K. Li, Y. Z. Yu, Y. L. Liu and C. Gou, "MS-GCN: Multi-stream graph convolution network for driver head pose estimation", IEEE International Conference on Intelligent Transportation Systems (ITSC 2022), pp. 3819-3824, October 2022.
30.
G. P. Meyer, S. Gupta, I. Frosio, D. Reddy and J. Kautz, "Robust model-based 3D head pose estimation", IEEE International Conference on Computer Vision (ICCV 2015), pp. 3649-3657, December 2015.

Contact IEEE to Subscribe

References

References is not available for this document.