Journals & Magazines >IEEE Open Journal of Intellig... >Volume: 4

Driver Visual Attention Estimation Using Head Pose and Eye Appearance Information

Abstract:

In autonomous, as well as manually operated vehicles, monitoring the driver visual attention provides useful information about the behavior, intent and vigilance level of...Show More

Metadata

Abstract:

In autonomous, as well as manually operated vehicles, monitoring the driver visual attention provides useful information about the behavior, intent and vigilance level of the driver. The gaze of the driver can be formulated in terms of a probabilistic visual map representing the region around which the driver’s attention is focused. The area of the estimated region changes based on the level of confidence of the estimation. This paper proposes a framework based on convolutional neural networks (CNNs) that takes the head pose and the eye appearance of the driver as inputs, and creates a fusion model that estimates the driver’s gaze on a 2D grid. The model contains upsampling layers to create estimations at multiple resolutions. The model is trained using data collected from 59 subjects with continuous recordings where the subject looks at a moving target in a parked car, and glances at a set of markers inside the car while driving the vehicle and while the car is parked. Our fusion framework provides superior performance than unimodal systems trained exclusively with head pose or eye appearance information. It estimates the gaze region with the target location lying within the 75% confidence region with an accuracy of 92.54%.

Published in: IEEE Open Journal of Intelligent Transportation Systems ( Volume: 4)

Page(s): 216 - 231

Date of Publication: 16 March 2023

Electronic ISSN: 2687-7813

DOI: 10.1109/OJITS.2023.3258184

Funding Agency:

Citations are not available for this document.

Contents

SECTION I.

Introduction

Current driving systems place the driver at the center of the controls. The driver takes responsibility for ensuring that the entire system runs smoothly. If there is a sudden change in the environment, such as a pedestrian crossing the road, or a vehicle suddenly stopping in front of the ego car, it is up to the driver to take necessary actions to avoid potential accidents. While advances in in-vehicle technologies have helped drivers with safety features, such as collision avoidance, lane assist, and adaptive cruise control, the responsibility is still on the driver. Hence, it is only natural to design systems that can monitor drivers, inferring if they are aware of the driving environment [1]. The transition from manually controlled vehicles to fully autonomous systems is going to be gradual, given the complexity of mixed-autonomy traffic [2]. A semi-autonomous system needs to have synergy between the autonomy of the car and the human driver. When the driving system is unable to make important maneuvering decisions, it must transfer control to the driver. A distracted driver takes longer to resume control of the vehicle [3]. Hence, the system should be able to understand the actions, intents, and behaviors of the driver before transferring control of the car.

Various studies have addressed different aspects of driver attention, such as vehicle state information [4], physiological signals [5], [6], [7], cognitive distractions [8], [9], [10] and changes in emotional states [11]. The 100-car naturalistic driving study by the National Highway Traffic Safety Administration (NHTSA) concluded that in 80% of crash events and 65% of near-crash events, the driver was looking away from the incoming road just before the event [12]. A driver relies on vision to gather information from the environment, including road signs [13], and pedestrians [14]. Driving tasks such as mirror checking actions [15], [16] and lane change [17], [18] also require the driver’s visual attention. Knowledge of the driver’s visual attention can be helpful in understanding their intent and their knowledge of the environment [19]. Correlating driver visual attention with visual saliency of the scene can be helpful in gaining insights about what the driver is attending to [20], [21]. This information can be useful in cloning behavioral models onto autonomous systems to replicate safe driving patterns [22]. These examples illustrate the importance of estimating the driver’s visual attention, and its potential applications. As the technology evolves towards autonomous vehicles, from level 1 to level 4, the relevance of this task also increases. In non-autonomous vehicles (L1 and L2), the drivers are involved in controlling the operations of the car, so the knowledge of their attention helps in maintaining safe driving conditions. In level 3, it is crucial to have synergy between the driver and the car, as most activities will be shared between them. While the driver might not need to be always attentive, at crucial times during handovers, the system needs to check that the driver is paying attention before transferring control of the car. With full autonomy (L4), the knowledge of a user’s visual attention can be helpful for infotainment and navigation systems (e.g., displays information on buildings visually attended by the driver, and resolve ambiguities for road-related driver’s commands).

Different studies have tried to use the driver’s gaze to estimate distractions. Studies have binarized the problem by considering the duration of gaze-off-the-road events [23], [10]. Other studies have associated head pose and gaze estimation directly with driving activities [4], [17], [24]. An alternative approach is to divide the driver visual attention into gaze zones [25], [26], [27]. While these methods provide a good coarse estimate of the driver’s visual attention, some applications may require a finer estimation of where the driver is directing her/his visual attention. While commercial head-mounted eye trackers such as Tobii and faceLAB can provide an accurate estimate of the driver’s gaze, and are useful for research, due to their invasive nature, they provide limited usage in real-world applications. These methods do not provide information about the driver’s awareness of particular objects on the road such as other vehicles and pedestrians. Acknowledging the fact that fine details about the driver’s gaze may not always be available, we design a model that uses both head pose and eye appearance information to estimate the gaze in real naturalistic driving recordings. We define eye appearance as the image around the eyes of a person.

Visual attention is a broader concept in cognitive science that deals with the study of a person’s awareness of the visual world. However, This study restricts our definition of visual attention as the heat map representing the distribution of the driver’s gaze (i.e., the area that a driver is looking at while operating the car). We quantify visual attention with a probability map that can be easily projected onto the road, mapping the driving environment with the driver’s attention. This formulation also allows us to assign an intensity to the direction of the driver’s vision. For example, a driver fixating on a location will have a map with high intensity in a small region. When a driver is exploring the driving environments, the saccade movements will create a map with low intensity in a large area.

In Jha and Busso [28], [29], we proposed that providing probabilistic maps is a practical and effective way to estimate visual attention. Our earlier efforts relied only on the head pose, but driver visual attention is a function of both head pose and eye movement [30]. In this work, we build a fusion model with two branches, each of which takes the head pose and the eye image as inputs and learns a combined representation of the visual attention of the driver. The proposed model contains three parts: a head pose encoder, an eye encoder, and a decoder. The head pose encoder takes the six parameters describing the head pose (position and orientation) as the inputs, which are passed through fully connected networks, followed by reshaping to obtain a low-dimensional map. The eye encoder takes the eye patch image as the input and sends it through a neural network to extract discriminative information. The decoder of our model concatenates the outputs of the encoders, creating a unified feature map that is sent through an upsampling network to obtain the probabilistic map at different resolutions. We classify the driver’s visual attention on a 2D grid which is learned by using upsampling with convolutional neural networks (CNNs). This representation is purely learned from the data and, hence, is non-parametric. Therefore, it does not require imposing parametric distributions as in our previous model [28], increasing the flexibility of our formulation.

We use the multimodal driver monitoring (MDM) dataset [31] to train and evaluate our model. The recordings used for training are a combination of continuous data, where the driver follows a target marker that is moved around in front of a parked car, and discrete data points, where the driver briefly glances at target markers inside the car, while driving the vehicle and while the car is parked. These datasets provide us with a diverse set of data in terms of how the subject approaches a gaze target. We first design simple models that use only one input modality (head pose only or eye appearance only) and compare them with respective baseline models. We observe that our models show superior performance to the baseline models. Our fusion model that takes both head pose and eye appearance as inputs shows better accuracy when compared to a simple model based solely on the head pose or eye appearance. For the eye appearance, we use the face camera and for the head pose we use Fi-Cap labels. We demonstrate potential applications of the model by projecting the probability map onto the road. For example, the visual map that includes 75% of the probability overlaps with the true target point 92.54% of the time.

This paper is organized as follows. Section II discusses related studies in the field of monitoring driver visual attention. Section III describes the portion of the MDM dataset that we use for our proposed model. Section IV describes in detail the proposed model architecture. Section V discusses the experimental settings, including baselines and implementation details. Section VI evaluates our proposed model by comparing its performance in different settings with various baselines. It also discusses projections of the probabilistic gaze map on the road, evaluating the estimation for cases when the subjects were looking at landmarks on the road. Section VII provides the concluding remarks and future research directions.

SECTION II.

Related Work

Driving is a challenging task requiring a high level of vigilance from the driver. Therefore, many researchers have studied the factors associated with driver behaviors and their effects on vehicle operation [32]. In their review of human behavior in an intelligent vehicle environment, Ohn-Bar and Trivedi [33] noted that multiple studies have analyzed drivers in terms of their intentions, behaviors, and actions for maneuvering control. These studies mostly include the gaze, eye appearance, hands, and head movements of the driver, and their interactions with objects inside the car. This Section reviews relevant studies on driver visual attention, emphasizing the relation between head pose and eye appearance for gaze estimation.

A. Driver Visual Attention estimation

Given the importance of visual attention in studying driver vigilance, many studies have proposed methods for using head pose and/or gaze to estimate the driver’s visual attention. Dong et al. [34] presented a review of various monitoring systems for driver distraction. They discussed methods based on subjective reports [35], [36], [37], physiological methods [38], [39], [40], [41], physical methods [42], [43], [44], and driving performance measures [45], [46], [47]. Dong et al. [34] reports many studies that have used the driver’s eye and head movements to estimate distraction [42], [48] and fatigue [43], [44]. Tracking eye movement does not require invasive sensors like other physiological signals, such as electroencephalogram (EEG) and electrocardiogram (ECG). They concluded that using driving performance measures in conjunction with eye and head movements is the most reliable solution for monitoring driver distractions.

1) Relationship between Head Pose and Gaze

The eyes and head move when we glance at a given target location. The gaze-eye relationship is non-trivial, depending on several factors including the cognitive load. Mu noz et al. [49] analyzed the head rotations of a driver during forward glances and glances to the center console of the vehicle. They recognized these two glance patterns using a temporal model based on a hidden Markov model (HMM). They observed that head pose is a strong indicator of gaze location. They also observed that there are differences in the patterns observed across individual drivers. Talamonti et al. [50] studied head pose and eye gaze dynamics by asking various subjects to look at different locations in the car in a simulated driving environment. They observed that there is very little head movement when looking at locations such as the odometer and rear-view mirror. However, looking at locations such as the center console and the left mirror require more head movement from the driver. Jha and Busso [30] noted that while there is a strong relationship between head pose and gaze location, the relationship is not one-to-one. They observed that the variability in the gaze changes based on the direction that the driver is directing her/his visual attention to. They also observed a significant difference in gaze patterns when driving a vehicle, where glancing is one of several driving tasks, and when the car is parked, where glancing is the only task and the driver has more time.

2) Using Head Pose to Estimate Driver Visual Attention

Given the strong relation between head pose and gaze, many studies have used head pose to estimate the driver visual attention. Fridman et al. [25] proposed a method to estimate gaze zones with only head pose. They extracted facial landmark features that provide strong cues for head pose. They used these features as inputs to a random forest (RF) classifier to categorize the driver’s visual attention into seven gaze zones. They trained and evaluated the model on a dataset collected from 50 subjects with two different vehicles. Yuen et al. [51] used a dataset collected in a naturalistic driving setting to perform face localization, landmark detection, and head pose estimation using deep neural network (DNN) architectures. They observed higher performance compared to models trained using public datasets collected in indoor settings. Jha and Busso [28] proposed an alternative probabilistic method to model visual attention of a driver. Instead of dividing the visual area into gaze zones, they use heat maps to represent a probability distribution of the driver’s gaze conditioned on the head pose. They use Gaussian process regression (GPR), which provides a Gaussian distribution of the gaze as a function of the driver’s head pose.

3) Incorporating Eye Appearance Information

Head pose provides important information about the visual attention of the driver. However, the estimates are coarse. To provide finer details, many studies have incorporated information from the eyes. Tawari et al. [52] used both head and eye cues to classify the driver’s visual attention into preset gaze zones. They extracted high level geometrical eye features from the eye location and used them as inputs along with the head pose into a RF classifier. They observed that adding eye cues increases the performance of their model, reaching 94.9% accuracy when compared to a head pose only model which had an accuracy of 79.8%. Fridman et al. [53] studied the gain in performance of a model when using head pose alone versus when using information about the head pose and eye appearance. They observed that while there is an improvement in performance, this improvement is user specific. They defined a metric called owlness, which quantifies how much the driver depends on head movement alone to complete a glance. The study concluded that if a driver’s owlness score is low, there is a larger improvement in the model when adding the eye appearance information. Most of these approaches classify the driver’s gaze into discrete zones. Bergasa et al. [43] proposed a model based approach for this problem. They estimate factors related to eye closure, head pose, and gaze to determine the drivers level of vigilance. Near-infrared (IR) illuminators were used, which create bright reflections on the eye, making it easier to detect the pupil. They used these factors to estimate the driver’s fatigue level and inattentiveness using preset rules in a fuzzy logic design. Vora et al. [26] proposed a generalized framework for estimating driver gaze zones using CNNs. They performed an analysis using different models and images with different parts of the face, concluding that a squeezenet model that takes the top half of the face as input provided the best results with an accuracy of 95.18%. Ewaisha et al. [27] proposed a CNN based multitask learning approach that simultaneously performs gaze estimation and head pose estimation. The model performs regression on both tasks, followed by a clustering step where the data samples are assigned to different gaze zones. Performing head pose estimation as an auxiliary task increased the robustness of the model against head pose variations. They achieved 78.2% accuracy on cross-subject evaluations.

B. Application of Visual Attention

Ahlstrom et al. [54] studied the effect of a visual distraction warning system on the driver’s behavior. They used a gaze zone classification system to generate a warning when the driver looked away for more than a certain amount of time. They observed that using this warning system reduced the amount of time the driver spent looking off-the-road, even when involved in a secondary distracting activity. Liang et al. [55] analyzed the crash risk associated with driver gaze variables (i.e., glance history, glance duration, and glance location). They analyzed 24 different algorithms that estimate risk and concluded that the off-the-road glance duration was the most important factor for estimating crash risk. Li and Busso [10] studied various multimodal features from the CAN-Bus data, road scene, and driver’s face to estimate the perceived visual and cognitive distraction during driving. The distraction space formed by visual and cognitive distractions was divided into four clusters and a classifier was trained to distinguish between these classes using multimodal features. They noted that the model classified the data as distracted when the drivers were performing secondary tasks characterized by high cognitive and/or visual load.

Sodhi et al. [56] analyzed the eye positions and pupil diameter using an eye tracking device to study the effect of a secondary task on the driver’s visual attention. They observed an increase in the amount of off-the-road fixation because of competing visual demands on tasks. They observed that the impact varied based on the task. For example, tuning the radio required more off-the-road fixation compared to checking the odometer. They also noted that cognitive tasks decreased the amount of eye movement, since the standard deviation for the fixation displacement reduced when performing mental calculations. Similarly, Reimer et al. [9] studied the effect of cognitive demand on visual attention. They used gaze concentration as a metric, which is a measure of reduction in gaze variability. As the amount of cognitive demand increased, they found that the amount of gaze concentration also increased, implying that drivers tend to fixate at a point when cognitively distracted. They observed that while the amount of gaze concentration increased when moving from low to moderate difficulty tasks, the difference was less defined than the difference from moderate to high difficulty tasks. Martin and Trivedi [17] used the gaze dynamics of the driver to estimate lane based maneuvers in a real world driving scenario. They designed models that could estimate lane change 600 milliseconds before the event with an accuracy of 85%.

There has also been interest in estimating visual saliency based on the road scene to estimate where the driver is expected to look. Palazzi et al. [20] collected the Dr(eye)ve dataset that uses an eye-tracker to track the driver’s fixation on the road view. They used this data to train a temporal model that estimates the visual attention of the driver based on the road scene image. They trained a three branch network that takes the original RGB image, the optical flow image, and a semantic segmentation from the scene view to estimate the focus of attention. Lv et al. [21] improved this model by using reinforced attention. This method uses deep reinforcement learning as a regulatory mechanism that increases the density of the estimation.

C. Relation to Prior Work

Our model is inspired by the model proposed in our previous conference papers [29], [57]. In Jha and Busso [29], we proposed a method that uses CNN with upsampling to create a 2D grid representation of the visual attention learned from the head pose. This model has important limitations, since we exclusively rely on head pose without considering eye information. As a result, the model created large maps of visual attention with low certainty. Additionally, the model was only trained on discrete gaze points which made the model overfit to those specific locations. In Jha and Busso [57], we proposed a method to estimate gaze maps from eye patch images using CNN with maxpooling and upsampling. This model was trained with the MSP-Gaze corpus [58], which was collected in an indoor laboratory setting. Therefore, the model did not address the challenges observed in naturalistic driving conditions [59].

In this study, we propose a model that incorporates both head pose and eye appearance with a novel encoder-decoder approach that relies on downsampling and upsampling with CNN. We also use data collected with continuous gaze sequences and with discrete gaze locations making the data more representative of gaze in a vehicle. This approach provides better representation, leading to a model that can estimate a small area of gaze region with high accuracy. The contributions of this study are:

We propose a principled gaze detection approach to fuse head and eye movements. The approach uses two encoders that take inputs from the head pose and eye appearance and fuses the representations with a single decoder that generates the visual map at different resolutions.
We propose a loss function that uses marginal distribution in the horizontal and vertical directions with Gaussian filtering so that estimations which are further away from the ground truth get higher penalties.
We create a loss term for each resolution and optimize them in parallel to make sure that the correct representation is learned at each resolution.
We conduct an exhaustive evaluation using natural driving recordings, demonstrating the performance of the proposed fusion approach, which achieves better performance than alternative baseline methods.
The approach is trained such that it works even when one modality is missing, making this approach more practical for real world applications.
We demonstrate the potential of the proposed approach by projecting the estimated probabilistic gaze map onto the road to identify gaze targets outside the vehicle.

SECTION III.

Database

We use the multimodal driver monitoring (MDM) dataset [31] to train and evaluate our model. The MDM dataset is a real world driving dataset collected with 59 subjects, where the drivers are asked to perform various actions in a parked car as well as while driving. The objective of the dataset is to collect naturalistic data with reliable ground truths for the head pose and gaze of the driver. Each subject wore the Fi-Cap helmet [60] during the data collection, which is a cap like structure that contains 23 AprilTags that can be easily tracked in an image (Fig. 1). By tracking these AprilTags, we can establish reliable information about the driver’s head pose. Figure 1 shows the sensors used in the data collection. The data is recorded using four GoPro RGB cameras: frontal face, profile face (near rearview mirror), back, and road. The data also includes a PMD picoflexx depth camera, which is not used in this study. This depth camera uses time-of-flight (TOF) technology. This dataset provides annotation of the driver’s gaze in the 3D space in a variety of situations. Since the data is collected in both driving and parking settings, it provides a diverse set of data to train robust driver-independent models. Another reason for using the MDM corpus for our experiments is the labels available in the corpus for the gaze and head pose.

FIGURE 1.

Sensors included in the MDM corpus. This study relies on the face camera. It also uses the back camera to estimate the head pose of the drivers.

Driver Visual Attention Estimation Using Head Pose and Eye Appearance Information

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Driver Visual Attention estimation

1) Relationship between Head Pose and Gaze

2) Using Head Pose to Estimate Driver Visual Attention

3) Incorporating Eye Appearance Information

B. Application of Visual Attention

C. Relation to Prior Work

Database

A. Protocol

1) Discrete Gaze Markers

2) Continuous Gaze Target

3) Gaze Target Landmark

B. Data Preparation

Proposed Model

A. Head Pose Encoder

B. Eye Encoder

C. Visual Attention Decoder

D. Loss

Experimental Settings

A. Baseline Models

1) Upsampled Neural Networks Using Head Pose

2) GPR Model with Head Pose Input

3) Upsampled Neural Networks Using Eye Appearance

4) Regression Model With Eye Image Input

B. Implementation Details

Experimental Results

A. Performance of the Proposed Approach

B. Fusion Model Performance at Different Resolutions

C. Performance of Proposed Model Without Eye Information

D. Effect of Wearing Glasses on the Performance

E. Training with Matched and Mismatched Datasets

F. Projecting Probabilistic Maps onto the Road

Conclusion

Cites in Papers - IEEE (6) | Other Publishers (3)

Cites in Papers - IEEE (6)

Cites in Papers - Other Publishers (3)

References

Cites in Papers - |