Introduction
Current driving systems place the driver at the center of the controls. The driver takes responsibility for ensuring that the entire system runs smoothly. If there is a sudden change in the environment, such as a pedestrian crossing the road, or a vehicle suddenly stopping in front of the ego car, it is up to the driver to take necessary actions to avoid potential accidents. While advances in in-vehicle technologies have helped drivers with safety features, such as collision avoidance, lane assist, and adaptive cruise control, the responsibility is still on the driver. Hence, it is only natural to design systems that can monitor drivers, inferring if they are aware of the driving environment [1]. The transition from manually controlled vehicles to fully autonomous systems is going to be gradual, given the complexity of mixed-autonomy traffic [2]. A semi-autonomous system needs to have synergy between the autonomy of the car and the human driver. When the driving system is unable to make important maneuvering decisions, it must transfer control to the driver. A distracted driver takes longer to resume control of the vehicle [3]. Hence, the system should be able to understand the actions, intents, and behaviors of the driver before transferring control of the car.
Various studies have addressed different aspects of driver attention, such as vehicle state information [4], physiological signals [5], [6], [7], cognitive distractions [8], [9], [10] and changes in emotional states [11]. The 100-car naturalistic driving study by the National Highway Traffic Safety Administration (NHTSA) concluded that in 80% of crash events and 65% of near-crash events, the driver was looking away from the incoming road just before the event [12]. A driver relies on vision to gather information from the environment, including road signs [13], and pedestrians [14]. Driving tasks such as mirror checking actions [15], [16] and lane change [17], [18] also require the driver’s visual attention. Knowledge of the driver’s visual attention can be helpful in understanding their intent and their knowledge of the environment [19]. Correlating driver visual attention with visual saliency of the scene can be helpful in gaining insights about what the driver is attending to [20], [21]. This information can be useful in cloning behavioral models onto autonomous systems to replicate safe driving patterns [22]. These examples illustrate the importance of estimating the driver’s visual attention, and its potential applications. As the technology evolves towards autonomous vehicles, from level 1 to level 4, the relevance of this task also increases. In non-autonomous vehicles (L1 and L2), the drivers are involved in controlling the operations of the car, so the knowledge of their attention helps in maintaining safe driving conditions. In level 3, it is crucial to have synergy between the driver and the car, as most activities will be shared between them. While the driver might not need to be always attentive, at crucial times during handovers, the system needs to check that the driver is paying attention before transferring control of the car. With full autonomy (L4), the knowledge of a user’s visual attention can be helpful for infotainment and navigation systems (e.g., displays information on buildings visually attended by the driver, and resolve ambiguities for road-related driver’s commands).
Different studies have tried to use the driver’s gaze to estimate distractions. Studies have binarized the problem by considering the duration of gaze-off-the-road events [23], [10]. Other studies have associated head pose and gaze estimation directly with driving activities [4], [17], [24]. An alternative approach is to divide the driver visual attention into gaze zones [25], [26], [27]. While these methods provide a good coarse estimate of the driver’s visual attention, some applications may require a finer estimation of where the driver is directing her/his visual attention. While commercial head-mounted eye trackers such as Tobii and faceLAB can provide an accurate estimate of the driver’s gaze, and are useful for research, due to their invasive nature, they provide limited usage in real-world applications. These methods do not provide information about the driver’s awareness of particular objects on the road such as other vehicles and pedestrians. Acknowledging the fact that fine details about the driver’s gaze may not always be available, we design a model that uses both head pose and eye appearance information to estimate the gaze in real naturalistic driving recordings. We define eye appearance as the image around the eyes of a person.
Visual attention is a broader concept in cognitive science that deals with the study of a person’s awareness of the visual world. However, This study restricts our definition of visual attention as the heat map representing the distribution of the driver’s gaze (i.e., the area that a driver is looking at while operating the car). We quantify visual attention with a probability map that can be easily projected onto the road, mapping the driving environment with the driver’s attention. This formulation also allows us to assign an intensity to the direction of the driver’s vision. For example, a driver fixating on a location will have a map with high intensity in a small region. When a driver is exploring the driving environments, the saccade movements will create a map with low intensity in a large area.
In Jha and Busso [28], [29], we proposed that providing probabilistic maps is a practical and effective way to estimate visual attention. Our earlier efforts relied only on the head pose, but driver visual attention is a function of both head pose and eye movement [30]. In this work, we build a fusion model with two branches, each of which takes the head pose and the eye image as inputs and learns a combined representation of the visual attention of the driver. The proposed model contains three parts: a head pose encoder, an eye encoder, and a decoder. The head pose encoder takes the six parameters describing the head pose (position and orientation) as the inputs, which are passed through fully connected networks, followed by reshaping to obtain a low-dimensional map. The eye encoder takes the eye patch image as the input and sends it through a neural network to extract discriminative information. The decoder of our model concatenates the outputs of the encoders, creating a unified feature map that is sent through an upsampling network to obtain the probabilistic map at different resolutions. We classify the driver’s visual attention on a 2D grid which is learned by using upsampling with convolutional neural networks (CNNs). This representation is purely learned from the data and, hence, is non-parametric. Therefore, it does not require imposing parametric distributions as in our previous model [28], increasing the flexibility of our formulation.
We use the multimodal driver monitoring (MDM) dataset [31] to train and evaluate our model. The recordings used for training are a combination of continuous data, where the driver follows a target marker that is moved around in front of a parked car, and discrete data points, where the driver briefly glances at target markers inside the car, while driving the vehicle and while the car is parked. These datasets provide us with a diverse set of data in terms of how the subject approaches a gaze target. We first design simple models that use only one input modality (head pose only or eye appearance only) and compare them with respective baseline models. We observe that our models show superior performance to the baseline models. Our fusion model that takes both head pose and eye appearance as inputs shows better accuracy when compared to a simple model based solely on the head pose or eye appearance. For the eye appearance, we use the face camera and for the head pose we use Fi-Cap labels. We demonstrate potential applications of the model by projecting the probability map onto the road. For example, the visual map that includes 75% of the probability overlaps with the true target point 92.54% of the time.
This paper is organized as follows. Section II discusses related studies in the field of monitoring driver visual attention. Section III describes the portion of the MDM dataset that we use for our proposed model. Section IV describes in detail the proposed model architecture. Section V discusses the experimental settings, including baselines and implementation details. Section VI evaluates our proposed model by comparing its performance in different settings with various baselines. It also discusses projections of the probabilistic gaze map on the road, evaluating the estimation for cases when the subjects were looking at landmarks on the road. Section VII provides the concluding remarks and future research directions.
Related Work
Driving is a challenging task requiring a high level of vigilance from the driver. Therefore, many researchers have studied the factors associated with driver behaviors and their effects on vehicle operation [32]. In their review of human behavior in an intelligent vehicle environment, Ohn-Bar and Trivedi [33] noted that multiple studies have analyzed drivers in terms of their intentions, behaviors, and actions for maneuvering control. These studies mostly include the gaze, eye appearance, hands, and head movements of the driver, and their interactions with objects inside the car. This Section reviews relevant studies on driver visual attention, emphasizing the relation between head pose and eye appearance for gaze estimation.
A. Driver Visual Attention estimation
Given the importance of visual attention in studying driver vigilance, many studies have proposed methods for using head pose and/or gaze to estimate the driver’s visual attention. Dong et al. [34] presented a review of various monitoring systems for driver distraction. They discussed methods based on subjective reports [35], [36], [37], physiological methods [38], [39], [40], [41], physical methods [42], [43], [44], and driving performance measures [45], [46], [47]. Dong et al. [34] reports many studies that have used the driver’s eye and head movements to estimate distraction [42], [48] and fatigue [43], [44]. Tracking eye movement does not require invasive sensors like other physiological signals, such as electroencephalogram (EEG) and electrocardiogram (ECG). They concluded that using driving performance measures in conjunction with eye and head movements is the most reliable solution for monitoring driver distractions.
1) Relationship between Head Pose and Gaze
The eyes and head move when we glance at a given target location. The gaze-eye relationship is non-trivial, depending on several factors including the cognitive load. Mu noz et al. [49] analyzed the head rotations of a driver during forward glances and glances to the center console of the vehicle. They recognized these two glance patterns using a temporal model based on a hidden Markov model (HMM). They observed that head pose is a strong indicator of gaze location. They also observed that there are differences in the patterns observed across individual drivers. Talamonti et al. [50] studied head pose and eye gaze dynamics by asking various subjects to look at different locations in the car in a simulated driving environment. They observed that there is very little head movement when looking at locations such as the odometer and rear-view mirror. However, looking at locations such as the center console and the left mirror require more head movement from the driver. Jha and Busso [30] noted that while there is a strong relationship between head pose and gaze location, the relationship is not one-to-one. They observed that the variability in the gaze changes based on the direction that the driver is directing her/his visual attention to. They also observed a significant difference in gaze patterns when driving a vehicle, where glancing is one of several driving tasks, and when the car is parked, where glancing is the only task and the driver has more time.
2) Using Head Pose to Estimate Driver Visual Attention
Given the strong relation between head pose and gaze, many studies have used head pose to estimate the driver visual attention. Fridman et al. [25] proposed a method to estimate gaze zones with only head pose. They extracted facial landmark features that provide strong cues for head pose. They used these features as inputs to a random forest (RF) classifier to categorize the driver’s visual attention into seven gaze zones. They trained and evaluated the model on a dataset collected from 50 subjects with two different vehicles. Yuen et al. [51] used a dataset collected in a naturalistic driving setting to perform face localization, landmark detection, and head pose estimation using deep neural network (DNN) architectures. They observed higher performance compared to models trained using public datasets collected in indoor settings. Jha and Busso [28] proposed an alternative probabilistic method to model visual attention of a driver. Instead of dividing the visual area into gaze zones, they use heat maps to represent a probability distribution of the driver’s gaze conditioned on the head pose. They use Gaussian process regression (GPR), which provides a Gaussian distribution of the gaze as a function of the driver’s head pose.
3) Incorporating Eye Appearance Information
Head pose provides important information about the visual attention of the driver. However, the estimates are coarse. To provide finer details, many studies have incorporated information from the eyes. Tawari et al. [52] used both head and eye cues to classify the driver’s visual attention into preset gaze zones. They extracted high level geometrical eye features from the eye location and used them as inputs along with the head pose into a RF classifier. They observed that adding eye cues increases the performance of their model, reaching 94.9% accuracy when compared to a head pose only model which had an accuracy of 79.8%. Fridman et al. [53] studied the gain in performance of a model when using head pose alone versus when using information about the head pose and eye appearance. They observed that while there is an improvement in performance, this improvement is user specific. They defined a metric called owlness, which quantifies how much the driver depends on head movement alone to complete a glance. The study concluded that if a driver’s owlness score is low, there is a larger improvement in the model when adding the eye appearance information. Most of these approaches classify the driver’s gaze into discrete zones. Bergasa et al. [43] proposed a model based approach for this problem. They estimate factors related to eye closure, head pose, and gaze to determine the drivers level of vigilance. Near-infrared (IR) illuminators were used, which create bright reflections on the eye, making it easier to detect the pupil. They used these factors to estimate the driver’s fatigue level and inattentiveness using preset rules in a fuzzy logic design. Vora et al. [26] proposed a generalized framework for estimating driver gaze zones using CNNs. They performed an analysis using different models and images with different parts of the face, concluding that a squeezenet model that takes the top half of the face as input provided the best results with an accuracy of 95.18%. Ewaisha et al. [27] proposed a CNN based multitask learning approach that simultaneously performs gaze estimation and head pose estimation. The model performs regression on both tasks, followed by a clustering step where the data samples are assigned to different gaze zones. Performing head pose estimation as an auxiliary task increased the robustness of the model against head pose variations. They achieved 78.2% accuracy on cross-subject evaluations.
B. Application of Visual Attention
Ahlstrom et al. [54] studied the effect of a visual distraction warning system on the driver’s behavior. They used a gaze zone classification system to generate a warning when the driver looked away for more than a certain amount of time. They observed that using this warning system reduced the amount of time the driver spent looking off-the-road, even when involved in a secondary distracting activity. Liang et al. [55] analyzed the crash risk associated with driver gaze variables (i.e., glance history, glance duration, and glance location). They analyzed 24 different algorithms that estimate risk and concluded that the off-the-road glance duration was the most important factor for estimating crash risk. Li and Busso [10] studied various multimodal features from the CAN-Bus data, road scene, and driver’s face to estimate the perceived visual and cognitive distraction during driving. The distraction space formed by visual and cognitive distractions was divided into four clusters and a classifier was trained to distinguish between these classes using multimodal features. They noted that the model classified the data as distracted when the drivers were performing secondary tasks characterized by high cognitive and/or visual load.
Sodhi et al. [56] analyzed the eye positions and pupil diameter using an eye tracking device to study the effect of a secondary task on the driver’s visual attention. They observed an increase in the amount of off-the-road fixation because of competing visual demands on tasks. They observed that the impact varied based on the task. For example, tuning the radio required more off-the-road fixation compared to checking the odometer. They also noted that cognitive tasks decreased the amount of eye movement, since the standard deviation for the fixation displacement reduced when performing mental calculations. Similarly, Reimer et al. [9] studied the effect of cognitive demand on visual attention. They used gaze concentration as a metric, which is a measure of reduction in gaze variability. As the amount of cognitive demand increased, they found that the amount of gaze concentration also increased, implying that drivers tend to fixate at a point when cognitively distracted. They observed that while the amount of gaze concentration increased when moving from low to moderate difficulty tasks, the difference was less defined than the difference from moderate to high difficulty tasks. Martin and Trivedi [17] used the gaze dynamics of the driver to estimate lane based maneuvers in a real world driving scenario. They designed models that could estimate lane change 600 milliseconds before the event with an accuracy of 85%.
There has also been interest in estimating visual saliency based on the road scene to estimate where the driver is expected to look. Palazzi et al. [20] collected the Dr(eye)ve dataset that uses an eye-tracker to track the driver’s fixation on the road view. They used this data to train a temporal model that estimates the visual attention of the driver based on the road scene image. They trained a three branch network that takes the original RGB image, the optical flow image, and a semantic segmentation from the scene view to estimate the focus of attention. Lv et al. [21] improved this model by using reinforced attention. This method uses deep reinforcement learning as a regulatory mechanism that increases the density of the estimation.
C. Relation to Prior Work
Our model is inspired by the model proposed in our previous conference papers [29], [57]. In Jha and Busso [29], we proposed a method that uses CNN with upsampling to create a 2D grid representation of the visual attention learned from the head pose. This model has important limitations, since we exclusively rely on head pose without considering eye information. As a result, the model created large maps of visual attention with low certainty. Additionally, the model was only trained on discrete gaze points which made the model overfit to those specific locations. In Jha and Busso [57], we proposed a method to estimate gaze maps from eye patch images using CNN with maxpooling and upsampling. This model was trained with the MSP-Gaze corpus [58], which was collected in an indoor laboratory setting. Therefore, the model did not address the challenges observed in naturalistic driving conditions [59].
In this study, we propose a model that incorporates both head pose and eye appearance with a novel encoder-decoder approach that relies on downsampling and upsampling with CNN. We also use data collected with continuous gaze sequences and with discrete gaze locations making the data more representative of gaze in a vehicle. This approach provides better representation, leading to a model that can estimate a small area of gaze region with high accuracy. The contributions of this study are:
We propose a principled gaze detection approach to fuse head and eye movements. The approach uses two encoders that take inputs from the head pose and eye appearance and fuses the representations with a single decoder that generates the visual map at different resolutions.
We propose a loss function that uses marginal distribution in the horizontal and vertical directions with Gaussian filtering so that estimations which are further away from the ground truth get higher penalties.
We create a loss term for each resolution and optimize them in parallel to make sure that the correct representation is learned at each resolution.
We conduct an exhaustive evaluation using natural driving recordings, demonstrating the performance of the proposed fusion approach, which achieves better performance than alternative baseline methods.
The approach is trained such that it works even when one modality is missing, making this approach more practical for real world applications.
We demonstrate the potential of the proposed approach by projecting the estimated probabilistic gaze map onto the road to identify gaze targets outside the vehicle.
Database
We use the multimodal driver monitoring (MDM) dataset [31] to train and evaluate our model. The MDM dataset is a real world driving dataset collected with 59 subjects, where the drivers are asked to perform various actions in a parked car as well as while driving. The objective of the dataset is to collect naturalistic data with reliable ground truths for the head pose and gaze of the driver. Each subject wore the Fi-Cap helmet [60] during the data collection, which is a cap like structure that contains 23 AprilTags that can be easily tracked in an image (Fig. 1). By tracking these AprilTags, we can establish reliable information about the driver’s head pose. Figure 1 shows the sensors used in the data collection. The data is recorded using four GoPro RGB cameras: frontal face, profile face (near rearview mirror), back, and road. The data also includes a PMD picoflexx depth camera, which is not used in this study. This depth camera uses time-of-flight (TOF) technology. This dataset provides annotation of the driver’s gaze in the 3D space in a variety of situations. Since the data is collected in both driving and parking settings, it provides a diverse set of data to train robust driver-independent models. Another reason for using the MDM corpus for our experiments is the labels available in the corpus for the gaze and head pose.
Sensors included in the MDM corpus. This study relies on the face camera. It also uses the back camera to estimate the head pose of the drivers.
A. Protocol
While the data collection protocol involved multiple primary and secondary tasks, we limit our discussion in this paper to the sections that are relevant to the current study. The readers are referred to Jha et al. [31] for more details about the corpus.
1) Discrete Gaze Markers
A goal from this corpus is to have data to train gaze-based models. There are 21 differently numbered markers placed at different locations inside the car (Fig. 2(b)). The numbered markers are placed on the following locations: 1 to 13 on the windshield, 14 to 16 on the mirrors, 17 and 18 on the left and right windows, respectively, 19 on the speedometer panel, 20 on the dashboard, and 21 near the gear of the car. The subjects are asked to look at each of the numbered markers in a random order multiple times. For example, Figure 2(a) shows an example of a subject looking at marker number 13. This step is repeated for two conditions: when the car is parked, and when the participant is driving on a straight road. The location of each of the markers with respect to the back camera is known, which is used as the ground truth gaze location for the frames when the subject is looking at these target markers.
Protocol to collect ground truth for gaze data. (a) Discrete gaze markers, where drivers are asked to look at markers inside the car, (b) continuous gaze target, where drivers are asked to follow a board held by a researcher when the car is parked, and (c) gaze target landmark, where the driver is asked to look at landmarks outside the car.
2) Continuous Gaze Target
The corpus also includes continuous gaze data collected while the vehicle was parked. In this step, a researcher conducting the experiment holds a large board with an AprilTag [61] printed on it. The researcher walks in front of the vehicle (Fig. 2(d)). The AprilTag is used so that its 3D location can be tracked from the road camera image. The researcher moves this target and the subject is asked to follow the target with her/his gaze. The data is collected in 3 to 5 sessions of about one minute each. This part of the recordings provides us with continuous data with ground truth gaze annotated for each subject. This protocol cannot be implemented when the driver is operating the vehicle, resulting in limited diversity in terms of appearance and changes in illumination. However, it provides very rich data to be augmented with the marker data.
3) Gaze Target Landmark
In this part of the protocol, the subjects are asked to look at various landmarks on the road and answer questions about them (Fig. 2(f)). For example, we ask them to identify stores on the side of the road. This data is collected when the subject is driving. The landmark location is captured on the road camera. Since the target information is limited to the 2D projections on the road, we use this data to validate our model in real world scenarios by projecting our model estimation onto the road (Section VI-F).
B. Data Preparation
The proposed model takes as input the head pose of the driver and an image of her/his eyes to generate a 2D grid that represents the horizontal and vertical gaze directions.
The head pose is obtained using the Fi-Cap reading. While the model can work with head pose obtained from automatic computer vision algorithms, we use the Fi-Cap for this purpose because of its reliability in providing head pose in all six degrees of freedom for almost all the frames (position and orientation). The Fi-Cap is in the back of the head so it does not occlude important facial features. Hence, an accurate head pose estimation algorithm that relies on facial images will also work on the MDM dataset. Using the Fi-Cap, the head orientation angles are obtained using the multiple local reference frame calibration framework explained in Hu et al. [62]. These angles are with respect to the depth camera that is placed alongside the face RGB camera. Local reference frames are picked from each of the videos with near-frontal face orientation. The relative rotations of these frames from a global reference frame are calculated using the iterative closest point (ICP) algorithm. The angles for each frame are calculated based on the rotation of the Fi-Cap with respect to these local reference frames. The position of the head is approximated with the position of the center of the Fi-Cap. The back camera (Fig. 1) is used as the reference location to ensure that the gaze targets and the face lie on the same side of the coordinate system.
We obtain the eye patch from the face alignment network (FAN) algorithm [63] on images captured by the face camera. The landmarks surrounding the eye region are used to create a bounding box for the eyes. We add an extra margin by including 10 to 40 pixels around the eye, picking the actual number at random. The image is resized to
The ground truth gaze location during discrete gaze is obtained using the absolute location of the markers, which is estimated before the recordings. The ground truth location during continuous gaze is obtained using the location of the AprilTag tracked with the road camera. The locations are all transformed into the back camera coordinate system. The gaze vector \begin{align*} \mathbf {g}=&\mathbf {t_{gaze}} - \mathbf {h_{pos}} \tag{1}\\ \hat {\mathbf {g}}=&\frac {\mathbf {g}}{\|{\mathbf {g}}\|} \tag{2}\\ \theta _{gaze}=&\arctan \left ({\frac {\hat {g}_{x}}{-\hat {g}_{z}}}\right) \tag{3}\\ \phi _{gaze}=&\arctan \left ({\frac {\hat {g}_{y}}{\sqrt {{\hat {g}_{x}}^{2}+{\hat {g}_{z}}^{2}}}}\right) \tag{4}\end{align*}
The gaze angle values are truncated within the range of
Example of the process to convert the ground truth gaze angles into grid. The figure shows the maps as we increase the resolution from
Proposed Model
Our proposed approach uses the driver head pose and eye information to obtain a map representing the visual attention of the driver. Figure 4 shows the entire model, which consists of three blocks: the head pose encoder, the eye encoder and the visual attention decoder. The motivation behind this design is to obtain a common representation of the visual attention from both head pose and eye information at low resolution and then gradually upsample to refine the information. The head pose encoder and the eye encoder take the head pose and eye patch information as inputs, respectively, and generate multiple 2D maps that are concatenated. The visual attention decoder uses upsampling to create high resolution 2D maps that represent the direction of the driver visual attention. This Section presents these blocks in details.
Our proposed architecture to fuse head pose and eye information to estimate visual attention. The model contains three parts: head pose encoder that takes 6D head pose information, eye encoder that takes an eye patch image from the face camera.
A. Head Pose Encoder
The goal of the head encoder is to transform the head pose information into a tensor, which is used to upsample the visual attention representation. We have observed good performance by using CNNs to upsample the representation obtained from the head pose [29]. For this study, the CNN-based upsampling is conducted by the visual attention decoder (Section IV-C). The head pose information, represented by a six dimensional vector, is passed through two fully connected (FC) layers. The first FC layer has 512 nodes and the second FC layer has 672 nodes. The output of the FC layers is a 672D feature representation that is reshaped to transform this representation into a
While predicting a high-resolution 2D map from just six numbers may appear as an under-defined problem, we can effectively represent this gaze distribution by training the method per stage, presenting the target ground truth gaze area at different resolutions, as demonstrated by the experiments in Section VI-A.
B. Eye Encoder
The appearance of the eyes, including the relative position of the iris, gives valuable information about the target gaze direction [57], [58], [64], [65]. Our goal is to incorporate this information into the model. The eye encoder takes the eye image and generates a tensor which serves as a representation to estimate the gaze area. We can use several networks to obtain a discriminative feature representation from the eye patch provided by the FAN algorithm. This study relies on the MobileNet network [66], without the FC layers. This network relies on depth-wise separable convolutions, which provide a competitive performance with a reasonable size. The architecture has been found useful in various computer vision tasks such as image classification, detection, and segmentation. The output of the eye encoder is a representation with dimension
C. Visual Attention Decoder
The
D. Loss
Figure 6 illustrates the approach that we use to estimate the loss. The loss is separately calculated in the horizontal and vertical directions. This approach helps in reducing the number of operations by allowing us to obtain losses in two separate 1D spaces (horizontal and vertical directions) instead of looking at each point in a common 2D space. The visual map output and the ground truth are first filtered with a Gaussian mask to create a neighborhood of influence around each point. This approach starts rewarding the model when the estimations are getting spatially closer to the ground truth value, facilitating the learning process (i.e., overlap between the estimated and Gaussian masks). Then, the marginal distributions in both the horizontal and vertical directions are calculated by adding each column and row, respectively. The final loss is the sum of the cross entropy and the mean absolute error loss between the estimated and the ground truth vectors in each direction, \begin{align*} L_{h}=&\sum _{l=0}^{5} cc\left ({p_{true(hr)},p_{pred(hr)}}\right) + mae\left ({p_{true(hr)},p_{pred(hr)}}\right) \tag{5}\\ L_{v}=&\sum _{l=0}^{5} cc\left ({p_{true(vr)},p_{pred(vr)}}\right) + mae\left ({p_{true(vr)},p_{pred(vr)}}\right) \qquad \tag{6}\\ L=&L_{h} + L_{v} \tag{7}\end{align*}
Post processing of the model output and ground truth to estimate loss function. The model separately estimates the losses for the horizontal and vertical axes.
Experimental Settings
A. Baseline Models
We compare our approach with implementations of our approach relying on either head pose or eye appearance. We also compare with competitive alternative approaches for systems implemented with either head pose or eye appearance information.
1) Upsampled Neural Networks Using Head Pose
This model follows the same architecture as our main model without the eye encoder. The model estimates the visual attention of the driver with only the head pose. Figure 7(a) shows the architecture, which follows a similar structure as the one presented in our preliminary work [29]. We refer to this method as HP upsample NN.
2) GPR Model with Head Pose Input
To compare our models with regression-based methods, we train a Gaussian process regression (GPR) model to study the performance of the HP upsample NN. The model takes the head pose as the input, and provides Gaussian distributions of the horizontal and vertical gazes as the output [28]. The model uses a linear basis function with a squared exponential kernel function with automatic relevance determination. Equation (8) shows the expression for the kernel function, where \begin{equation*} k\left ({\mathbf {x},\mathbf {x}^{\prime }}\right) = \sigma ^{2}_{f} \exp \left ({-\frac {1}{2}\sum \limits _{i=0}^{d} \frac {\|x_{i}-x_{i}^{\prime }\|^{2}}{l_{i}^{2}} }\right) \tag{8}\end{equation*}
3) Upsampled Neural Networks Using Eye Appearance
We train an eye only model without the head pose encoder. Figure 7(b) shows the architecture of this baseline model, which is similar to the one presented in our preliminary work [57]. We refer to this method as eye upsample NN.
4) Regression Model With Eye Image Input
We train a regression model to compare the performance of the eye upsample NN. For this purpose, we design a model that uses the same architecture as the eye decoder. The output is then connected to a global average pooling (GAP) layer followed by a fully connected layer to give the horizontal and vertical gazes as the outputs. The mean squared error on the development set is taken as the variance of the model to construct a Gaussian distribution of the gaze around the mean value estimated by the model.
B. Implementation Details
All the CNN based models are trained using Tensorflow. We use a subject independent partition for training, development and testing. Out of the 59 subjects, data from 39 drivers are used for the train set, data from 10 drivers for the development set, and data from 10 drivers for the test set. Each partition is balanced in terms of gender and whether or not the subjects are wearing glasses. The training data consists of both the continuous gaze set (Section III-A2) as well as the discrete gaze set (Section III-A1). Since there are more samples in the continuous part, we oversample the discrete gaze set by five. We obtain the final training set by combining both sets. We use the ADAM optimizer with a learning rate set to
Experimental Results
This Section presents the extensive evaluations conducted to study the proposed model, which we refer to as fusion upsample NN. Since our model estimates a confidence region (CR) of visual attention as opposed to a single gaze location, we study the accuracy of our model as a function of the area within which the accuracy is obtained. A model can have good accuracy, but poor spatial resolution (e.g., a confidence region that includes the entire windshield). It can also have a good spatial resolution, but poor accuracy (e.g., a small confidence region that is incorrectly located). The ideal case is when a model achieves high accuracy within a small area. We present the performance with curves describing the tradeoff between accuracy and spatial resolution. The accuracy is measured by estimating the proportion of the gaze targets in the test set that are inside of the estimated confidence region. Since the estimations are in the horizontal and vertical angles, the spatial resolution is represented as a fraction of a sphere
Comparison of the head pose only model using the upsampled neural network (Fig. 7(a)) and the GPR model. While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Comparison of the eye only model using the upsampled neural network (Fig. 7(b)) and the regression models using eye appearance. While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Comparison of the proposed gaze model (Fig. 4) that uses head pose and eye appearance information and the upsampled neural networks using either head pose (Fig. 7(a)) or eye appearance (Fig. 7(b)). While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Comparison of the performance of the proposed gaze model (Fig. 4) at different resolutions. While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Comparison of the proposed gaze model (Fig. 4) when the eye information is missing (i.e., missing a modality). While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Analysis of the performance of the proposed approach for data collected from subjects wearing glasses and subjects without wearing glasses. While the models are trained with discrete and continuous gaze data, we separately report the performance of the models in the test set for both sets.
Analysis of the performance of the proposed approach trained with either the discrete or continuous gaze data. The performance drops for mismatched train and test conditions, indicating the need for using discrete and continuous gaze data to train the model.
The continuous data and the discrete marker data pose different sets of challenges in the estimation. The continuous gaze target cover less range in terms of angles, because the recordings are limited to the front of the vehicle where the road camera can see the marker. In the vertical direction, the range is further limited by the extent to which the researcher can move the marker. The discrete markers have higher range, as we have placed the target markers on the side windows, mirrors and the gear of the car. However, the points are grouped only around the markers, providing a more sparse coverage of possible gaze directions. For this purpose, we separately report our test results for the continuous and the discrete gaze markers. The training, however, is done on a combination of both datasets, with the exception of the results in Section VI-E.
A. Performance of the Proposed Approach
Before we present the results of our proposed gaze model, we present the models implemented with only head pose or eye appearance data. We compare these models with alternative baselines.
We start our analysis with the performance of the head pose only model. For this purpose, we compare the HP upsample NN with the GPR model. Figures 8(a) and 8(b) show the performance for the continuous gaze targets and the discrete gaze markers. In the continuous gaze targets, we observe that the HP upsample NN shows an improvement in performance compared to the GPR model. Looking at the discrete gaze markers, while the performance of the two models are very close, the GPR model has a slightly better performance. The GPR model is a regression function that learns a representation of the gaze distribution, while the HP upsample NN is a classification function that learns the target distribution purely based on data. Therefore, the model learns a rich representation in the continuous space where the gaze distribution is denser but with a limited range. The HP upsample NN achieves a non-parametric map that does not make assumptions about the distribution of gaze. This architecture also helps us in obtaining a representation that can be fused with the eye patch information to obtain a single model. In contrast, the GPR model learns better when the data is limited and the learning depends on the extrapolation of the available data. The parametric nature of the GPR model is suitable for this case.
We also analyze the model using only eye appearance information. We compare the eye upsample NN with the regression model with eye image input. Both models have an identical encoder architecture. Figure 9 shows the result of the evaluation. We observe a clear performance gain in continuous gaze targets (Fig. 9(a)) when the area of CR is under 2%. The regression model performs better for the test data associated with the discrete gaze markers (Fig. 9(b)). The eye upsample NN model shows poor performance for discrete gaze data because of the spatial sparsity of the data points, where the data is centered around few markers.
After evaluating the models trained with either head pose or eye appearance, we consider our proposed approach that combines both types of information (fusion upsample NN). Figure 10 shows the performance, which demonstrates the clear advantages of combining head pose and eye appearance information. We observe clear improvements for continuous gaze data and discrete gaze data. The head pose can only provide limited information about the gaze. The information provided by the appearance of the eye is also needed, which is clearly observed in the figure. Likewise, head position also provides complementary information about the gaze. The models can better interpret gaze information inferred from the eye appearance by having the location and orientation of the head. Hence, combining both sets of information helps our proposed model provide a better estimation of the driver visual attention.
We compare all five models in Table 1. We obtain the accuracy of each model by fixing the size of the confidence region. We use three sizes corresponding to 1%, 2% and 4% of the sphere around the driver’s head. The table separately reports the results for the continuous and discrete data. For the discrete gaze data, the fusion model shows the best performance for all the confidence regions, with the exception of the 4% CR condition, where the HP upsample NN performs marginally better. For the continuous gaze data, our proposed approach shows high accuracy for small CRs
B. Fusion Model Performance at Different Resolutions
This Section discusses the performance of our model at different resolutions. The output can be obtained at any desired resolution, since an output layer is connected after every upsampling stage. The resolution of the output at layer 1 is
C. Performance of Proposed Model Without Eye Information
The fusion model takes both head pose information and an image of the eyes. The eye patch detection might not work in cases when the illumination is not ideal or the head pose is extreme, which are common problems in naturalistic driving conditions [59]. In these cases, our model takes a blank image as input and only uses the head pose to provide information about the visual attention. In this situation, we expect the model to perform with a similar accuracy as a model trained with only head pose information. Figure 12 compares the performance of the proposed model when only head pose is available. For comparison, we include the performance of the model trained with only head pose (Fig. 7(a)), and the proposed approach evaluated with both inputs (head pose and eye appearance information). We observe that the model shows similar performance when we black out the eye input as the model trained with only head pose. For the discrete gaze data, Figure 12(b) shows that the proposed approach with missing eye information can achieve even better performance than the model trained exclusively with head pose information. Figure 12(a) shows that there is small difference between both conditions for the continuous gaze data.
D. Effect of Wearing Glasses on the Performance
Jha and Busso [59] discussed that the use of glasses is another important challenge in naturalistic conditions. The presence of glasses can affect the model, as our model depends on the appearance of the eyes. Five out the ten subjects included in the test set wore eye glasses. This Section analyzes the differences in performance observed when the subjects wore or did not wear glasses. Figure 13 shows that the model works slightly better with subjects who did not wear glasses. This difference is minimum because our training data also contain a mix of subjects with and without glasses. Therefore, our models learn the representations well. Methods that compensate for challenges caused by glasses [67] can potentially improve the performance of our model.
E. Training with Matched and Mismatched Datasets
Our model is trained with data collected using two different data sets: the continuous gaze data and the discrete gaze data. For all the previous results presented in this paper, the training set includes data collected from both sets. This Section evaluates the performance of the models when training only with the continuous or discrete gaze data. Figure 14 shows the performance of these different models. We observe that the model trained using the discrete gaze data shows low performance on the continuous gaze data on the test set. Similarly, the model trained using the continuous gaze data shows low performance on the discrete gaze data on the test set. These results show that a model trained on a single type of data tends to overfit on similar data. For example, the model trained only on the discrete data tends to estimate only gaze targets around the numbered markers. Similarly, since the continuous data is limited to the frontal region of the windshield, the model fails when the test samples contain data outside the range of its distribution (e.g., looking at the side windows). The performance of the model trained with both datasets comes very close to the model trained and tested with matched datasets. This evaluation demonstrates the need to train our model with both continuous and discrete gaze data. This is not the case in many related studies, which use broad gaze zones or limited target markers [26], [52].
F. Projecting Probabilistic Maps onto the Road
As an example, we illustrate the benefit of our proposed model by projecting the estimated distribution by our model onto the road to correlate the visual attention with targets on the scene. The model estimates the visual attention in terms of angles with respect to the reference of the driver’s head frame. Since we aim to project this region onto the camera view, we make some approximations. First we realize multiple planes at different distances from the camera ranging from 2 meters to 20 meters in one meter intervals. Figures 15(a), 15(b) and 15(c) show examples for projections at 2, 10, and 20 meters. For each of these planes, we calculate the angle subtended at the head for each pixel, which helps us in calculating the gaze given by the model. The distributions obtained for each map at different distances are added and normalized to obtain the final map. Figure 15(d) shows the final probability distribution map after combining the results at different planes.
Illustration of the estimated confidence regions projected on the road. The figure shows the projections at 2, 10 and 20 meters, which are combined to create the final projection. The figure also shows example of 50%, 75% and 95% probability regions estimated from the estimated map. The figure is best viewed in color.
We obtain continuous estimations for data when the subject looks at landmarks on the road, as described in Section III-A3, for all the test subjects. The output of the model is a 2D probability distribution. Starting from the mean value, we can define a confidence region by increasing the area until reaching a target probability. For example, we can define a 50% probability map, where it is equally likely that the gaze is inside or outside this region. Using this approach, we define 50%, 75% and 95% gaze regions for this analysis. Figure 15(e) shows an example of these regions. The 95% gaze region is naturally larger than the 50% gaze region. The estimation region is evaluated against the ground truth landmark that the subject was asked to look at during the recordings. We consider success if there is an overlap between the ground truth target and the visual attention map. We evaluate a total of 272 examples from the 10 subjects in the test set.
Table 2 shows the accuracy at different regions. We observe that the model could correctly estimate the gaze within the 50% region for most examples
Examples of road projection of the visual attention estimation where the ground truth overlaps with the model estimation. The red circles indicate the target gaze direction. The figure is best viewed in color.
Examples of road projection of the visual attention estimation where the ground truth does not overlap with the model estimation. The red circles indicate the target gaze direction. The figure is best viewed in color.
Conclusion
This paper proposed a novel architecture to estimate the driver visual attention using probabilistic maps from head pose and eye appearance information. The approach relies on the use of upsampling based blocks implemented with CNNs. We obtained a driver independent representation of the gaze that can be directly obtained from the eye appearance and the head pose of the driver without any calibration. The maps are non-parametric and, hence, purely learned from the data. The approach provides an efficient way to estimate visual attention, and helps in designing an efficient fusion mechanism. The variance of the maps is non-parametric and depends on the gaze direction that is learned directly from the data. Therefore, the size of the confidence regions will vary according to the underlying uncertainty. Our evaluation results showed that models implemented with either head pose or eye appearance information outperformed their respective regression baselines in estimating continuous gaze targets. The fusion model was found better than the models based on single modalities in both continuous and discrete gaze data, showing the complementary information provided by eye images and head pose values. One crucial advantage of the proposed model is that it can provide a coarse estimation of the visual attention when only the head pose is available, adding finer details when the eye appearance information is added. This feature is important when accurate eye patches cannot be reliably obtained due to challenging naturalistic driving conditions. Therefore, this architecture can have many potential applications in active safety systems.
Every vision-based solution for in-vehicle applications needs to be robust against car movements. Vibrations can lead to imaging problems that may interfere with the instant appearance of the face for a given frame (e.g., blurred images, and head movements that are not associated with visual attention). While our approach is sensitive to these issues, we expect that the results reported in this study are representative given that the MDM corpus was collected in a real car capturing common driving conditions. We predict the gaze direction relative to the driver’s head location, instead of predicting the absolute gaze location. Therefore, our model is reasonably robust to upper body movement. The robustness shown against glasses provides confidence that our proposed model can deal with some of these challenges.
In this paper, we defined visual attention as a heat map representing the distribution of the driver’s gaze. This definition does not capture the broad concept of visual attention (e.g., driving context, saccade movements, time spent looking at a place or object). First, the approach focuses mostly on eye fixations. Saccadic motions are more challenging to track because of the inherent randomness associated with them. Additionally, since the attention is not focused on a specific place, it is challenging to describe the ground truth gaze. Second, the definition does not consider cases of inattention due to cognitive distractions (e.g., seeing but not noticing the driving environment). This contextual information can be added to the attention map provided by the proposed model to obtain a more comprehensive safety system.
There are areas of potential improvement in the algorithm. We observed sub-optimal results for a few subjects. Factors that affect the performance include error in calibration between cameras, appearance variations with subjects wearing highly reflective glasses, and difference in gaze behaviors across subjects. Various methods can be used to improve our algorithm. For example, we can implement adaptation methods to personalize the system, addressing the variability problem across subjects. Some semi-supervised methods can also be used that leverage natural data where the ground truth gaze is not available. Another limitation of our work is that it does not consider temporal information. The discrete gaze marker Section of the MDM corpus only considers gaze information when the drivers were looking at the target markers and objects. Since the data only has gaze information for those moments, we do not have continuous gaze information describing the trajectories that led one driver to focus on a particular position. Currently, this option can only be done with the continuous gaze data since the MDM corpus provides gaze labels for only some frames in the discrete gaze data. Lastly, we rely on the Fi-Cap data to obtain the head pose for our method. This limitation can be easily addressed by using automatic head pose estimation algorithms. While current RGB based algorithms do not reliably provide the head pose in all 6 degrees of freedom, the MDM database also has point cloud images collected with a time of flight (TOF) camera. This modality was used by Hu et al. [62] to estimate the orientation of the driver’s head. This method can be easily enhanced to also estimate the head position of the driver, providing the necessary information our model requires.