Journals & Magazines >IEEE Access >Volume: 11

Improving Gaze Tracking in Large Screens With Symmetric Gaze Angle Amplification and Optimization Technique

In this work, we propose the Symmetric Angle Amplification Function (SAAF) to fit gaze onto a large screen. This function is embedded in an integrated system where gaze e...

Abstract:

Many gaze tracking applications focus on use in personal devices such as mobile phones and PCs. However, gaze tracking in large screens poses challenges because with an i...Show More

Metadata

Abstract:

Many gaze tracking applications focus on use in personal devices such as mobile phones and PCs. However, gaze tracking in large screens poses challenges because with an increase in screen size, gaze tracking in the edge region decreases owing to the restricted range of human eye movement. In addition, as large screens are often exposed to the public, anyone can use the gaze tracking module. This makes it difficult to apply personalized calibration as in personal devices. To acquire accurate gaze in the edge region, we propose a novel approach–symmetric angle amplifying function–for the gaze angle, which amplifies angles when a user is looking at the edge area of the large screen. Our function is designed particularly for the case where the screen is divided into grid-shaped regions. Furthermore, for the better user experience, we optimize neural networks using the network-optimization framework and also propose a center gravity function that pulls gaze coordinates presented on the screen to the predefined center of the region to compensate for the person-wise difference in movement of the human eyes. Experimental results revealed the superiority of the proposed methods over the baseline and different types of fitting functions. The gaze tracking module serves as a part of an aggregated system and is implemented for use in autonomous vehicles.

In this work, we propose the Symmetric Angle Amplification Function (SAAF) to fit gaze onto a large screen. This function is embedded in an integrated system where gaze e...

Published in: IEEE Access ( Volume: 11)

Page(s): 85799 - 85811

Date of Publication: 02 June 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3282185

Funding Agency:

Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.

IEEE Keywords
- Gaze tracking ,
- Estimation ,
- Cameras ,
- Task analysis ,
- Solid modeling ,
- Face recognition ,
- Three-dimensional displays ,
- User interfaces
Index Terms
- Eye-tracking ,
- Large Screening ,
- Neural Network ,
- Central Region ,
- Eye Movements ,
- User Experience ,
- Fitness Function ,
- Autonomous Vehicles ,
- Edge Region ,
- Personal Devices ,
- Aggregate System ,
- Convolutional Neural Network ,
- Coordinate System ,
- F1 Score ,
- Precision And Recall ,
- Point Cloud ,
- Center Of The Screen ,
- Geometric Model ,
- Tracking Accuracy ,
- Pose Estimation ,
- Head Pose ,
- Bezier Curve ,
- Color Points ,
- Hand Gesture Recognition ,
- Naive Model ,
- Virtual Assistant ,
- Eye Patch ,
- Gaze Direction ,
- Small Screen ,
- Edge Devices
Author Keywords

Contents

SECTION I.

Introduction

Gaze tracking is a task of measuring where we look, what is referred to as point of gaze. It is one of the most relevant tasks in human–computer interaction applications. Most of the interaction with a device at a remote distance can be realized using gaze. Gaze tracking has been applied to games [1], visual attention analysis [2], and virtual reality simulators [3].

Previously, achieving accurate gaze tracking required a cumbersome calibration procedure using expensive hardware. However, as almost every device has built-in cameras, it has become easy for consumers to experience gaze tracking applications in their everyday life owing to the development of gaze estimation methods. Gaze estimation is closely related to gaze tracking, and it is the underlying task of gaze tracking, which detects and obtains gaze direction vectors from the input image. Gaze estimation methods can be categorized into two types: model- and appearance-based methods. A model-based method creates a geometric eye model and determines gaze from the features of the constructed geometric model. However, model-based methods have the limitation that the gaze estimation accuracy is low owing to its dependency on accurate eye feature detection. To detect accurate eye features, high-resolution images obtained from various sensors such as near infrared (NIR) cameras are required. In contrast to the model-based method, an appearance-based method depends only on the appearance, taking images of the eyes as input and learning the mapping between the appearance and the image. Therefore, it can handle low-resolution images, which makes it robust to the distance between the camera and the user. With the recent advent of deep learning architectures for computer vision tasks, gaze estimation methods have adopted a deep learning-based approach treated as other computer vision tasks. In this context, appearance-based methods become prevalent in the domain of gaze estimation. Deep learning-based methods benefit from the large amount of real and synthetic training datasets, facilitating unconstrained and person-independent gaze estimation. This means that we can estimate gaze in everyday environments without any assumptions regarding personal facial features and properties of the environment such as illumination conditions and camera angles. Several studies and commercial products that utilize gaze tracking have focused on personal devices such as mobile phones or PCs [4], [5], [6], which typically have small screens, designed for individual use. Thus, most gaze tracking applications in personalized devices require calibration to achieve accurate gaze tracking performances. However, in this work, we implement a gaze tracking system for large screens in a public environment. Furthermore, the gaze tracking system implemented in this work is not independent from two perspectives. First, to achieve an accurate point of regard, other measures such as face landmark and the global coordinates of the user are required. Second, the gaze tracking system is a part of an integrated system with various functionalities such as pose estimation, voice assistance, and hand gesture recognition.

The characteristics of the proposed gaze estimation system give rise to two main challenges in its implementation. First, in contrast to devices that have small screens, where the entire screen can be covered with relatively small eye movement, the eyes must be moved as far as possible to stare at the edge of large screens. The appearance of the eyes reflected on the camera is similar when staring at the point within the farmost edge region of the large screen and the point more inward but close to the point within the farmost edge region. This makes it difficult for the appearance-based model to distinguish the difference between the two points in each region. To resolve this issue, the small difference of the eye movement when a user is gazing at the edge of the screen needs to be amplified. Our observation indicates that the absolute value of the angle, which is the output of the gaze estimation network, enlarges as the gaze moves from the center to the edge of the screen. Therefore, we propose a novel function called symmetric angle amplifying function (SAAF), which amplifies the angle as the value of the angle enlarges, enabling accurate gaze tracking in the edge region.

The second problem we address is the necessity of a short inference time and the lightweight deep learning network. As discussed earlier, in many cases, the gaze tracking module is not a standalone component. In other words, it is often used with other modules and functions. For example, in our case, the gaze tracking module is used with multimodality modules, which we explain in the implementation section. In addition, gaze tracking is often implemented and embedded in edge devices rather than high-performance computing devices. Such considerations again impose constraints of light weight and real-time performance on the designed system. The innate aspect of gaze tracking also coincides with the constraints because the delayed inference would be noninformative for the gaze tracking task. Hence, we address this issue with a simple ResNet-based [7] gaze estimation model and utilize the network with a network-optimization framework. In summary, the contributions of our paper can be summarized as follows:

We propose a novel function, SAAF, for accurate gaze tracking in the edge region of a large screen. Using our SAAF, the limitation of the appearance-based gaze estimation method, which relies solely on inputs from the monocular RGB cameras, can be overcome.
We present a framework suitable for a public environment. In this work, we aim to utilize the proposed framework in the public environment aggregated with other application modules such as pose estimation, voice assistance, and hand gesture recognition. In addition, owing to the potential use under the multi-person condition, we design a person-independent system that does not require personalized calibration.
We analyze and verify the proposed method and framework with extensive experiments. Additionally, we compare the proposed method with the baseline model and other functions such as the polynomial function, piecewise function, and bezier function. Experiments revealed the effectiveness of the proposed method.
We show the implementation of the framework for use in the real world. Finally, we implemented our module on two different vehicles which show the practical use of the aggregated system with various user interface contents.

SECTION II.

Related Work

A. System Type and Gaze Tracking Accuracy

Generally, gaze tracking systems are categorized into head-mounted system and remote system. Head-mounted tracking system depends on the head movement on the user. The pupil and the glints on the corneal surface can be attained from the high-resolution near-eye camera of the system, thus giving the accurate gaze. For example, See Glasses uses one camera and 8 infrared sources for gaze tracking, achieving accuracy of less than 0.5° based on corneal reflection. Pupil Invisible is the first deep learning based gaze tracking eye glasses. Besides scene camera, it is equipped with two IR near-eye camera for each side. In addition, it also utilizes IR LED to illuminate the respective eye region. On the other hand, remote systems generally can be operated at a certain distance. There are several studies on remote systems. Li et al. [8] proposed gaze estimation algorithm for long-distance camera based on deep learning using convolutional neural networks (CNNs). For the case of commercial products, Tobii Pro Fusion operates distance within 50-80cm. In the optimal condition, using pupil corneal reflection, the accuracy of the device is 0.3°. In addition, the number of cameras installed in the Smart Eye Pro is flexibly adjusted for different situations to determine the operating distance and tracking range reaching accuracy of 0.5°.

In summary, the proposed method was conducted as a remote system since several measuring equipment is required to be implemented as head-mounted system, which is expensive and not suitable for a public environment.

B. Deep Network Algorithms

Deep learning-based gaze tracking methods use multi-layer neural networks to learn a model that maps between appearance and gaze. In general, the input image is expected to be the image of the entire face or eye.

Most recent solutions have adopted CNN-based architectures [9], [10], [11], [12], [13], [14], which aim to learn end-to-end spatial representations. Some studies [9], [10] have proposed datasets and corresponding architectures. Generally, most gaze estimation networks use modified versions of popular CNN architectures in computer vision downstream tasks (e.g. AlexNet [15], VGG [16], ResNet [7], and Stacked Hourglasses [17]). Usually, the difference between the networks comes from the input, intermediate features, or output. For the case of the input, it is divided into a single RGB image stream (e.g., face and left or right eye patch) [9], [13], or multiple RGB image streams (e.g., face and eye patch) [10], [18], and prior knowledge [11] based on eye anatomy or geometric constraints. From the perspective of intermediate features, GazeNet [9] concatenates the head pose to the output of the CNN encoder, and Pictorial Gaze [11] first regresses the gaze map and uses it as an intermediate feature to obtain the final gaze direction output. Finally, the output is different because of the dataset. Most networks output a gaze direction vector that is a 2D vector representing the yaw and pitch of the gaze. In [10], horizontal and vertical distances from the camera were directly regressed in centimeters.

Usually, the gaze moves continuously and dynamically when a person looks around the surrounding environment. In this context, a continuous sequence is generated, and a specific time image frame has a high correlation with a previous time image frame. On the basis of this logic, several studies [19], [20], [21], [22], [23], [24] have utilized time information and ocular kinematics to improve gaze estimation performance over single image-based methods. Given a series of frames, the goal of the task is to estimate the gaze direction to match the ground-truth direction as much as possible. To model this task, a popular recurrent neural network (RNN) structure has been explored.

In our system, we adopted a CNN-based model instead of an RNN-based model for the following reasons. First, CNN-based models are easy to implement because these have already been implemented in many frameworks, including optimization frameworks such as TensorRT. Second, CNN-based models generally have faster inference speed than RNN-based models. Third, CNN layers are more suitable for acquiring spatial features of images.

C. Gaze Analysis Datasets

With the rapid progress of gaze estimation, several datasets on gaze estimation have been suggested. Datasets can be divided according to the environment in which they were collected: constrained indoor [25]; unconstrained indoor [12], [13], [26], [27]; and outdoor settings [22]. Compared with early environments [25], [28], recently released datasets [22], [29] are less biased and have improved complexity with a large scale, which makes them suitable for training and evaluation. In this section, we introduce some important datasets.

MPIIGaze [12] is the first in-the-wild dataset comprising 213,659 images, which were collected from 15 subjects in natural daily events. This dataset was generated by showing random points to subjects. It provides not only binocular images but also landmark of the eyes, 2D and 3D gaze, 3D head pose and annotation about the 3D center of the eye. Zhang et al. [13] suggested MPIIFaceGaze derived from the motivation that considering the entire face made gaze estimation more accurate and appended additional landmark annotation of faces. However, it has a limitation in that most of the head pose covered by MPIIFaceGaze is the front view and it has a small camera–subject distance, which makes it inappropriate for remote gaze estimation. Gaze360 [22] is a large-scale dataset collected from 238 subjects with a wide range of head poses and gaze directions. It was collected in unconstrained environments, both indoor and outdoor, covering the entire horizontal range of 360°. ETH-XGaze [30] is another large-scale high-resolution dataset collected from constrained indoor environments. It was collected from 110 subjects using 18 DSLR cameras and adjustable illuminations.

In this work, we conducted various experiments using ETH-XGaze, which includes more diverse head poses and gaze range compared to other datasets. Inspired by the robustness of the model trained with this dataset, we acquired various gaze direction vectors for multiple subjects.

D. Function Fitting Methods

Function fitting or curve fitting which fits a function (or curve) and interpolates and extrapolates a set of data without exact equation was used in many previous studies. Especially, when there is a desirable shape for the fitted curve, curve fitting methods often becomes a solution. One of the most representative methods for curve fitting is the Bezier curve fitting method. When data points are given without the equation, the Bezier curve fitting method finds the fitted curve and allows interpolation and extrapolation by specifying the degree and optimizing control points of a curve. Recent works on trajectory planning [31], [32] adopted Bezier curve fitting method to generate and plan trajectory and velocity, and Ueda et al. [33] used Bezier curve segments to approximate the boundaries of a point cloud. Another method which is based on basic polynomial method, divided the data into different intervals and fit polynomial function for each divided intervals, or segments. The key idea was to segment intervals. In [34], it used optimization based on $l_{0}$ penalized least square regression. In this paper, we compared our proposed fitting function (SAAF) to the forementioned two methods to prove the effectiveness of our proposed function.

SECTION III.

Proposed Method

In this section, we explain the proposed method in the top–down approach, starting from the integrated system outline, gaze estimation, and present our SAAF, which makes accurate gaze tracking possible in given conditions. In addition, for a smooth interaction between a user and the edge device, we apply the inference optimizer to deep neural networks included in our systems. This guarantees the inference speed to some extent, enabling undisturbed interaction between the gaze module and the user. To enhance interaction experience, we propose the center gravity function, which pulls gaze coordinates to the center of the predefined regions.

A. Integrated System

Our integrated gaze tracking system comprises three modules: a pose estimation module, an object detection module, and a gaze estimation module, which have been optimized for performance as shown in Fig. 1. For the object detection module, we adopted the popular model, YOLOv3 [35], to track multiple users, allowing us to give priority to particular users. To ensure precise pose estimation, we employed HRNet [36], the widely used top–down pose estimation network. Then, the outputs of the pose estimation network were used directly for gaze estimation, generating a 3D face model of the subject. In particular, we used two keypoints corresponding to the ears of a person to find the bounding box containing the face of the user. For the gaze estimation itself, we employed ResNet [7], a simple deep learning network.

FIGURE 1.

Overall framework of the proposed integrated gaze tracking system. For gaze estimation, we propose the symmetric angle amplifying function (SAAF) through the data-acquisition step. For pose estimation and object detection, we employed widely used off-the-shelf models. All neural networks included in the integrated system were optimized through the TensorRT framework. The processed gaze vector output is converted to the screen coordinate and projected to the screen.

MIT Libraries

MIT Libraries

Improving Gaze Tracking in Large Screens With Symmetric Gaze Angle Amplification and Optimization Technique

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. System Type and Gaze Tracking Accuracy

B. Deep Network Algorithms

C. Gaze Analysis Datasets

D. Function Fitting Methods

Proposed Method

A. Integrated System

B. Gaze Estimation

C. Data Acquisition

D. Symmetric Angle Amplifying Function

E. User Experience Optimization

Algorithm 1 Center Gravity Function Algorithm

Experiments

A. Experimental Settings

B. Quantitative Results

1) Region Classification

2) Real-Time Click Accuracy

C. Qualitative Results

Implementation

Conclusion

ACKNOWLEDGMENT

References

IEEE Account

Purchase Details

Profile Information

Need Help?