Journals & Magazines >IEEE Access >Volume: 8

Predicting Saccadic Eye Movements in Free Viewing of Webpages

A saliency map cannot explain the dynamic behaviors in attention. To explore the dynamic attributes of human visual behaviors in free-viewing webpages, we propose a sacca...

Abstract:

Attention modeling for webpages has emerged as a new research direction in computer vision. Despite an amount of research effort, most studies have focused on estimating ...Show More

Metadata

Abstract:

Attention modeling for webpages has emerged as a new research direction in computer vision. Despite an amount of research effort, most studies have focused on estimating webpage saliency to reveal the static location of human fixations. Without temporal information, existing models cannot interpret the dynamic properties of the actual attention process in free-viewing webpages. To solve this problem, we propose a webpage-based saccadic model in this study to model dynamic visual search behaviors of humans when they view webpages. In the first stage, we utilize the support vector machine to learn the mapping from multilevel saliency features to an initial probability of being fixated. In the second stage, we combine the mechanisms of spatial bias and inhibition of return with the estimation of the initial probability to iteratively predict a sequence of successive fixations for each webpage image. Experimental results on a benchmark eye-tracking data set for webpages have demonstrated that the proposed model outperforms the state-of-the-art saccadic methods.

A saliency map cannot explain the dynamic behaviors in attention. To explore the dynamic attributes of human visual behaviors in free-viewing webpages, we propose a sacca...

Published in: IEEE Access ( Volume: 8)

Page(s): 15598 - 15610

Date of Publication: 14 January 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.2966628

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The webpage is a prominent platform for information communication on the Internet. With the prevalence of the Internet, webpages have played an important role in our daily life. According to the statistics on Internet Live Stats, there are over 1.5 billion websites on the world wide web. Moreover, the total number of Internet users has reached 4.5 billion in June 2019. Due to the ubiquitous webpage reading, it is necessary to investigate how humans deploy visual sources to acquire information in free-viewing webpages.

For the human visual system (HVS), there is an imbalance between input and computing resources. Embraced by overwhelming amounts of visual input, the HVS can still effectively process visual information to form a more accurate understanding of the external world [1]. The main reason for that is the selective visual attention. It serves as a mediating mechanism, which selects the most critical regions into detailed processing while limiting the influence of the rest areas [2]. Based on its important role in visual processing, modeling visual attention for webpages can contribute to revealing the internal working mechanisms of the HVS in the free viewing of webpages. On the other hand, predicting visual attention is beneficial to webpage design. A good webpage not only needs to meet functional needs, but also needs to have a reasonable layout. Visual attention prediction can help to improve the layout of webpages and thus achieve a better reading of Internet users.

Due to the significance of attention modeling for webpages, researchers have begun to predict where humans pay attention on webpages. Shen and Zhao [3] constructed the eye-tracking data set of webpages and used multiple kernel learning to learn saliency maps from recorded data. Zheng et al. [4] proposed an end-to-end learning framework to estimate task-driven webpage saliency. Despite multiple computational frameworks, the existing work usually aims at estimating two-dimensional saliency maps for webpages to encode the static saliency of distinct areas [3]–[6]. However, actual attention is a dynamic visual search process [7]. Even two regions with the same saliency value have different attended orders. As is shown in Fig. 1, a saliency map cannot explain the dynamic behaviors in attention, such as the temporal relationship between salient regions and the generation of saccadic scanpaths. To explore the dynamic attributes of human visual behaviors in free-viewing webpages, we propose a saccadic model for webpages, which can predict saccadic scanpaths according to the content of webpages.

FIGURE 1.

Comparison between webpage saliency estimation and saccade prediction. (a) Webpages. (b) Saliency maps (fixation heat maps). (c) Saccadic scanpaths of different subjects. (d) Predicted saccadic scanpaths by the proposed model.

Show All

The proposed model consists of two main stages. In the first stage, we calculate a saliency map to estimate the initial distribution of fixations for scanpath generation. Concretely, we first extract multilevel saliency features to represent each area from distinct perspectives. Then, we utilize the support vector machine (SVM) to learn the relationship between the extracted features and the initial probability of being fixated on webpages. In the second stage, we combine the mechanisms of spatial bias and inhibition of return (IoR) with saliency estimation to predict human saccadic scanpath on each webpage image. In summary, the main contributions are three-fold.

We aim at investigating the dynamic properties of human attention in free-viewing webpages. Based on existing research on webpage saliency, it can establish a complete visual attention model for viewing webpages.
We construct a more effective feature description for saliency estimation. We employ deep networks to extract deeper and more abstract representation from input. Then, based on learning from eye-tracking data, we can more accurately model the relationship between multilevel features and saliency.
We model the spatial influence on scanpath generation, which can generate scanpaths more consistent with human eye movements. Experimental results have shown that spatial bias is beneficial to improve the performance on saccade prediction for webpages. By combing the influence of saliency, spatial bias, and IoR, the final model outperforms the state-of-the-art saccadic methods.

The remainder of the paper is organized as follows: Section II reviews related work on visual attention modeling. Section III describes the proposed saccadic model for webpages. The performances of models are evaluated in Section IV, and the conclusions are drawn in Section V.

SECTION II.

Related Work

A. Visual Attention Models on Natural Images

Most existing visual attention models have originated from Treisman and Gelade’s Feature Integration Theory (FIT) [8]. It suggests that different visual features are combined in parallel to affect the human attention process. Based on their study, Koch and Ullman [9] fused the influence of different features into a two-dimensional topographic map called “saliency map” to represent the conspicuity of each region in an image. Furthermore, Itti et al. [2] completely implemented the framework in [9] and built the classic bottom-up computational approach of attention. They proposed the center-surround (C-S) assumption and estimated saliency according to the feature contrast between the central pixel and the average of neighborhood pixels. Based on this milestone, predicting the saliency maps consistent with the eye movements of humans has become the primary task in attention modeling.

To obtain a more accurate saliency map, in the past two decades, a large number of models have emerged to improve Itti et al.’s model from distinct perspectives [2]. Firstly, the pixel-wise comparison has been replaced by patch-based feature difference which takes context information into account. For instance, Bruce and Tsotsos [10] determined saliency by comparing the independent components of central and surrounding patches. Borji [11] used the space-weighted feature dissimilarity between the center patch and other surrounding patches to represent local saliency. Han et al. [12] adopted the patches from image boundary to model background and calculated saliency by the reconstruction residual of a background-based network.

Secondly, the area for C-S comparison has expanded from local neighborhoods to nonlocal or global regions. As stated in previous research [13], the local comparison is insufficient when a region has low C-S contrast but the entire local region is globally rare. Therefore, models have begun to integrate more nonlocal information into C-S comparison. For instance, Xia et al. [14] described the C-S contrast based on an autoencoder network learned from each scene globally. Wang et al. [15] extended the range of context to a corpus of similar images to stress the regions deviating from traditional notions.

Thirdly, learning has been introduced in the calculation. On the one hand, methods have used learning as methodologies for building features in C-S comparison to enhance the generalization ability of models. For instance, Borji and Itti [13] learned V1-like features from natural scenes to compute local and global rarity. Vig et al. [16] searched for the optimal blend of features from a hierarchical model family. On the other hand, models have learned the relationship between features and saliency values directly. For instance, Shen and Zhao [17] adopted a linear classifier to learn the inference from extracted features to saliency. Wang and Shen [18] applied an end-to-end CNN to predict multilevel saliency from the input of an image.

Besides the research on saliency estimation for predicting human fixations, another branch of saliency calculation has emerged to detect object-level salient areas, which has achieved a wide range of applications. Cheng et al. [19] first segmented each scene into regions and computed saliency based on the comparison between regions. To use the information beyond the current image for saliency estimation, Wang et al. [20] calculated the saliency of each image by warping the annotations of similar scenes. For video salient object detection, Wang et al. [21] computed static and dynamic saliency by unitizing CNNs to learn the inference from input to labeled salient areas. In [22], Wang et al. calculated object-aware saliency to generate spatiotemporal saliency prior for video object segmentation. In [23], Wang et al. also applied a CNN to predict the attention bounding box for each image. Then, they integrated an aesthetics assessment based network to select from the attention-based candidate windows for photo cropping. In [24], Wang et al. estimated stereoscopic saliency based on disparity and edge, and used saliency to guide stereoscopic thumbnail generation.

In the development of visual attention, saliency estimation has always been a research topic of great interest. With a large amount of research effort, the last decade has witnessed significant improvements in the prediction of saliency maps. However, a saliency map does not contain any dynamic information. Therefore, it cannot wholly account for actual human eye movements. To investigate the dynamic properties in visual attention, saccadic models have emerged to predict human saccadic scanpaths. The earliest research can be traced back to Itti et al.’s model [2], which employed winner-take-all (WTA) and IoR on each saliency map to generate a sequence of successive fixations. Besides, Lee and Yu [25] interpreted saccadic eye movements under the framework of information maximization. They iteratively selected the locations of the maximum complexity in responses for scanpath generation. Similarly, Wang et al. [26] directed the next fixation to the maximum on a residual perceptual information map measured by Site Entropy Rate.

In recent years, another tendency in saccade prediction is to regard saccade as a Markov process, with the next fixation determined by the maximum transition probabilities calculated based on the last fixation. Liu et al. [27] calculated the transition probabilities based on low-level saliency and the semantic content modeled by a Hidden Markov Model. Le Meur and Liu [28] combined saliency, oculomotor biases, and memory effect for estimating the transition probabilities. To further model the effect of stimuli and spatial location, Le Meur and Coutrot [29] extended the previous work [28] by training the model with distinct image categories and spatial locations. Also inspired by [28], Wu and Chen [30] introduced visual memory and combined it with oculomotor bias and IoR in the calculation of transition probabilities for gaze shifts.

B. Visual Attention Models on Webpages

With the popularity of the Internet and the rapid development of big data, webpages have become one of the most important channels for humans to acquire information from the external environment. Due to the ubiquitous webpage data, it is necessary to understand how humans deploy their attention on webpages in free-viewing tasks. To this end, multiple methods have been proposed to investigate the attention process on webpages.

Shen and Zhao [3] pioneered in this direction and proposed an early saliency model for webpages. Firstly, they analyzed the features and mechanisms underlying webpage saliency. Then, they constructed the first eye-tracking data set of webpages by collecting the eye movements from 11 observers on 149 webpages. Finally, they learned from the data set via multiple kernel learning to obtain the model based on distinct features and positional bias. In [31], they extended the study of [3] by generating high-level representations from CNNs. Besides, Li et al. [5] improved Shen and Zhao’s work [3] from two perspectives. For one thing, they introduced subband features to complement the features calculated in the space domain. For another, they detected object blobs in webpages to further enhance the performance of the model.

Besides webpage saliency under the free-viewing conditions, in recent years, researchers have begun to model task-driven attention on webpages to investigate the effect of targets on human eye movements. For instance, Zheng et al. [4] proposed an end-to-end learning framework to estimate the task-driven webpage saliency. They first constructed an eye-tracking data set with the stimuli from six categories. Then, they separately calculated task-specific and task-free fixation maps by learn from the data set. Finally, they integrated the effects of two components in the way of addition to derive the final saliency map.

In summary, the previous research on attention modeling for webpages has mainly focused on the estimation of webpage saliency [32]. However, these studies cannot interpret the temporal sequence of fixations, which is valuable for understanding actual human attention during visual exploration. Because of this, we present a saccadic model of webpages in this study to predict the shifts between fixations and to generate saccadic scanpaths on distinct webpage images.

SECTION III.

Saccadic Model for Webpages

In this section, we describe the framework of the proposed saccadic model for webpages. The overall flowchart of the model is presented in Fig. 2. As is shown in the figure, the model consists of two main stages. In the first stage, we calculate a saliency map based on multilevel features to estimate the initial distribution of fixations. In the second stage, we combine the mechanisms of spatial bias and IoR with saliency estimation to iteratively predict fixations for saccadic scanpath generation. Besides, we have shown the overall algorithm in Algorithm 1 to provide more algorithmic details.

SECTION Algorithm 1

The Proposed Model of Predicting the Sequence of Saccadic Points

Input:

Eye-tracking data $D$ , Test image $I$ , Number of saccadic points $n=10$ , time $t=1$ .

Output:

A set of saccadic points $\{q_{1},q_{2},\cdots,q_{n}\}$ .

Train SVM model

Select positive and negative samples according to the recorded data.

Extract multilevel features for each sample according to III-A.

Train SVM to obtain the parameters $\{\boldsymbol {W}_{S}, \boldsymbol {b}_{S}\}$ of the model.

Predict scanpath on a given image

Extract multilevel features $f(\boldsymbol {x})$ for each pixel $\boldsymbol {x}$ according to III-A.

Compute the saliency value $\boldsymbol {S}(\boldsymbol {x})={ \boldsymbol {W}_{S}}^{T} \boldsymbol {f}(\boldsymbol {x})+ \boldsymbol {b}_{S}$ by feeding the features $f(\boldsymbol {x})$ into the trained model.

while $t\leq n$ do

Compute top-left bias map $\mu (\boldsymbol {x})$ according to (8).

Compute oculomotor bias map $\rho _{t}(\boldsymbol {x})$ according to (9).

Generate spatial bias $\boldsymbol {W}_{t}(\boldsymbol {x})$ by integrating $\mu (\boldsymbol {x})$ and $\rho _{t}(\boldsymbol {x})$ according to (10).

Compute IoR map $\boldsymbol {I}_{t}(\boldsymbol {x})$ according to (13).

Estimate integrated map $\boldsymbol {F}(\boldsymbol {x})$ as $\boldsymbol {S}(\boldsymbol {x}) \boldsymbol {W}_{t}(\boldsymbol {x}) \boldsymbol {I}_{t}(\boldsymbol {x})$ .

Select the location of the maximum $\boldsymbol {F}(\boldsymbol {x})$ as $q_{t}$ .

10:

$t=t+1$ .

11:

end while

FIGURE 2.

Framework of the proposed saccadic model for webpages. Given a webpage image, we predict the saccadic scanpath consisting of ten fixations according to the image content.

Show All

A. Saliency Estimation With Feature Fusion

Results from perceptual research [28], [33] have indicated that saliency is an influential factor in guiding saccadic eye navigation. The salient regions with rare visual information usually present a high probability of being fixated. Therefore, we first compute the saliency map of each webpage image to estimate the initial probability of the gaze shift.

In saliency estimation, the core of our model is to learn the relationship between visual representation and saliency value from human eye-tracking data. We first construct the description of each region by extracting multilevel features. Then, we choose positive and negative samples separately from the most salient and nonsalient areas to train the parameters of mapping. Finally, based on the learned model, we calculate the saliency values of distinct locations to generate the saliency map of each test image.

To make a complete description of scenes, we use the features from different levels to represent each pixel. As is shown in Fig. 3, we first extract six low-level features to generate a physiologically plausible representation.

FIGURE 3.

Block diagram of saliency estimation for webpages. We first extract a set of visual features from training images. Then, we choose positive and negative samples separately from the most salient (top 20% of the human heatmap) and nonsalient areas (bottom 30% of the human heatmap) to train the parameters of SVM. Finally, we use the learned model to predict the saliency map of a test image.

Show All

1) Subbands of the Steerable Pyramid

The pyramid subbands in four orientations and three scales.

2) 3-Channel Color

The red, green, and blue color of each pixel.

3) Probability of Color

The probability of the pixel’s value in the corresponding color channel.

4) ITTI Model Features

The conspicuity under intensity, color, and orientation in Itti et al.’s model [2], which is computed by across-scale C-S contrast.

Concretely, we resize the image to the resolution of 200 $\times \,\,200\times 3$ and use Gaussian pyramid to subsample the image into 3 scales. At each scale $l$ (where $l \in [{0, 1, 2}]$ ), we calculate the intensity map $I(l)$ , four color maps $R(l)$ , $G(l)$ , $B(l)$ , and $Y(l)$ , and four orientation maps $O(l, \theta)$ , $\theta \in \{0^{\circ }, 45^{\circ }, 90^{\circ }, 135^{\circ }\}$ . Then, we implement the C-S operation $\ominus$ as the difference between fine and coarse scales. The C-S maps under intensity, color, and orientation are computed as (1), (2), and (3), respectively.

$\begin{align*} \mathcal {I}(c, s)=&|I(c)\ominus I(s)|, \tag{1}\\ \mathcal {RG}(c, s)=&|(R(c) - G(c)) \ominus (G(s) - R(s))| \\[3pt] \mathcal {BY}(c, s)=&|(B(c) - Y(c)) \ominus (Y(s) - B(s))|, \tag{2}\\[3pt] \mathcal {O}(c, s, \theta)=&|O(c, \theta) \ominus O(s, \theta)|,\tag{3}\end{align*}$ View Source

where

$[c,s]\in {[{0,1}],[{0,2}],[{1,2}]}$

. Finally, for each feature, we add the results of different scales to obtain the conspicuity under intensity

$\bar {\mathcal {I}}$

, color

$\bar {\mathcal {C}}$

, and orientation

$\bar {\mathcal {O}}$

, respectively.

5) 3D Color Histograms of Filtered Image

The probability of each color according to 3D color histograms of the image filtered with a median filter under five scales.

6) Torralba Saliency

The saliency value calculated by [34], which defines saliency as the difference between the target velocity and the average of distractors.

7) Distance to Center

For mid-level features, we compute the distance to the center of images because important figures tend to be placed in the center of webpages [35].

8) Semantic Hashing Code

Semantic hashing, as an effective means to find a compressed representation of high-dimensional input data, has been successively applied in multiple fields. The semantic hashing is in fact to convert raw data from images to short binary code. The code can organize the data into a memory space where nearby addresses store the pointers of similar semantic objects [36]. To obtain binary code, an autoencoder network as Fig. 4 is used for feature learning. As can be seen from the figure, the network is a reconstruction network, which has symmetrical encoder and decoder. The input first passes through the encoder with the gradually decreasing number of units to obtain short binary code. Then, the decoder generates the output based on the binary code in the central layer.

FIGURE 4.

Autoencoder for extracting semantic hashing code. The network consists of symmetrical encoder and decoder.

Show All

For high-level features, we combine the methodologies of semantic hashing, multiresolution CNN, and object detection for feature extraction.

Concretely, as is shown in Fig. 4, given an input vector $\boldsymbol {y}$ , we can derive the output of the first hidden layer in the encoder as:

$\begin{equation*} \boldsymbol {e}_{1} = \mathrm {sigmoid}(\boldsymbol {W}_{1} \boldsymbol {y}+ \boldsymbol {b}_{1}),\tag{4}\end{equation*}$ View Source

where sigmoid is the sigmoid function, and

$\boldsymbol {\theta }_{1}=\{ \boldsymbol {W}_{1}, \boldsymbol {b}_{1}\}$

is the parameter of the first hidden layer in the encoder, with weight of

$\boldsymbol {W}_{1}$

and bias of

$\boldsymbol {b}_{1}$

. Then,

$\boldsymbol {e}_{1}$

iteratively passes through the following hidden layers in the encoder to generate a vector as the output of the fifth hidden layer. To generate binary code

$\boldsymbol {c}$

, the values in the vector are rounded to 1 or 0 in the forward pass while this rounding is ignored in the backpropagation. In this way, noise can be introduced in the network to make the model more robust [36]. Next, in the decoding process, code

$\boldsymbol {c}$

first passes through the first hidden layer in the decoder as:

$\begin{equation*} \boldsymbol {d}_{1} = \mathrm {sigmoid}(\bar { \boldsymbol {W}}_{1} \boldsymbol {c}+\bar { \boldsymbol {b}}_{1}),\tag{5}\end{equation*}$

View Source

where

$\bar { \boldsymbol {\theta }}_{1}=\{\bar { \boldsymbol {W}}_{1},\bar { \boldsymbol {b}}_{1}\}$

is the parameter of the first hidden layer in the decoder. After several iterations, the network can generate an output of

$\bar { \boldsymbol {y}}$

at the end of the decoder. Based on

$\bar { \boldsymbol {y}}$

, a loss function can be defined as:

$\begin{equation*} L(\boldsymbol {y},\bar { \boldsymbol {y}})=- \sum \nolimits _{i} \big (y_{i} \log \bar {y}_{i} + \left ({1 - y_{i} }\right)\log \left ({1 - \bar {y}_{i} }\right) \big),\tag{6}\end{equation*}$

View Source

where

$i$

is the index of elements in the vectors. Finally, for the training set

$T$

of all training samples, the total loss function is calculated as:

$\begin{equation*} L_{AE}(\boldsymbol {\theta }) = \sum _{ \boldsymbol {y}\in T} L(\boldsymbol {y},\bar { \boldsymbol {y}}).\tag{7}\end{equation*}$

View Source

Due to the restricted encoder and the loss function of minimizing the reconstruction error, the central layer can work as a filter to adaptively extract a compressed representation of the input.

To train the network, we sample $d \times d$ unlabeled patches from the data set of webpages [3]. Then, we transform each patch into the form of a vector to feed into the network. With the training samples, we adopt a two-stage learning process to train the parameters of the network [37]. Firstly, we initialize the network layer by layer to obtain a set of sensible initial parameters for the network. We take the encoder as a stack of Restricted Boltzmann Machines (RBMs) to obtain initial values by contrastive divergence and use its transposes to initialize the decoder [14]. Based on the initialization, we use backpropagation to fine-tune the parameters globally according to (7). Finally, with the trained network, we can extract the 30-dimensional binary code of semantic hashing by inputting the patch of the pixel into the network.

9) Multiresolution CNN Features

Deep learning has played an important part in the progress of saliency. CNN, which is inspired by the function of visual cells, can capture semantic features of data in a hierarchical way [18], [38]. Therefore, in this study, we also extract CNN features to complement the hierarchically semantic information of images. Concretely, we adopt the trained network in [38] for feature learning. Given a pixel $\boldsymbol {x}$ , we first extract $79\times 79\times 3$ central patch $\boldsymbol {c}(\boldsymbol {x})$ and $158\times 158\times 3$ surrounding patch $\boldsymbol {s}(\boldsymbol {x})$ from the center of $\boldsymbol {x}$ . Then, we resize the surrounding patch into $79\times 79\times 3\,\,\bar { \boldsymbol {s}}(\boldsymbol {x})$ as the size of the input in the second stream. Next, we feed $\boldsymbol {c}(\boldsymbol {x})$ and $\bar { \boldsymbol {s}}(\boldsymbol {x})$ into the input of the first and second streams, respectively. Finally, we combine the output in the third convolutional layers of two streams as CNN features.

10) Object Detection

The analyses on the eye movements of webpages [3] have shown that humans tend to pay more attention to the regions of faces and persons. Therefore, we generate binary maps of objects to obtain a robust description of objects under different scenarios. Firstly, we derive the map of cars and persons by car and person detectors implemented by Felzenszwalb et al.’s model [39]. Then, we obtain the maps of faces by Viola and Jone’s face detector [40].

Based on feature extraction, the next step is to construct the relationship between multilevel features and saliency value for estimating the saliency of test images. To solve this problem, we use SVM to learn from human eye-tracking data on webpages. Firstly, we randomly choose half of the data set for training, with the rest as test images. Then, for each training image, we sample ten positive points from the top 20% of the fixation density map and ten negative points from the bottom 30%. As a result, for each training sample, we can extract the pair of multilevel features and the corresponding label (1 for positive samples and 0 for negative samples, respectively). Finally, we use all pairs of sampled data to train the SVM. In testing, we adopt the method similar to regression for estimating the continuously changing saliency value $\boldsymbol {S}(\boldsymbol {x})$ of each pixel $\boldsymbol {x}$ . Specifically, the saliency is calculated by ${ \boldsymbol {W}_{S}}^{T} \boldsymbol {f}(\boldsymbol {x})+ \boldsymbol {b}_{S}$ where $\boldsymbol {f}(\boldsymbol {x})$ refers to the features and $\{ \boldsymbol {W}_{S}, \boldsymbol {b}_{S}\}$ denotes the parameters of the SVM.

B. Saccadic Scanpath Generation

Saliency estimation can provide an initial probability of being fixated for each pixel to guide the selection of fixation at each moment. However, besides saliency, there are other factors influencing saccade. One critical factor is spatial bias. It means that human eye movements exhibit certain spatial biases on webpages because of the unique properties of webpage images. Therefore, in this subsection, we first combine the spatial influence from distinct perspectives to model the spatial bias for the determination of saccade.

The first is top-left bias. As is shown in Fig. 5, there is a tendency to make saccade towards the top-left regions. Furthermore, previous research has demonstrated that the top-left bias mainly presents in the initial fixations on scanpaths [17]. The main reason is that the upper left corner of images usually contains important information of scenes, such as the headlines. Observers tend to pay more attention to the top-left regions at the initial moment when they try to obtain a general understanding of the scene. To model the spatial influence of top-left bias at initial stages, we generate a top-left bias map to introduce the traction towards top-left regions in the saccade prediction of the first four fixations, which is calculated as:

$\begin{equation*} \mu (\boldsymbol {x}) \propto 1-\bar {d}(\boldsymbol {x}, \boldsymbol {x}_{tp}),\tag{8}\end{equation*}$ View Source

where

$\boldsymbol {x}_{tp}=[h/4,w/4]$

refers to the pixel at the top-left region, with

$h$

and

$w$

as the height and width of an image, respectively. Besides,

$\bar {d}(\boldsymbol {x}, \boldsymbol {x}_{tp})$

denotes the normalized distance from the pixel

$\boldsymbol {x}_{tp}$

. Consequently, the generated top-left bias map is shown in Fig. 5b.

FIGURE 5.

Top-left bias of fixations on webpages. (a) Fixation density map generated by convolving all fixations in the data set [17] with a 2D Gaussian function. (b) Map of top-left bias.

Show All

The second is the oculomotor bias represented by saccade amplitude and saccade orientation. To investigate the oculomotor bias in free-viewing webpages, we analyze all the eye movement data of the eye-tracking database [3]. As is shown in Fig. 6, we plot the two-dimensional histogram of saccade amplitude and saccade orientation. Concretely, for saccade amplitude, we calculate the spatial distance between every two successive fixations. For saccade orientation, we compute the absolute value of the angle between the X-axis and each gaze shift. Therefore, a value close to 0° indicates that the saccade tends to be horizontal. Otherwise, a value close to 90° refers to a vertical saccade.

FIGURE 6.

Oculomotor bias of saccade on webpages. Saccade amplitude is the spatial distance between two adjacent fixations. Saccade orientation is the absolute value of the angle with the X-axis. 0° is a horizontal saccade while 90° is a vertical saccade.

Show All

As can be seen from Fig. 6, observers tend to generate short and horizontal saccade when free-viewing webpages. From another perspective, the oculomotor bias has demonstrated that the determination of the current fixation is influenced by the position of the last fixation [28]. Therefore, we define an oculomotor bias map that depends on the last fixation to enhance the probability of short and horizontal saccade. To be specific, to generate the oculomotor bias map, we use a two-dimensional Gaussian function, which is calculated as:

$\begin{equation*} \rho _{t}(\boldsymbol {x}) \propto \exp \left[{-\left ({\frac {(x-x_{t-1})^{2}}{2\sigma _{x}^{2}}+\frac {(y-y_{t-1})^{2}}{2\sigma _{y}^{2}}}\right)}\right],\tag{9}\end{equation*}$ View Source

where

$[x,y]$

and

$[x_{t-1},y_{t-1}]$

denote the two-dimensional coordinates of the pixel

$\boldsymbol {x}$

and the last fixation

$\boldsymbol {q}_{t-1}$

, respectively. Besides, we set asymmetric standard deviations with

$\sigma _{x}=\min [w,h]/3$

and

$\sigma _{y}=\sigma _{x}\times 4$

to encourage horizontal saccade. Finally, based on the top-left bias and the oculomotor bias, the map of spatial bias can be computed as:

$\begin{equation*}W_{t}(x) = \begin{cases}{\frac{\mu(x)}{\sum_{x} \mu(x)} \frac{\rho_{t}(x)}{\sum_{x} \rho_{t}(x)}} & {\text { if } t \leqslant 4} \\ {\frac{\rho_{t}(x)}{\sum_{x} \rho_{t}(x)}} & {\text { otherwise }}\end{cases}\tag{10}\end{equation*}$

View Source

Besides spatial bias, IoR mechanism is another important factor in dynamic attention [2], [28]. Its essence is to prevent the following shifts returning to previously attended regions in a period. Previous research has shown that each saccade takes 30-70 ms while the inhibition of each local region lasts approximately 500-900 ms [2]. It means that in the generation of ten fixations, each previously fixated region would be inhibited to return. To model the IoR mechanism on saccade, we calculate the IoR map $\boldsymbol {I}_{t}(\boldsymbol {x})$ at stage $t$ as:

$\begin{equation*}I_{t}(x) = \begin{cases}{0} & {\text { if } d\left(x, q_{t-1}\right) \leqslant r} \\ {I_{t-1}(x)} & {\text { otherwise }}\end{cases}\tag{11}\end{equation*}$ View Source

where

$d(\boldsymbol {x}, \boldsymbol {q}_{t-1})$

is the spatial distance between pixel

$\boldsymbol {x}$

and the last fixation

$\boldsymbol {q}_{t-1}$

. Since we focus on modeling the saccade in a free-viewing process, similar to [2], we set the local region of each fixation as a simple disk with radius of

$r=\min [w,h]/12$

After modeling the multiple factors, we integrate the influence of saliency, spatial bias, and IoR into an integrated map $\boldsymbol {F}(\boldsymbol {x})$ as:

$\begin{equation*} \boldsymbol {F}(\boldsymbol {x}) = \boldsymbol {S}(\boldsymbol {x}) \boldsymbol {W}_{t}(\boldsymbol {x}) \boldsymbol {I}_{t}(\boldsymbol {x}).\tag{12}\end{equation*}$ View Source

Then, we take the maximum on the map

$F(\boldsymbol {x})$

as the current fixation

$\boldsymbol {q}_{t}$

, which is expressed as:

$\begin{equation*} \boldsymbol {q}_{t} = \arg \max _{ \boldsymbol {x}} \boldsymbol {F}(\boldsymbol {x}).\tag{13}\end{equation*}$

View Source

Based on the selection of the current fixation

$\boldsymbol {q}_{t}$

, we update the spatial bias map and IoR map in the following calculation. After several iterations of prediction and updating, we can finally generate a sequence of fixations and construct a saccadic scanpath on each webpage image. An example of the generation of successive fixations based on the integration of multiple factors can be found in Fig. 7.

FIGURE 7.

Generation of a sequence of fixations by integrating the influence of saliency, spatial bias, and IoR.

Show All

SECTION IV.

Experimental Results

In this section, firstly, we describe the comparison models and implementation details in IV-A. Then, we introduce three benchmark evaluation metrics for saccadic scanpath prediction in IV-B. Finally, we evaluate different combinations of saccadic factors and compare our predicted scanpaths on webpages with the state-of-the-art models in IV-C.

A. Experimental Settings

1) Data Set

In this study, we adopted the eye-tracking data set of webpages proposed by Shen and Zhao [3] for the evaluation of saccadic models. It includes 149 webpage images of three categories, namely pictorial, text, and mixed. Based on the images, they recorded the eye movements of 11 subjects with age from 21 to 25. Subjects were positioned at a distance of 0.6m from a screen of $1360\times768$ pixels. Besides, each scene was presented for 5s in the collection of eye-tracking data.

2) Computational Saccadic Models

To verify the validity of the proposed saccadic model, we compared the prediction of the proposed model with the results of the state-of-the-art saccadic models. Firstly, we compared the proposed model with Itti et al.’s attention model [2], which applied WTA and IoR on saliency maps for scanpath generation. Secondly, we compared the proposed model with Boccignone and Ferraro’s model [41] of Constrained Levy Exploration (CLE). Thirdly, we compared the proposed model with Le Meur and Liu’s saccadic model [28], which combined saliency estimation and oculomotor bias for scanpath prediction. Fourthly, we compared the proposed model with Xia et al.’s iterative representation learning (IRL) model [7] to output scanpaths based on a self-taught learning framework.

Moreover, human performance should be one of the benchmarks for [7]. Therefore, we also evaluated the inter-observer performance. Concretely, given an evaluation metric, we first evaluated each human scanpath by taking other human scanpaths on the same image as the ground truth. Then, we averaged the results across all human scanpaths and all images to measure the inter-observer performance “IO”.

3) Implementation Details

To obtain the semantic hashing code, we first extracted 200, $000\,\,15\times15$ unlabeled patches from webpages to train the autoencoder in Fig. 4, with the dimension of each input vector as $675=15\times 15 \times 3$ . Then, we adopted the network with the layer of size 675-2048-1024-512-256-30 in the encoder. For the network, the code units in the central layer are binary neurons, while the others are logistic neurons. In the two-stage training, we utilized 100 mini-batches with 50 epochs and 200 with 100 epochs, respectively. Besides, we updated the weights after each mini-batch with a learning rate of 0.001 and 0.1 for the central RBM and other RBMs in the encoder, respectively. To train the SVM model, we took half of the data set as training images and the other half as test images.

B. Evaluation Metrics

Unlike saliency estimation, there has been a lack of consensus about the evaluation metrics for scanpath prediction [7]. Therefore, to fairly compare models, we adopted multiple saccadic metrics to evaluate the performance of models from different perspectives.

1) Time-Delay Embedding (TDE)

TDE is a piece-based metric. It was first introduced to the evaluation of saccadic scanpath prediction by Wang et al. in a work of scanpath prediction on natural images [26]. Its calculation consists of three main steps. Firstly, we divided the predicted scanpath and all human scanpaths into saccadic pieces of $k$ fixations. For instance, $\boldsymbol {C}^{k}_{p}(t)=(\boldsymbol {c}_{p}(t), \cdots, \boldsymbol {c}_{p}(t+k-1))$ is a $k$ -dimensional saccadic piece with $\boldsymbol {c}_{p}(t)$ denoting the $t$ -th predicted fixation. As a result, we can obtain the sets of predicted scanpath ( $\boldsymbol {S}^{k}_{p}$ ) and human data ( $\boldsymbol {S}^{k}_{h}$ ) by dividing the predicted scanpath and all human scanpaths into $k$ -dimensional saccadic pieces, respectively.

Then, for each saccadic piece $\boldsymbol {C}^{k}_{p}(t) \in \boldsymbol {S}^{k}_{p}$ , we calculated its distance $d_{k}(t)$ to the set of human scanpaths $\boldsymbol {S}^{k}_{h}$ as:

$\begin{equation*} d_{k}(t) = \min _{i,\tau } \left \|{ \boldsymbol {C}^{k}_{p}(t)- \boldsymbol {C}^{k}_{h_{i}}(\tau) }\right \|_{2}/k, \boldsymbol {C}^{k}_{h_{i}}(\tau)\in \boldsymbol {S}^{k}_{h},\tag{14}\end{equation*}$ View Source

where

$\boldsymbol {C}^{k}_{h_{i}}(\tau)$

refers to the saccadic piece extracted from the time

$\tau$

$i$

-th observer’s scanpath.

Finally, we adopted Hausdorff distance (HD) and mean minimal distance (MMD) to measure the total similarity between the pair of predicted and actual human scanpaths. Concretely, HD was defined as the maximum among the distances of all the pieces on the predicted scanpath:

$\begin{equation*} D_{k}^{1} = \max _{t} d_{k}(t).\tag{15}\end{equation*}$ View Source

Besides, MMD was defined as the mean minimal distance:

$\begin{equation*} D_{k}^{2} = \frac {1}{n_{k}}\sum _{i=1}^{n_{k}} d_{k}(t),\tag{16}\end{equation*}$

View Source

where

$n_{k}$

denotes the total number of

$k$

-dimensional saccadic pieces included in the prediction. It should be noted that in the experiment we varied

$k$

from 2 to 5 as [26] to investigate the performance of models with the outputted saccadic scanpath of distinct lengths.

2) Sequence Score (SS)

SS is a string-based metric. It was proposed by Borji et al. [42] to convert the scanpaths into strings for comparison. Its calculation consists of three main steps. Firstly, we utilized mean-shift to cluster the fixations, which can better integrate global information of the scene into evaluation than classifying fixations according to grids. Then, we assigned a set of characters to each cluster and converted the scanpaths into strings by determining the cluster of each fixation. Finally, we adopted the Needleman-Wunsch algorithm to calculate the similarity between each pair of predicted and human scanpaths. Based on the pair-wise similarity calculation, we can derive the total measurement of SS by averaging the results across all human scanpaths on each image and all images in the data set. Besides, similar to TDE [26], we varied the length of scanpath for comparison from 1 to 6 by truncating subsequent fixations to evaluate the model at different stages of scanpath generation.

3) Multimatch (MM)

MM is a vector-based metric. It was proposed by Jarodzka and Holmqvist [43] and has been widely used in the recent challenges of saccadic scanpath prediction [44]. Its calculation also includes three main steps. Firstly, we encoded the scanpaths into geometrical vectors. Each saccade can present the shift between two fixations. By taking the fixations as the points in the two-dimensional coordinate system, saccade can be regarded as vectors for dissimilarity comparison. Concretely, for two scanpaths $\boldsymbol {s}_{1} = (\boldsymbol {u}_{1}, \boldsymbol {u}_{2}, \cdots, \boldsymbol {u}_{N})$ and $\boldsymbol {s}_{2} = (\boldsymbol {v}_{1}, \boldsymbol {v}_{2}, \cdots, \boldsymbol {v}_{M})$ , we computed the distance between the vectors of saccades $\boldsymbol {u}_{i}$ and $\boldsymbol {v}_{j}$ . In the calculation of distance, the following measures can be used: Position (the distance in the locations), Amplitude (the distance in saccade amplitude), Direction (the distance in saccade orientation), and Shape (the distance in vector representation). Then, we utilized dynamic programming to align scanpaths based on their shapes temporally. Concretely, we computed a cost matrix with each entry representing the distance between pair-wise vectors. After that, we adopted a dynamic programming approach to calculate the minimal accumulating cost in the cost matrix. The final dissimilarity between two scanpaths was reflected by the last entry in the accumulating cost matrix. By traversing all pairs of predicted scanpath and human scanpath in the data set, we can derive the average score under the metric MM.

C. Comparison of Scanpath Prediction

In this subsection, we first evaluated different combinations of saccadic factors based on the metrics in IV-B. Firstly, we constructed the model without the estimation of saliency “w/o Saliency” to predict saccadic scanpaths based on random initial positions and the modeling of spatial bias and IoR. Then, we generated the model “w/o Spatial Bias” to iteratively output fixations based on saliency estimation and IoR. Finally, we compared these models with the proposed model (“Proposed”), which fuses the factors of saliency, spatial bias, and IoR to generate scanpaths on webpages.

The comparison under TDE and SS is shown in Fig. 8, and the results under MM are shown in Table 1. Firstly, the comprehensive comparison across different combinations can show the importance of integrating the influence of saliency and spatial bias in scanpath prediction. Secondly, the comparison under “Position” in Table 1 indicates that saliency estimation helps to provide a more accurate prediction for the position of the fixations. Similarly, the significant improvement made by the proposed model at initial stages (e.g., fixation stage 1 under SS in Fig. 8c) can also demonstrate the effectiveness of estimating the initial distribution of fixations by saliency estimation. In other words, static saliency and dynamic saccade are not independent, but coherent mechanisms which interpret the visual behaviors. Thirdly, by comparing the evaluation under “w/o Spatial Bias” and “Proposed”, we can conclude that modeling spatial bias is an effective means to obtain scanpaths more consistent with human data.

TABLE 1 Comparison on Different Combinations of Factors and Features Under MM. The Best Scores are in Bold Face Font

FIGURE 8.

Comparison on different combinations of factors and features under TDE (HD and MMD) and SS. HD and MMD are divergence measures (should be minimized), while SS is a similarity measure (should be maximized).

Show All

On the other hand, we discussed an ablation analysis to investigate the influence of different levels of features. The results under distinct combinations of features are shown in Fig. 8 and Table 1. In the comparison, “L”, “M”, and “H” refer to the models using low-level, mid-level, and high-level features, respectively. As can be observed from the results, the model “H” outperforms models “L” and “M”, which illustrates the contribution of high-level features. Also, the comparison between “L+M” and the proposed model can show the important role of high-level cues in scanpath prediction. The final model “L+M+H” that integrates multiple levels of features can comprehensively describe the input, thus outperforming other combinations.

In the second part, we compared the performance of the state-of-the-art models on the data set of webpage [3]. The evaluation under TDE is shown in Figs. 9a and 9b. As can be seen from the results of HD and MMD, the rankings of models are consistent under different values of $k$ . Moreover, the proposed saccadic model can obtain a more accurate prediction of human scanpaths on webpages than other predictive models. It should be noted that the proposed model surpasses the “IO” model under HD. It is mainly because of the limited number of observers for comparison (11 subjects). The upper bound can only be derived by calculating the similarity with the data from an unlimited number of observers [7].

FIGURE 9.

Comparison under TDE (HD and MMD) and SS on the eye-tracking data set of webpage [3]. “IO” refers to the inter-observer performance.

Show All

The evaluation under SS is shown in Fig. 9c. By observing the figure, we can draw some conclusions as follows. Firstly, the proposed model built by integrating the learning-based saliency and factors from saccade can outperform other algorithms at different stages of scanpath generation. Secondly, in the first three stages, the proposed model can achieve a better prediction of scanpaths than “IO” model, which indicates the importance of modeling top-left bias. In contrast, with the length of scanpaths increasing, the consistency among human data is gradually enhanced. Therefore, the predictive models still have room for improvement by better exploring the mechanisms in long-term eye movements. Thirdly, the advantages of the proposed model over the state-of-the-art ones for natural scenes (e.g., “IRL” [7] and “Le Meur” [28]) further demonstrate that it is necessary to model human saccadic behaviors on webpages. Fourthly, an overall comparison with the SS scores under natural images [7] can show that human eye movements on webpages is less consistent than those on natural images. It is because that webpage images usually include multiple figures or objects. There are more possible saccade orders to extract information.

The evaluation under MM is shown in Table 2. It can be observed from the table that the proposed model has advantages over other models. On the one hand, the proposed model can achieve a more accurate prediction of the fixated positions by adopting the feature-fusing learning-based saliency to estimate the initial prohibition of being fixated. On the other hand, by integrating the mechanisms of spatial bias and IoR, the proposed model can also surpass the others in the properties of “Length” and “Direction”. As a result, the predicted scanpaths of the proposed model are more consistent with human scanpaths.

TABLE 2 Performance of MM on the Test Data Set [3]. Four Properties Were Used for Evaluation. The Best Scores are in Bold Face Font

In addition, we also compared the results under each subset. The subsets are “Pictorial”, “Text”, and “Mixed”. The example images of the three subsets can be seen from Fig. 1. “Pictorial” includes 50 images, each of which has a dominant picture with less text. “Text” includes 50 images with informative text. “Mixed” includes 49 images with each of thumbnail pictures and text. The results in Fig. 10 and Table 2 have shown that the proposed model performs best on the subset of “Pictorial” but performs worst on “Text”. It has implied that modeling text representation is important, although texts can be popped out by low-level features, such as intensity and orientation. Therefore, determining how to extract text-based features to improve the saccade prediction on text images will be a future research direction.

FIGURE 10.

The evaluation under TDE (HD and MMD) and SS on the subsets “Pictorial”, “Text”, and “Mixed” in data set [3].

Show All

Besides quantitative comparison, we also provided a visual comparison of generated scanpaths in Fig. 11. In the figure, the second column “Heatmap” is the ground truth of saliency estimation, which can show the distribution of human fixations on each scene. The third column “Human Scanpaths” displays all human scanpaths on each image with different colors. As can be observed from the figure, the proposed scanpaths are more similar to human data than other saccadic algorithms. In addition, the comparison between the models with (the proposed method and “Le Meur”) and without (“Itti” and “CLE”) the modeling of oculomotor bias can demonstrate the importance of learning the bias of saccade amplitude and saccade orientation in scanpath generation.

FIGURE 11.

Visual comparison of the scanpaths from distinct models. “Heatmap” refers to the distribution of all human fixations. “Human Scanpaths” shows all human scanpaths on each webpage image. For both human and predicted scanpaths, pentagrams are the starting points.

Show All

SECTION V.

Conclusion

Despite a large amount of research effort in the last two decades, there are still some limitations in visual attention modeling. For one thing, previous research has usually focused on saliency estimation of natural scenes. The calculation of webpage saliency has still been a challenge. For another, the study on dynamic attention mechanisms is a bit limited [7].

To address these problems, we have proposed a saccadic model for webpages to investigate human dynamic eye movements in free-viewing webpages. In the first stage, we have estimated the initial distribution of fixations based on multilevel saliency learning. In the second stage, we have combined the factors of saliency, top-left bias, oculomotor bias, and IoR to predict fixations for scanpath generation iteratively. Qualitative and quantitative experimental results have demonstrated the advantages of the proposed model over other state-of-the-art methods.

For future work, we will extend this study from the following perspectives. Firstly, we will explore the task-driven dynamic visual attention on webpages. Based on attention modeling in free-viewing conditions, we will investigate the influence of tasks on the generation of saccadic scanpaths. Secondly, besides the dissimilarity of scanpaths presented by targets, we will also focus on the difference in scanpaths presented by distinct groups of subjects to solve the classification problems in multiple applications, such as disease identification and age recognition [45]. Thirdly, we will integrate the methodologies of deep learning into the estimation of saliency and the modeling of saccadic properties. On the one hand, we will build a large-scale eye-tracking data set of webpages to complement the lack of large-scale webpage-based data set. On the other hand, we will take advantage of the learning ability of the deep models to reveal deeper dynamic attention mechanisms from human eye movements.

References is not available for this document.

Predicting Saccadic Eye Movements in Free Viewing of Webpages

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Visual Attention Models on Natural Images

B. Visual Attention Models on Webpages

Saccadic Model for Webpages

The Proposed Model of Predicting the Sequence of Saccadic Points

A. Saliency Estimation With Feature Fusion

1) Subbands of the Steerable Pyramid

2) 3-Channel Color

3) Probability of Color

4) ITTI Model Features

5) 3D Color Histograms of Filtered Image

6) Torralba Saliency

7) Distance to Center

8) Semantic Hashing Code

9) Multiresolution CNN Features

10) Object Detection

B. Saccadic Scanpath Generation

Experimental Results

A. Experimental Settings

1) Data Set

2) Computational Saccadic Models

3) Implementation Details

B. Evaluation Metrics

1) Time-Delay Embedding (TDE)

2) Sequence Score (SS)

3) Multimatch (MM)

C. Comparison of Scanpath Prediction

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Predicting Saccadic Eye Movements in Free Viewing of Webpages

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Visual Attention Models on Natural Images

B. Visual Attention Models on Webpages

Saccadic Model for Webpages

The Proposed Model of Predicting the Sequence of Saccadic Points

A. Saliency Estimation With Feature Fusion

1) Subbands of the Steerable Pyramid

2) 3-Channel Color

3) Probability of Color

4) ITTI Model Features

5) 3D Color Histograms of Filtered Image

6) Torralba Saliency

7) Distance to Center

8) Semantic Hashing Code

9) Multiresolution CNN Features

10) Object Detection

B. Saccadic Scanpath Generation

Experimental Results

A. Experimental Settings

1) Data Set

2) Computational Saccadic Models

3) Implementation Details

B. Evaluation Metrics

1) Time-Delay Embedding (TDE)

2) Sequence Score (SS)

3) Multimatch (MM)

C. Comparison of Scanpath Prediction

Conclusion

References