Introduction
The webpage is a prominent platform for information communication on the Internet. With the prevalence of the Internet, webpages have played an important role in our daily life. According to the statistics on Internet Live Stats, there are over 1.5 billion websites on the world wide web. Moreover, the total number of Internet users has reached 4.5 billion in June 2019. Due to the ubiquitous webpage reading, it is necessary to investigate how humans deploy visual sources to acquire information in free-viewing webpages.
For the human visual system (HVS), there is an imbalance between input and computing resources. Embraced by overwhelming amounts of visual input, the HVS can still effectively process visual information to form a more accurate understanding of the external world [1]. The main reason for that is the selective visual attention. It serves as a mediating mechanism, which selects the most critical regions into detailed processing while limiting the influence of the rest areas [2]. Based on its important role in visual processing, modeling visual attention for webpages can contribute to revealing the internal working mechanisms of the HVS in the free viewing of webpages. On the other hand, predicting visual attention is beneficial to webpage design. A good webpage not only needs to meet functional needs, but also needs to have a reasonable layout. Visual attention prediction can help to improve the layout of webpages and thus achieve a better reading of Internet users.
Due to the significance of attention modeling for webpages, researchers have begun to predict where humans pay attention on webpages. Shen and Zhao [3] constructed the eye-tracking data set of webpages and used multiple kernel learning to learn saliency maps from recorded data. Zheng et al. [4] proposed an end-to-end learning framework to estimate task-driven webpage saliency. Despite multiple computational frameworks, the existing work usually aims at estimating two-dimensional saliency maps for webpages to encode the static saliency of distinct areas [3]–[6]. However, actual attention is a dynamic visual search process [7]. Even two regions with the same saliency value have different attended orders. As is shown in Fig. 1, a saliency map cannot explain the dynamic behaviors in attention, such as the temporal relationship between salient regions and the generation of saccadic scanpaths. To explore the dynamic attributes of human visual behaviors in free-viewing webpages, we propose a saccadic model for webpages, which can predict saccadic scanpaths according to the content of webpages.
Comparison between webpage saliency estimation and saccade prediction. (a) Webpages. (b) Saliency maps (fixation heat maps). (c) Saccadic scanpaths of different subjects. (d) Predicted saccadic scanpaths by the proposed model.
The proposed model consists of two main stages. In the first stage, we calculate a saliency map to estimate the initial distribution of fixations for scanpath generation. Concretely, we first extract multilevel saliency features to represent each area from distinct perspectives. Then, we utilize the support vector machine (SVM) to learn the relationship between the extracted features and the initial probability of being fixated on webpages. In the second stage, we combine the mechanisms of spatial bias and inhibition of return (IoR) with saliency estimation to predict human saccadic scanpath on each webpage image. In summary, the main contributions are three-fold.
We aim at investigating the dynamic properties of human attention in free-viewing webpages. Based on existing research on webpage saliency, it can establish a complete visual attention model for viewing webpages.
We construct a more effective feature description for saliency estimation. We employ deep networks to extract deeper and more abstract representation from input. Then, based on learning from eye-tracking data, we can more accurately model the relationship between multilevel features and saliency.
We model the spatial influence on scanpath generation, which can generate scanpaths more consistent with human eye movements. Experimental results have shown that spatial bias is beneficial to improve the performance on saccade prediction for webpages. By combing the influence of saliency, spatial bias, and IoR, the final model outperforms the state-of-the-art saccadic methods.
The remainder of the paper is organized as follows: Section II reviews related work on visual attention modeling. Section III describes the proposed saccadic model for webpages. The performances of models are evaluated in Section IV, and the conclusions are drawn in Section V.
Related Work
A. Visual Attention Models on Natural Images
Most existing visual attention models have originated from Treisman and Gelade’s Feature Integration Theory (FIT) [8]. It suggests that different visual features are combined in parallel to affect the human attention process. Based on their study, Koch and Ullman [9] fused the influence of different features into a two-dimensional topographic map called “saliency map” to represent the conspicuity of each region in an image. Furthermore, Itti et al. [2] completely implemented the framework in [9] and built the classic bottom-up computational approach of attention. They proposed the center-surround (C-S) assumption and estimated saliency according to the feature contrast between the central pixel and the average of neighborhood pixels. Based on this milestone, predicting the saliency maps consistent with the eye movements of humans has become the primary task in attention modeling.
To obtain a more accurate saliency map, in the past two decades, a large number of models have emerged to improve Itti et al.’s model from distinct perspectives [2]. Firstly, the pixel-wise comparison has been replaced by patch-based feature difference which takes context information into account. For instance, Bruce and Tsotsos [10] determined saliency by comparing the independent components of central and surrounding patches. Borji [11] used the space-weighted feature dissimilarity between the center patch and other surrounding patches to represent local saliency. Han et al. [12] adopted the patches from image boundary to model background and calculated saliency by the reconstruction residual of a background-based network.
Secondly, the area for C-S comparison has expanded from local neighborhoods to nonlocal or global regions. As stated in previous research [13], the local comparison is insufficient when a region has low C-S contrast but the entire local region is globally rare. Therefore, models have begun to integrate more nonlocal information into C-S comparison. For instance, Xia et al. [14] described the C-S contrast based on an autoencoder network learned from each scene globally. Wang et al. [15] extended the range of context to a corpus of similar images to stress the regions deviating from traditional notions.
Thirdly, learning has been introduced in the calculation. On the one hand, methods have used learning as methodologies for building features in C-S comparison to enhance the generalization ability of models. For instance, Borji and Itti [13] learned V1-like features from natural scenes to compute local and global rarity. Vig et al. [16] searched for the optimal blend of features from a hierarchical model family. On the other hand, models have learned the relationship between features and saliency values directly. For instance, Shen and Zhao [17] adopted a linear classifier to learn the inference from extracted features to saliency. Wang and Shen [18] applied an end-to-end CNN to predict multilevel saliency from the input of an image.
Besides the research on saliency estimation for predicting human fixations, another branch of saliency calculation has emerged to detect object-level salient areas, which has achieved a wide range of applications. Cheng et al. [19] first segmented each scene into regions and computed saliency based on the comparison between regions. To use the information beyond the current image for saliency estimation, Wang et al. [20] calculated the saliency of each image by warping the annotations of similar scenes. For video salient object detection, Wang et al. [21] computed static and dynamic saliency by unitizing CNNs to learn the inference from input to labeled salient areas. In [22], Wang et al. calculated object-aware saliency to generate spatiotemporal saliency prior for video object segmentation. In [23], Wang et al. also applied a CNN to predict the attention bounding box for each image. Then, they integrated an aesthetics assessment based network to select from the attention-based candidate windows for photo cropping. In [24], Wang et al. estimated stereoscopic saliency based on disparity and edge, and used saliency to guide stereoscopic thumbnail generation.
In the development of visual attention, saliency estimation has always been a research topic of great interest. With a large amount of research effort, the last decade has witnessed significant improvements in the prediction of saliency maps. However, a saliency map does not contain any dynamic information. Therefore, it cannot wholly account for actual human eye movements. To investigate the dynamic properties in visual attention, saccadic models have emerged to predict human saccadic scanpaths. The earliest research can be traced back to Itti et al.’s model [2], which employed winner-take-all (WTA) and IoR on each saliency map to generate a sequence of successive fixations. Besides, Lee and Yu [25] interpreted saccadic eye movements under the framework of information maximization. They iteratively selected the locations of the maximum complexity in responses for scanpath generation. Similarly, Wang et al. [26] directed the next fixation to the maximum on a residual perceptual information map measured by Site Entropy Rate.
In recent years, another tendency in saccade prediction is to regard saccade as a Markov process, with the next fixation determined by the maximum transition probabilities calculated based on the last fixation. Liu et al. [27] calculated the transition probabilities based on low-level saliency and the semantic content modeled by a Hidden Markov Model. Le Meur and Liu [28] combined saliency, oculomotor biases, and memory effect for estimating the transition probabilities. To further model the effect of stimuli and spatial location, Le Meur and Coutrot [29] extended the previous work [28] by training the model with distinct image categories and spatial locations. Also inspired by [28], Wu and Chen [30] introduced visual memory and combined it with oculomotor bias and IoR in the calculation of transition probabilities for gaze shifts.
B. Visual Attention Models on Webpages
With the popularity of the Internet and the rapid development of big data, webpages have become one of the most important channels for humans to acquire information from the external environment. Due to the ubiquitous webpage data, it is necessary to understand how humans deploy their attention on webpages in free-viewing tasks. To this end, multiple methods have been proposed to investigate the attention process on webpages.
Shen and Zhao [3] pioneered in this direction and proposed an early saliency model for webpages. Firstly, they analyzed the features and mechanisms underlying webpage saliency. Then, they constructed the first eye-tracking data set of webpages by collecting the eye movements from 11 observers on 149 webpages. Finally, they learned from the data set via multiple kernel learning to obtain the model based on distinct features and positional bias. In [31], they extended the study of [3] by generating high-level representations from CNNs. Besides, Li et al. [5] improved Shen and Zhao’s work [3] from two perspectives. For one thing, they introduced subband features to complement the features calculated in the space domain. For another, they detected object blobs in webpages to further enhance the performance of the model.
Besides webpage saliency under the free-viewing conditions, in recent years, researchers have begun to model task-driven attention on webpages to investigate the effect of targets on human eye movements. For instance, Zheng et al. [4] proposed an end-to-end learning framework to estimate the task-driven webpage saliency. They first constructed an eye-tracking data set with the stimuli from six categories. Then, they separately calculated task-specific and task-free fixation maps by learn from the data set. Finally, they integrated the effects of two components in the way of addition to derive the final saliency map.
In summary, the previous research on attention modeling for webpages has mainly focused on the estimation of webpage saliency [32]. However, these studies cannot interpret the temporal sequence of fixations, which is valuable for understanding actual human attention during visual exploration. Because of this, we present a saccadic model of webpages in this study to predict the shifts between fixations and to generate saccadic scanpaths on distinct webpage images.
Saccadic Model for Webpages
In this section, we describe the framework of the proposed saccadic model for webpages. The overall flowchart of the model is presented in Fig. 2. As is shown in the figure, the model consists of two main stages. In the first stage, we calculate a saliency map based on multilevel features to estimate the initial distribution of fixations. In the second stage, we combine the mechanisms of spatial bias and IoR with saliency estimation to iteratively predict fixations for saccadic scanpath generation. Besides, we have shown the overall algorithm in Algorithm 1 to provide more algorithmic details.
The Proposed Model of Predicting the Sequence of Saccadic Points
Eye-tracking data
A set of saccadic points
Train SVM model
Select positive and negative samples according to the recorded data.
Extract multilevel features for each sample according to III-A.
Train SVM to obtain the parameters
Predict scanpath on a given image
Extract multilevel features
Compute the saliency value
while
Compute top-left bias map
Compute oculomotor bias map
Generate spatial bias
Compute IoR map
Estimate integrated map
Select the location of the maximum
end while
Framework of the proposed saccadic model for webpages. Given a webpage image, we predict the saccadic scanpath consisting of ten fixations according to the image content.
A. Saliency Estimation With Feature Fusion
Results from perceptual research [28], [33] have indicated that saliency is an influential factor in guiding saccadic eye navigation. The salient regions with rare visual information usually present a high probability of being fixated. Therefore, we first compute the saliency map of each webpage image to estimate the initial probability of the gaze shift.
In saliency estimation, the core of our model is to learn the relationship between visual representation and saliency value from human eye-tracking data. We first construct the description of each region by extracting multilevel features. Then, we choose positive and negative samples separately from the most salient and nonsalient areas to train the parameters of mapping. Finally, based on the learned model, we calculate the saliency values of distinct locations to generate the saliency map of each test image.
To make a complete description of scenes, we use the features from different levels to represent each pixel. As is shown in Fig. 3, we first extract six low-level features to generate a physiologically plausible representation.
Block diagram of saliency estimation for webpages. We first extract a set of visual features from training images. Then, we choose positive and negative samples separately from the most salient (top 20% of the human heatmap) and nonsalient areas (bottom 30% of the human heatmap) to train the parameters of SVM. Finally, we use the learned model to predict the saliency map of a test image.
1) Subbands of the Steerable Pyramid
The pyramid subbands in four orientations and three scales.
2) 3-Channel Color
The red, green, and blue color of each pixel.
3) Probability of Color
The probability of the pixel’s value in the corresponding color channel.
4) ITTI Model Features
The conspicuity under intensity, color, and orientation in Itti et al.’s model [2], which is computed by across-scale C-S contrast.
Concretely, we resize the image to the resolution of 200 \begin{align*} \mathcal {I}(c, s)=&|I(c)\ominus I(s)|, \tag{1}\\ \mathcal {RG}(c, s)=&|(R(c) - G(c)) \ominus (G(s) - R(s))| \\[3pt] \mathcal {BY}(c, s)=&|(B(c) - Y(c)) \ominus (Y(s) - B(s))|, \tag{2}\\[3pt] \mathcal {O}(c, s, \theta)=&|O(c, \theta) \ominus O(s, \theta)|,\tag{3}\end{align*}
5) 3D Color Histograms of Filtered Image
The probability of each color according to 3D color histograms of the image filtered with a median filter under five scales.
6) Torralba Saliency
The saliency value calculated by [34], which defines saliency as the difference between the target velocity and the average of distractors.
7) Distance to Center
For mid-level features, we compute the distance to the center of images because important figures tend to be placed in the center of webpages [35].
8) Semantic Hashing Code
Semantic hashing, as an effective means to find a compressed representation of high-dimensional input data, has been successively applied in multiple fields. The semantic hashing is in fact to convert raw data from images to short binary code. The code can organize the data into a memory space where nearby addresses store the pointers of similar semantic objects [36]. To obtain binary code, an autoencoder network as Fig. 4 is used for feature learning. As can be seen from the figure, the network is a reconstruction network, which has symmetrical encoder and decoder. The input first passes through the encoder with the gradually decreasing number of units to obtain short binary code. Then, the decoder generates the output based on the binary code in the central layer.
Autoencoder for extracting semantic hashing code. The network consists of symmetrical encoder and decoder.
For high-level features, we combine the methodologies of semantic hashing, multiresolution CNN, and object detection for feature extraction.
Concretely, as is shown in Fig. 4, given an input vector \begin{equation*} \boldsymbol {e}_{1} = \mathrm {sigmoid}(\boldsymbol {W}_{1} \boldsymbol {y}+ \boldsymbol {b}_{1}),\tag{4}\end{equation*}
\begin{equation*} \boldsymbol {d}_{1} = \mathrm {sigmoid}(\bar { \boldsymbol {W}}_{1} \boldsymbol {c}+\bar { \boldsymbol {b}}_{1}),\tag{5}\end{equation*}
\begin{equation*} L(\boldsymbol {y},\bar { \boldsymbol {y}})=- \sum \nolimits _{i} \big (y_{i} \log \bar {y}_{i} + \left ({1 - y_{i} }\right)\log \left ({1 - \bar {y}_{i} }\right) \big),\tag{6}\end{equation*}
\begin{equation*} L_{AE}(\boldsymbol {\theta }) = \sum _{ \boldsymbol {y}\in T} L(\boldsymbol {y},\bar { \boldsymbol {y}}).\tag{7}\end{equation*}
To train the network, we sample
9) Multiresolution CNN Features
Deep learning has played an important part in the progress of saliency. CNN, which is inspired by the function of visual cells, can capture semantic features of data in a hierarchical way [18], [38]. Therefore, in this study, we also extract CNN features to complement the hierarchically semantic information of images. Concretely, we adopt the trained network in [38] for feature learning. Given a pixel
10) Object Detection
The analyses on the eye movements of webpages [3] have shown that humans tend to pay more attention to the regions of faces and persons. Therefore, we generate binary maps of objects to obtain a robust description of objects under different scenarios. Firstly, we derive the map of cars and persons by car and person detectors implemented by Felzenszwalb et al.’s model [39]. Then, we obtain the maps of faces by Viola and Jone’s face detector [40].
Based on feature extraction, the next step is to construct the relationship between multilevel features and saliency value for estimating the saliency of test images. To solve this problem, we use SVM to learn from human eye-tracking data on webpages. Firstly, we randomly choose half of the data set for training, with the rest as test images. Then, for each training image, we sample ten positive points from the top 20% of the fixation density map and ten negative points from the bottom 30%. As a result, for each training sample, we can extract the pair of multilevel features and the corresponding label (1 for positive samples and 0 for negative samples, respectively). Finally, we use all pairs of sampled data to train the SVM. In testing, we adopt the method similar to regression for estimating the continuously changing saliency value
B. Saccadic Scanpath Generation
Saliency estimation can provide an initial probability of being fixated for each pixel to guide the selection of fixation at each moment. However, besides saliency, there are other factors influencing saccade. One critical factor is spatial bias. It means that human eye movements exhibit certain spatial biases on webpages because of the unique properties of webpage images. Therefore, in this subsection, we first combine the spatial influence from distinct perspectives to model the spatial bias for the determination of saccade.
The first is top-left bias. As is shown in Fig. 5, there is a tendency to make saccade towards the top-left regions. Furthermore, previous research has demonstrated that the top-left bias mainly presents in the initial fixations on scanpaths [17]. The main reason is that the upper left corner of images usually contains important information of scenes, such as the headlines. Observers tend to pay more attention to the top-left regions at the initial moment when they try to obtain a general understanding of the scene. To model the spatial influence of top-left bias at initial stages, we generate a top-left bias map to introduce the traction towards top-left regions in the saccade prediction of the first four fixations, which is calculated as: \begin{equation*} \mu (\boldsymbol {x}) \propto 1-\bar {d}(\boldsymbol {x}, \boldsymbol {x}_{tp}),\tag{8}\end{equation*}
Top-left bias of fixations on webpages. (a) Fixation density map generated by convolving all fixations in the data set [17] with a 2D Gaussian function. (b) Map of top-left bias.
The second is the oculomotor bias represented by saccade amplitude and saccade orientation. To investigate the oculomotor bias in free-viewing webpages, we analyze all the eye movement data of the eye-tracking database [3]. As is shown in Fig. 6, we plot the two-dimensional histogram of saccade amplitude and saccade orientation. Concretely, for saccade amplitude, we calculate the spatial distance between every two successive fixations. For saccade orientation, we compute the absolute value of the angle between the X-axis and each gaze shift. Therefore, a value close to 0° indicates that the saccade tends to be horizontal. Otherwise, a value close to 90° refers to a vertical saccade.
Oculomotor bias of saccade on webpages. Saccade amplitude is the spatial distance between two adjacent fixations. Saccade orientation is the absolute value of the angle with the X-axis. 0° is a horizontal saccade while 90° is a vertical saccade.
As can be seen from Fig. 6, observers tend to generate short and horizontal saccade when free-viewing webpages. From another perspective, the oculomotor bias has demonstrated that the determination of the current fixation is influenced by the position of the last fixation [28]. Therefore, we define an oculomotor bias map that depends on the last fixation to enhance the probability of short and horizontal saccade. To be specific, to generate the oculomotor bias map, we use a two-dimensional Gaussian function, which is calculated as: \begin{equation*} \rho _{t}(\boldsymbol {x}) \propto \exp \left[{-\left ({\frac {(x-x_{t-1})^{2}}{2\sigma _{x}^{2}}+\frac {(y-y_{t-1})^{2}}{2\sigma _{y}^{2}}}\right)}\right],\tag{9}\end{equation*}
\begin{equation*}W_{t}(x) = \begin{cases}{\frac{\mu(x)}{\sum_{x} \mu(x)} \frac{\rho_{t}(x)}{\sum_{x} \rho_{t}(x)}} & {\text { if } t \leqslant 4} \\ {\frac{\rho_{t}(x)}{\sum_{x} \rho_{t}(x)}} & {\text { otherwise }}\end{cases}\tag{10}\end{equation*}
Besides spatial bias, IoR mechanism is another important factor in dynamic attention [2], [28]. Its essence is to prevent the following shifts returning to previously attended regions in a period. Previous research has shown that each saccade takes 30-70 ms while the inhibition of each local region lasts approximately 500-900 ms [2]. It means that in the generation of ten fixations, each previously fixated region would be inhibited to return. To model the IoR mechanism on saccade, we calculate the IoR map \begin{equation*}I_{t}(x) = \begin{cases}{0} & {\text { if } d\left(x, q_{t-1}\right) \leqslant r} \\ {I_{t-1}(x)} & {\text { otherwise }}\end{cases}\tag{11}\end{equation*}
After modeling the multiple factors, we integrate the influence of saliency, spatial bias, and IoR into an integrated map \begin{equation*} \boldsymbol {F}(\boldsymbol {x}) = \boldsymbol {S}(\boldsymbol {x}) \boldsymbol {W}_{t}(\boldsymbol {x}) \boldsymbol {I}_{t}(\boldsymbol {x}).\tag{12}\end{equation*}
\begin{equation*} \boldsymbol {q}_{t} = \arg \max _{ \boldsymbol {x}} \boldsymbol {F}(\boldsymbol {x}).\tag{13}\end{equation*}
Generation of a sequence of fixations by integrating the influence of saliency, spatial bias, and IoR.
Experimental Results
In this section, firstly, we describe the comparison models and implementation details in IV-A. Then, we introduce three benchmark evaluation metrics for saccadic scanpath prediction in IV-B. Finally, we evaluate different combinations of saccadic factors and compare our predicted scanpaths on webpages with the state-of-the-art models in IV-C.
A. Experimental Settings
1) Data Set
In this study, we adopted the eye-tracking data set of webpages proposed by Shen and Zhao [3] for the evaluation of saccadic models. It includes 149 webpage images of three categories, namely pictorial, text, and mixed. Based on the images, they recorded the eye movements of 11 subjects with age from 21 to 25. Subjects were positioned at a distance of 0.6m from a screen of
2) Computational Saccadic Models
To verify the validity of the proposed saccadic model, we compared the prediction of the proposed model with the results of the state-of-the-art saccadic models. Firstly, we compared the proposed model with Itti et al.’s attention model [2], which applied WTA and IoR on saliency maps for scanpath generation. Secondly, we compared the proposed model with Boccignone and Ferraro’s model [41] of Constrained Levy Exploration (CLE). Thirdly, we compared the proposed model with Le Meur and Liu’s saccadic model [28], which combined saliency estimation and oculomotor bias for scanpath prediction. Fourthly, we compared the proposed model with Xia et al.’s iterative representation learning (IRL) model [7] to output scanpaths based on a self-taught learning framework.
Moreover, human performance should be one of the benchmarks for [7]. Therefore, we also evaluated the inter-observer performance. Concretely, given an evaluation metric, we first evaluated each human scanpath by taking other human scanpaths on the same image as the ground truth. Then, we averaged the results across all human scanpaths and all images to measure the inter-observer performance “IO”.
3) Implementation Details
To obtain the semantic hashing code, we first extracted 200,
B. Evaluation Metrics
Unlike saliency estimation, there has been a lack of consensus about the evaluation metrics for scanpath prediction [7]. Therefore, to fairly compare models, we adopted multiple saccadic metrics to evaluate the performance of models from different perspectives.
1) Time-Delay Embedding (TDE)
TDE is a piece-based metric. It was first introduced to the evaluation of saccadic scanpath prediction by Wang et al. in a work of scanpath prediction on natural images [26]. Its calculation consists of three main steps. Firstly, we divided the predicted scanpath and all human scanpaths into saccadic pieces of
Then, for each saccadic piece \begin{equation*} d_{k}(t) = \min _{i,\tau } \left \|{ \boldsymbol {C}^{k}_{p}(t)- \boldsymbol {C}^{k}_{h_{i}}(\tau) }\right \|_{2}/k, \boldsymbol {C}^{k}_{h_{i}}(\tau)\in \boldsymbol {S}^{k}_{h},\tag{14}\end{equation*}
Finally, we adopted Hausdorff distance (HD) and mean minimal distance (MMD) to measure the total similarity between the pair of predicted and actual human scanpaths. Concretely, HD was defined as the maximum among the distances of all the pieces on the predicted scanpath: \begin{equation*} D_{k}^{1} = \max _{t} d_{k}(t).\tag{15}\end{equation*}
\begin{equation*} D_{k}^{2} = \frac {1}{n_{k}}\sum _{i=1}^{n_{k}} d_{k}(t),\tag{16}\end{equation*}
2) Sequence Score (SS)
SS is a string-based metric. It was proposed by Borji et al. [42] to convert the scanpaths into strings for comparison. Its calculation consists of three main steps. Firstly, we utilized mean-shift to cluster the fixations, which can better integrate global information of the scene into evaluation than classifying fixations according to grids. Then, we assigned a set of characters to each cluster and converted the scanpaths into strings by determining the cluster of each fixation. Finally, we adopted the Needleman-Wunsch algorithm to calculate the similarity between each pair of predicted and human scanpaths. Based on the pair-wise similarity calculation, we can derive the total measurement of SS by averaging the results across all human scanpaths on each image and all images in the data set. Besides, similar to TDE [26], we varied the length of scanpath for comparison from 1 to 6 by truncating subsequent fixations to evaluate the model at different stages of scanpath generation.
3) Multimatch (MM)
MM is a vector-based metric. It was proposed by Jarodzka and Holmqvist [43] and has been widely used in the recent challenges of saccadic scanpath prediction [44]. Its calculation also includes three main steps. Firstly, we encoded the scanpaths into geometrical vectors. Each saccade can present the shift between two fixations. By taking the fixations as the points in the two-dimensional coordinate system, saccade can be regarded as vectors for dissimilarity comparison. Concretely, for two scanpaths
C. Comparison of Scanpath Prediction
In this subsection, we first evaluated different combinations of saccadic factors based on the metrics in IV-B. Firstly, we constructed the model without the estimation of saliency “w/o Saliency” to predict saccadic scanpaths based on random initial positions and the modeling of spatial bias and IoR. Then, we generated the model “w/o Spatial Bias” to iteratively output fixations based on saliency estimation and IoR. Finally, we compared these models with the proposed model (“Proposed”), which fuses the factors of saliency, spatial bias, and IoR to generate scanpaths on webpages.
The comparison under TDE and SS is shown in Fig. 8, and the results under MM are shown in Table 1. Firstly, the comprehensive comparison across different combinations can show the importance of integrating the influence of saliency and spatial bias in scanpath prediction. Secondly, the comparison under “Position” in Table 1 indicates that saliency estimation helps to provide a more accurate prediction for the position of the fixations. Similarly, the significant improvement made by the proposed model at initial stages (e.g., fixation stage 1 under SS in Fig. 8c) can also demonstrate the effectiveness of estimating the initial distribution of fixations by saliency estimation. In other words, static saliency and dynamic saccade are not independent, but coherent mechanisms which interpret the visual behaviors. Thirdly, by comparing the evaluation under “w/o Spatial Bias” and “Proposed”, we can conclude that modeling spatial bias is an effective means to obtain scanpaths more consistent with human data.
Comparison on different combinations of factors and features under TDE (HD and MMD) and SS. HD and MMD are divergence measures (should be minimized), while SS is a similarity measure (should be maximized).
On the other hand, we discussed an ablation analysis to investigate the influence of different levels of features. The results under distinct combinations of features are shown in Fig. 8 and Table 1. In the comparison, “L”, “M”, and “H” refer to the models using low-level, mid-level, and high-level features, respectively. As can be observed from the results, the model “H” outperforms models “L” and “M”, which illustrates the contribution of high-level features. Also, the comparison between “L+M” and the proposed model can show the important role of high-level cues in scanpath prediction. The final model “L+M+H” that integrates multiple levels of features can comprehensively describe the input, thus outperforming other combinations.
In the second part, we compared the performance of the state-of-the-art models on the data set of webpage [3]. The evaluation under TDE is shown in Figs. 9a and 9b. As can be seen from the results of HD and MMD, the rankings of models are consistent under different values of
Comparison under TDE (HD and MMD) and SS on the eye-tracking data set of webpage [3]. “IO” refers to the inter-observer performance.
The evaluation under SS is shown in Fig. 9c. By observing the figure, we can draw some conclusions as follows. Firstly, the proposed model built by integrating the learning-based saliency and factors from saccade can outperform other algorithms at different stages of scanpath generation. Secondly, in the first three stages, the proposed model can achieve a better prediction of scanpaths than “IO” model, which indicates the importance of modeling top-left bias. In contrast, with the length of scanpaths increasing, the consistency among human data is gradually enhanced. Therefore, the predictive models still have room for improvement by better exploring the mechanisms in long-term eye movements. Thirdly, the advantages of the proposed model over the state-of-the-art ones for natural scenes (e.g., “IRL” [7] and “Le Meur” [28]) further demonstrate that it is necessary to model human saccadic behaviors on webpages. Fourthly, an overall comparison with the SS scores under natural images [7] can show that human eye movements on webpages is less consistent than those on natural images. It is because that webpage images usually include multiple figures or objects. There are more possible saccade orders to extract information.
The evaluation under MM is shown in Table 2. It can be observed from the table that the proposed model has advantages over other models. On the one hand, the proposed model can achieve a more accurate prediction of the fixated positions by adopting the feature-fusing learning-based saliency to estimate the initial prohibition of being fixated. On the other hand, by integrating the mechanisms of spatial bias and IoR, the proposed model can also surpass the others in the properties of “Length” and “Direction”. As a result, the predicted scanpaths of the proposed model are more consistent with human scanpaths.
In addition, we also compared the results under each subset. The subsets are “Pictorial”, “Text”, and “Mixed”. The example images of the three subsets can be seen from Fig. 1. “Pictorial” includes 50 images, each of which has a dominant picture with less text. “Text” includes 50 images with informative text. “Mixed” includes 49 images with each of thumbnail pictures and text. The results in Fig. 10 and Table 2 have shown that the proposed model performs best on the subset of “Pictorial” but performs worst on “Text”. It has implied that modeling text representation is important, although texts can be popped out by low-level features, such as intensity and orientation. Therefore, determining how to extract text-based features to improve the saccade prediction on text images will be a future research direction.
The evaluation under TDE (HD and MMD) and SS on the subsets “Pictorial”, “Text”, and “Mixed” in data set [3].
Besides quantitative comparison, we also provided a visual comparison of generated scanpaths in Fig. 11. In the figure, the second column “Heatmap” is the ground truth of saliency estimation, which can show the distribution of human fixations on each scene. The third column “Human Scanpaths” displays all human scanpaths on each image with different colors. As can be observed from the figure, the proposed scanpaths are more similar to human data than other saccadic algorithms. In addition, the comparison between the models with (the proposed method and “Le Meur”) and without (“Itti” and “CLE”) the modeling of oculomotor bias can demonstrate the importance of learning the bias of saccade amplitude and saccade orientation in scanpath generation.
Visual comparison of the scanpaths from distinct models. “Heatmap” refers to the distribution of all human fixations. “Human Scanpaths” shows all human scanpaths on each webpage image. For both human and predicted scanpaths, pentagrams are the starting points.
Conclusion
Despite a large amount of research effort in the last two decades, there are still some limitations in visual attention modeling. For one thing, previous research has usually focused on saliency estimation of natural scenes. The calculation of webpage saliency has still been a challenge. For another, the study on dynamic attention mechanisms is a bit limited [7].
To address these problems, we have proposed a saccadic model for webpages to investigate human dynamic eye movements in free-viewing webpages. In the first stage, we have estimated the initial distribution of fixations based on multilevel saliency learning. In the second stage, we have combined the factors of saliency, top-left bias, oculomotor bias, and IoR to predict fixations for scanpath generation iteratively. Qualitative and quantitative experimental results have demonstrated the advantages of the proposed model over other state-of-the-art methods.
For future work, we will extend this study from the following perspectives. Firstly, we will explore the task-driven dynamic visual attention on webpages. Based on attention modeling in free-viewing conditions, we will investigate the influence of tasks on the generation of saccadic scanpaths. Secondly, besides the dissimilarity of scanpaths presented by targets, we will also focus on the difference in scanpaths presented by distinct groups of subjects to solve the classification problems in multiple applications, such as disease identification and age recognition [45]. Thirdly, we will integrate the methodologies of deep learning into the estimation of saliency and the modeling of saccadic properties. On the one hand, we will build a large-scale eye-tracking data set of webpages to complement the lack of large-scale webpage-based data set. On the other hand, we will take advantage of the learning ability of the deep models to reveal deeper dynamic attention mechanisms from human eye movements.