I. Introduction
Predicting the human gaze from both the visual and audio modalities of 360-degree videos has recently become increasingly important for streaming [1], [2] and rendering [3] 360-degree videos in many practical applications, such as video summarization [4], [5] and virtual reality (VR)/augmented reality (AR) [6], [7], in contrast to most of the existing saliency prediction models that only exploit visual information but neglect the audio modality. Researches in neuroscience reveal that the human gaze behavior is controlled by the superior colliculus accepting both the visual and audio stimuli [8], [9]. On one hand, visual modality in 360-degree videos encodes the omnidirectional scene in a spherical signal domain. While watching a 360-degree video, viewers can freely turn their heads with the head-mounted display (HMD) to explore the viewing range on the sphere, thus acquiring an immersive experience. The audio modality, on the other hand, is usually encoded as ambisonics that approximates the sound pressure field at a single point in the spherical space with a spherical harmonic representation, which can thus indicate the sound source's location. This format of spatial audio also enables viewers to perceive sound in all directions [10]. Whenever the viewers move their heads, the sound incoming to their ears will be adjusted according to the head position, which helps them tell the direction of the sound and thus enhances the immersive multimedia experience. Besides, spatial audio can also remind the viewers of some new sounding targets not in their current field of view [11], [12], [13], and generate a different viewpoint distribution even under the same visual stimulation as shown in Fig. 1. This example motivates us that the visual and audio modalities are both important for the saliency prediction of 360-degree videos since viewers are likely to concentrate their focus around the sounding targets. This, however, cannot be accurately predicted without the audio information.