Loading [MathJax]/extensions/MathMenu.js
Audio Matters in Video Super-Resolution by Implicit Semantic Guidance | IEEE Journals & Magazine | IEEE Xplore

Audio Matters in Video Super-Resolution by Implicit Semantic Guidance


Abstract:

Video super-resolution (VSR) aims to use multiple consecutive low-resolution frames to recover the corresponding high-resolution frames. However, existing VSR methods onl...Show More

Abstract:

Video super-resolution (VSR) aims to use multiple consecutive low-resolution frames to recover the corresponding high-resolution frames. However, existing VSR methods only consider videos as image sequences, ignoring another essential timing information-audio, while in fact, there is a semantic link between audio and vision, and extensive studies have shown that audio can provide supervisory information in visual networks. Meanwhile, the addition of semantic priors has been proven to be effective in super-resolution (SR) tasks, but a pretrained segmentation network is required to obtain semantic segmentation maps. By contrast, audio as the information contained in the video itself can be directly used. Therefore, in this study, we propose a novel and pluggable multiscale audiovisual fusion (MS-AVF) module to enhance VSR performance by exploiting the relevant audio information, which can be regarded as implicit semantic guidance compared with the kind of explicit segmentation priors. Specifically, we first fuse audiovisual features on the semantic feature maps of different granularities of the target frames, and then through a top-down multiscale fusion approach, feedback high-level semantics to the underlying global visual features layer by layer, thereby providing effective audio implicit semantic guidance for VSR. Experimental results show that audio can further improve the VSR effect. Moreover, by visualizing the learned attention mask, the proposed end-to-end model can automatically learn potential audiovisual semantic links, especially improving the accuracy and effectiveness of the SR of sound sources and their surrounding regions.
Published in: IEEE Transactions on Multimedia ( Volume: 24)
Page(s): 4128 - 4142
Date of Publication: 23 February 2022

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

As a classical issue in the field of computer vision and image processing, video super-resolution (VSR), aimed at reconstructing the corresponding high-resolution (HR) frames from multiple low-resolution (LR) frames, has been widely employed in various applications, such as video surveillance, ultra-high-definition television, and video coding. With the development of display devices and the video industry, VSR algorithms have received a huge attention.

Select All
1.
Z. Wang, J. Chen and S. C. Hoi, "Deep learning for image super-resolution: A survey", IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 10, pp. 3365-3387, Oct. 2021.
2.
H. Liu et al., "Video super resolution based on deep learning: A comprehensive survey", 2020.
3.
T. Xue, B. Chen, J. Wu, D. Wei and W. T. Freeman, "Video enhancement with task-oriented flow", Int J. Comput Vis, vol. 127, no. 8, pp. 1106-1125, 2019.
4.
M. Haris, G. Shakhnarovich and N. Ukita, "Recurrent back-projection network for video super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3897-3906, 2019.
5.
Y. Tian, Y. Zhang, Y. Fu and C. Xu, "TDan: Temporally-deformable alignment network for video super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3360-3369, 2020.
6.
X. Wang, K. C. Chan, K. Yu, C. Dong and C. Change Loy, "EDVR: Video restoration with enhanced deformable convolutional networks", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Workshops, pp. 0-0, 2019.
7.
Y. Jo, S. W. Oh, J. Kang and S. J. Kim, "Deep video super-resolution network using dynamic upsampling filters without explicit motion compensation", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 3224-3232, 2018.
8.
S. Y. Kim, J. Lim, T. Na and M. Kim, "Video super-resolution based on 3D-CNNs with consideration of scene change", Proc. IEEE Int. Conf. Image Process., pp. 2831-2835, 2019.
9.
Y. Huang, W. Wang and L. Wang, "Video super-resolution via bidirectional recurrent convolutional networks", IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 1015-1028, Apr. 2018.
10.
M. S. Sajjadi, R. Vemulapalli and M. Brown, "Frame-recurrent video super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 6626-6634, 2018.
11.
R. I. Tivadar, C. Retsa, N. Turoman, P. J. Matusz and M. M. Murray, "Sounds enhance visual completion processes", NeuroImage, vol. 179, pp. 480-488, 2018.
12.
L. E. Marks, "Contextual processing of multidimensional and unidimensional auditory stimuli", J. Exp. Psychol. Hum. Percept Perform, vol. 19, no. 2, pp. 227-249, 1993.
13.
T. Noesselt, S. Tyll, C. N. Boehler, E. Budinger, H.-J. Heinze and J. Driver, "Sound-induced enhancement of low-intensity vision: Multisensory influences on human sensory-specific cortices and thalamic bodies relate to perceptual enhancement of visual detection sensitivity", J. Neurosci., vol. 30, no. 41, pp. 13 609-13623, 2010.
14.
A. Rouditchenko, H. Zhao, C. Gan, J. McDermott and A. Torralba, "Self-supervised audio-visual co-segmentation", Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 2357-2361, 2019.
15.
E. Kazakos, A. Nagrani, A. Zisserman and D. Damen, "Epic-fusion: Audio-visual temporal binding for egocentric action recognition", Proc. IEEE Int. Conf. Comput. Vis., pp. 5492-5501, 2019.
16.
R. Gao, T.-H. Oh, K. Grauman and L. Torresani, "Listen to look: Action recognition by previewing audio", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10 457-10467, 2020.
17.
Y. Tian, D. Li and C. Xu, "Unified multisensory perception: Weakly-supervised audio-visual video parsing", Proc. Eur. Conf. Comput. Vis., pp. 436-454, 2020.
18.
Y. Wu and Y. Yang, "Exploring heterogeneous clues for weakly-supervised audio-visual video parsing", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1326-1335, 2021.
19.
A. W. Ellis, "Neuro-cognitive processing of faces and voices" in Handbook Res. Face. Process, Elsevier, pp. 207-215, 1989.
20.
B. Shelton and C. Searle, "The influence of vision on the absolute identification of sound-source position", Atten Percept Psychophys, vol. 28, no. 6, pp. 589-596, 1980.
21.
X. Wang, K. Yu, C. Dong and C. C. Loy, "Recovering realistic texture in image super-resolution by deep spatial feature transform", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 606-615, 2018.
22.
Y. Zhang, Z. Zheng and R. Hu, "Super resolution using segmentation-prior self-attention generative adversarial network", 2020.
23.
C. Dong, C. C. Loy, K. He and X. Tang, "Learning a deep convolutional network for image super-resolution", Proc. Eur. Conf. Comput. Vis., pp. 184-199, 2014.
24.
J. Kim, J. K. Lee and K. M. Lee, "Accurate image super-resolution using very deep convolutional networks", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1646-1654, 2016.
25.
W. Shi et al., "Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1874-1883, 2016.
26.
B. Lim, S. Son, H. Kim, S. Nah and K. Mu Lee, "Enhanced deep residual networks for single image super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit, pp. 136-144, 2017.
27.
C. Ledig et al., "Photo-realistic single image super-resolution using a generative adversarial network", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4681-4690, 2017.
28.
Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong and Y. Fu, "Image super-resolution using very deep residual channel attention networks", Proc. Eur. Conf. Comput. Vis., pp. 286-301, 2018.
29.
Y. Zhang, Y. Tian, Y. Kong, B. Zhong and Y. Fu, "Residual dense network for image super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 2472-2481, 2018.
30.
X. Yang et al., "DRFN: Deep recurrent fusion network for single-image super-resolution with large factors", IEEE Trans. Multimedia, vol. 21, no. 2, pp. 328-337, Feb. 2019.
Contact IEEE to Subscribe

References

References is not available for this document.