Loading [MathJax]/extensions/MathMenu.js
A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU | IEEE Journals & Magazine | IEEE Xplore

A High-Performance Accelerator for Super-Resolution Processing on Embedded GPU


Abstract:

Over the past few years, super-resolution (SR) processing has achieved astonishing progress along with the development of deep learning. Nevertheless, the rigorous requir...Show More

Abstract:

Over the past few years, super-resolution (SR) processing has achieved astonishing progress along with the development of deep learning. Nevertheless, the rigorous requirement for real-time inference, especially for video tasks, leaves a harsh challenge for both the model architecture design and the hardware-level implementation. In this article, we propose a hardware-aware acceleration on embedded GPU devices as a full-stack SR deployment framework. The most critical stage with dictionary learning applied in SR flow was analyzed in details and optimized with a tailored dictionary slimming strategy. Moreover, we also delve into the programming architecture of hardware while analyzing the model structure to optimize the computation kernels to reduce inference latency and maximize the throughput given restricted computing power. In addition, we further accelerate the model with 8-bit integer inference by quantizing the weights in the compressed model. An adaptive 8-bit quantization flow for SR task enables the quantized model to achieve a comparable result with the full-precision baselines. With the help of our approaches, the computation and communication bottlenecks in the deep dictionary learning-based SR models can be overcome effectively. The experiments on both edge embedded device NVIDIA NX and 2080Ti prove that our framework exceeds the performance of state-of-the-art NVIDIA TensorRT significantly and can achieve real-time performance.
Page(s): 3210 - 3223
Date of Publication: 08 February 2023

ISSN Information:

Funding Agency:

References is not available for this document.

I. Introduction

Super-resolution (SR) is an important class of graphical processing techniques that plays an important role in the digital image era. The SR task aims at generating or recovering high-resolution (HR) video frames given frames with low-resolution (LR). Among all existing approaches, the naive solution is to interpolate the LR image with RGB value collected bilinear or bicubic from spatially invariant nearest-neighbor pixels. Advanced development of deep learning in computer vision has stimulated a group of powerful SR approaches with impressive performance for SR. From conventional convolution neural networks [2] to novel generative adversarial networks [3], [4], various methods have appeared in the last decade. Recently, by introducing dictionary learning methods with pixel-level local feature fusion operations [5], [6], the image quality of generated HR images or videos is further improved with richer color/texture details recovered thanks to the idea of dictionary learning and pixel-level local feature fuse operations. As algorithms get performant, the efficient and optimized deployment of such deep learning-based SR methods on hardware has gradually become the new spot of attention.

Select All
1.
W. Zhao et al., "A high-performance accelerator for super-resolution processing on embedded GPU", Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), pp. 1-9, 2021.
2.
C. Dong, C. C. Loy, K. He and X. Tang, "Image super-resolution using deep convolutional networks", IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2, pp. 295-307, Feb. 2016.
3.
C. Ledig et al., "Photo-realistic single image super-resolution using a generative adversarial network", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 4681-4690, 2017.
4.
S. Y. Kim, J. Oh and M. Kim, "JSI-GAN: GAN-based joint super-resolution and inverse tone-mapping with pixel-wise task-specific filters for UHD HDR video", Proc. AAAI Conf. Artif. Intell., pp. 11287-11295, 2020.
5.
W. Li, X. Tao, T. Guo, L. Qi, J. Lu and J. Jia, "MuCAN: Multi-correspondence aggregation network for video super-resolution", Proc. Eur. Conf. Comput. Vis. (ECCV), pp. 335-351, 2020.
6.
W. Li, K. Zhou, L. Qi, N. Jiang, J. Lu and J. Jia, "LAPAR: Linearly-assembled pixel-adaptive regression network for single image super-resolution and beyond", Proc. Conf. Neural Inf. Process. Syst. (NIPS), pp. 20343-20355, 2020.
7.
C. Hao et al., "FPGA/DNN co-design: An efficient design methodology for IoT intelligence on the edge", Proc. ACM/IEEE Design Autom. Conf. (DAC), pp. 1-6, 2019.
8.
C. Guo et al., "Balancing efficiency and flexibility for DNN acceleration via temporal GPU-systolic array integration", Proc. ACM/IEEE Design Autom. Conf. (DAC), pp. 1-6, 2020.
9.
H. Li, M. Bhargav, P. N. Whatmough and H.-S. P. Wong, "On-chip memory technology design space explorations for mobile deep neural network accelerators", Proc. ACM/IEEE Design Autom. Conf. (DAC), pp. 131, 2019.
10.
Q. Sun, C. Bai, H. Geng and B. Yu, "Deep neural network hardware deployment optimization via advanced active learning", Proc. Design Autom. Test Europe Conf. Exhibit. (DATE), pp. 1510-1515, 2021.
11.
X. Wei et al., "Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs", Proc. ACM/IEEE Design Autom. Conf. (DAC), pp. 29, 2017.
12.
R. Pinkham, S. Zeng and Z. Zhang, "QuickNN: Memory and performance optimization of k-d tree based nearest neighbor search for 3D point clouds", Proc. IEEE Int. Symp. High Perform. Comput. Architect. (HPCA), pp. 180-192, 2020.
13.
Y. Bai and W. Wang, "ACPNet: Anchor-Center based person network for human pose estimation and instance segmentation", Proc. IEEE Int. Conf. Multimedia Expo (ICME), pp. 1072-1077, 2019.
14.
S. Cao et al., "Efficient and effective sparse LSTM on FPGA with bank-balanced sparsity", Proc. ACM Int. Symp. Field-Program. Gate Arrays (FPGA), pp. 63-72, 2019.
15.
NVIDIA TensorRT, Mar. 2021, [online] Available: https://docs.nvidia.com/deeplearning/tensorrt/index.html.
16.
Intel MKL-DNN, Mar. 2021, [online] Available: https://github.com/oneapi-src/oneDNN.
17.
Y. Jung, Y. Choi, J. Sim and L.-S. Kim, "eSRCNN: A framework for optimizing super-resolution tasks on diverse embedded CNN accelerators", Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), pp. 1-8, 2019.
18.
T. Dai, J. Cai, Y. Zhang, S.-T. Xia and L. Zhang, "Second-order attention network for single image super-resolution", Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 11065-11074, 2019.
19.
Z. Luo, Y. Huang, S. Li, L. Wang and T. Tan, "Unfolding the alternating optimization for blind super resolution", Proc. Conf. Neural Inf. Process. Syst. (NIPS), vol. 33, 2020.
20.
J. Yang, Z. Wang, Z. Lin, S. Cohen and T. Huang, "Coupled dictionary training for image super-resolution", IEEE Trans. Image Process., vol. 21, no. 8, pp. 3467-3478, Aug. 2012.
21.
Y. Romano, J. Isidoro and P. Milanfar, "RAISR: Rapid and accurate image super resolution", IEEE Trans. Comput. Imag., vol. 3, no. 1, pp. 110-125, Mar. 2017.
22.
P. Getreuer, I. Garcia-Dorado, J. Isidoro, S. Choi, F. Ong and P. Milanfar, "BLADE: Filter learning for general purpose computational photography", Proc. ICCP, pp. 1-11, 2018.
23.
C. Wang, Z. Li and J. Shi, "Lightweight image super-resolution with adaptive weighted learning network" in arXiv:1904.02358, 2019.
24.
J. Cheng, M. Grossman and T. McKercher, Professional CUDA C Programming, New York, NY, USA:Wiley, 2014.
25.
H. Wu, P. Judd, X. Zhang, M. Isaev and P. Micikevicius, "Integer quantization for deep learning inference: Principles and empirical evaluation" in arXiv:2004.09602, 2020.
26.
M. Nagel, M. V. Baalen, T. Blankevoort and M. Welling, "Data-free quantization through weight equalization and bias correction", Proc. IEEE/CVF Int. Conf. Comput. Vis., pp. 1325-1334, 2019.
27.
M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos and T. Blankevoort, "Up or down? Adaptive rounding for post-training quantization", Proc. Int. Conf. Mach. Learn., pp. 7197-7206, 2020.
28.
R. Banner, Y. Nahshan and D. Soudry, "Post training 4-bit quantization of convolutional networks for rapid-deployment", Proc. Adv. Neural Inf. Process. Syst., vol. 32, pp. 7948-7956, 2019.
29.
M. Courbariaux, Y. Bengio and J.-P. David, "Training deep neural networks with low precision multiplications" in arXiv:1412.7024, 2014.
30.
S. Han, H. Mao and W. J. Dally, "Deep compression: Compressing deep neural networks with pruning trained quantization and huffman coding" in arXiv:1510.00149, 2015.
Contact IEEE to Subscribe

References

References is not available for this document.