Journals & Magazines >IEEE Access >Volume: 11

Quality Feature Learning via Multi-Channel CNN and GRU for No-Reference Video Quality Assessment

A novel no-reference video quality assessment method that adopts a new multi-channel CNN model with GRU for optimized quality features representation learning while consi...

Abstract:

Nowadays, video quality assessment (VQA) plays a vital role in video-related industries to predict human perceived video quality to maintain the quality of service. Altho...Show More

Metadata

Abstract:

Nowadays, video quality assessment (VQA) plays a vital role in video-related industries to predict human perceived video quality to maintain the quality of service. Although many deep neural network-based VQA methods have been proposed, the robustness and performance are limited by small scale of available human-label data. Recently, some transfer learning-based methods and pre-trained models in other domains have been adopted in VQA to compensate for the lack of enormous training samples. However, they result in a domain gap between the source and target domains, which provides sub-optimal feature representation for VQA tasks and deteriorates the accuracy. Therefore, in the paper, we propose quality feature learning via a multi-channel convolutional neural network (CNN) with a gated recurrent unit (GRU), taking into account both the motion-aware information and human visual perception (HVP) characteristics to solve the above issue for no-reference VQA. First, inspired by self-supervised learning (SSL), the multi-channel CNN is pre-trained on the image quality assessment (IQA) domain without using human annotation labels. Then, semi-supervised learning is applied on top of the pre-trained multi-channel CNN to fine-tune the model to transfer the domain from IQA to VQA while considering motion-aware information for better frame-level quality feature representation. After that, several HVP features are extracted with frame-level quality feature representation as the input of the GRU model to obtain the final precise predicted video quality. Finally, the experimental results demonstrate the robustness and validity of our model, which is superior to the state-of-the-art approaches and is closely related to human perception.

A novel no-reference video quality assessment method that adopts a new multi-channel CNN model with GRU for optimized quality features representation learning while consi...

Published in: IEEE Access ( Volume: 11)

Page(s): 28060 - 28075

Date of Publication: 20 March 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3259101

Funding Agency:

Contents

SECTION I.

Introduction

In the era of explosive information, video sharing has dramatic growth on social networks. As Cisco forecasts [1], by 2022, there will be approximately 400 exabytes of IP traffic per month, of which 82% of IP traffic will be video traffic. However, videos will inevitably be distorted after compression, processing, and transmission, thereby affecting the human visual experience (HVE) [2]. Consequently, to provide a better end-user experience, an accurate VQA approach is highly required to preserve the quality of service.

In considering the limited time and labor, although subjective VQA methods could estimate the most accurate perceived video quality, it is generally used to construct a benchmark video quality database only. In contrast, an objective VQA allows automatic video quality evaluation without enormous resources. Also, the ultimate goal of the objective VQA is to evaluate the perceptual quality highly related to the subjective study. Therefore, it has recently become an attractive and challenging topic for researchers. There are three types of objective VQA methods according to their use of reference video [3]: Full-reference (FR) VQA [4], [5], [6], [7] requires complete information from the reference video; Reduced-reference (RR) VQA [8], [9] only takes part of the information from reference video; No-reference (NR) VQA [10], [11], [12], [13] does not require any information from the reference video. Since the reference video is not always available in real VQA applications, the NR-VQA approach is preferable to evaluate the video quality [14].

In the early stage, some traditional NR-VQA methods [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21] have been developed by exploring different spatial and temporal features. For example, TLVQM [10] extracts 75 spatial and temporal features of frames and predicts the final video score using support vector regression (SVR). However, those hand-crafted features focus on the specific distortion only, limiting the performance and generalization of visual quality prediction.

Recently, many deep neural network (DNN) models have been proposed to learn the data representation, hidden features, and abstract features automatically. However, directly applying DNN in the VQA task faces two main challenges. The first challenge is that it requires high computational power and vast memory size. Usually, a raw video contains high spatial resolution and frame rate. Due to the limitation of graphics processing unit (GPU) memory size, it is hard to process a whole VQA database and train an end-to-end DNN model for the VQA task. Therefore, some existing NR-VQA methods use spatial resolution downscaling or/and temporally downsampling strategies to reduce the computational requirements and achieve end-to-end training. For example, the spatial resolution of the video in [22] is downsized to $448\times448$ , and the video frame rate is reduced to one frame per second to achieve end-to-end model training. Also, RAPIQUE [23], HEKE [24], and RIRNet [25] also use temporally downsampling, which only processes one frame, around ten frames, and four frames per second, respectively, to reduce computational complexity. However, the effects of spatial resolution downsizing and temporally downsampling strategies lead to information loss that hinders the performance and accuracy of VQA models.

To relieve the above issue, most deep learning-based NR-VQA models separate the spatial and temporal learning process to avoid enormous high computational power at once and information loss. However, for the databases used in the subjective VQA study, each video only contains one mean opinion score (MOS) as ground truth to represent overall video quality. There is no human-annotated MOS label for each frame, i.e., the frame-level quality. While a video can contain thousands of frames or even more, it is impossible to annotate the frame-level quality for spatial feature learning, which is another challenge of DNN model training for the VQA task since it relies on large-scale data with robust labeling. Therefore, to ease the labeling burden of training a DNN from scratch, some pre-trained CNN models, such as ResNet [26] pre-trained on ImageNet [27], are used by the state-of-the-art NR-VQA methods, such as VSFA [12] and CNN-TLVQM [13]. These NR-VQA methods learn the features from the image classification task to the VQA target domain via transfer learning [28], [29]. However, the features learned from the image classification could only provide sub-optimal feature representation since the domain gap exists between the source image classification task and the target VQA domain.

Besides, some NR-VQA methods use a pre-trained DNN model in the IQA domain for the spatial feature learning process by assuming the spatial quality features of frames are close to the IQA domain. For example, the PHIQNet model in [30] is pre-trained on IQA to extract the perceptual quality features of video frames and then fed to the long short-term convolutional transformer (LSCT) model for temporal pooling. However, unlike still images, where continuous frames contain motion information that human visual attention is more attracted to regions with motion events than to the structural details or background [31]. Therefore, only using the IQA pre-trained model to extract frame perceptual quality and ignoring motion information could only provide the sub-optimal frame quality feature representation.

Inspired by SSL, this paper proposes a multi-channel CNN model using non-human annotated supervision signals for frame-level quality feature learning, with a GRU model to take HVP characteristics into account for NR-VQA. First, the multi-channel CNN with a channel attention mechanism is pre-trained on the IQA domain with the distorted images and their corresponding structure-aware maps and saliency maps for learning the image quality feature representation guided by non-human annotated supervision signals, which is motivated by SSL using pretext tasks [32], [33], [34], [35], [36], [37], [38], [39]. For example, RotNet [32] predicts the image rotation as a pretext task to learn the image representation prior to the fine-tuning for image classification. There are also other pretext tasks such as image or video colorization [33], [34], [35], jigsaw puzzle [36], relative position [37], pixel generation (iGPT) [38], and visual token reconstruction (BEiT) [39]. In addition, since human visual attention is attracted by the region with motion event more than the structural details, we perform the semi-supervised learning to fine-tune the pre-trained CNN to reduce the domain gap further. To incorporate the motion-aware information on the video frame, the unlabeled distorted frame and its corresponding structure-aware map and motion-aware map are fed into the pre-trained CNN to predict the pseudo label, which is treated as the label of the frame quality to solve the limitation of the lack of available human-annotated label data for video frames. Then, the data from IQA and the data from VQA are combined and fine-tuned to transfer the feature learning from IQA to VQA domain. It achieves a better frame-level quality feature representation while considering motion-aware information on a video frame. Besides, some temporal and color-aware features, such as motion intensity, video smoothing, and color description in HSV color space, that are highly related to HVP [40], [41], are also extracted and incorporated with the frame-level quality feature representation as the input of the GRU model to obtain the final precise predicted video quality. The contributions of this work are summarized as follows:

To compensate for the shortage of human-annotated labels on video frames used for the VQA task, we are the first to adopt self-supervised learning (SSL)-based NR-VQA framework based on non-human annotated supervision signals for the frame-level quality feature learning. All the details of this contribution will be presented in Section III-A.
On the top of SSL-based NR-VQA framework, we contrive semi-supervised learning to fine-tune the pre-trained CNN, that will be described in Section III-A.3. Our objective is to reduce the domain gap by taking motion-aware information into consideration, thereby providing the optimized frame-level quality feature representation for the VQA task.
We also extract some HVP-related features to assist the perceived video quality prediction. All features are then fed into the GRU model with pre-padding and masking strategies to comprehensively evaluate the perceived quality of the whole video. This contribution will be described in Section III-B.

By evaluating our model on three UGC VQA databases and two traditional distortion VQA databases, we verify that our model can provide better frame-level quality feature representation for various distortions and contents and can predict the video quality precisely close to HVP compared with other state-of-the-art transfer learning/pre-trained model-based VQA methods.

The rest of this paper is organized as follows. In Section II, we present the relevant research work. In Section III, the details of our proposed model are described. Then, the experimental results and related analysis are presented in Section IV. Finally, Section V concludes the paper.

SECTION II.

Related Work in Nr-VQA

A. Traditional Methods

The general model of the NR-VQA method contains two key points: discriminative feature extraction and accurate quality prediction. For the spatial information as a vital feature of HVE, some successful and efficient IQA [42], [43], [44] methods were exploited to develop the spatial feature extraction algorithms in some NR-VQA approaches. For example, some NR-VQA methods [15], [16], [17], [18] were proposed that uses an NR-IQA method with the help of the natural scene statistics (NSS) model to estimate the quality of the frame based on the statistical properties of the spatial information, and then weight the distorted videos frame by frame using average pooling or regression. However, videos with 3D information are different from images. The characteristics of video contain not only spatial information but also temporal information. Therefore, several methods take the temporal features into account for their NR-VQA methods. Manasa and Channappayya [19] proposed an optical flow-based NR-VQA algorithm by measuring irregularities at the patch and frame levels. Video intrinsic integrity and distortion evaluation oracle (VIIDEO) [20] observed the intrinsic statistical regularities in natural videos and used it to quantify disturbances introduced by the distortions. Saad et al. [21] proposed a blind VQA method, V-BLIINDS, that assesses the frame quality using the spatiotemporal NSS model in the discrete cosine transform (DCT) domain and quantifies the motion coherency to predict the video quality.

B. Deep Learning-Based Methods

It is well-known that neural networks can automatically learn the data representation, hidden features, and abstract features. CNN is a typical type of DNN that can extract discriminative, semantic, and comprehensive features of image/video. Therefore, many deep learning-based methods have been adopted on NR-VQA. For instance, in [45], 3D-DCT is used to represent the spatiotemporal features of video blocks and form the deformation of AC coefficients to capture the temporal features. The CNN model and the frequency histogram mapping function are then employed to explore the spatiotemporal regularities and obtain the final video quality score. SACONVA [46] uses 3D shearlet transform to extract the primary spatiotemporal features, which can also capture the NSS properties of video blocks. Afterward, the CNN and regression are applied to expand these features further and predict the video quality. DeepBVQA [47] uses a CNN model to extract various spatial features. The sharpness variation is then handcrafted as the temporal features. Lastly, features are aggregated and regressed to obtain the final quality score.

Moreover, Tran [48] proposed a 3D CNN model to extract spatiotemporal features to further address the problem in which a 2D CNN is unable to extract the temporal information directly in videos. You and Korhonen [49] proposed an NR-VQA model based on the 3D CNN and the long short-term memory (LSTM) [50] model to extract spatiotemporal features from video blocks and resolve the time series processing of video blocks. Wu et al. [51] also proposed an NR-VQA model based on the 3D CNN and LSTM model to construct the spatial attention map of video blocks and combined with the corresponding predicted similarity map to further extract the spatial quality information by applying the average pooling and standard deviation pooling. These features are then fed into the LSTM model to predict the overall video quality. Besides, Yi et al. [22] proposed an end-to-end training model for the VQA task. First, the VGG16 model is used to extract the spatial features, while an attention module is added to calculate the dependency between local spatial features. Then, the GPU and memory function are used to obtain the final video quality score.

C. Transfer Learning and Pre-Training Based Methods

To compensate for the lack of enormous training samples to train the robust deep CNN model, some NR-VQA methods learn the features from other domains and then transfer them to the VQA target domain via transfer learning. VSFA [12] extracts content-aware features from a CNN model pre-trained on an image classification task and then predicts the video quality using a GRU temporal-memory model. The authors also improved this method by training on mixed datasets in [52]. CNN-TLVQM [13] combines the handcrafted human vision system (HVS) features extracted from TLVQM [10] and the spatial features obtained from a pre-trained CNN via transfer learning, and then uses an SVR model to evaluate the predicted quality score. Chu et al. [53] also uses a CNN pre-trained on an image classification task to extract spatial features, and horizontal and vertical spatiotemporal slice features of frames. These features are then learned by multi-layer perceptron (MLP) to predict the frame-level quality, and SVR is adopted to fuse the scores of MLP into a final score. LSCT-PHIQNet [30] pre-trains the PHIQNet on an IQA task and then uses the features extracted from it with an LSCT model as a temporal regression model to predict the final video quality. PVQ [54] uses a pre-trained IQA model to extract spatial features of frames and a pre-trained model on a video classification task to extract spatiotemporal features of a 3D clip. Then, the final video quality can be predicted after the spatiotemporal pooling and time series regression with an inception time model. HEKE [24] creates a large-scale video dataset with weak labels to pre-train a feature encoder to extract the spatiotemporal representation of video and then uses the pre-trained feature encoder and hierarchical features regression to predict the video quality. RIRNet [25] extracts spatial quality features from a pre-trained model on an image classification task and then predicts the video quality by the motion effect modeling. However, since the features learned from other tasks are not closely related to the VQA target domain, we believe that there is still room for improvement in transfer learning/pre-trained model-based NR-VQA approaches by reducing the features gap between the source domain and VQA domain.

SECTION III.

Proposed Method

In this section, we introduce a novel NR-VQA method that adopts a new multi-channel CNN model with GRU, incorporating motion-aware information and HVP characteristics. The framework of our proposed model is shown in Fig. 1. First, the multi-channel CNN is pre-trained on the IQA database to predict the image quality feature focusing on structure-aware features and saliency region, which can be regarded as a sort of SSL-based method using a pretext task. Then, with the semi-supervised learning and fine-tuning strategies, the features learned from the pre-trained CNN is fine-tuned to predict the frame-level quality feature representation focusing on structure-aware features and motion-aware regions to transfer the feature learning from IQA to VQA domain to reduce the domain gap for better feature representation. In the meantime, HVP-related temporal and color-aware features are also extracted. Lastly, all features are fed into the GRU model to explore spatiotemporal features and the gradient of temporal features to comprehensively evaluate the quality of the whole video. We will detail each part in the following sections.

FIGURE 1.

The framework of our proposed NR-VQA model.

Show All

A. SSL-Based Multi-Channel CNN Model for VQA

The HVS is known to be sensitive to moving objects [31]. Hence, visual attention should be more attracted by the motion event regions rather than the structural details of the video. Therefore, the distortion occurring in moving objects should affect the human perceptual quality more than those occurring in the background or spatial structures. However, most of the existing transfer learning/pre-trained CNN models in NR-VQA are used to extract the spatial features or content-aware features of the whole frame to represent the frame quality without considering motion information and the motion-aware region, which extends the domain gap between the source domain and target VQA domain and could only provide the sub-optimal feature representation for the VQA task. Besides, human-annotated labels for frame quality are not available in the VQA databases and the subjective quality scores of videos cannot represent the frame quality due to varying distortions over time and frames. In other words, there is no human annotated MOS label to represent the frame-level quality with motion. To address this issue, SSL, which is a form of unsupervised learning that can let network learn the critical feature from unlabeled data by providing the non-human annotated supervision signal, introduced in the proposed VQA framework is to guide and pre-train the multi-channel CNN model in both IQA and VQA databases to learn the frame-level quality feature representation by the non-human annotated supervision signal.

Based on the concept of the region of interest (ROI), we hypothesize that SM can guide the image quality prediction by focusing on the vital stillness region, while the motion-aware region map can guide the frame quality prediction by focusing on the motion-aware region. Therefore, we use the concept of semi-supervised learning on top of our SSL-based multi-channel CNN model and combine the data from IQA and the data from VQA to fine-tune our multi-channel CNN model, as shown in Fig. 2, to process the distorted frame, the structure-aware map, and the motion-aware region map to estimate the optimized frame-level quality feature representation by considering both spatial and motion-aware information at the frame level.

$FIGURE 2. - The network architecture of our proposed multi-channel CNN model. (a) The structure of Residual Block (ch); (b) The structure of Residual Block 2 (ch); (c) The structure of SE-Residual Block (ch); (d) The structure of SE-Residual Block 2 (ch). Conv(ch, kn, st, pd) represents the 2D convolution operation where ch is the output channel, kn $\boldsymbol {\times }$ kn is the kernel size, st represent the size of stride and pd is the padding size. BN, FC and GP represent the batch normalization operation, fully connected layer and global pooling. $\boldsymbol {I}^{ \boldsymbol {d}}\mathbf {,}{\mathbf {GMM}}^{ \boldsymbol {d}}\mathbf {,}{\mathbf {SM}}^{ \boldsymbol {d}}$ is the data from IQA dataset and $\boldsymbol {I}_{ \boldsymbol {t}}^{ \boldsymbol {d}}\mathbf {,}\mathrm { \mathbf {GM}}\mathbf {M}_{ \boldsymbol {t}}^{ \boldsymbol {d}}\mathbf {,}\mathbf {OF}\mathbf {M}_{ \boldsymbol {t}}$ is the data from VQA dataset.$

FIGURE 2.

The network architecture of our proposed multi-channel CNN model. (a) The structure of Residual Block (ch); (b) The structure of Residual Block 2 (ch); (c) The structure of SE-Residual Block (ch); (d) The structure of SE-Residual Block 2 (ch). Conv(ch, kn, st, pd) represents the 2D convolution operation where ch is the output channel, kn $\boldsymbol {\times }$ kn is the kernel size, st represent the size of stride and pd is the padding size. BN, FC and GP represent the batch normalization operation, fully connected layer and global pooling. $\boldsymbol {I}^{ \boldsymbol {d}}\mathbf {,}{\mathbf {GMM}}^{ \boldsymbol {d}}\mathbf {,}{\mathbf {SM}}^{ \boldsymbol {d}}$ is the data from IQA dataset and $\boldsymbol {I}_{ \boldsymbol {t}}^{ \boldsymbol {d}}\mathbf {,}\mathrm { \mathbf {GM}}\mathbf {M}_{ \boldsymbol {t}}^{ \boldsymbol {d}}\mathbf {,}\mathbf {OF}\mathbf {M}_{ \boldsymbol {t}}$ is the data from VQA dataset.

Show All

1) Pre-Processing Stage

Before the training process of the multi-channel CNN model, we first compute the gradient magnitude map (GMM) as the structure-aware map, because the GMM of the image is responsive to image distortions, such as compression, blur, and noise, and can effectively capture image local structures, to which the HVS is highly sensitive. Therefore, the GMM can reflect the structural information of images proved in a series of literature on image processing [55]. As shown in Fig. 3(b), the GMM can show the rich structural information of Fig. 3(a). To obtain the GMM of the input distorted image $I^{d}$ , we convolve $I^{d}$ using the Prewitt filters along the horizontal and vertical directions to compute the image directional gradients. The GMM of $I^{d}$ , ${\mathrm {GMM}}^{d}$ , is then constructed by estimating the root mean square of horizontal and vertical directional gradients as follows:\begin{equation*} {\text {GMM}}^{d}=\sqrt {{(I^{d}\ast g_{h})}^{2}+{(I^{d}\ast g_{v})}^{2}} \tag{1}\end{equation*} View Source where symbol $\ast $ denotes the convolution operation, and $g_{h}$ and $g_{v}$ are the Prewitt filters along horizontal and vertical directions defined by \begin{align*} g_{h}=\left [{ {\begin{array}{cccccccccccccccccccc} 1/3 & 0 & -1/3\\ 1/3 & 0 & -1/3\\ 1/3 & 0 & -1/3\\ \end{array}} }\right],g_{v}=\left [{ {\begin{array}{cccccccccccccccccccc} 1/3 & 1/3 & 1/3\\ 0 & 0 & 0\\ -1/3 & -1/3 & -1/3\\ \end{array}} }\right] \tag{2}\end{align*} View Source

FIGURE 3.

Results from the pre-processing stage. (a) Original frame; (b) Gradient magnitude Map; (c) Saliency Map; (d) Optical flow Map.

Show All

Moreover, to compute the SM of the image, we first implement the method in [56] to determine the saliency residuals on the spectrum domain. This is because the log-spectrum can be sensitive to NSS that indicates the salient region of the image. The saliency residual $\mathcal {R}\left ({f }\right)$ defined as:\begin{equation*} \mathcal {R}\left ({f }\right)=\mathcal {L}(f)-\mathcal {A}(f) \tag{3}\end{equation*} View Source where $\mathcal {A}(f)$ is the real part of the image $f$ after Fourier transform, which represent the shape information of the image

in the spectral domain and $\mathcal {L}(f)$ is the log spectrum of $\mathcal {A}(f)$ . Then, the preliminary saliency map (PSM) is computed by inverting the saliency residuals from the spectral domain back to the spatial domain.\begin{equation*} \text {PSM}(f)=\mathcal {F}^{-1}\left ({\mathcal {R}\left ({f }\right) }\right) \tag{4}\end{equation*} View Source where $\mathcal {F}^{-1}(\cdot)$ indicates the inversion process of saliency residuals from the spectral domain to the spatial domain. We also apply the visual saliency feature (VSF) method in [57] to calculate the center-surround differences on the salient region in the PSM. It further extracts the fine-grained features and defines borders for the PSM to compute our final SM, as shown in Fig. 3(c), as follows:\begin{equation*} {\text {SM}}^{d}=\text {VSF}\left ({(\text {PSM}\left ({I^{d} }\right) }\right) \tag{5}\end{equation*} View Source For the motion-aware map, according to HSV, human perception is caught by moving objects. Since the optical flow map (OFM) can determine the inter-frame motion variation, it is treated as the motion-aware map to represent the motion-aware region. In this paper, we applied the algorithm in [58] to compute the OFM, as shown in Fig. 3(d). First, two neighboring frames are transformed into polynomial expansion. Then, the polynomial expansion coefficients can be used to estimate the displacement field. Therefore, we use this algorithm as the motion estimation method to compute the OFM of two inter-frames as follows:\begin{equation*} {\text {OFM}}_{t}=\text {ME}\left ({\text {PET}\left ({I_{t-1}^{d} }\right),\text {PET}\left ({I_{t}^{d} }\right) }\right) \tag{6}\end{equation*} View Source where $I_{t}^{d}$ and $I_{t-1}^{d}$ represents the current $t^{\mathrm {th}}$ frame and t- $1^{th}$ frame, respectively, PET $\left ({\cdot }\right)$ is the polynomial expansion transform and $\mathrm {ME}\left ({\cdot }\right)$ is the motion estimation mechanism.

After the pre-processing state, the GMM and SM of images are computed for the IQA dataset, and the GMM and OFM of frames are extracted for the VQA dataset for the following training process.

2) Self-Supervised Learning-Based CNN Model Pre-TRANING

To train our multi-channel CNN model, inspired by [59] and [60] using distortion intensity as the self-supervised signal for the regression task in SSL, we use Gradient Magnitude Similarity Deviation (GMSD) [55] as the non-human annotated supervision signal, or so-called pseudo label (PL), ${\mathrm {PL}}_{I}^{d}$ to label the unlabeled data from IQA $U_{I}^{d}=\{I^{d},\mathrm { }{\mathrm {GMM}}^{d},{\mathrm {SM}}^{d}$ first, where $U_{I}^{d}$ includes the distorted image, and the corresponding GMM and SM. It is noted that GMSD has been proven to be effective in representing the distortion intensity of an image and image quality by comparing the similarity between its reference version and distorted version [55]. Therefore, GMSD as the non-human annotated supervision signal, ${\mathrm {PL}}_{I}^{d}=\mathrm {GMSD}$ , can be used to guide the model for effective learning of valuable information and image quality. This pre-training method is inspired by SSL using pretext tasks. By learning GMSD as a pretext task, our multi-channel CNN model can learn the image quality feature representation.

To compute the structure-based quality features, we extract the global spatial features by the summation of the features extracted from the channel of ${\mathrm {GMM}}^{d}$ and the channel of $I^{d}$ as shown in Fig. 2. Features extracted from channel $I^{d}$ represent the basic image quality features of the distorted image. Additionally, features extracted from channel ${\mathrm {GMM}}^{d}$ are the structure-aware quality features highlighted in the structural information of the image. By summing both features, we can capture the global spatial features, $F_{\mathrm {GSF}}$ , of the image accentuated by the structure-aware features, which is given by:\begin{equation*} F_{\text {GSF}}=F_{\text {IC}}(I^{d})\oplus F_{\text {SC}}({\text {GMM}}^{d}) \tag{7}\end{equation*} View Source where symbol $\oplus $ is the element-wise summation operation, $F_{\mathrm {IC}}(\cdot)$ and $F_{\mathrm {SC}}(\cdot)$ represent the feature extraction processes on $I^{d}$ and ${\mathrm {GMM}}^{d}$ , respectively.

Also, we incorporate the squeeze-and-excitation block [61] with residual block, which can squeeze the features to be one dimensional data as global information. It can then reinforce the critical features and weaken the inconsequence features by the channel-wise multiplication. Therefore, by performing the channel attention mechanism on the global spatial features, $F_{\mathrm {GSF}}$ , we can extract the high-level spatial quality features, which are highly related to image quality, named as spatial-aware features $F_{\mathrm {SAF}}$ and defined as:\begin{equation*} F_{\text {SAF}}={\text {SEN}}_{S}\left ({F_{\text {GSF}} }\right) \tag{8}\end{equation*} View Source where ${\mathrm {SEN}}_{S}(\cdot)$ is the channel attention mechanism for $F_{\mathrm {GSF}}$ .

In the meantime, features extracted from the SM channel are concatenated with the spatial-aware features $F_{\mathrm {SAF}}$ . Thus, features extracted from channel ${\mathrm {SM}}^{d}$ can be used as the region-aware based side information to guide the spatial-aware features for better image quality prediction by focusing on the vital region, which is sensitive to human visual attention. Therefore, via the second channel attention mechanism, spatial-aware features can be weighted by the vital region of the image with the guidance of the region-aware features. The fusion of the region-aware features and spatial-aware features, $F_{\mathrm {RASAF}}$ , is defined as:\begin{equation*} F_{\text {RASAF}}={\text {SEN}}_{R}\left ({F_{\text {SAF}}\otimes F_{\text {RC}}\left ({{\text {SM}}^{d} }\right) }\right) \tag{9}\end{equation*} View Source where symbol $\otimes $ is the concatenation operation, $F_{\mathrm {RC}}(\cdot)$ represents the region-aware features extraction process on SM channel and ${\mathrm {SEN}}_{R}(\cdot)$ is the channel attention mechanism for region attention fusion. Finally, two fully connected layer are appended to $F_{\mathrm {RASAF}}$ to predict the image quality given by:\begin{equation*} \hat {p}^{L}={\mathrm {MultiCNN}}_{\mathrm {IQA}}(U_{I}^{d})={\mathrm {FC}}_{2}\left ({{\mathrm {FC}}_{1}\left ({F_{\mathrm {RASAF}} }\right) }\right) \tag{10}\end{equation*} View Source where $\hat {p}^{L}$ denotes the predicted GMSD of the multi-channel CNN model, ${\mathrm {MultiCNN}}_{\mathrm {IQA}}\left ({\cdot }\right)$ . The loss function of our ${\mathrm {MultiCNN}}_{\mathrm {IQA}}\left ({\cdot }\right)$ pre-trained in IQA data is defined as:\begin{equation*} \mathcal {L}_{self}(\hat {p}^{L},{PL}_{I}^{d})=\frac {1}{M}\sum \nolimits _{i=0}^{M-1} {(\hat {p}_{i}^{L}-{PL}_{I,i}^{d})}^{2} \tag{11}\end{equation*} View Source where $M$ is the batch size, and ${\mathrm {PL}}_{I}^{d}$ represent the corresponding supervision signal, GMSD, of the IQA data, $U_{I}^{d}$ . With this pretraining, our multi-channel CNN model is able to learn image quality feature representation first. Then we will fine-tune the model with video frames for frame quality feature learning.

3) Semi-Supervised Learning for Fine-Tuning

As mentioned before, frame quality feature extraction without considering motion information and the motion-aware region generates the domain gap between the source IQA and target VQA tasks. To transfer the feature representation of our multi-channel CNN models from IQA to VQA domain, we incorporate the motion-aware information into our model to compute the frame-level quality feature representation by using the semi-supervised learning and fine-tuning strategies from our pre-trained multi-channel CNN model, ${\mathrm {MultiCNN}}_{\mathrm {IQA}}\left ({\cdot }\right)$ in Section III-A.2.

With the same concept of ROI, we assume that the OFM can be used to guide the video frame quality prediction by focusing on the motion-aware region as same as the SM guide the image quality prediction by focusing on the stillness salient structure region. Therefore, we replace SM with OFM as the region-aware map for VQA data $U_{V}^{d}=\{I_{t}^{d},{\mathrm {GMM}}_{t}^{d},{\mathrm {OFM}}_{t}$ to fine-tune the model, where $U_{V}^{d}$ includes the distorted $t^{\mathrm {th}}$ frame, and the corresponding GMM and OFM. By doing so, after extracting the spatial-aware features of frames $F_{\mathrm {SAF}}$ using $I_{t}^{d}$ and ${\mathrm {GMM}}_{t}^{d}$ via (7) and (8), the spatial-aware features of frames $F_{\mathrm {SAF}}$ will be weighted by the motion-aware region ${\mathrm {OFM}}_{t}$ instead of SM in (9), which can optimize the frame quality feature representation extraction while considering the motion-aware region and motion information.

Besides, as aforementioned, there is no human-annotated label for each video frame. Therefore, for each video frame of the training VQA dataset, ${\mathrm {MultiCNN}}_{\mathrm {IQA}}\left ({\cdot }\right)$ is initially used to generate the pseudo labels ${\mathrm {PL}}_{V}^{d}$ of data from VQA, $U_{V}^{d}$ . During the fine-tuning process, the re-trained multi-channel CNN model is then used to predict the new ${\mathrm {PL}}_{V}^{d}$ of data from $U_{V}^{d}$ for the next training process, similar to the semi-supervised image classification in [62]. Thus, assuming ${\mathrm {MultiCNN}}_{\mathrm {trans}}\left ({\cdot }\right)$ is the multi-channel CNN model that is being transferred from IQA to VQA, the ${\mathrm {PL}}_{V}^{d}$ of data from $U_{V}^{d}$ is then generated as:\begin{equation*} {\text {PL}}_{V}^{d}={\text {MultiCNN}}_{\text {trans}}(U_{V}^{d}) \tag{12}\end{equation*} View Source

After preparing the VQA dataset {$U_{V}^{d},{\mathrm {PL}}_{V}^{d}$ }, it will be combined with the IQA dataset {$U_{I}^{d},{\mathrm {PL}}_{I}^{d}$ } for semi-supervised learning to fine-tune the multi-channel CNN model for domain adaptation, using the same features learning process, (7)–(9), as shown in algorithm 1. Hence, the loss function of the entire training process of semi-supervised learning including both IQA dataset, {$U_{I}^{d},{\mathrm {PL}}_{I}^{d}$ }, and VQA dataset, {$U_{V}^{d},{\mathrm {PL}}_{V}^{d}$ }, is defined as:\begin{equation*} \mathcal {L}_{\text {semi}}=\mathcal {L}_{\text {self}}(\hat {p}^{L},{\text {PL}}_{I}^{d})+a(k)\mathcal {L}_{\text {self}} (\hat {u}^{L},{\text {PL}}_{V}^{d}) \tag{13}\end{equation*} View Source where \begin{align*} a\left ({k }\right)=\begin{cases} 0&k< K_{1}\\ \left [{ (k-K_{1})/(K_{2}-K_{1}) }\right]a_{f}&K_{1}\le k< K_{2}\\ a_{f}&k\ge K_{2}\\ \end{cases} \tag{14}\end{align*} View Source $\hat {u}^{L}$ denote the predicted result of VQA data, $U_{V}^{d}$ , $k$ is the current epoch, and $K_{1}$ , $K_{2}$ , and $a_{f}$ are the parameters for tuning $a\left ({k }\right)$ at different epochs.

Algorithm 1

Training Process of Semi-Supervised Learning Performed on Our SSL-Based Multi-Channel CNN Mode

Show All

As we can see in (13) and (14), only $\mathrm { }\mathcal {L}_{\mathrm {self}}(\hat {p}^{L},{\mathrm {PL}}_{I}^{d})$ is performed when $k< K_{1}$ since the network is at the pre-training stage mentioned in Section III-A.2. When $k\ge K_{1}$ , $a\left ({k }\right)$ is progressively increased by the epoch, which also increase the influence of VQA dataset {$U_{V}^{d},{\mathrm {PL}}_{V}^{d}$ } in the training process. Thus, the model is gradually fine-tuned with VQA data, $U_{V}^{d}$ , so that the domain gap between IQA and target VQA domain is reduced with the consideration of motion-aware information. Finally, after completing the training process at the last epoch, the features learned from the IQA domain is transferred to our target VQA domain. The well-trained multi-channel CNN model, named as ${\mathrm {MultiCNN}}_{\mathrm {VQA}}\left ({\cdot }\right)$ , is obtained for extracting the optimized frame-level quality feature representation for the VQA task. The summary of the training process of semi-supervised learning performed on our SSL-based multi-channel CNN model, including the content in Section III-A.2 and Section III-A.3, is shown in algorithm 1.

4) Frame-Level Quality Features Extraction

Specifically, the frame-level quality features, FQFR, are extracted at the output of the first fully connected layer in ${\mathrm {MultiCNN}}_{\mathrm {VQA}}\left ({\cdot }\right)$ , as shown in Fig. 2. In practice, we divide the frame into $B$ non-overlapping frame-blocks and each frame block goes through ${\mathrm {MultiCNN}}_{\mathrm {VQA}}\left ({\cdot }\right)$ to obtain ${\mathrm {FQFR}}_{b}$ . At the end, we take the mean and standard deviation of all ${\mathrm {FQFR}}_{b}$ within the frame which as shown in Fig. 1:\begin{equation*} {\text {FQFR}}_{t}=\left \{{\mu {\{{\text {FQFR}}_{b}\}}_{b=1}^{b=B},\sigma {\{{\text {FQFR}}_{b}\}}_{b=1}^{b=B} }\right \} \tag{15}\end{equation*} View Source where ${\mathrm {FQFP}}_{b}$ is the quality feature representation of a frame-block in $t^{\mathrm {th}}$ frame, $B$ is the total number of frame-blocks in $t^{\mathrm {th}}$ frame, and $\mu \left ({\cdot }\right)$ and $\sigma \left ({\cdot }\right)$ represent the mean and standard deviation operation, respectively. With our proposed pre-training and fine-tuning strategies, the extracted frame quality feature representation ${\mathrm {FQFR}}_{t}$ (512 dimensions) includes both structure-aware features and motion-aware features for VQA task.

B. HVP-Related Features Extraction

Although the proposed multi-channel CNN considers the motion-aware information of the inter-frame to extract the optimized frame-level quality feature representation, some motion events, such as sudden screen changes, frame freeze, and new object appearances, and the large variation of color could also have a significant impact in HVP [40], [41]. The works in [10], [13], and [23] considered human perception and HVS features, we also adopt the HVP characteristic, which reflects the comprehensive perceived quality following the HVS, by extracting additional temporal features and color-aware features of frames and incorporating them with the frame-level quality features into the GRU model to assist in predicting precisely video quality that is close to human perception.

For the temporal features, as mentioned, optical flow is used to determine the motion variation of the inter-frame that reflects the temporal attention. Therefore, based on the OFM in (6), we further calculate the global motion intensity (GMI), size of the motion event region (MER), and the mean (${\mathrm {MV}}_{\mu }$ ) and standard deviation (${\mathrm {MV}}_{\sigma }$ ) value within the motion region of $t^{\mathrm {th}}$ frame to reflect the level of motion variation as follows:\begin{align*} {\text {GMI}}_{t}&=\frac {1}{W\times H}\sum \nolimits _{x=0}^{W-1} \sum \limits _{y=0}^{H-1} {{\text {OFM}}_{t}\left ({x,y }\right)} \\ {\text {MER}}_{t} &=\frac {1}{W\times H}\sum \nolimits _{x=0}^{W-1} \sum \nolimits _{y=0}^{H-1} {{\text {MR}}_{t}(x,y)} \\ {\text {MV}}_{\mu }^{t}&=\mu (\left \{{{{\text {OFM}}_{t}\left ({x,y }\right)}\thinspace \vert \thinspace {{\text {OFM}}_{t}\left ({x,y }\right)>0}}\right \}) \\ {\text {MV}}_{\sigma }^{t}&=\sigma (\left \{{{{\text {OFM}}_{t}\left ({x,y }\right)}\thinspace \vert \thinspace {{\text {OFM}}_{t}\left ({x,y }\right)>0}}\right \}) \tag{16}\end{align*} View Source with \begin{align*} {\mathrm {MR}}_{t}\left ({x,y }\right)=\begin{cases} 1&{\mathrm {OFM}}_{t}\left ({x,y }\right)>0\\ 0&{\mathrm {OFM}}_{t}\left ({x,y }\right)=0\\ \end{cases} \tag{17}\end{align*} View Source where $W$ and $H$ are the width and height of the OFM, ${\mathrm {OFM}}_{t}\left ({x,y }\right)$ is the pixel value of OFM at the location ($x,y$ ), $\mu \left ({\cdot }\right)$ and $\sigma \left ({\cdot }\right)$ represent the mean operation and the standard deviation operation, respectively.

Also, since human eyes are sensitive to sudden scene changes, frame freeze, and new object appearance in videos, we also use the Structural Similarity Index Measure (SSIM) [63] to measure the structural similarity between frames to indicate their structural difference and represent the video smoothing (VS) by:\begin{equation*} {\text {VS}}_{t}=\text {SSIM}(I_{t-1}^{d},I_{t}^{d}) \tag{18}\end{equation*} View Source

To extract the color-aware features, we first convert the frame into HSV color space since it is a color description model that is more consistent with human perception. In addition to the human attention that may be drawn to color variation, spatial distortions may also be reflected in the color domain. Thus, we compute the standard deviation of the frame and the mean-square error (MSE) between frames in hue (H) and saturation (S) to represent the color spatial distortion (CSD) and the level of color variation (CV) as follows:\begin{align*} {\text {CSD}}_{t}&=\{\sigma \left ({H\left ({I_{t}^{d} }\right) }\right),\mathrm { }\sigma \left ({S\left ({I_{t}^{d} }\right) }\right) \\ {\text {CV}}_{t}&=\left \{{\text {MSE}\left ({H\left ({I_{t-1}^{d} }\right),H\left ({I_{t}^{d} }\right) }\right),}\right. \\ &\quad \left.{\text {MSE}\left ({S\left ({I_{t-1}^{d} }\right),S\left ({I_{t}^{d} }\right) }\right) }\right \} \tag{19}\end{align*} View Source where $H\left ({I_{t}^{d} }\right)$ and $S\left ({I_{t}^{d} }\right)$ represent the $t^{\mathrm {th}}$ frame in H and S color spaces, respectively, and $\mathrm {MSE}\left ({\cdot }\right)$ is the MSE operation. Consequently, we can use the above additional temporal and color-aware features of frames, which are highly related to HVS and consistent with human perception, to assist with ${\mathrm {FQFR}}_{t}$ for video quality prediction.

C. Video Quality Prediction Via GRU Model

A GRU model is a well-known recurrent neural network that has a recurrent nature to process input sequences in an iterative way. The recurrent nature means that the node output at the current timestamp acts as feedback and inputs into the node at the next timestamp. This makes GRU extract the temporal feature of data efficiently. Consequently, GRU can make predictions based on time series data and can explore the spatiotemporal regularities of distorted videos for our VQA task. Therefore, we take advantage of the GRU model for VQA to learn a temporal variation of frame-level quality feature representation and additional HVP-related features along with time series to represent the spatiotemporal features of the video. The GRU model can reveal the gradient of temporal features by analyzing the entire temporal data sequences, which can comprehensively reflect the whole video quality.

First, we concatenate the frame-level quality feature representation, and temporal and color-aware features as a feature vector ${\mathbf {fv}}_{t} =$ {${\mathrm {FQFR}}_{t},{\mathrm {GMI}}_{t},{\mathrm {MER}}_{t},\mathrm { }{\mathrm {MV}}_{\mu },{\mathrm {MV}}_{\sigma },{\mathrm {VS}}_{t}, \mathrm { }{\mathrm {CSD}}_{t},{\mathrm {CV}}_{t}$ } with 521 dimensions. For the $c^{\mathrm {th}}$ distorted video, a features vector, ${\mathbf {FV}}_{ \boldsymbol {c}} ={\mathbf {[}\mathbf {fv}}_{\mathbf {1}}\mathbf {,}{\mathbf {fv}}_{\mathbf {2}}\mathbf {,\ldots,} {{\mathbf {fv}}_{\mathbf {T}_{ \boldsymbol {c}}\mathbf {-1}}\mathbf {,}\mathbf {fv}}_{\mathbf {T}_{ \boldsymbol {c}}}\mathbf {]}$ is then generated, where $c = 1,2, 3,\ldots, C$ , $C$ is the total number of videos in the VQA database, fv$_{ \boldsymbol {t}}$ is the feature vector of $t^{\mathrm {th}}$ frame, and $T_{c}$ is the total number of frames of the $c^{\mathrm {th}}$ distorted video. After that, we built a GRU model to process the input data, ${\mathbf {FV}}_{ \boldsymbol {c}}$ , sequentially to explore spatiotemporal features and the gradient of temporal features to comprehensively evaluate the quality of the whole video. We also perform the pre-padding and masking strategy on ${\mathbf {FV}}_{ \boldsymbol {c}}$ to improve the performance since the memory function of GRU can reduce the influence of padding data placed in front of the actual data and it benefits the gradient descent and let the GRU model focus more on meaningful data when the meaningful data are placed at the back. The loss function of the supervised learning GRU model is defined as:\begin{equation*} \mathcal {L}_{\text {GRU}}=\frac {1}{N}\sum \nolimits _{i=0}^{N-1} {(\hat {v}_{i}^{L}-v_{i}^{L})}^{2} \tag{20}\end{equation*} View Source where $N$ is the batch size, $\hat {v}^{L}$ represents the final predicted video quality score by the GRU model for the distorted video and $v^{L}$ is the ground truth label of the corresponding distorted video collected from the subjective study.

Hence, during inference, we first extract the features in (15), (16)–(18) and (19) from all frames within a video to obtain ${\mathbf {FV}}_{ \boldsymbol {c}}$ . Then, ${\mathbf {FV}}_{ \boldsymbol {c}}$ is input into GRU to get the final predicted video quality score $\hat {v}^{L}$ .

SECTION IV.

Experimental Results

A. Video Quality Databases and Evaluations

To demonstrate the validity and the robustness of our proposed model, three UGC VQA databases (KoNViD-1k [64], LIVE-Qualcomm [65], and LIVE-VQC [66]) and two traditional distortion VQA databases (LIVE [2] and CSIQ [67]) were tested on model. The summary of the above video databases is shown in Table 1.

KoNViD-1k [64] is an extensive database that contains 1200 real-world video sequences with frame rates of 24, 25, and 30 fps. The large number of video sequences in KoNViD-1k represents a wide variety of content and covers almost all kinds of distortions. The MOS ranges are from 1.22 to 4.64.
LIVE-Qualcomm [65] contains 208 distorted videos. These videos are with six common in-capture distortions: artifacts (noise and blocking effect), color, exposure, focus, blurriness, and camera shaking. All videos have a duration of 15 seconds with a frame rate of 30 fps and with the MOS ranging from 16.56 to 73.64.
LIVE-VQC [66] contains 585 distorted videos, with the MOS ranging from 6.22 to 94.29. All videos have a duration of 10s with frame rates of 19–30 fps (one is 120fps). These videos contain 18 types of resolutions from 240P to 1080P, unique contents, and different combinations of distortions.
LIVE [2] includes 150 distorted videos. These videos are generated from each reference video with four different distortion types: wireless distortions, IP distortions, H.264 compression, and MPEG-2 compression. Also, each distorted video is an 8.68-10-second video with a frame rate of 25 fps or 50 fps. Besides, the average differential MOS is provided in the range of 30.94 to 81.16.
CSIQ [67] has 216 distorted videos. All videos have a duration of 10 seconds and span various frame rates: 24 to 60 fps. The 18 distorted videos produced by each reference video have three different levels, and each type has six distortions: Motion JPEG compression, H.264 compression, HEVC compression, wavelet compression using SNOW codec [68], packet-loss, and additive white Gaussian noise (AWGN). Besides, the average differential MOS is provided in the range of 14.48 to 82.80.

TABLE 1 The Summary of the Video Databases

To evaluate the performance of our proposed model, we used the Pearson Linear Correlation Coefficient (PLCC) and root MSE (RMSE) to measure the accuracy between the objective prediction and subjective assessment, and the Spearman Rank Order Correlation Coefficient (SROCC) to measure the monotonic consistency between the objective prediction and subjective assessment. The closer the value of the correlation coefficient is to 1, the higher the performance of the VQA model. Also, a nonlinear regression process is performed to map the prediction result to the subjective scores with different value domains according to the video quality experts group (VQEG) [69] as follows:\begin{equation*} \hat {Q}=\beta _{1}\left [{ \frac {1}{2}-\frac {1}{1+\text {exp}[\beta _{2}(Q-\beta _{3})] } }\right]+\beta _{4}Q+\beta _{5} \tag{21}\end{equation*} View Source where $\beta _{1}$ , $\beta _{2}$ , $\beta _{3}$ , $\beta _{4}$ , $\beta _{5}$ are the parameters to be determined.

B. Implementation Details

In our experiments, each video database was divided into five non-overlapping datasets. Four sets of the distorted videos were selected for training and validation, while the remaining set was used for testing. Then, five-fold cross-validation was conducted in our experiments. Also, of the distorted videos for training and validation of each cross-validation, 80% were used for training, and 20% were used for validating ablation performance. For the multi-channel CNN model training, the CSIQ IQA database [70] was used to pre-train the multi-channel CNN model using (11) since the CSIQ IQA database is a large-scale dataset containing many distorted images with common, general, and diverse distortion, which is suitable as the baseline for image quality feature representation learning. All images were split into $128\times128$ image blocks and labeled by GMSD as non-human annotated supervision signals. During the fine-tuning process, we randomly selected the video frames in the training set of the video database and split them into $128\times128$ frame-blocks. And the sample size of the VQA data $U_{V}^{d}$ is made the same as the IQA data $U_{I}^{d}$ to train the multi-channel CNN model. The parameters $a_{f}$ , $K_{1}$ , and $K_{2}$ in (14) were set as 3, 100, and 700, respectively, through the experiments. We trained the model for 1000 epochs with an initial learning rate of 0.0001 using (13) as the loss function and an Adam optimizer.

The mean and standard deviation of frame-level quality feature representation are then extracted using (15) and concatenated with temporal features and color-aware features in (16)–(18) and (19) to form the feature vector, ${\mathbf {FV}}_{ \boldsymbol {c}}$ , as the input of the GRU model. For the training process of the GRU model, specifically, we built a GRU model with three layers and 75 cell units. We set the maximum length of the video in each database (Note that the video of 120 fps in LIVE-VQC was treated as an isolated case and was not used in our experiment) as the length of FV$_{\mathbf {c}}$ with pre-padding data. Similarly, we used (20) as a loss function and an Adam optimizer to train the model for 500 epochs with an initial learning rate of 0.0001.

C. Performance Evaluation on UGC VQA Databases

To comprehensively evaluate the efficiency of our proposed method, we trained and tested our model on the three UGC databases individually and compared the performance with other state-of-the-art NR-VQA approaches. Twelve NR-VQA methods, BRISQUE [17], NIQE [15], V-BLIINDS [21], VIIDEO [20], TLVQM [10], VSFA [12], CNN-TLVQM [13], HEKE [24], RAPIQUE [23], VIDEVAL [71], MDTVSFA [52], and LSCT-PHIQNet [30] were included. In particular, VSFA, CNN-TLVQM, RAPIQUE, and MDTVSFA use the CNN model pre-trained on ImageNet classification task to extract the content-aware features via transfer learning. Also, LSCT-PHIQNet uses the pre-trained model on the IQA task to extract quality-aware features. The mean and standard deviation performances of PLCC, SROCC, and RMSE results for the mentioned competitors and the proposed model on the KoNViD-1k, LIVE-Qualcomm, and LIVE-VQC video databases are given in Table 2. Following the method in [12], Table 2 also includes the weighted average to weigh the results according to the number of videos to represent the overall performance. In Table 2, the best and the second-best performances of PLCC, SROCC, and RMSE are highlighted in bold and underlined, respectively.

TABLE 2 Performance Comparison of NR-VQA Models on the Three UGC Databases. The Boldfaced and Underlined Entries Indicate the Best and the Second-Best Performers on Each Database for Each Performance Metric

As shown in Table 2, our proposed model achieves the best and second-best performance in terms of PLCC, SROCC, and RMSE in these three UGC databases and has small standard deviation values, representing that our model is more robust. Although our performance is the second-best performance in terms of PLCC in the KoNViD-1k database that is on par with LSCT-PHIQNet with a 0.003 slight difference, our proposed model outperforms LSCT-PHIQNet in LIVE-Qualcomm and LIVE-VQC databases with about 0.02 to 0.03 improvement in terms of PLCC and SROCC. Furthermore, in the LIVE-Qualcomm database, the correlation and accuracy (PLCC, SROCC, and RMSE) of our proposed model are superior to other NR-VQA methods. Ours also achieves the best performance in the LIVE-VQC database in terms of PLCC and SROCC. Compared with the second-best performance method, CNN-TLVQM, the performance of our proposed method outperforms about 0.015 and 0.019 in PLCC and SROCC, respectively. In terms of the weighted average results, our proposed model obtains the best overall performance in PLCC and SROCC, with improvements of 0.007 and 0.014 compared to the second-best performance method, respectively.

From the result in Table 2, it is evident that our proposed model outperforms other NR-VQA methods and exhibits better effectiveness and generalization performance on all three video databases. It can prove that our proposed NR-VQA method is more robust and effective than other transfer learning/pre-trained model-based methods.

D. Performance Evaluation on Traditional VQA Databases

Unlike the UGC databases, which focus on in-capture distortion and videos in the wild, the traditional VQA databases focus on the distortion produced in the compression and transmission process, called post-capture distortions. Most existing NR-VQA methods are designed for either UGC or traditional databases. Therefore, we also tested our proposed model and compared the performance with other state-of-the-art NR-VQA approaches, including V-BLIINDS [21], VIIDEO [20], SACONVA [46], TLVQM [10], VSFA [12], CNN-TLVQM [13], Wang’s [72], RIRNet [25], and HEKE [24], on the two traditional VQA databases, the LIVE and CSIQ video databases, to further demonstrate the effectiveness and generalization. It is noted that the results of SACONVA [46], Wang’s [72], and RIRNet [25] are duplicated from their papers. Also, the best and the second- best performances of PLCC and SROCC are highlighted in bold and underlined, respectively.

The PLCC and SROCC results of the LIVE and CSIQ video databases are shown in Table 3, where the standard deviation is also provided in the bracket. As we can see, the results of ours outperforms all other NR-VQA methods. It is worth noting that although Wang’s [72] adopts the saliency map and frame difference information for learning the CNN model to achieve video quality assessment, it ignores the motion information that we believe is crucial to the frame-level quality prediction, which affect the accuracy of perceptual quality prediction. In contrast, we conjecture that our proposed method focuses on the HVP and motion information, taking account of the motion-aware region map into the CNN model to learn the motion-aware and spatial-aware fusion features at the same time via the semi-supervised learning, and fine-tuning strategies. As a result, Table 3 shows that our proposed model yields significantly higher PLCC and SROCC than Wang’s [72]. Besides, the proposed model achieves significant improvements on both PLCC and SROCC compared to the second-best performance HEKE. In the LIVE video database, our proposed method outperforms HEKE [24] by about 0.053 and 0.051 in PLCC and SROCC. Additionally, PLCC and SROCC improvements of 0.02 and 0.014 are achieved in the CSIQ video database. Therefore, it is demonstrated that our proposed model is universal and can achieve a strong correlation with human perception in both UGC and traditional databases.

TABLE 3 Performance Comparison of NR-VQA Models on Two Traditional Databases. Note That $^{\ast}$ are Performances Taken from the Method’s Original Papers

$Table 3- Performance Comparison of NR-VQA Models on Two Traditional Databases. Note That $^{\ast}$ are Performances Taken from the Method’s Original Papers$

E. Performance Evaluation on Cross Databases

To further verify the generalization capability of our proposed model with diverse contents and distortions, this Section shows the performances of LIVE, KoNViD-1k, and LIVE-Qualcomm in a cross-database scenario, where LIVE is the traditional database focusing on post-capture distortions, KoNViD-1k is the UGC database focusing on video in the wild and various video contents, and LIVE-Qualcomm is the UGC database focusing on in-capture distortions. We trained our proposed model on one database and tested it with another two databases. Then, under the same experiment scheme, the SROCC performance was compared with TLVQM [10], VSFA [12], CNN-TLVQM [13], and LSCT-PHIQNet [30]. Again, the best and the second-best performances of SROCC are highlighted in bold and underlined, respectively. Table 4 clearly shows that the generalization ability of our proposed model trained on the three databases outperforms others in SROCC.

TABLE 4 SROCC Results for Cross-Databases and Combined Database Training and Testing

Besides, we also randomly picked up samples from all three video databases as training, and then used the rest of all samples for evaluation. When the model is trained on this combined database, as shown in Table 4, the performance of our model significantly outperforms the second-best method, CNN-TLVQM and LSCT-PHIQNet, improving SROCC by 0.041, 0.035 and 0.04 on the LIVE, KoNViD-1k and LIVE-Qualcomm video databases, respectively. Since our proposed model can achieve satisfactory results in the cross-database and combined- database scenario, in which three databases contain different and diverse video contents and distortions, the generalization capability of our proposed model is demonstrated, and it can be concluded that our model is universal for different video contents and all types of distortions.

F. Ablation Study of Motion-Aware Region Map, Semi-Supervised Learning, and Fine-Tuning Strategies

As mentioned in Sections III, based on the concept of ROI, SM in (5) can guide image quality feature learning by focusing on salient regions, as OFM in (6) can guide frame-level quality feature representation learning based on motion-aware regions, and the domain gap can then be reduced by the semi-supervised learning and fine-tuning strategies using (13). Therefore, to demonstrate the effects of SM, OFM, and these techniques for frame-level quality feature extraction, the ablation study was performed on the multi-channel CNN model using various training settings. It includes three combi- nations: VQA by using SM on the pre-trained multi-channel CNN model without performing the semi-supervised learning and fine-tuning strategies (Multi-CNN$_{\mathrm {SM+w/oSF}}$ ), VQA by using OFM on the pre-trained multi-channel CNN model without performing semi-supervised learning and fine-tuning strategies (Multi-CNN$_{\mathrm {OFM+w/oSF}}$ ), and VQA by using OFM to perform semi-supervised learning and fine-tuning strategies on the pre-trained multi-channel CNN model (Proposed). It is noted that we only used the training set and validation set data to perform the ablation study. The experimental results on the LIVE and KoNViD-1k video databases are shown in Table 5.

TABLE 5 Ablation Study of Our Proposed Model with Various Training Settings on Multi-Channel CNN

As we can see in Table 5, PLCC and SROCC of Multi-CNN$_{\mathrm {SM+w/oSF}}$ and Multi-CNN$_{\mathrm {OFM+w/oSF}}$ are similar. This proves that both of SM and OPM can be used to highlight important regions on a frame. However, due to the domain gap between stillness salient regions on images and motion-aware regions on video frames, the improvement is limited when directly applying OFM on the CNN model pre-trained on SM. After performing semi-supervised learning and fine-tuning strategies, the proposed model can outperform Multi-CNN$_{\mathrm {SM+w/oSF}}$ and Multi-CNN$_{\mathrm {OFM+w/oSF}}$ on LIVE and KoNViD-1k in terms of PLCC and SROCC. Thus, it shows that the semi-supervised learning and fine-tuning strategies can reduce the domain gap, resulting in more accurate frame-level quality feature representation while considering motion information.

G. Ablation Study of CAHNNEL Attention Mechanism

Table 6 shows an ablation study using regular residual blocks instead of SENet blocks (channel attention mechanism) on the multi-channel CNN model. The results show an improvement when using the SENet blocks to the multi-channel CNN model. It demonstrates that channel attention mechanism can reinforce crucial features and weaken inconsequence features, resulting in better quality feature representation learning.

TABLE 6 Ablation Study on Multi-Channel CNN Model With and Without Channel Attention Mechanism

H. Ablation Study of Features Learning

This ablation study performs feature selection to analyze the performance gain from features, which also demonstrates the effectiveness of the feature learning ability of our proposed model. There are four groups in the experiment: prediction by our proposed model with the frame-level quality feature representation only (FQFR), prediction by our proposed model with the frame-level quality feature representation and temporal features (FQFR+TF), prediction by our proposed model with the frame-level quality feature representation and color-aware features (FQFR+CF), and prediction by our proposed model with all features (FQFR+TF+CF). The experimental results are shown in Table 7.

TABLE 7 Performance Comparison of NR-VQA Models. Note That FQFR, TF, and CF Represents the Frame Quality Feature Representation, Temporal Features and Color-Aware Features, Respectively

Although the frame-level quality feature representation with the GRU model can achieve a satisfactory result, it can be seen that incorporating the temporal or color-aware features can also be of great help in predicting precise video quality scores with HVP characteristics. When only frame-level quality feature representation is used, the PLCC results are around 0.898 and 0.861 in the LIVE and KoNViD-1k video databases. When incorporating the frame-level quality feature representation with temporal features or color-aware features, the PLCC results in LIVE improved from 0.898 to 0.917 and 0.915. Furthermore, after combining the frame-level quality feature representation, temporal features, and color-aware features into our model, it can achieve the best results by learning the gradient of temporal features and temporal variation of spatial features along with the time series as spatiotemporal features, thereby enhancing video quality prediction with the help of HVP characteristics.

I. Computational Complexity

Computational complexity is another major concern in order to apply VQA methods in practical applications. Therefore, we evaluate the computational complexity of our proposed model and six competing NR-VQA models for benchmarking. For a fair comparison, all methods were tested on the same device operating on Windows 10 platform with Intel i9-10900K CPU, 64G RAM, and NVIDIA GeForce RTX 3090 24G GPU.

The comparison results of computational complexity are shown in Table 8. Two videos with different resolutions (720p and 1080p) from the LIVE-VQC were tested to compute the required runtime of a video for each NR-VQA method. Although our proposed model requires higher computational complexity than HEKE [24] and RAPIQUE [23], which perform temporal downsampling, our proposed model delivers better prediction accuracy results, as shown in Fig. 4. Meanwhile, for our proposed model, we can also perform temporal downsampling by a factor of two to form a lighter mode (Proposed_Light), i.e., feature extraction every 2 frames instead of every frame, which can further reduce the computational complexity by sacrificing a slight accuracy. The Proposed_light now has a complexity closer to HEKE. Notably, the Proposed_Light can achieve high accurate performance as shown in Fig. 4. Also, our proposed model is faster than other NR-VQA methods and has better PLCC performance. Specifically, compared with LSCT-PHIQNet [30] and CNN-TLVQM [13], our proposed model achieves about 20% and 50% reduction in computational complexity. Overall, Fig. 4 demonstrates that our proposed model has the best performance in the trade-off between accuracy and computational complexity.

TABLE 8 Computational Complexity (Seconds Per Video) Comparison of NR-VQA Models

FIGURE 4.

The PLCC results in LIVE-VQC video database (collected from Table 2) against the computational complexity runtime with 1080p resolution.

Show All

SECTION V.

Conclusion

In this paper, by using SSL, we developed a quality feature learning through multi-channel CNN using non-human annotated labels, and GRU while considering HVP characteristics of NR-VQA. First, we solve the limitations of the lack of available human-annotated label data for the VQA task by a SSL-based multi-channel CNN based on the image quality feature learning method in IQA domain. Second, we bridge the domain gap between the IQA and VQA tasks by adopting semi-supervised learning and fine-tuning strategies in the pre-trained CNN model. The model takes motion-aware information into consideration to optimize frame-level quality feature representation learning. Finally, a GRU model is established to explore spatiotemporal features and gradient of temporal features of video to estimate the video quality by incorporating the frame-level quality feature representation, and HVP-related temporal and color-aware features. Experimental results demonstrate the robustness and generalization of our proposed model, which is practical in real applications and is strongly related to human perception. In the future, one of the ways is to replace GRU with transformer to exploit the long-range dependencies for further improvement. Also, a lightweight NR-VQA model is essential for the further development of real-time applications.

References is not available for this document.

Quality Feature Learning via Multi-Channel CNN and GRU for No-Reference Video Quality Assessment

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction