Introduction
In the era of explosive information, video sharing has dramatic growth on social networks. As Cisco forecasts [1], by 2022, there will be approximately 400 exabytes of IP traffic per month, of which 82% of IP traffic will be video traffic. However, videos will inevitably be distorted after compression, processing, and transmission, thereby affecting the human visual experience (HVE) [2]. Consequently, to provide a better end-user experience, an accurate VQA approach is highly required to preserve the quality of service.
In considering the limited time and labor, although subjective VQA methods could estimate the most accurate perceived video quality, it is generally used to construct a benchmark video quality database only. In contrast, an objective VQA allows automatic video quality evaluation without enormous resources. Also, the ultimate goal of the objective VQA is to evaluate the perceptual quality highly related to the subjective study. Therefore, it has recently become an attractive and challenging topic for researchers. There are three types of objective VQA methods according to their use of reference video [3]: Full-reference (FR) VQA [4], [5], [6], [7] requires complete information from the reference video; Reduced-reference (RR) VQA [8], [9] only takes part of the information from reference video; No-reference (NR) VQA [10], [11], [12], [13] does not require any information from the reference video. Since the reference video is not always available in real VQA applications, the NR-VQA approach is preferable to evaluate the video quality [14].
In the early stage, some traditional NR-VQA methods [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21] have been developed by exploring different spatial and temporal features. For example, TLVQM [10] extracts 75 spatial and temporal features of frames and predicts the final video score using support vector regression (SVR). However, those hand-crafted features focus on the specific distortion only, limiting the performance and generalization of visual quality prediction.
Recently, many deep neural network (DNN) models have been proposed to learn the data representation, hidden features, and abstract features automatically. However, directly applying DNN in the VQA task faces two main challenges. The first challenge is that it requires high computational power and vast memory size. Usually, a raw video contains high spatial resolution and frame rate. Due to the limitation of graphics processing unit (GPU) memory size, it is hard to process a whole VQA database and train an end-to-end DNN model for the VQA task. Therefore, some existing NR-VQA methods use spatial resolution downscaling or/and temporally downsampling strategies to reduce the computational requirements and achieve end-to-end training. For example, the spatial resolution of the video in [22] is downsized to
To relieve the above issue, most deep learning-based NR-VQA models separate the spatial and temporal learning process to avoid enormous high computational power at once and information loss. However, for the databases used in the subjective VQA study, each video only contains one mean opinion score (MOS) as ground truth to represent overall video quality. There is no human-annotated MOS label for each frame, i.e., the frame-level quality. While a video can contain thousands of frames or even more, it is impossible to annotate the frame-level quality for spatial feature learning, which is another challenge of DNN model training for the VQA task since it relies on large-scale data with robust labeling. Therefore, to ease the labeling burden of training a DNN from scratch, some pre-trained CNN models, such as ResNet [26] pre-trained on ImageNet [27], are used by the state-of-the-art NR-VQA methods, such as VSFA [12] and CNN-TLVQM [13]. These NR-VQA methods learn the features from the image classification task to the VQA target domain via transfer learning [28], [29]. However, the features learned from the image classification could only provide sub-optimal feature representation since the domain gap exists between the source image classification task and the target VQA domain.
Besides, some NR-VQA methods use a pre-trained DNN model in the IQA domain for the spatial feature learning process by assuming the spatial quality features of frames are close to the IQA domain. For example, the PHIQNet model in [30] is pre-trained on IQA to extract the perceptual quality features of video frames and then fed to the long short-term convolutional transformer (LSCT) model for temporal pooling. However, unlike still images, where continuous frames contain motion information that human visual attention is more attracted to regions with motion events than to the structural details or background [31]. Therefore, only using the IQA pre-trained model to extract frame perceptual quality and ignoring motion information could only provide the sub-optimal frame quality feature representation.
Inspired by SSL, this paper proposes a multi-channel CNN model using non-human annotated supervision signals for frame-level quality feature learning, with a GRU model to take HVP characteristics into account for NR-VQA. First, the multi-channel CNN with a channel attention mechanism is pre-trained on the IQA domain with the distorted images and their corresponding structure-aware maps and saliency maps for learning the image quality feature representation guided by non-human annotated supervision signals, which is motivated by SSL using pretext tasks [32], [33], [34], [35], [36], [37], [38], [39]. For example, RotNet [32] predicts the image rotation as a pretext task to learn the image representation prior to the fine-tuning for image classification. There are also other pretext tasks such as image or video colorization [33], [34], [35], jigsaw puzzle [36], relative position [37], pixel generation (iGPT) [38], and visual token reconstruction (BEiT) [39]. In addition, since human visual attention is attracted by the region with motion event more than the structural details, we perform the semi-supervised learning to fine-tune the pre-trained CNN to reduce the domain gap further. To incorporate the motion-aware information on the video frame, the unlabeled distorted frame and its corresponding structure-aware map and motion-aware map are fed into the pre-trained CNN to predict the pseudo label, which is treated as the label of the frame quality to solve the limitation of the lack of available human-annotated label data for video frames. Then, the data from IQA and the data from VQA are combined and fine-tuned to transfer the feature learning from IQA to VQA domain. It achieves a better frame-level quality feature representation while considering motion-aware information on a video frame. Besides, some temporal and color-aware features, such as motion intensity, video smoothing, and color description in HSV color space, that are highly related to HVP [40], [41], are also extracted and incorporated with the frame-level quality feature representation as the input of the GRU model to obtain the final precise predicted video quality. The contributions of this work are summarized as follows:
To compensate for the shortage of human-annotated labels on video frames used for the VQA task, we are the first to adopt self-supervised learning (SSL)-based NR-VQA framework based on non-human annotated supervision signals for the frame-level quality feature learning. All the details of this contribution will be presented in Section III-A.
On the top of SSL-based NR-VQA framework, we contrive semi-supervised learning to fine-tune the pre-trained CNN, that will be described in Section III-A.3. Our objective is to reduce the domain gap by taking motion-aware information into consideration, thereby providing the optimized frame-level quality feature representation for the VQA task.
We also extract some HVP-related features to assist the perceived video quality prediction. All features are then fed into the GRU model with pre-padding and masking strategies to comprehensively evaluate the perceived quality of the whole video. This contribution will be described in Section III-B.
By evaluating our model on three UGC VQA databases and two traditional distortion VQA databases, we verify that our model can provide better frame-level quality feature representation for various distortions and contents and can predict the video quality precisely close to HVP compared with other state-of-the-art transfer learning/pre-trained model-based VQA methods.
The rest of this paper is organized as follows. In Section II, we present the relevant research work. In Section III, the details of our proposed model are described. Then, the experimental results and related analysis are presented in Section IV. Finally, Section V concludes the paper.
Related Work in Nr-VQA
A. Traditional Methods
The general model of the NR-VQA method contains two key points: discriminative feature extraction and accurate quality prediction. For the spatial information as a vital feature of HVE, some successful and efficient IQA [42], [43], [44] methods were exploited to develop the spatial feature extraction algorithms in some NR-VQA approaches. For example, some NR-VQA methods [15], [16], [17], [18] were proposed that uses an NR-IQA method with the help of the natural scene statistics (NSS) model to estimate the quality of the frame based on the statistical properties of the spatial information, and then weight the distorted videos frame by frame using average pooling or regression. However, videos with 3D information are different from images. The characteristics of video contain not only spatial information but also temporal information. Therefore, several methods take the temporal features into account for their NR-VQA methods. Manasa and Channappayya [19] proposed an optical flow-based NR-VQA algorithm by measuring irregularities at the patch and frame levels. Video intrinsic integrity and distortion evaluation oracle (VIIDEO) [20] observed the intrinsic statistical regularities in natural videos and used it to quantify disturbances introduced by the distortions. Saad et al. [21] proposed a blind VQA method, V-BLIINDS, that assesses the frame quality using the spatiotemporal NSS model in the discrete cosine transform (DCT) domain and quantifies the motion coherency to predict the video quality.
B. Deep Learning-Based Methods
It is well-known that neural networks can automatically learn the data representation, hidden features, and abstract features. CNN is a typical type of DNN that can extract discriminative, semantic, and comprehensive features of image/video. Therefore, many deep learning-based methods have been adopted on NR-VQA. For instance, in [45], 3D-DCT is used to represent the spatiotemporal features of video blocks and form the deformation of AC coefficients to capture the temporal features. The CNN model and the frequency histogram mapping function are then employed to explore the spatiotemporal regularities and obtain the final video quality score. SACONVA [46] uses 3D shearlet transform to extract the primary spatiotemporal features, which can also capture the NSS properties of video blocks. Afterward, the CNN and regression are applied to expand these features further and predict the video quality. DeepBVQA [47] uses a CNN model to extract various spatial features. The sharpness variation is then handcrafted as the temporal features. Lastly, features are aggregated and regressed to obtain the final quality score.
Moreover, Tran [48] proposed a 3D CNN model to extract spatiotemporal features to further address the problem in which a 2D CNN is unable to extract the temporal information directly in videos. You and Korhonen [49] proposed an NR-VQA model based on the 3D CNN and the long short-term memory (LSTM) [50] model to extract spatiotemporal features from video blocks and resolve the time series processing of video blocks. Wu et al. [51] also proposed an NR-VQA model based on the 3D CNN and LSTM model to construct the spatial attention map of video blocks and combined with the corresponding predicted similarity map to further extract the spatial quality information by applying the average pooling and standard deviation pooling. These features are then fed into the LSTM model to predict the overall video quality. Besides, Yi et al. [22] proposed an end-to-end training model for the VQA task. First, the VGG16 model is used to extract the spatial features, while an attention module is added to calculate the dependency between local spatial features. Then, the GPU and memory function are used to obtain the final video quality score.
C. Transfer Learning and Pre-Training Based Methods
To compensate for the lack of enormous training samples to train the robust deep CNN model, some NR-VQA methods learn the features from other domains and then transfer them to the VQA target domain via transfer learning. VSFA [12] extracts content-aware features from a CNN model pre-trained on an image classification task and then predicts the video quality using a GRU temporal-memory model. The authors also improved this method by training on mixed datasets in [52]. CNN-TLVQM [13] combines the handcrafted human vision system (HVS) features extracted from TLVQM [10] and the spatial features obtained from a pre-trained CNN via transfer learning, and then uses an SVR model to evaluate the predicted quality score. Chu et al. [53] also uses a CNN pre-trained on an image classification task to extract spatial features, and horizontal and vertical spatiotemporal slice features of frames. These features are then learned by multi-layer perceptron (MLP) to predict the frame-level quality, and SVR is adopted to fuse the scores of MLP into a final score. LSCT-PHIQNet [30] pre-trains the PHIQNet on an IQA task and then uses the features extracted from it with an LSCT model as a temporal regression model to predict the final video quality. PVQ [54] uses a pre-trained IQA model to extract spatial features of frames and a pre-trained model on a video classification task to extract spatiotemporal features of a 3D clip. Then, the final video quality can be predicted after the spatiotemporal pooling and time series regression with an inception time model. HEKE [24] creates a large-scale video dataset with weak labels to pre-train a feature encoder to extract the spatiotemporal representation of video and then uses the pre-trained feature encoder and hierarchical features regression to predict the video quality. RIRNet [25] extracts spatial quality features from a pre-trained model on an image classification task and then predicts the video quality by the motion effect modeling. However, since the features learned from other tasks are not closely related to the VQA target domain, we believe that there is still room for improvement in transfer learning/pre-trained model-based NR-VQA approaches by reducing the features gap between the source domain and VQA domain.
Proposed Method
In this section, we introduce a novel NR-VQA method that adopts a new multi-channel CNN model with GRU, incorporating motion-aware information and HVP characteristics. The framework of our proposed model is shown in Fig. 1. First, the multi-channel CNN is pre-trained on the IQA database to predict the image quality feature focusing on structure-aware features and saliency region, which can be regarded as a sort of SSL-based method using a pretext task. Then, with the semi-supervised learning and fine-tuning strategies, the features learned from the pre-trained CNN is fine-tuned to predict the frame-level quality feature representation focusing on structure-aware features and motion-aware regions to transfer the feature learning from IQA to VQA domain to reduce the domain gap for better feature representation. In the meantime, HVP-related temporal and color-aware features are also extracted. Lastly, all features are fed into the GRU model to explore spatiotemporal features and the gradient of temporal features to comprehensively evaluate the quality of the whole video. We will detail each part in the following sections.
A. SSL-Based Multi-Channel CNN Model for VQA
The HVS is known to be sensitive to moving objects [31]. Hence, visual attention should be more attracted by the motion event regions rather than the structural details of the video. Therefore, the distortion occurring in moving objects should affect the human perceptual quality more than those occurring in the background or spatial structures. However, most of the existing transfer learning/pre-trained CNN models in NR-VQA are used to extract the spatial features or content-aware features of the whole frame to represent the frame quality without considering motion information and the motion-aware region, which extends the domain gap between the source domain and target VQA domain and could only provide the sub-optimal feature representation for the VQA task. Besides, human-annotated labels for frame quality are not available in the VQA databases and the subjective quality scores of videos cannot represent the frame quality due to varying distortions over time and frames. In other words, there is no human annotated MOS label to represent the frame-level quality with motion. To address this issue, SSL, which is a form of unsupervised learning that can let network learn the critical feature from unlabeled data by providing the non-human annotated supervision signal, introduced in the proposed VQA framework is to guide and pre-train the multi-channel CNN model in both IQA and VQA databases to learn the frame-level quality feature representation by the non-human annotated supervision signal.
Based on the concept of the region of interest (ROI), we hypothesize that SM can guide the image quality prediction by focusing on the vital stillness region, while the motion-aware region map can guide the frame quality prediction by focusing on the motion-aware region. Therefore, we use the concept of semi-supervised learning on top of our SSL-based multi-channel CNN model and combine the data from IQA and the data from VQA to fine-tune our multi-channel CNN model, as shown in Fig. 2, to process the distorted frame, the structure-aware map, and the motion-aware region map to estimate the optimized frame-level quality feature representation by considering both spatial and motion-aware information at the frame level.
The network architecture of our proposed multi-channel CNN model. (a) The structure of Residual Block (ch); (b) The structure of Residual Block 2 (ch); (c) The structure of SE-Residual Block (ch); (d) The structure of SE-Residual Block 2 (ch). Conv(ch, kn, st, pd) represents the 2D convolution operation where ch is the output channel, kn
1) Pre-Processing Stage
Before the training process of the multi-channel CNN model, we first compute the gradient magnitude map (GMM) as the structure-aware map, because the GMM of the image is responsive to image distortions, such as compression, blur, and noise, and can effectively capture image local structures, to which the HVS is highly sensitive. Therefore, the GMM can reflect the structural information of images proved in a series of literature on image processing [55]. As shown in Fig. 3(b), the GMM can show the rich structural information of Fig. 3(a). To obtain the GMM of the input distorted image \begin{equation*} {\text {GMM}}^{d}=\sqrt {{(I^{d}\ast g_{h})}^{2}+{(I^{d}\ast g_{v})}^{2}} \tag{1}\end{equation*}
\begin{align*} g_{h}=\left [{ {\begin{array}{cccccccccccccccccccc} 1/3 & 0 & -1/3\\ 1/3 & 0 & -1/3\\ 1/3 & 0 & -1/3\\ \end{array}} }\right],g_{v}=\left [{ {\begin{array}{cccccccccccccccccccc} 1/3 & 1/3 & 1/3\\ 0 & 0 & 0\\ -1/3 & -1/3 & -1/3\\ \end{array}} }\right] \tag{2}\end{align*}
Results from the pre-processing stage. (a) Original frame; (b) Gradient magnitude Map; (c) Saliency Map; (d) Optical flow Map.
Moreover, to compute the SM of the image, we first implement the method in [56] to determine the saliency residuals on the spectrum domain. This is because the log-spectrum can be sensitive to NSS that indicates the salient region of the image. The saliency residual \begin{equation*} \mathcal {R}\left ({f }\right)=\mathcal {L}(f)-\mathcal {A}(f) \tag{3}\end{equation*}
in the spectral domain and \begin{equation*} \text {PSM}(f)=\mathcal {F}^{-1}\left ({\mathcal {R}\left ({f }\right) }\right) \tag{4}\end{equation*}
\begin{equation*} {\text {SM}}^{d}=\text {VSF}\left ({(\text {PSM}\left ({I^{d} }\right) }\right) \tag{5}\end{equation*}
\begin{equation*} {\text {OFM}}_{t}=\text {ME}\left ({\text {PET}\left ({I_{t-1}^{d} }\right),\text {PET}\left ({I_{t}^{d} }\right) }\right) \tag{6}\end{equation*}
After the pre-processing state, the GMM and SM of images are computed for the IQA dataset, and the GMM and OFM of frames are extracted for the VQA dataset for the following training process.
2) Self-Supervised Learning-Based CNN Model Pre-TRANING
To train our multi-channel CNN model, inspired by [59] and [60] using distortion intensity as the self-supervised signal for the regression task in SSL, we use Gradient Magnitude Similarity Deviation (GMSD) [55] as the non-human annotated supervision signal, or so-called pseudo label (PL),
To compute the structure-based quality features, we extract the global spatial features by the summation of the features extracted from the channel of \begin{equation*} F_{\text {GSF}}=F_{\text {IC}}(I^{d})\oplus F_{\text {SC}}({\text {GMM}}^{d}) \tag{7}\end{equation*}
Also, we incorporate the squeeze-and-excitation block [61] with residual block, which can squeeze the features to be one dimensional data as global information. It can then reinforce the critical features and weaken the inconsequence features by the channel-wise multiplication. Therefore, by performing the channel attention mechanism on the global spatial features, \begin{equation*} F_{\text {SAF}}={\text {SEN}}_{S}\left ({F_{\text {GSF}} }\right) \tag{8}\end{equation*}
In the meantime, features extracted from the SM channel are concatenated with the spatial-aware features \begin{equation*} F_{\text {RASAF}}={\text {SEN}}_{R}\left ({F_{\text {SAF}}\otimes F_{\text {RC}}\left ({{\text {SM}}^{d} }\right) }\right) \tag{9}\end{equation*}
\begin{equation*} \hat {p}^{L}={\mathrm {MultiCNN}}_{\mathrm {IQA}}(U_{I}^{d})={\mathrm {FC}}_{2}\left ({{\mathrm {FC}}_{1}\left ({F_{\mathrm {RASAF}} }\right) }\right) \tag{10}\end{equation*}
\begin{equation*} \mathcal {L}_{self}(\hat {p}^{L},{PL}_{I}^{d})=\frac {1}{M}\sum \nolimits _{i=0}^{M-1} {(\hat {p}_{i}^{L}-{PL}_{I,i}^{d})}^{2} \tag{11}\end{equation*}
3) Semi-Supervised Learning for Fine-Tuning
As mentioned before, frame quality feature extraction without considering motion information and the motion-aware region generates the domain gap between the source IQA and target VQA tasks. To transfer the feature representation of our multi-channel CNN models from IQA to VQA domain, we incorporate the motion-aware information into our model to compute the frame-level quality feature representation by using the semi-supervised learning and fine-tuning strategies from our pre-trained multi-channel CNN model,
With the same concept of ROI, we assume that the OFM can be used to guide the video frame quality prediction by focusing on the motion-aware region as same as the SM guide the image quality prediction by focusing on the stillness salient structure region. Therefore, we replace SM with OFM as the region-aware map for VQA data
Besides, as aforementioned, there is no human-annotated label for each video frame. Therefore, for each video frame of the training VQA dataset, \begin{equation*} {\text {PL}}_{V}^{d}={\text {MultiCNN}}_{\text {trans}}(U_{V}^{d}) \tag{12}\end{equation*}
After preparing the VQA dataset {\begin{equation*} \mathcal {L}_{\text {semi}}=\mathcal {L}_{\text {self}}(\hat {p}^{L},{\text {PL}}_{I}^{d})+a(k)\mathcal {L}_{\text {self}} (\hat {u}^{L},{\text {PL}}_{V}^{d}) \tag{13}\end{equation*}
\begin{align*} a\left ({k }\right)=\begin{cases} 0&k< K_{1}\\ \left [{ (k-K_{1})/(K_{2}-K_{1}) }\right]a_{f}&K_{1}\le k< K_{2}\\ a_{f}&k\ge K_{2}\\ \end{cases} \tag{14}\end{align*}
Training Process of Semi-Supervised Learning Performed on Our SSL-Based Multi-Channel CNN Mode
As we can see in (13) and (14), only
4) Frame-Level Quality Features Extraction
Specifically, the frame-level quality features, FQFR, are extracted at the output of the first fully connected layer in \begin{equation*} {\text {FQFR}}_{t}=\left \{{\mu {\{{\text {FQFR}}_{b}\}}_{b=1}^{b=B},\sigma {\{{\text {FQFR}}_{b}\}}_{b=1}^{b=B} }\right \} \tag{15}\end{equation*}
B. HVP-Related Features Extraction
Although the proposed multi-channel CNN considers the motion-aware information of the inter-frame to extract the optimized frame-level quality feature representation, some motion events, such as sudden screen changes, frame freeze, and new object appearances, and the large variation of color could also have a significant impact in HVP [40], [41]. The works in [10], [13], and [23] considered human perception and HVS features, we also adopt the HVP characteristic, which reflects the comprehensive perceived quality following the HVS, by extracting additional temporal features and color-aware features of frames and incorporating them with the frame-level quality features into the GRU model to assist in predicting precisely video quality that is close to human perception.
For the temporal features, as mentioned, optical flow is used to determine the motion variation of the inter-frame that reflects the temporal attention. Therefore, based on the OFM in (6), we further calculate the global motion intensity (GMI), size of the motion event region (MER), and the mean (\begin{align*} {\text {GMI}}_{t}&=\frac {1}{W\times H}\sum \nolimits _{x=0}^{W-1} \sum \limits _{y=0}^{H-1} {{\text {OFM}}_{t}\left ({x,y }\right)} \\ {\text {MER}}_{t} &=\frac {1}{W\times H}\sum \nolimits _{x=0}^{W-1} \sum \nolimits _{y=0}^{H-1} {{\text {MR}}_{t}(x,y)} \\ {\text {MV}}_{\mu }^{t}&=\mu (\left \{{{{\text {OFM}}_{t}\left ({x,y }\right)}\thinspace \vert \thinspace {{\text {OFM}}_{t}\left ({x,y }\right)>0}}\right \}) \\ {\text {MV}}_{\sigma }^{t}&=\sigma (\left \{{{{\text {OFM}}_{t}\left ({x,y }\right)}\thinspace \vert \thinspace {{\text {OFM}}_{t}\left ({x,y }\right)>0}}\right \}) \tag{16}\end{align*}
\begin{align*} {\mathrm {MR}}_{t}\left ({x,y }\right)=\begin{cases} 1&{\mathrm {OFM}}_{t}\left ({x,y }\right)>0\\ 0&{\mathrm {OFM}}_{t}\left ({x,y }\right)=0\\ \end{cases} \tag{17}\end{align*}
Also, since human eyes are sensitive to sudden scene changes, frame freeze, and new object appearance in videos, we also use the Structural Similarity Index Measure (SSIM) [63] to measure the structural similarity between frames to indicate their structural difference and represent the video smoothing (VS) by:\begin{equation*} {\text {VS}}_{t}=\text {SSIM}(I_{t-1}^{d},I_{t}^{d}) \tag{18}\end{equation*}
To extract the color-aware features, we first convert the frame into HSV color space since it is a color description model that is more consistent with human perception. In addition to the human attention that may be drawn to color variation, spatial distortions may also be reflected in the color domain. Thus, we compute the standard deviation of the frame and the mean-square error (MSE) between frames in hue (H) and saturation (S) to represent the color spatial distortion (CSD) and the level of color variation (CV) as follows:\begin{align*} {\text {CSD}}_{t}&=\{\sigma \left ({H\left ({I_{t}^{d} }\right) }\right),\mathrm { }\sigma \left ({S\left ({I_{t}^{d} }\right) }\right) \\ {\text {CV}}_{t}&=\left \{{\text {MSE}\left ({H\left ({I_{t-1}^{d} }\right),H\left ({I_{t}^{d} }\right) }\right),}\right. \\ &\quad \left.{\text {MSE}\left ({S\left ({I_{t-1}^{d} }\right),S\left ({I_{t}^{d} }\right) }\right) }\right \} \tag{19}\end{align*}
C. Video Quality Prediction Via GRU Model
A GRU model is a well-known recurrent neural network that has a recurrent nature to process input sequences in an iterative way. The recurrent nature means that the node output at the current timestamp acts as feedback and inputs into the node at the next timestamp. This makes GRU extract the temporal feature of data efficiently. Consequently, GRU can make predictions based on time series data and can explore the spatiotemporal regularities of distorted videos for our VQA task. Therefore, we take advantage of the GRU model for VQA to learn a temporal variation of frame-level quality feature representation and additional HVP-related features along with time series to represent the spatiotemporal features of the video. The GRU model can reveal the gradient of temporal features by analyzing the entire temporal data sequences, which can comprehensively reflect the whole video quality.
First, we concatenate the frame-level quality feature representation, and temporal and color-aware features as a feature vector \begin{equation*} \mathcal {L}_{\text {GRU}}=\frac {1}{N}\sum \nolimits _{i=0}^{N-1} {(\hat {v}_{i}^{L}-v_{i}^{L})}^{2} \tag{20}\end{equation*}
Hence, during inference, we first extract the features in (15), (16)–(18) and (19) from all frames within a video to obtain
Experimental Results
A. Video Quality Databases and Evaluations
To demonstrate the validity and the robustness of our proposed model, three UGC VQA databases (KoNViD-1k [64], LIVE-Qualcomm [65], and LIVE-VQC [66]) and two traditional distortion VQA databases (LIVE [2] and CSIQ [67]) were tested on model. The summary of the above video databases is shown in Table 1.
KoNViD-1k [64] is an extensive database that contains 1200 real-world video sequences with frame rates of 24, 25, and 30 fps. The large number of video sequences in KoNViD-1k represents a wide variety of content and covers almost all kinds of distortions. The MOS ranges are from 1.22 to 4.64.
LIVE-Qualcomm [65] contains 208 distorted videos. These videos are with six common in-capture distortions: artifacts (noise and blocking effect), color, exposure, focus, blurriness, and camera shaking. All videos have a duration of 15 seconds with a frame rate of 30 fps and with the MOS ranging from 16.56 to 73.64.
LIVE-VQC [66] contains 585 distorted videos, with the MOS ranging from 6.22 to 94.29. All videos have a duration of 10s with frame rates of 19–30 fps (one is 120fps). These videos contain 18 types of resolutions from 240P to 1080P, unique contents, and different combinations of distortions.
LIVE [2] includes 150 distorted videos. These videos are generated from each reference video with four different distortion types: wireless distortions, IP distortions, H.264 compression, and MPEG-2 compression. Also, each distorted video is an 8.68-10-second video with a frame rate of 25 fps or 50 fps. Besides, the average differential MOS is provided in the range of 30.94 to 81.16.
CSIQ [67] has 216 distorted videos. All videos have a duration of 10 seconds and span various frame rates: 24 to 60 fps. The 18 distorted videos produced by each reference video have three different levels, and each type has six distortions: Motion JPEG compression, H.264 compression, HEVC compression, wavelet compression using SNOW codec [68], packet-loss, and additive white Gaussian noise (AWGN). Besides, the average differential MOS is provided in the range of 14.48 to 82.80.
To evaluate the performance of our proposed model, we used the Pearson Linear Correlation Coefficient (PLCC) and root MSE (RMSE) to measure the accuracy between the objective prediction and subjective assessment, and the Spearman Rank Order Correlation Coefficient (SROCC) to measure the monotonic consistency between the objective prediction and subjective assessment. The closer the value of the correlation coefficient is to 1, the higher the performance of the VQA model. Also, a nonlinear regression process is performed to map the prediction result to the subjective scores with different value domains according to the video quality experts group (VQEG) [69] as follows:\begin{equation*} \hat {Q}=\beta _{1}\left [{ \frac {1}{2}-\frac {1}{1+\text {exp}[\beta _{2}(Q-\beta _{3})] } }\right]+\beta _{4}Q+\beta _{5} \tag{21}\end{equation*}
B. Implementation Details
In our experiments, each video database was divided into five non-overlapping datasets. Four sets of the distorted videos were selected for training and validation, while the remaining set was used for testing. Then, five-fold cross-validation was conducted in our experiments. Also, of the distorted videos for training and validation of each cross-validation, 80% were used for training, and 20% were used for validating ablation performance. For the multi-channel CNN model training, the CSIQ IQA database [70] was used to pre-train the multi-channel CNN model using (11) since the CSIQ IQA database is a large-scale dataset containing many distorted images with common, general, and diverse distortion, which is suitable as the baseline for image quality feature representation learning. All images were split into
The mean and standard deviation of frame-level quality feature representation are then extracted using (15) and concatenated with temporal features and color-aware features in (16)–(18) and (19) to form the feature vector,
C. Performance Evaluation on UGC VQA Databases
To comprehensively evaluate the efficiency of our proposed method, we trained and tested our model on the three UGC databases individually and compared the performance with other state-of-the-art NR-VQA approaches. Twelve NR-VQA methods, BRISQUE [17], NIQE [15], V-BLIINDS [21], VIIDEO [20], TLVQM [10], VSFA [12], CNN-TLVQM [13], HEKE [24], RAPIQUE [23], VIDEVAL [71], MDTVSFA [52], and LSCT-PHIQNet [30] were included. In particular, VSFA, CNN-TLVQM, RAPIQUE, and MDTVSFA use the CNN model pre-trained on ImageNet classification task to extract the content-aware features via transfer learning. Also, LSCT-PHIQNet uses the pre-trained model on the IQA task to extract quality-aware features. The mean and standard deviation performances of PLCC, SROCC, and RMSE results for the mentioned competitors and the proposed model on the KoNViD-1k, LIVE-Qualcomm, and LIVE-VQC video databases are given in Table 2. Following the method in [12], Table 2 also includes the weighted average to weigh the results according to the number of videos to represent the overall performance. In Table 2, the best and the second-best performances of PLCC, SROCC, and RMSE are highlighted in bold and underlined, respectively.
As shown in Table 2, our proposed model achieves the best and second-best performance in terms of PLCC, SROCC, and RMSE in these three UGC databases and has small standard deviation values, representing that our model is more robust. Although our performance is the second-best performance in terms of PLCC in the KoNViD-1k database that is on par with LSCT-PHIQNet with a 0.003 slight difference, our proposed model outperforms LSCT-PHIQNet in LIVE-Qualcomm and LIVE-VQC databases with about 0.02 to 0.03 improvement in terms of PLCC and SROCC. Furthermore, in the LIVE-Qualcomm database, the correlation and accuracy (PLCC, SROCC, and RMSE) of our proposed model are superior to other NR-VQA methods. Ours also achieves the best performance in the LIVE-VQC database in terms of PLCC and SROCC. Compared with the second-best performance method, CNN-TLVQM, the performance of our proposed method outperforms about 0.015 and 0.019 in PLCC and SROCC, respectively. In terms of the weighted average results, our proposed model obtains the best overall performance in PLCC and SROCC, with improvements of 0.007 and 0.014 compared to the second-best performance method, respectively.
From the result in Table 2, it is evident that our proposed model outperforms other NR-VQA methods and exhibits better effectiveness and generalization performance on all three video databases. It can prove that our proposed NR-VQA method is more robust and effective than other transfer learning/pre-trained model-based methods.
D. Performance Evaluation on Traditional VQA Databases
Unlike the UGC databases, which focus on in-capture distortion and videos in the wild, the traditional VQA databases focus on the distortion produced in the compression and transmission process, called post-capture distortions. Most existing NR-VQA methods are designed for either UGC or traditional databases. Therefore, we also tested our proposed model and compared the performance with other state-of-the-art NR-VQA approaches, including V-BLIINDS [21], VIIDEO [20], SACONVA [46], TLVQM [10], VSFA [12], CNN-TLVQM [13], Wang’s [72], RIRNet [25], and HEKE [24], on the two traditional VQA databases, the LIVE and CSIQ video databases, to further demonstrate the effectiveness and generalization. It is noted that the results of SACONVA [46], Wang’s [72], and RIRNet [25] are duplicated from their papers. Also, the best and the second- best performances of PLCC and SROCC are highlighted in bold and underlined, respectively.
The PLCC and SROCC results of the LIVE and CSIQ video databases are shown in Table 3, where the standard deviation is also provided in the bracket. As we can see, the results of ours outperforms all other NR-VQA methods. It is worth noting that although Wang’s [72] adopts the saliency map and frame difference information for learning the CNN model to achieve video quality assessment, it ignores the motion information that we believe is crucial to the frame-level quality prediction, which affect the accuracy of perceptual quality prediction. In contrast, we conjecture that our proposed method focuses on the HVP and motion information, taking account of the motion-aware region map into the CNN model to learn the motion-aware and spatial-aware fusion features at the same time via the semi-supervised learning, and fine-tuning strategies. As a result, Table 3 shows that our proposed model yields significantly higher PLCC and SROCC than Wang’s [72]. Besides, the proposed model achieves significant improvements on both PLCC and SROCC compared to the second-best performance HEKE. In the LIVE video database, our proposed method outperforms HEKE [24] by about 0.053 and 0.051 in PLCC and SROCC. Additionally, PLCC and SROCC improvements of 0.02 and 0.014 are achieved in the CSIQ video database. Therefore, it is demonstrated that our proposed model is universal and can achieve a strong correlation with human perception in both UGC and traditional databases.
E. Performance Evaluation on Cross Databases
To further verify the generalization capability of our proposed model with diverse contents and distortions, this Section shows the performances of LIVE, KoNViD-1k, and LIVE-Qualcomm in a cross-database scenario, where LIVE is the traditional database focusing on post-capture distortions, KoNViD-1k is the UGC database focusing on video in the wild and various video contents, and LIVE-Qualcomm is the UGC database focusing on in-capture distortions. We trained our proposed model on one database and tested it with another two databases. Then, under the same experiment scheme, the SROCC performance was compared with TLVQM [10], VSFA [12], CNN-TLVQM [13], and LSCT-PHIQNet [30]. Again, the best and the second-best performances of SROCC are highlighted in bold and underlined, respectively. Table 4 clearly shows that the generalization ability of our proposed model trained on the three databases outperforms others in SROCC.
Besides, we also randomly picked up samples from all three video databases as training, and then used the rest of all samples for evaluation. When the model is trained on this combined database, as shown in Table 4, the performance of our model significantly outperforms the second-best method, CNN-TLVQM and LSCT-PHIQNet, improving SROCC by 0.041, 0.035 and 0.04 on the LIVE, KoNViD-1k and LIVE-Qualcomm video databases, respectively. Since our proposed model can achieve satisfactory results in the cross-database and combined- database scenario, in which three databases contain different and diverse video contents and distortions, the generalization capability of our proposed model is demonstrated, and it can be concluded that our model is universal for different video contents and all types of distortions.
F. Ablation Study of Motion-Aware Region Map, Semi-Supervised Learning, and Fine-Tuning Strategies
As mentioned in Sections III, based on the concept of ROI, SM in (5) can guide image quality feature learning by focusing on salient regions, as OFM in (6) can guide frame-level quality feature representation learning based on motion-aware regions, and the domain gap can then be reduced by the semi-supervised learning and fine-tuning strategies using (13). Therefore, to demonstrate the effects of SM, OFM, and these techniques for frame-level quality feature extraction, the ablation study was performed on the multi-channel CNN model using various training settings. It includes three combi- nations: VQA by using SM on the pre-trained multi-channel CNN model without performing the semi-supervised learning and fine-tuning strategies (Multi-CNN
As we can see in Table 5, PLCC and SROCC of Multi-CNN
G. Ablation Study of CAHNNEL Attention Mechanism
Table 6 shows an ablation study using regular residual blocks instead of SENet blocks (channel attention mechanism) on the multi-channel CNN model. The results show an improvement when using the SENet blocks to the multi-channel CNN model. It demonstrates that channel attention mechanism can reinforce crucial features and weaken inconsequence features, resulting in better quality feature representation learning.
H. Ablation Study of Features Learning
This ablation study performs feature selection to analyze the performance gain from features, which also demonstrates the effectiveness of the feature learning ability of our proposed model. There are four groups in the experiment: prediction by our proposed model with the frame-level quality feature representation only (FQFR), prediction by our proposed model with the frame-level quality feature representation and temporal features (FQFR+TF), prediction by our proposed model with the frame-level quality feature representation and color-aware features (FQFR+CF), and prediction by our proposed model with all features (FQFR+TF+CF). The experimental results are shown in Table 7.
Although the frame-level quality feature representation with the GRU model can achieve a satisfactory result, it can be seen that incorporating the temporal or color-aware features can also be of great help in predicting precise video quality scores with HVP characteristics. When only frame-level quality feature representation is used, the PLCC results are around 0.898 and 0.861 in the LIVE and KoNViD-1k video databases. When incorporating the frame-level quality feature representation with temporal features or color-aware features, the PLCC results in LIVE improved from 0.898 to 0.917 and 0.915. Furthermore, after combining the frame-level quality feature representation, temporal features, and color-aware features into our model, it can achieve the best results by learning the gradient of temporal features and temporal variation of spatial features along with the time series as spatiotemporal features, thereby enhancing video quality prediction with the help of HVP characteristics.
I. Computational Complexity
Computational complexity is another major concern in order to apply VQA methods in practical applications. Therefore, we evaluate the computational complexity of our proposed model and six competing NR-VQA models for benchmarking. For a fair comparison, all methods were tested on the same device operating on Windows 10 platform with Intel i9-10900K CPU, 64G RAM, and NVIDIA GeForce RTX 3090 24G GPU.
The comparison results of computational complexity are shown in Table 8. Two videos with different resolutions (720p and 1080p) from the LIVE-VQC were tested to compute the required runtime of a video for each NR-VQA method. Although our proposed model requires higher computational complexity than HEKE [24] and RAPIQUE [23], which perform temporal downsampling, our proposed model delivers better prediction accuracy results, as shown in Fig. 4. Meanwhile, for our proposed model, we can also perform temporal downsampling by a factor of two to form a lighter mode (Proposed_Light), i.e., feature extraction every 2 frames instead of every frame, which can further reduce the computational complexity by sacrificing a slight accuracy. The Proposed_light now has a complexity closer to HEKE. Notably, the Proposed_Light can achieve high accurate performance as shown in Fig. 4. Also, our proposed model is faster than other NR-VQA methods and has better PLCC performance. Specifically, compared with LSCT-PHIQNet [30] and CNN-TLVQM [13], our proposed model achieves about 20% and 50% reduction in computational complexity. Overall, Fig. 4 demonstrates that our proposed model has the best performance in the trade-off between accuracy and computational complexity.
Conclusion
In this paper, by using SSL, we developed a quality feature learning through multi-channel CNN using non-human annotated labels, and GRU while considering HVP characteristics of NR-VQA. First, we solve the limitations of the lack of available human-annotated label data for the VQA task by a SSL-based multi-channel CNN based on the image quality feature learning method in IQA domain. Second, we bridge the domain gap between the IQA and VQA tasks by adopting semi-supervised learning and fine-tuning strategies in the pre-trained CNN model. The model takes motion-aware information into consideration to optimize frame-level quality feature representation learning. Finally, a GRU model is established to explore spatiotemporal features and gradient of temporal features of video to estimate the video quality by incorporating the frame-level quality feature representation, and HVP-related temporal and color-aware features. Experimental results demonstrate the robustness and generalization of our proposed model, which is practical in real applications and is strongly related to human perception. In the future, one of the ways is to replace GRU with transformer to exploit the long-range dependencies for further improvement. Also, a lightweight NR-VQA model is essential for the further development of real-time applications.