Loading [a11y]/accessibility-menu.js
From Pixels to Rich-Nodes: A Cognition-Inspired Framework for Blind Image Quality Assessment | IEEE Journals & Magazine | IEEE Xplore

From Pixels to Rich-Nodes: A Cognition-Inspired Framework for Blind Image Quality Assessment


Abstract:

Blind image quality assessment (BIQA) is a subjective perception-driven task, which necessitates assessment results consistent with human cognition. The human cognitive s...Show More

Abstract:

Blind image quality assessment (BIQA) is a subjective perception-driven task, which necessitates assessment results consistent with human cognition. The human cognitive system inherently involves both separation and integration mechanisms. Recent works have witnessed the success of deep learning methods in separating distortion features. Nonetheless, traditional deep-learning-based BIQA methods predominantly depend on fixed topology to mimic the information integration in the brain, which gives rise to scale sensitivity and low flexibility. To handle this challenge, we delve into the dynamic interactions among neurons and propose a cognition-inspired BIQA model. Drawing insights from the rich club structure in network neuroscience, a graph-inspired feature integrator is devised to reconstruct the network topology. Specifically, we argue that the activity of individual neurons (pixels) tends to exhibit a random fluctuation with ambiguous meaning, while clear and coherent cognition arises from neurons with high connectivity (rich-nodes). Therefore, a self-attention mechanism is employed to establish strong semantic associations between pixels and rich-nodes. Subsequently, we design intra- and inter-layer graph structures to promote the feature interaction across spatial and scale dimensions. Such dynamic circuits endow the BIQA method with efficient, flexible, and robust information processing capabilities, so as to achieve more human-subjective assessment results. Moreover, since the limited samples in existing IQA datasets are prone to model overfitting, we devise two prior hypotheses: frequency prior and ranking prior. The former stepwise augments high-frequency components that reflect the distortion degree during the multilevel feature extraction, while the latter seeks to motivate the model’s in-depth comprehension of differences in sample quality. Extensive experiments on five publicly datasets reveal that the proposed algorithm achieves competitive results.
Published in: IEEE Transactions on Broadcasting ( Volume: 71, Issue: 1, March 2025)
Page(s): 229 - 239
Date of Publication: 07 October 2024

ISSN Information:


SECTION I.

Introduction

In recent years, the rapid development of video streaming platforms has led to exponential growth in user engagement. This trend raises concerns about detail loss and color distortion during multimedia data acquisition, compression, transmission, and storage. Consequently, Image Quality Assessment (IQA) emerges as a pivotal technology for evaluating user visual perception and augmenting overall viewing experiences [1], [2], [3].

In terms of dependency on the reference image, IQA methods are classified into three categories: full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA), and no-reference IQA (NR-IQA) or blind IQA (BIQA) [4]. Both FR-IQA and RR-IQA methods are grounded on the discrepancy priors between reference images and distorted images to capture the crucial features that affect the image quality. Regardless, given the overwhelming challenges of acquiring reference images in real scenarios, BIQA methods have gained more attention for their flexibility than FR-IQA and RR-IQA methods. These BIQA methods prevent the direct comparison with reference images, opting to employ intrinsic features of distorted images for quality assessment.

Traditional BIQA methods [5], [6], [7] are founded on the principles of the human visual system and image statistical properties. Such methods depend on manually designed feature extractors, posing challenges in handling IQA datasets with various distortions. Recently, the rapid advancement of deep learning technology has revitalized the field of BIQA, empowering models to acquire feature representations in a data-driven manner autonomously. Despite notable progress in assessment performance, these methods still suffer from the following limitations: 1) Current deep learning-based BIQA methods exhibit inadequate capabilities in integrating multi-level features, which poses risks of scale sensitivity and low flexibility; 2) BIQA datasets generally have insufficient sample size and are prone to overfitting problems.

BIQA is a task rooted in subjective perception, which requires deep learning networks to mimic human cognition. Separation and integration are key components of the human cognitive system [8]. The separation mechanism breaks complex information into independent components, each focusing on specific perceptual features. Meanwhile, the integration mechanism combines this information for higher-order cognition. Deep learning has been empirically validated for its outstanding performance in segregating multidimensional features. However, early integration research concentrated on exploring the fixed topology, exemplified by feature pyramid structures [9], [10], [11]. Such structures have a low tolerance to scale inconsistency, and the scale adjustment strategies pose potential hazards of feature loss and improper padding. Moreover, progressive feature aggregation focuses on adjacent feature maps, weakening interactions between non-adjacent ones [12]. Recently, biologically-inspired integration methods have gained attention [13]. Zhao et al. [14] employed a graph framework to capture dynamic feature interactions. Beyond resolving the concern of scale consistency, the graph framework facilitates parallel feature transmission across diverse scales. Regrettably, despite the grouped subsets derived from the pre-trained Convolutional Oriented Boundaries (COB) method decreasing the network’s computational cost, the non-end-to-end design may inadvertently learn irrelevant features for the IQA task. This potential risk could increase training difficulty and impede the synergistic optimization among disparate network components.

To tackle the above-mentioned issues, we rethink the multidimensional feature integration process and find that it can resonate with the rich club in network neuroscience. Rich club [15] holds that the separated modules reflect the informational specificity of different encephalon regions. Meanwhile, the few nodes exhibiting extensive connectivity are designated as rich connectors to facilitate inter-module communication, as depicted in Fig. 1(a). Therefore, we put forward a cognitive shift from pixel-level to rich-node-level. Individual pixels are highly sensitive to perturbations, making it difficult to provide reliable outputs. By autonomously learning similarity matrices within feature maps, we can identify rich-nodes with explicit feature representations from pixels, as depicted in Fig. 1(b). Furthermore, intra-layer and inter-layer graph structures are designed to update multidimensional features synchronously, facilitating dynamic integration across spatial and scale dimensions.

Fig. 1. - The core idea of GraspIQA. This work is motivated by the rich club concept, as depicted in (a), where the brain is a complex network composed of multiple interconnected modules. Each module exhibits distinct information processing capabilities. Moreover, a small subset of nodes featuring high connectivity are recognized as critical intermediaries for information integration across encephalon regions. Analogously, as illustrated in (b), the rectangular boxes represent specific features extracted from the input data. Subsequently, rich-nodes (represented by red nodes within circles) serve as the basis for constructing intra- and inter-layer graph structure that emulates the dynamic neural circuits.
Fig. 1.

The core idea of GraspIQA. This work is motivated by the rich club concept, as depicted in (a), where the brain is a complex network composed of multiple interconnected modules. Each module exhibits distinct information processing capabilities. Moreover, a small subset of nodes featuring high connectivity are recognized as critical intermediaries for information integration across encephalon regions. Analogously, as illustrated in (b), the rectangular boxes represent specific features extracted from the input data. Subsequently, rich-nodes (represented by red nodes within circles) serve as the basis for constructing intra- and inter-layer graph structure that emulates the dynamic neural circuits.

Moreover, the sample scarcity in existing IQA datasets poses a risk of model overfitting. When confronted with unknown matters, humans rely on prior cognition to swiftly form a preliminary understanding of new information, thereby fostering higher perceptiveness under variable conditions. Analogously, the prior knowledge in the deep learning network essentially refers to the assumption, rule, and domain-specific knowledge. In this paper, we propose two prior hypotheses. The first one is the frequency prior. We argue that high-frequency components typically indicate rapid changes in pixel values, such as image edges and textures, which play a crucial role in quantifying the extent of image distortion. The second one is the ranking prior. As described in [16], traditional quality regression operators ($l_{1}$ and $l_{2}$ norms) solely focus on the value accuracy, neglecting the ranking relationships among samples. However, humans tend to establish comparative relationships to rapidly distinguish between different options and make more rational choices. Building upon the aforementioned hypotheses, we devise a novel loss function based on self-cognitive priors, aiming to guide the model in understanding the input samples more deeply during the training process.

In summary, we propose a blind image quality assessment method based on Graph learning and self-cognitive priors, termed GraspIQA. The contributions of our work are summarized as follows.

  • We propose a cognition-inspired BIQA framework, aiming to simulate the integration process of the human cognitive system, thereby striving for assessment results that better align with human subjective perceptions.

  • We devise a graph-inspired feature integrator (GIFI) to establish strong semantic correlations between pixels and rich-nodes. Meanwhile, two graph structures are constructed for intra-layer contextual propagation and inter-layer semantic interaction. This dynamic topology demonstrates remarkable flexibility in facilitating parallel information transfer across diverse scale features.

  • We present a self-cognitive prior loss, which contains two prior hypotheses: frequency prior and ranking prior. The former aims to restore high-frequency information progressively, guiding the network to focus on crucial distorted features. Meanwhile, the latter provides an in-depth understanding of quality differences through sample ranking relationships.

The remainder of the paper is organized as follows. We discuss the related works in Section II. The proposed GraspIQA network is introduced in Section III, and Section IV describes the experimental results. Finally, some conclusions are given in Section V.

SECTION II.

Related Works

A. Blind Image Quality Assessment

Due to its generalizability, the BIQA method has emerged as an important research field. Nevertheless, due to the absence of reference images, its assessment performance often falls short compared to FR-IQA methods. Several works [17], [18], [19] attempted to improve assessment performance by introducing learnable priors.

Early works [20], [21], [22] generated pseudo-reference images via image restoration methods to establish discrepancy prior-assisted networks. However, image restoration methods often fabricate plausible details, making it challenging for the network to distinguish the generated textures and true noise [23]. To address this, Yang et al. [24] proposed a saliency-assisted prediction branch to learn attention masks, which participated in locally weighted estimation. Yao et al. [25] simulated the human scanning path by selecting a series of peak responses in the saliency map to acquire cropped patches. The aforementioned saliency-based BIQA methods attempt to mimic human attention preferences but neglect the interdependencies between salient and non-salient regions, potentially resulting in assessment bias. Frequency domain analysis is a fundamental technique in image processing, and it allows for the capture of micro-structural variations in images. Zhang et al. [26] devised a dual-stream deep network to respectively extract image high-frequency and low-frequency features for quantifying the quality of super-resolution images. Zhou et al. [27] utilized discrete Haar wavelet transform to obtain wavelet sub-bands and calculate entropy intensity. Moreover, the ranking prior is also a crucial strategy for improving assessment performance. Golestaneh et al. [28] proposed a margin triple loss function to learn the relative ranking between images with the highest and lowest quality scores. Ou et al. [29] generated many unlabeled ranking samples and set ranking upper and lower bounds to construct a controllable list-wise ranking loss.

In this work, we present a self-cognitive prior loss, which contains two prior hypotheses: frequency prior and ranking prior. Unlike the above-mentioned methods, the designed frequency prior does not require building multiple stream branches. Instead, it directly participates in IQA feature extraction and incrementally corrects multi-level features stepwise. Additionally, the ranking prior underscores the global ranking relationship among samples. Each sample within a batch contributes to the margin ranking loss, obviating the necessity to create pairs of ranked samples and mitigating the overfitting problem that arises from training solely on local extreme instances.

B. Graph Neural Networks

Graph neural networks [30], [31] are derived from the intricate interconnections among neurons. In the human brain, neurons interact through synaptic connections, consistent with the node and edge connectivity manner in graph neural networks. Furthermore, graph neural networks have the property of parallel computing, which is more conducive to learning feature interactions from large-scale graph data. Therefore, the graph neural network can be viewed as a straightforward simulation of the information transmission mechanism in the human brain, which broadens the domain of applicability of neural networks and enhances the effectiveness of handling unstructured data [32].

Shen et al. [33] proved that transforming input images into graph-based representations is a versatile and effective option for capturing visual perceptual features. Xu et al. [34] employed multiple viewpoints of omnidirectional images as nodes to construct a spatial viewport graph. Shan et al. [35] introduced a novel graph convolutional approach, wherein point clouds are conceptualized as graphs, facilitating the exploration of structural and textural perturbations through the interactions among local points. Sun et al. [36] investigated the intrinsic correlation between distortion types, levels, and quality scores, and proposed a general BIQA framework for distortion representation learning. Wang et al. [37] introduced an adaptive graph attention module, which adeptly refines post-transformer features into an adaptive graph structure, thereby facilitating local information enhancement. However, the challenge of serving images as graph nodes lies in the fact that images contain many pixels, leading to a marked surge in computational complexity. Moreover, due to the sensitivity of individual pixels to interference, the pixel-level features often lack clear definitions. Therefore, we propose a more elegant cognitive representation from multidimensional features. The self-attention mechanism summarizes pixel features with similar visual characteristics into a rich-node. The dense interconnections among these rich-nodes facilitate robust information transmission to generate clear cognition.

SECTION III.

Methodology

In this section, we elaborate on the GraspIQA method, as illustrated in Fig. 2, which consists of two components: a graph-inspired feature integrator for the spatial and scale interactions, and a self-cognitive prior loss for the in-depth comprehension of input samples.

Fig. 2. - The network architecture of the GraspIQA model. GraspIQA is constructed upon the ResNet50 network, where rich-nodes are extracted from multi-level features to construct inter-layer and intra-layer graph structures. Furthermore, we employ self-distillation to guide the current feature maps in learning the high-frequency information, and the relative ordering relationship between samples is emphasized by the ground truth ranking priors. In the intra-layer graph learning, the orange, green, and blue lines represent feature learning within 
$4 \times 4$
 feature maps, 
$8 \times 8$
 feature maps, and 
$16 \times 16$
 feature maps, respectively. In the inter-layer graph learning, the purple lines indicate the information interaction among these three feature maps.
Fig. 2.

The network architecture of the GraspIQA model. GraspIQA is constructed upon the ResNet50 network, where rich-nodes are extracted from multi-level features to construct inter-layer and intra-layer graph structures. Furthermore, we employ self-distillation to guide the current feature maps in learning the high-frequency information, and the relative ordering relationship between samples is emphasized by the ground truth ranking priors. In the intra-layer graph learning, the orange, green, and blue lines represent feature learning within $4 \times 4$ feature maps, $8 \times 8$ feature maps, and $16 \times 16$ feature maps, respectively. In the inter-layer graph learning, the purple lines indicate the information interaction among these three feature maps.

A. Graph-Inspired Feature Integrator

Motivated by the hierarchical cognition mechanism in the human visual system [13], several IQA methods [22], [38] have shifted their focus on exploring multi-scale feature integration methods in a more biologically-plausible manner. For this purpose, feature pyramid structures are adopted into the BIQA task. Despite their successes, traditional fixed-topology integration networks face challenges when integrating diverse cross-scale features due to their reliance on progressive fusion strategies. Therefore, we propose a graph-inspired feature integrator to form a cognitive shift from pixel-level to rich-node-level, thereby fostering the contextual propagation in the spatial dimension and the semantic interaction in the scale dimension, as shown in Fig. 3.

Fig. 3. - Comparison of the traditional feature pyramid structures and ours structure. Here, the cycles in the (a) and (b) denote different feature maps (pixel-level), and the cycles in the (c) indicate rich-nodes.
Fig. 3.

Comparison of the traditional feature pyramid structures and ours structure. Here, the cycles in the (a) and (b) denote different feature maps (pixel-level), and the cycles in the (c) indicate rich-nodes.

1) Rich-Nodes Mapping:

Given a feature layer $I_{i} \in \mathbb {R}^{C \times H \times W}$ , our goal is to obtain explicit feature representations, transitioning the feature map $I_{i}$ from the pixel-level space to the rich-node space. Initially, we employ an adaption average pooling to sample an initial rich-node map $\widetilde {S}_{i} \in \mathbb {R}^{C \times {}\frac {H}{r} \times {}\frac {W}{r}}$ to aggregate neighboring pixels and capture higher-order semantic information, where r denotes the pixel integration factor. Nevertheless, the initial rich-node map solely attends to the geometric structure of the feature map without modeling content relevance. Therefore, we further introduce the I-$\widetilde {S}$ attention mechanism [39], aiming to establish deeper semantic correlations between the pixel-level and rich-node-level, as shown in Eq. (1).\begin{equation*} Att = \rm {Softmax} \left ({{ \frac {I \cdot \widetilde {S}^{\mathrm {T}}}{\sqrt {C}} }}\right) \text {.} \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Based on the attention map Att, the rich-nodes are updated as:\begin{equation*} S = \overline {Att}^{\mathrm {T}} \cdot I, \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $\overline {Att}$ represents the column-normalized Att. Moreover, to accelerate the sampling rate, we constrain the association calculation of each pixel to the adjacent 9 rich-nodes.

2) Graph Construction:

Building upon [14], we introduce a dual-edge framework encompassing intra-layer and inter-layer graph learning. Regarding intra-layer edges, inspired by the observations presented in [40], we argue that neural networks’ tendency to glean crucial features from central cross positions (known as skeletons) within the feature map instead of peripheral corners. Thus, we establish ancestral-descendant relationships for each node based on its skeleton position to facilitate the propagation of contextual information. For inter-layer edges, due to the computational cost of connecting each rich node within one layer to all rich nodes in another layer, we start with the feature map with fewer rich nodes and evenly distribute the rich nodes of the low-level feature map. Subsequently, we establish inter-layer graph relationships to bridge semantic disparities across different feature maps. Note that all edges are bidirectional, and the connections of intra-layer and inter-layer edges are represented in Fig. 4.

Fig. 4. - The connections of intra-layer and inter-layer edges. (a) Intra-layer nodes construct ancestor-descendant edges based on their neighboring skeleton nodes; (b) The first sub-graph from left to right represents the inter-layer node relationships between adjacent layers, and the second sub-graph represents the inter-layer node relationships between non-adjacent layers, where 
$S_{1},S_{2},S_{3}$
 denote the triple-layer cross-scale rich-node maps.
Fig. 4.

The connections of intra-layer and inter-layer edges. (a) Intra-layer nodes construct ancestor-descendant edges based on their neighboring skeleton nodes; (b) The first sub-graph from left to right represents the inter-layer node relationships between adjacent layers, and the second sub-graph represents the inter-layer node relationships between non-adjacent layers, where $S_{1},S_{2},S_{3}$ denote the triple-layer cross-scale rich-node maps.

3) Graph Learning:

The graph-inspired feature integrator consists of inter-graph learning and intra-graph learning, which share a common set of nodes to extract different structural features. Specifically, both graph structures employ Graph Attention Network (GAT) [30] to integrate information from neighboring nodes via a self-attention mechanism, thereby obtaining updated representations of the current nodes. This adaptive attentional mechanism allows the model to adjust attentional preferences based on different inter-node relationships dynamically. Importantly, each layer of graph learning has its learnable parameters, which are not shared with any other layer. For node $N_{i}$ , the updated node is represented as shown in Eq. (3).\begin{equation*} \vec {N}_{i}^{\prime }=\mathcal {M}\left ({{\vec {N}_{i},\left \{{{\vec {N}_{j}}}\right \}_{j \in {\mathcal {C}}_{i}}}}\right), \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $\{\vec {N}_{j}\}_{j \in {\mathcal {C}}_{i}}$ denotes the set of neighboring nodes of $N_{i}$ , and $\mathcal {M}(\cdot)$ represents the self-attention operation.

B. Self-Cognitive Prior Loss

The limited sample of the current IQA dataset contributes to the risk of model overfitting. We aim to induce common rules from limited data to establish self-cognitive priors: frequency prior and ranking prior.

1) Frequency Prior:

The high-frequency information contain features such as texture and details, which better reflect the extent of image distortion compared to the low-frequency components, as depicted in Fig. 5. However, during the feature extraction process, as the depth of the network increases, the feature representation is gradually converted from its raw form to abstract structures. By selectively filtering out high-frequency information, the network focuses on unraveling the global structure of the input data, i.e., the low-frequency information. Obviously, this phenomenon is contrary to the primary intention of the IQA task to pay attention to high-frequency information. Therefore, we propose a frequency prior loss for the layer-by-layer correction of multilevel features.

Fig. 5. - Comparison of frequency maps of different images. (a) Comparison images, wherein the first row is the original image and rows 2–4 are distortion images for various degradation types; (b) Low-frequency maps; (c) High-frequency maps, it is noteworthy that due to the insignificance of the high-frequency features, we uniformly double the high-frequency maps of all images to strengthen their feature expression; (d) The histogram of low-frequency and high-frequency maps (Left: low-frequency results; right: high-frequency results). Here, we employ the Fast Fourier Transform with a filter radius of 30 to extract the high-frequency and low-frequency maps of diverse images. It is revealed that the low-frequency maps predominantly retain the intrinsic information and only manifest distortion under the “CONTRAST” condition. Further histogram data analysis indicates a high degree of similarity in pixel distribution among the low-frequency maps of the original, AWAN, and BLUR images, rendering them visually indistinguishable. In contrast, high-frequency maps emphasize the extent of image degradation across diverse scenarios, exemplified by numerous noise in the AWGN image, fuzzy edges in the BLUR image, and low pixel values in the CONTRAST image. Meanwhile, the histogram result also demonstrates distinct differences in pixel distribution among the high-frequency maps of different images.
Fig. 5.

Comparison of frequency maps of different images. (a) Comparison images, wherein the first row is the original image and rows 2–4 are distortion images for various degradation types; (b) Low-frequency maps; (c) High-frequency maps, it is noteworthy that due to the insignificance of the high-frequency features, we uniformly double the high-frequency maps of all images to strengthen their feature expression; (d) The histogram of low-frequency and high-frequency maps (Left: low-frequency results; right: high-frequency results). Here, we employ the Fast Fourier Transform with a filter radius of 30 to extract the high-frequency and low-frequency maps of diverse images. It is revealed that the low-frequency maps predominantly retain the intrinsic information and only manifest distortion under the “CONTRAST” condition. Further histogram data analysis indicates a high degree of similarity in pixel distribution among the low-frequency maps of the original, AWAN, and BLUR images, rendering them visually indistinguishable. In contrast, high-frequency maps emphasize the extent of image degradation across diverse scenarios, exemplified by numerous noise in the AWGN image, fuzzy edges in the BLUR image, and low pixel values in the CONTRAST image. Meanwhile, the histogram result also demonstrates distinct differences in pixel distribution among the high-frequency maps of different images.

The ResNet50 network is structured into four stages, with multiple residual blocks in each stage employed to extract the feature map at a 1/2 downsampling rate. This design implies that the antecedent stage retains valuable high-frequency features that are omitted in the succeeding stage. Therefore, for $I_{i} \in \mathbb {R}^{c \times h \times w}$ , the frequency prior can be obtained in the frequency domain, as shown in Eq. (4).\begin{equation*} p_{i} = \mathcal {F}^{-1} \left ({{ \sigma _{\gamma }\left ({{ \mathcal {F} \left ({{ I_{i-1} }}\right) }}\right)}}\right), \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $\mathcal {F}(\cdot)$ and $\mathcal {F}^{-1}(\cdot)$ represent the Fast Fourier Transform and inverse Fast Fourier Transform, $\sigma _{\gamma }$ denotes a high-pass filter with a filter radius of $\gamma \cdot h$ , and h indicates the height of the image. We leverage the frequency prior $p_{i}$ derived from the antecedent feature map $I_{i-1} $ to rectify the high-frequency components $f_{i}^{high}$ of the succeeding feature map $I_{i}$ . Subsequently, the rectified information is served as the frequency prior to the next-level feature map, i.e., $p_{i+1} = f_{i}^{high}$ . we leverage $[p_{1},p_{2},p_{3}]$ as the teacher feature set $P^{T}$ and $[p_{2},p_{3},p_{4}]$ as the student feature set $P^{S}$ . According to the activation-based attention knowledge transfer method [41], the model is guided progressively to reinforce high-frequency components throughout the multi-level feature extraction. The loss function for frequency prior is defined as follows.\begin{equation*} L_{fp}=\sum _{j \in J}\left \|{{\frac {V^{S}_{j}}{\left \|{{V^{S}_{j}}}\right \|_{2}}-\frac {V^{T}_{j}}{\left \|{{V^{T}_{j}}}\right \|_{2}}}}\right \|_{2}, \tag {5}\end{equation*} View SourceRight-click on figure for MathML and additional features.where J represents the number of training stages in the student and teacher feature sets, $\|{\cdot }\|_{2}$ denotes the $l_{2}$ norm, $V^{S}$ and $V^{T}$ denote the vectorized form of attention maps, which can be obtained as follows.\begin{align*} V^{S}=& \text {vec}\left ({{\sum _{j=1}^{C} \left |{{ P_{j}^{S}}}\right |^{2}}}\right), \tag {6}\\ V^{T}=& \text {vec}\left ({{\sum _{j=1}^{C} \left |{{ P_{j}^{T}}}\right |^{2}}}\right) \text {.} \tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features.

2) Ranking Prior:

Although traditional quality regression operators, such as $l_{1}$ and $l_{2}$ norms, have shown their efficacy, their limitation lies in their sole reliance on updating network parameters by minimizing the error between the predicted values and the ground truths. We propose a ranking prior loss that motivates the model to develop an in-depth understanding of the quality differences among samples by learning the sorting relation of the ground truth values.

Firstly, the ground truth values of distorted images $(g_{1},g_{2},\ldots, g_{n})$ within the same batch are sorted descendingly, and image indexes are documented to reorganize the prediction sequence. For the reorganized prediction sequence $P = {p_{1}, p_{2},\ldots, p_{n}}$ , we employ a margin ranking loss to quantify the relative ordering relationship. However, it is costly to acquire the complete relative ordering information for all samples. Therefore, the reorganized prediction data is partitioned into K groups, where the average is computed for each group to yield the average sequence $M = {m_{1},m_{2},\ldots, m_{k}}$ . Extracting the first $K-1$ numbers from M to construct X and the last $K-1$ numbers to form Y, a ranking prior loss $L_{rp}$ is devised to enforce the model to strictly comply with $x_{i} {\gt } y_{i}$ , as illustrated in Eq. (8).\begin{align*} L_{rp}\left ({{X,Y,\theta }}\right) = \frac {1}{K-1} \sum _{i \in K-1} \max \left \{{{0,-\theta \times \left ({{x_{i}-y_{i}}}\right)+mar}}\right \}, \tag {8}\end{align*} View SourceRight-click on figure for MathML and additional features.where $\theta $ denotes the expected sorting rule, in this paper, $\theta =1$ , mar is the minimum margin value. As $L_{rp}$ is solely concerned with ordering relationships, margin is set to 0.

Ultimately, our model undergoes end-to-end training while concurrently minimizing the aforementioned losses. The total loss of our model is defined as:\begin{equation*} L_{total} = \lambda _{1} L_{quality} + \lambda _{2} L_{fp} +\lambda _{3} L_{rp}, \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features.where $L_{quality}$ denote a quality regression operation, and $L_{quality} = \sum _{n \in N}\| p_{n} - g_{n}\|_{1}$ , $\lambda _{1}, \lambda _{2}, \lambda _{3}$ are balancing coefficients.

SECTION IV.

Experiments

A. Experimental Protocol

1) Datasets and Evaluation Criteria:

We conducted extensive experiments on five image quality assessment datasets to evaluate the performance of our proposed method. These datasets cover many distortion scenarios, including three synthetic distortion datasets and two authentic distortion datasets.

  • Synthetic distortion datasets

    • LIVE [42]: undergoes five different distortion procedures to 29 reference images, generating 779 distorted images with a resolution mainly of $768 \times 512$ . This dataset provides image differential mean opinion scores ranging from 0 to 100.

    • CSIQ [43]: undergoes six different distortion procedures to 30 reference images, generating 866 distorted images with a resolution of $512 \times 512$ . This dataset provides image differential mean opinion scores ranging from 0 to 1.

    • TID2013 [44]: undergoes 25 different distortion procedures to 25 reference images, generating 3000 distorted images with a $512 \times 384$ resolution. This dataset provides image mean opinion scores ranging from 0 to 9.

  • Authentic distortion datasets

    • CLIVE [45]: contains 1162 authentic distorted images with a resolution of $500 \times 500$ . This dataset provides image mean opinion scores ranging from 0 to 100.

    • KonIQ-10k [46]: contains 10073 authentic distorted images with a resolution of $1280 \times 768$ . This dataset provides image mean opinion scores ranging from 0 to 5.

Moreover, we quantified the assessment performance of our proposed method in terms of two well-used metrics: Spearman’s Rank-Order Correlation Coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC). Both metrics indicate better model performance with higher values. For N images, the SROCC and PLCC metrics can be computed as follows.\begin{align*} \mathrm {SROCC}=& 1-\frac {6 \sum _{i=1}^{N}\left ({{\hat {y}_{i}-y_{i}}}\right)^{2}}{N\left ({{N^{2}-1}}\right)}, \tag {10}\\ \mathrm {PLCC}=& \frac {\sum _{i=1}^{N}\left ({{y_{i}-\bar {y}}}\right)\left ({{\hat {y}_{i}-\hat {\bar {y}}}}\right)}{\sqrt {\sum _{i=1}^{N}\left ({{y_{i}-\bar {y}}}\right)^{2}} \sqrt {\sum _{i=1}^{N}\left ({{\hat {y}_{i}-\hat {\bar {y}}}}\right)^{2}}}, \tag {11}\end{align*} View SourceRight-click on figure for MathML and additional features.where ${y}_{i}$ and $\hat {y}_{i}$ denote the ground truth and prediction scores of the i-th image, respectively, and $\bar {y}$ and $\hat {\bar {y}}$ are the mean value of ground truth and prediction scores for all images, respectively.

2) Implementation Details:

Following the standard training strategy for IQA methods, the pre-trained ResNet50 served as the backbone network for the GraspIQA model. From each image, a set of 25 patches was randomly sampled, and these patches underwent horizontal flipping with a 50% probability to enhance the dataset’s diversity. It should be noted that the patch size of the synthetic dataset is $128 \times 128$ , while the patch size of the authentic dataset is $256 \times 256$ . Moreover, in the frequency prior, $\gamma = {}\frac {1}{12}$ , in the ranking prior, we set $K=10$ . The hyperparameters $\lambda _{1}, \lambda _{2}, \lambda _{3}$ are empirically set to 1, 1, and 10, respectively.

Initially, we established ten random seeds to produce ten sets of training, validation, and testing datasets distributed by a 6:2:2 split ratio. Specifically, the authentic dataset was randomly split, and the synthetic dataset was split along with the reference image to avoid overlapping content. An Adam optimizer with the weight decay $5 \times 10^{-4}$ and the initial learning rate $2 \times 10^{-5}$ was selected to minimize the loss in the training set. We employed the optimal weight from the validation dataset to assess test samples and obtain the final model accuracy. In the experimental setting, the training patches inherited the quality scores of the original images. In contrast, in the validation and testing phases, the quality assessment of each image was derived by calculating the average score of all patches. Notably, to ensure that the discrepancies among the three types of losses were within a controllable range, we proportionally scaled the quality scores of all datasets to fall within the $[{0,1}]$ range. After ten experiments on the baseline network, the random seed with the median value of the SROCC and PLCC metrics was selected for the ablation experiments to ensure the consistency of data samples.

B. Comparison Experiments

We have selected 15 state-of-the-art methods to underscore the superiority of our proposed network. These methods comprise five traditional BIQA methods, namely a DCT-domain IQA method (DIIVINE) [47], a spatial-domain IQA method (BRISQUE) [6], a codebook-based IQA method (CORNIA) [48], a feature-enriched IQA method (ILNIQE) [7], and a statistics aggregation-based IQA method (HOSA) [49]. Furthermore, 11 deep learning-based BIQA methods are chosen, which include five dual-stream networks: DIQaM-NR [50], TS-CNN [16], DBCNN [51], HyperIQA [52], and MMMNet [53], two meta-learning networks: MetaIQA [54], and Lang et al. [55], two networks that enhance sample diversity: CLRIQA [29] and CONTRIQUE [56], and a large model-based network, CLIPIQA [57]. All experimental results are derived from the original paper or replicated based on source code.

Table I presents the overall performance results of SROCC and PLCC across five IQA datasets. Several observations can be made. 1) Traditional BIQA methods perform poorly. 2) Among the five dual-stream deep learning networks, the average assessment accuracy of DIQaM-NR and TS-CNN is similar. Both perform well on the LIVE dataset but poorly on the other datasets. The DBCNN method achieves the best SROCC and PLCC values on the CSIQ dataset. However, it performs poorly on the large KonIQ-10k dataset. The HyperIQA and MMMNet methods can accurately predict image quality on real-world datasets. However, they still fall short of the best values on the CSIQ and TID2013 datasets. 3) Among the two meta-learning-based networks, the MetaIQA method performs better. It achieves the best SROCC value on the TID2013 dataset but underperforms on the other datasets. 4) Among the two networks focusing on sample diversity, CLRIQA achieves the best SROCC and PLCC values on the LIVE dataset. However, its SROCC value is 7.5% lower than the best on the KonIQ-10k dataset. CONTRIQUE ranks third in average assessment accuracy and performs well on the CSIQ and TID2013 datasets. 5)For the large model-based network, the performance of the CLIP-IQA network is below the standard on all datasets except KonIQ-10k. We attribute this mainly to the small sample sizes of IQA datasets, which are insufficient for training large models.

TABLE I Comparison of GraspIQA v.s. State-of-the-Art BIQA Algorithms on Synthetic and Authentic Distortion Datasets. In Each Column, the Best, Second-Best, and Third-Best Results Are Highlighted in Red, Blue and Bold, Respectively. The Average Calculates the Mean Value Across Five Datasets. $^{\dagger }$ Denotes That the Code is Not Publicly Available
Table I- Comparison of GraspIQA v.s. State-of-the-Art BIQA Algorithms on Synthetic and Authentic Distortion Datasets. In Each Column, the Best, Second-Best, and Third-Best Results Are Highlighted in Red, Blue and Bold, Respectively. The Average Calculates the Mean Value Across Five Datasets. 
$^{\dagger }$
Denotes That the Code is Not Publicly Available

In summary, the 15 comparison methods mentioned above perform well only on partial datasets, lacking universal applicability across all datasets. For that, we propose a cognitive framework from pixels to rich-nodes that mines high-coupled features from low-cohesive pixels, thereby fostering information interaction across scales and spatial dimensions. Experimental results show that our GraspIQA method performs exceptionally well. In ten comparisons, it attains two best values, four second-best values, and four third-best values. This demonstrates its competitiveness on both synthetic and authentic datasets. Ultimately, based on the assessment results in five datasets, GraspIQA exhibits the highest average SROCC and PLCC values, further validating its superiority.

C. Ablation Experiments

To evaluate the effectiveness of the design modules of our GraspIQA method, we conducted ablation experiments on the LIVE and TID2013 datasets, as shown in Table II.

TABLE II Ablation Experimental Results (SROCC / PLCC). GIFI Refers the Graph-Inspired Feature Integrator, and the Best Results Are Highlighted in Bold
Table II- Ablation Experimental Results (SROCC / PLCC). GIFI Refers the Graph-Inspired Feature Integrator, and the Best Results Are Highlighted in Bold

In Table II, “Baseline” refers to the ResNet50 model with two fully connected layers. Adding individual modules improves performance, with the GIFI module showing the most significant gain. SROCC and PLCC increase by (1.0%, 0.6%) and (2.7%, 1.9%) on the LIVE and TID2013 datasets, respectively. This improvement results from the shift from pixels to rich nodes, enabling better feature extraction and multi-dimensional interaction. We also examine the impact of self-cognitive prior loss. The $L_{fp}$ function improves SROCC by 1.5% on TID2013. The $L_{rp}$ function enhances understanding of quality differences by leveraging ranking relationships, achieving an SROCC of 0.809 on TID2013. Ultimately, GraspIQA achieves the highest performance on both LIVE and TID2013, with metrics of 0.973/0.976 and 0.842/0.872, respectively.

D. Effectiveness of Frequency Prior

Figs. 6 and 7 depict schematic diagrams of feature mappings under different distortion conditions. The ResNet50 network is divided into four stages. Since shallow-level features contain more detailed information, we focus on analyzing the effectiveness of the first frequency prior (Stage $1~\rightarrow $ Stage 2).

Fig. 6. - Schematic of feature maps under the “JPEG2000” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.
Fig. 6.

Schematic of feature maps under the “JPEG2000” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.

Fig. 7. - Schematic of feature maps under the “AWGN” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.
Fig. 7.

Schematic of feature maps under the “AWGN” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.

Experimental results show that as network depth increases, convolutional neural networks tend to focus more on global structures. However, this approach introduces some issues. As shown in Fig. 6 w/o $L_{fp}$ , texture features diminish after “Stage 1”, and the compression loss of trees in the background becomes less noticeable. Similarly, Fig. 7 w/o $L_{fp}$ illustrates a loss of facial details and clothing texture of the foreground person during “Stage 2”. Excessive information loss is problematic for the IQA task, rendering it difficult for the BIQA networks to accurately assess image quality relying on deep features. To address this, we introduce the frequency prior to perform a step-wise correction of multi-level features. This helps the network learn and recover lost high-frequency features. Notably, compared to the w/o $L_{fp}$ approach, the prediction errors of w/ $L_{fp}$ have been significantly reduced, decreasing by 0.2144 and 0.2742 respectively.

E. Cross-Dataset Experiments

An excellent quality assessment algorithm should exhibit outstanding performance on an individual dataset and perform well and accurately across different datasets. Therefore, to verify the robustness of the proposed method, we conducted extensive cross-dataset experiments on synthetic datasets LIVE, CSIQ, TID2013, and authentic dataset CLIVE. We compared GraspIQA with three traditional methods: BRISQUE [6], CORNIA [48], FRIQUEE [58], and five deep learning methods: WaDIQaM [50], CNNIQA [59], NIMA [60], DBCNN [51], and HyperIQA [52]. Notably, the cross-dataset experiments were rigorously designed to ensure training on one full dataset and subsequent testing on another full dataset.

Table III depicts the results of the cross-dataset experiments. We conducted two sets of generalization experiments for each synthetic dataset, while one set was carried out for the authentic dataset. Based on the experimental results, we arrive at the following conclusions. 1) The generalization performance of traditional BIQA methods is generally lower than that of deep learning methods. 2) All methods perform well in cross-dataset experiments from large to small datasets. However, performance is limited when weights from smaller datasets are used on larger ones. For example, in TID2013 (3000) $\rightarrow $ CSIQ (866) and TID2013 (3000) $\rightarrow $ LIVE (779), the best SROCC and PLCC values exceed 0.8. In contrast, for CSIQ (866) $\rightarrow $ TID2013 (3000) and LIVE (779) $\rightarrow $ TID2013 (3000), SROCC values fall below 0.6. Fortunately, when our GraspIQA method is applied to authentic datasets, the situation improves significantly. Authentic images captured in varied outdoor environments provide a broader range of scenarios. This allows GraspIQA to effectively extend knowledge from small to large datasets. In CLIVE (1162) $\rightarrow $ KonIQ (10073), the PLCC value reaches 0.806. 3) In cross-dataset experiments with TID2013, CNNIQA shows relatively stable generalization, but SROCC and PLCC values are below 0.7. The other seven methods perform well on only one dataset each. For example, NIMA excels on CSIQ but not on LIVE, while the traditional methods and WaDIQaM, DBCNN, and HyperIQA perform better with LIVE data. The accuracy for TID2013 (3000) $\rightarrow $ CSIQ (866) is notably lower than for TID2013 (3000) $\rightarrow $ LIVE (866). In contrast, the GraspIQA network achieves over 0.8 accuracy in both generalization experiments on the TID2013 dataset, demonstrating superior generalization. Our method stands out with six optimal and six suboptimal values across seven cross-dataset experiments, achieving the highest average SROCC and PLCC values.

TABLE III Cross-Dataset Experimental Results (SROCC / PLCC). Values Given in Parentheses Indicate the Sample Numbers Included in the Dataset. “Average” Refers to Calculating the Mean Value Across All Generalization Experiments. Meanwhile, in Each Column, the Best and Second-Best Results Are Highlighted in Red and Blue
Table III- Cross-Dataset Experimental Results (SROCC / PLCC). Values Given in Parentheses Indicate the Sample Numbers Included in the Dataset. “Average” Refers to Calculating the Mean Value Across All Generalization Experiments. Meanwhile, in Each Column, the Best and Second-Best Results Are Highlighted in Red and Blue

SECTION V.

Conclusion

We propose a cognition-inspired BIQA approach that emulates the rich-club mechanism in the human brain. By reconstructing the topology of conventional neural networks, we form a shift from pixels to rich-nodes. Such learnable rich-nodes serve as vital connectors to mine the semantic interaction between multi-scale feature maps. Additionally, we devise a self-cognitive prior loss to assist the network in extracting consistent information from limited samples, thereby improving the generalization performance of our model. Extensive experimental results show that our GraspIQA method is competitive on the five benchmark datasets and exhibits excellent accuracy in cross-dataset experiments. Moving forward, we aim to further enhance the scalability of our method, particularly in terms of its generalization capability from small to large datasets.

References

References is not available for this document.