Introduction
In recent years, the rapid development of video streaming platforms has led to exponential growth in user engagement. This trend raises concerns about detail loss and color distortion during multimedia data acquisition, compression, transmission, and storage. Consequently, Image Quality Assessment (IQA) emerges as a pivotal technology for evaluating user visual perception and augmenting overall viewing experiences [1], [2], [3].
In terms of dependency on the reference image, IQA methods are classified into three categories: full-reference IQA (FR-IQA), reduced-reference IQA (RR-IQA), and no-reference IQA (NR-IQA) or blind IQA (BIQA) [4]. Both FR-IQA and RR-IQA methods are grounded on the discrepancy priors between reference images and distorted images to capture the crucial features that affect the image quality. Regardless, given the overwhelming challenges of acquiring reference images in real scenarios, BIQA methods have gained more attention for their flexibility than FR-IQA and RR-IQA methods. These BIQA methods prevent the direct comparison with reference images, opting to employ intrinsic features of distorted images for quality assessment.
Traditional BIQA methods [5], [6], [7] are founded on the principles of the human visual system and image statistical properties. Such methods depend on manually designed feature extractors, posing challenges in handling IQA datasets with various distortions. Recently, the rapid advancement of deep learning technology has revitalized the field of BIQA, empowering models to acquire feature representations in a data-driven manner autonomously. Despite notable progress in assessment performance, these methods still suffer from the following limitations: 1) Current deep learning-based BIQA methods exhibit inadequate capabilities in integrating multi-level features, which poses risks of scale sensitivity and low flexibility; 2) BIQA datasets generally have insufficient sample size and are prone to overfitting problems.
BIQA is a task rooted in subjective perception, which requires deep learning networks to mimic human cognition. Separation and integration are key components of the human cognitive system [8]. The separation mechanism breaks complex information into independent components, each focusing on specific perceptual features. Meanwhile, the integration mechanism combines this information for higher-order cognition. Deep learning has been empirically validated for its outstanding performance in segregating multidimensional features. However, early integration research concentrated on exploring the fixed topology, exemplified by feature pyramid structures [9], [10], [11]. Such structures have a low tolerance to scale inconsistency, and the scale adjustment strategies pose potential hazards of feature loss and improper padding. Moreover, progressive feature aggregation focuses on adjacent feature maps, weakening interactions between non-adjacent ones [12]. Recently, biologically-inspired integration methods have gained attention [13]. Zhao et al. [14] employed a graph framework to capture dynamic feature interactions. Beyond resolving the concern of scale consistency, the graph framework facilitates parallel feature transmission across diverse scales. Regrettably, despite the grouped subsets derived from the pre-trained Convolutional Oriented Boundaries (COB) method decreasing the network’s computational cost, the non-end-to-end design may inadvertently learn irrelevant features for the IQA task. This potential risk could increase training difficulty and impede the synergistic optimization among disparate network components.
To tackle the above-mentioned issues, we rethink the multidimensional feature integration process and find that it can resonate with the rich club in network neuroscience. Rich club [15] holds that the separated modules reflect the informational specificity of different encephalon regions. Meanwhile, the few nodes exhibiting extensive connectivity are designated as rich connectors to facilitate inter-module communication, as depicted in Fig. 1(a). Therefore, we put forward a cognitive shift from pixel-level to rich-node-level. Individual pixels are highly sensitive to perturbations, making it difficult to provide reliable outputs. By autonomously learning similarity matrices within feature maps, we can identify rich-nodes with explicit feature representations from pixels, as depicted in Fig. 1(b). Furthermore, intra-layer and inter-layer graph structures are designed to update multidimensional features synchronously, facilitating dynamic integration across spatial and scale dimensions.
The core idea of GraspIQA. This work is motivated by the rich club concept, as depicted in (a), where the brain is a complex network composed of multiple interconnected modules. Each module exhibits distinct information processing capabilities. Moreover, a small subset of nodes featuring high connectivity are recognized as critical intermediaries for information integration across encephalon regions. Analogously, as illustrated in (b), the rectangular boxes represent specific features extracted from the input data. Subsequently, rich-nodes (represented by red nodes within circles) serve as the basis for constructing intra- and inter-layer graph structure that emulates the dynamic neural circuits.
Moreover, the sample scarcity in existing IQA datasets poses a risk of model overfitting. When confronted with unknown matters, humans rely on prior cognition to swiftly form a preliminary understanding of new information, thereby fostering higher perceptiveness under variable conditions. Analogously, the prior knowledge in the deep learning network essentially refers to the assumption, rule, and domain-specific knowledge. In this paper, we propose two prior hypotheses. The first one is the frequency prior. We argue that high-frequency components typically indicate rapid changes in pixel values, such as image edges and textures, which play a crucial role in quantifying the extent of image distortion. The second one is the ranking prior. As described in [16], traditional quality regression operators (
In summary, we propose a blind image quality assessment method based on Graph learning and self-cognitive priors, termed GraspIQA. The contributions of our work are summarized as follows.
We propose a cognition-inspired BIQA framework, aiming to simulate the integration process of the human cognitive system, thereby striving for assessment results that better align with human subjective perceptions.
We devise a graph-inspired feature integrator (GIFI) to establish strong semantic correlations between pixels and rich-nodes. Meanwhile, two graph structures are constructed for intra-layer contextual propagation and inter-layer semantic interaction. This dynamic topology demonstrates remarkable flexibility in facilitating parallel information transfer across diverse scale features.
We present a self-cognitive prior loss, which contains two prior hypotheses: frequency prior and ranking prior. The former aims to restore high-frequency information progressively, guiding the network to focus on crucial distorted features. Meanwhile, the latter provides an in-depth understanding of quality differences through sample ranking relationships.
The remainder of the paper is organized as follows. We discuss the related works in Section II. The proposed GraspIQA network is introduced in Section III, and Section IV describes the experimental results. Finally, some conclusions are given in Section V.
Related Works
A. Blind Image Quality Assessment
Due to its generalizability, the BIQA method has emerged as an important research field. Nevertheless, due to the absence of reference images, its assessment performance often falls short compared to FR-IQA methods. Several works [17], [18], [19] attempted to improve assessment performance by introducing learnable priors.
Early works [20], [21], [22] generated pseudo-reference images via image restoration methods to establish discrepancy prior-assisted networks. However, image restoration methods often fabricate plausible details, making it challenging for the network to distinguish the generated textures and true noise [23]. To address this, Yang et al. [24] proposed a saliency-assisted prediction branch to learn attention masks, which participated in locally weighted estimation. Yao et al. [25] simulated the human scanning path by selecting a series of peak responses in the saliency map to acquire cropped patches. The aforementioned saliency-based BIQA methods attempt to mimic human attention preferences but neglect the interdependencies between salient and non-salient regions, potentially resulting in assessment bias. Frequency domain analysis is a fundamental technique in image processing, and it allows for the capture of micro-structural variations in images. Zhang et al. [26] devised a dual-stream deep network to respectively extract image high-frequency and low-frequency features for quantifying the quality of super-resolution images. Zhou et al. [27] utilized discrete Haar wavelet transform to obtain wavelet sub-bands and calculate entropy intensity. Moreover, the ranking prior is also a crucial strategy for improving assessment performance. Golestaneh et al. [28] proposed a margin triple loss function to learn the relative ranking between images with the highest and lowest quality scores. Ou et al. [29] generated many unlabeled ranking samples and set ranking upper and lower bounds to construct a controllable list-wise ranking loss.
In this work, we present a self-cognitive prior loss, which contains two prior hypotheses: frequency prior and ranking prior. Unlike the above-mentioned methods, the designed frequency prior does not require building multiple stream branches. Instead, it directly participates in IQA feature extraction and incrementally corrects multi-level features stepwise. Additionally, the ranking prior underscores the global ranking relationship among samples. Each sample within a batch contributes to the margin ranking loss, obviating the necessity to create pairs of ranked samples and mitigating the overfitting problem that arises from training solely on local extreme instances.
B. Graph Neural Networks
Graph neural networks [30], [31] are derived from the intricate interconnections among neurons. In the human brain, neurons interact through synaptic connections, consistent with the node and edge connectivity manner in graph neural networks. Furthermore, graph neural networks have the property of parallel computing, which is more conducive to learning feature interactions from large-scale graph data. Therefore, the graph neural network can be viewed as a straightforward simulation of the information transmission mechanism in the human brain, which broadens the domain of applicability of neural networks and enhances the effectiveness of handling unstructured data [32].
Shen et al. [33] proved that transforming input images into graph-based representations is a versatile and effective option for capturing visual perceptual features. Xu et al. [34] employed multiple viewpoints of omnidirectional images as nodes to construct a spatial viewport graph. Shan et al. [35] introduced a novel graph convolutional approach, wherein point clouds are conceptualized as graphs, facilitating the exploration of structural and textural perturbations through the interactions among local points. Sun et al. [36] investigated the intrinsic correlation between distortion types, levels, and quality scores, and proposed a general BIQA framework for distortion representation learning. Wang et al. [37] introduced an adaptive graph attention module, which adeptly refines post-transformer features into an adaptive graph structure, thereby facilitating local information enhancement. However, the challenge of serving images as graph nodes lies in the fact that images contain many pixels, leading to a marked surge in computational complexity. Moreover, due to the sensitivity of individual pixels to interference, the pixel-level features often lack clear definitions. Therefore, we propose a more elegant cognitive representation from multidimensional features. The self-attention mechanism summarizes pixel features with similar visual characteristics into a rich-node. The dense interconnections among these rich-nodes facilitate robust information transmission to generate clear cognition.
Methodology
In this section, we elaborate on the GraspIQA method, as illustrated in Fig. 2, which consists of two components: a graph-inspired feature integrator for the spatial and scale interactions, and a self-cognitive prior loss for the in-depth comprehension of input samples.
The network architecture of the GraspIQA model. GraspIQA is constructed upon the ResNet50 network, where rich-nodes are extracted from multi-level features to construct inter-layer and intra-layer graph structures. Furthermore, we employ self-distillation to guide the current feature maps in learning the high-frequency information, and the relative ordering relationship between samples is emphasized by the ground truth ranking priors. In the intra-layer graph learning, the orange, green, and blue lines represent feature learning within
A. Graph-Inspired Feature Integrator
Motivated by the hierarchical cognition mechanism in the human visual system [13], several IQA methods [22], [38] have shifted their focus on exploring multi-scale feature integration methods in a more biologically-plausible manner. For this purpose, feature pyramid structures are adopted into the BIQA task. Despite their successes, traditional fixed-topology integration networks face challenges when integrating diverse cross-scale features due to their reliance on progressive fusion strategies. Therefore, we propose a graph-inspired feature integrator to form a cognitive shift from pixel-level to rich-node-level, thereby fostering the contextual propagation in the spatial dimension and the semantic interaction in the scale dimension, as shown in Fig. 3.
Comparison of the traditional feature pyramid structures and ours structure. Here, the cycles in the (a) and (b) denote different feature maps (pixel-level), and the cycles in the (c) indicate rich-nodes.
1) Rich-Nodes Mapping:
Given a feature layer \begin{equation*} Att = \rm {Softmax} \left ({{ \frac {I \cdot \widetilde {S}^{\mathrm {T}}}{\sqrt {C}} }}\right) \text {.} \tag {1}\end{equation*}
Based on the attention map Att, the rich-nodes are updated as:\begin{equation*} S = \overline {Att}^{\mathrm {T}} \cdot I, \tag {2}\end{equation*}
2) Graph Construction:
Building upon [14], we introduce a dual-edge framework encompassing intra-layer and inter-layer graph learning. Regarding intra-layer edges, inspired by the observations presented in [40], we argue that neural networks’ tendency to glean crucial features from central cross positions (known as skeletons) within the feature map instead of peripheral corners. Thus, we establish ancestral-descendant relationships for each node based on its skeleton position to facilitate the propagation of contextual information. For inter-layer edges, due to the computational cost of connecting each rich node within one layer to all rich nodes in another layer, we start with the feature map with fewer rich nodes and evenly distribute the rich nodes of the low-level feature map. Subsequently, we establish inter-layer graph relationships to bridge semantic disparities across different feature maps. Note that all edges are bidirectional, and the connections of intra-layer and inter-layer edges are represented in Fig. 4.
The connections of intra-layer and inter-layer edges. (a) Intra-layer nodes construct ancestor-descendant edges based on their neighboring skeleton nodes; (b) The first sub-graph from left to right represents the inter-layer node relationships between adjacent layers, and the second sub-graph represents the inter-layer node relationships between non-adjacent layers, where
3) Graph Learning:
The graph-inspired feature integrator consists of inter-graph learning and intra-graph learning, which share a common set of nodes to extract different structural features. Specifically, both graph structures employ Graph Attention Network (GAT) [30] to integrate information from neighboring nodes via a self-attention mechanism, thereby obtaining updated representations of the current nodes. This adaptive attentional mechanism allows the model to adjust attentional preferences based on different inter-node relationships dynamically. Importantly, each layer of graph learning has its learnable parameters, which are not shared with any other layer. For node \begin{equation*} \vec {N}_{i}^{\prime }=\mathcal {M}\left ({{\vec {N}_{i},\left \{{{\vec {N}_{j}}}\right \}_{j \in {\mathcal {C}}_{i}}}}\right), \tag {3}\end{equation*}
B. Self-Cognitive Prior Loss
The limited sample of the current IQA dataset contributes to the risk of model overfitting. We aim to induce common rules from limited data to establish self-cognitive priors: frequency prior and ranking prior.
1) Frequency Prior:
The high-frequency information contain features such as texture and details, which better reflect the extent of image distortion compared to the low-frequency components, as depicted in Fig. 5. However, during the feature extraction process, as the depth of the network increases, the feature representation is gradually converted from its raw form to abstract structures. By selectively filtering out high-frequency information, the network focuses on unraveling the global structure of the input data, i.e., the low-frequency information. Obviously, this phenomenon is contrary to the primary intention of the IQA task to pay attention to high-frequency information. Therefore, we propose a frequency prior loss for the layer-by-layer correction of multilevel features.
Comparison of frequency maps of different images. (a) Comparison images, wherein the first row is the original image and rows 2–4 are distortion images for various degradation types; (b) Low-frequency maps; (c) High-frequency maps, it is noteworthy that due to the insignificance of the high-frequency features, we uniformly double the high-frequency maps of all images to strengthen their feature expression; (d) The histogram of low-frequency and high-frequency maps (Left: low-frequency results; right: high-frequency results). Here, we employ the Fast Fourier Transform with a filter radius of 30 to extract the high-frequency and low-frequency maps of diverse images. It is revealed that the low-frequency maps predominantly retain the intrinsic information and only manifest distortion under the “CONTRAST” condition. Further histogram data analysis indicates a high degree of similarity in pixel distribution among the low-frequency maps of the original, AWAN, and BLUR images, rendering them visually indistinguishable. In contrast, high-frequency maps emphasize the extent of image degradation across diverse scenarios, exemplified by numerous noise in the AWGN image, fuzzy edges in the BLUR image, and low pixel values in the CONTRAST image. Meanwhile, the histogram result also demonstrates distinct differences in pixel distribution among the high-frequency maps of different images.
The ResNet50 network is structured into four stages, with multiple residual blocks in each stage employed to extract the feature map at a 1/2 downsampling rate. This design implies that the antecedent stage retains valuable high-frequency features that are omitted in the succeeding stage. Therefore, for \begin{equation*} p_{i} = \mathcal {F}^{-1} \left ({{ \sigma _{\gamma }\left ({{ \mathcal {F} \left ({{ I_{i-1} }}\right) }}\right)}}\right), \tag {4}\end{equation*}
\begin{equation*} L_{fp}=\sum _{j \in J}\left \|{{\frac {V^{S}_{j}}{\left \|{{V^{S}_{j}}}\right \|_{2}}-\frac {V^{T}_{j}}{\left \|{{V^{T}_{j}}}\right \|_{2}}}}\right \|_{2}, \tag {5}\end{equation*}
\begin{align*} V^{S}=& \text {vec}\left ({{\sum _{j=1}^{C} \left |{{ P_{j}^{S}}}\right |^{2}}}\right), \tag {6}\\ V^{T}=& \text {vec}\left ({{\sum _{j=1}^{C} \left |{{ P_{j}^{T}}}\right |^{2}}}\right) \text {.} \tag {7}\end{align*}
2) Ranking Prior:
Although traditional quality regression operators, such as
Firstly, the ground truth values of distorted images \begin{align*} L_{rp}\left ({{X,Y,\theta }}\right) = \frac {1}{K-1} \sum _{i \in K-1} \max \left \{{{0,-\theta \times \left ({{x_{i}-y_{i}}}\right)+mar}}\right \}, \tag {8}\end{align*}
Ultimately, our model undergoes end-to-end training while concurrently minimizing the aforementioned losses. The total loss of our model is defined as:\begin{equation*} L_{total} = \lambda _{1} L_{quality} + \lambda _{2} L_{fp} +\lambda _{3} L_{rp}, \tag {9}\end{equation*}
Experiments
A. Experimental Protocol
1) Datasets and Evaluation Criteria:
We conducted extensive experiments on five image quality assessment datasets to evaluate the performance of our proposed method. These datasets cover many distortion scenarios, including three synthetic distortion datasets and two authentic distortion datasets.
Synthetic distortion datasets
LIVE [42]: undergoes five different distortion procedures to 29 reference images, generating 779 distorted images with a resolution mainly of
. This dataset provides image differential mean opinion scores ranging from 0 to 100.$768 \times 512$ CSIQ [43]: undergoes six different distortion procedures to 30 reference images, generating 866 distorted images with a resolution of
. This dataset provides image differential mean opinion scores ranging from 0 to 1.$512 \times 512$ TID2013 [44]: undergoes 25 different distortion procedures to 25 reference images, generating 3000 distorted images with a
resolution. This dataset provides image mean opinion scores ranging from 0 to 9.$512 \times 384$
Authentic distortion datasets
CLIVE [45]: contains 1162 authentic distorted images with a resolution of
. This dataset provides image mean opinion scores ranging from 0 to 100.$500 \times 500$ KonIQ-10k [46]: contains 10073 authentic distorted images with a resolution of
. This dataset provides image mean opinion scores ranging from 0 to 5.$1280 \times 768$
Moreover, we quantified the assessment performance of our proposed method in terms of two well-used metrics: Spearman’s Rank-Order Correlation Coefficient (SROCC) and Pearson’s linear correlation coefficient (PLCC). Both metrics indicate better model performance with higher values. For N images, the SROCC and PLCC metrics can be computed as follows.\begin{align*} \mathrm {SROCC}=& 1-\frac {6 \sum _{i=1}^{N}\left ({{\hat {y}_{i}-y_{i}}}\right)^{2}}{N\left ({{N^{2}-1}}\right)}, \tag {10}\\ \mathrm {PLCC}=& \frac {\sum _{i=1}^{N}\left ({{y_{i}-\bar {y}}}\right)\left ({{\hat {y}_{i}-\hat {\bar {y}}}}\right)}{\sqrt {\sum _{i=1}^{N}\left ({{y_{i}-\bar {y}}}\right)^{2}} \sqrt {\sum _{i=1}^{N}\left ({{\hat {y}_{i}-\hat {\bar {y}}}}\right)^{2}}}, \tag {11}\end{align*}
2) Implementation Details:
Following the standard training strategy for IQA methods, the pre-trained ResNet50 served as the backbone network for the GraspIQA model. From each image, a set of 25 patches was randomly sampled, and these patches underwent horizontal flipping with a 50% probability to enhance the dataset’s diversity. It should be noted that the patch size of the synthetic dataset is
Initially, we established ten random seeds to produce ten sets of training, validation, and testing datasets distributed by a 6:2:2 split ratio. Specifically, the authentic dataset was randomly split, and the synthetic dataset was split along with the reference image to avoid overlapping content. An Adam optimizer with the weight decay
B. Comparison Experiments
We have selected 15 state-of-the-art methods to underscore the superiority of our proposed network. These methods comprise five traditional BIQA methods, namely a DCT-domain IQA method (DIIVINE) [47], a spatial-domain IQA method (BRISQUE) [6], a codebook-based IQA method (CORNIA) [48], a feature-enriched IQA method (ILNIQE) [7], and a statistics aggregation-based IQA method (HOSA) [49]. Furthermore, 11 deep learning-based BIQA methods are chosen, which include five dual-stream networks: DIQaM-NR [50], TS-CNN [16], DBCNN [51], HyperIQA [52], and MMMNet [53], two meta-learning networks: MetaIQA [54], and Lang et al. [55], two networks that enhance sample diversity: CLRIQA [29] and CONTRIQUE [56], and a large model-based network, CLIPIQA [57]. All experimental results are derived from the original paper or replicated based on source code.
Table I presents the overall performance results of SROCC and PLCC across five IQA datasets. Several observations can be made. 1) Traditional BIQA methods perform poorly. 2) Among the five dual-stream deep learning networks, the average assessment accuracy of DIQaM-NR and TS-CNN is similar. Both perform well on the LIVE dataset but poorly on the other datasets. The DBCNN method achieves the best SROCC and PLCC values on the CSIQ dataset. However, it performs poorly on the large KonIQ-10k dataset. The HyperIQA and MMMNet methods can accurately predict image quality on real-world datasets. However, they still fall short of the best values on the CSIQ and TID2013 datasets. 3) Among the two meta-learning-based networks, the MetaIQA method performs better. It achieves the best SROCC value on the TID2013 dataset but underperforms on the other datasets. 4) Among the two networks focusing on sample diversity, CLRIQA achieves the best SROCC and PLCC values on the LIVE dataset. However, its SROCC value is 7.5% lower than the best on the KonIQ-10k dataset. CONTRIQUE ranks third in average assessment accuracy and performs well on the CSIQ and TID2013 datasets. 5)For the large model-based network, the performance of the CLIP-IQA network is below the standard on all datasets except KonIQ-10k. We attribute this mainly to the small sample sizes of IQA datasets, which are insufficient for training large models.
In summary, the 15 comparison methods mentioned above perform well only on partial datasets, lacking universal applicability across all datasets. For that, we propose a cognitive framework from pixels to rich-nodes that mines high-coupled features from low-cohesive pixels, thereby fostering information interaction across scales and spatial dimensions. Experimental results show that our GraspIQA method performs exceptionally well. In ten comparisons, it attains two best values, four second-best values, and four third-best values. This demonstrates its competitiveness on both synthetic and authentic datasets. Ultimately, based on the assessment results in five datasets, GraspIQA exhibits the highest average SROCC and PLCC values, further validating its superiority.
C. Ablation Experiments
To evaluate the effectiveness of the design modules of our GraspIQA method, we conducted ablation experiments on the LIVE and TID2013 datasets, as shown in Table II.
In Table II, “Baseline” refers to the ResNet50 model with two fully connected layers. Adding individual modules improves performance, with the GIFI module showing the most significant gain. SROCC and PLCC increase by (1.0%, 0.6%) and (2.7%, 1.9%) on the LIVE and TID2013 datasets, respectively. This improvement results from the shift from pixels to rich nodes, enabling better feature extraction and multi-dimensional interaction. We also examine the impact of self-cognitive prior loss. The
D. Effectiveness of Frequency Prior
Figs. 6 and 7 depict schematic diagrams of feature mappings under different distortion conditions. The ResNet50 network is divided into four stages. Since shallow-level features contain more detailed information, we focus on analyzing the effectiveness of the first frequency prior (Stage
Schematic of feature maps under the “JPEG2000” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.
Schematic of feature maps under the “AWGN” distortion type. In columns 2-3, rows 1 and 2 represent the 10th channel feature mapping obtained by Resnet50 after Stage 1 and Stage 2, respectively.
Experimental results show that as network depth increases, convolutional neural networks tend to focus more on global structures. However, this approach introduces some issues. As shown in Fig. 6 w/o
E. Cross-Dataset Experiments
An excellent quality assessment algorithm should exhibit outstanding performance on an individual dataset and perform well and accurately across different datasets. Therefore, to verify the robustness of the proposed method, we conducted extensive cross-dataset experiments on synthetic datasets LIVE, CSIQ, TID2013, and authentic dataset CLIVE. We compared GraspIQA with three traditional methods: BRISQUE [6], CORNIA [48], FRIQUEE [58], and five deep learning methods: WaDIQaM [50], CNNIQA [59], NIMA [60], DBCNN [51], and HyperIQA [52]. Notably, the cross-dataset experiments were rigorously designed to ensure training on one full dataset and subsequent testing on another full dataset.
Table III depicts the results of the cross-dataset experiments. We conducted two sets of generalization experiments for each synthetic dataset, while one set was carried out for the authentic dataset. Based on the experimental results, we arrive at the following conclusions. 1) The generalization performance of traditional BIQA methods is generally lower than that of deep learning methods. 2) All methods perform well in cross-dataset experiments from large to small datasets. However, performance is limited when weights from smaller datasets are used on larger ones. For example, in TID2013 (3000)
Conclusion
We propose a cognition-inspired BIQA approach that emulates the rich-club mechanism in the human brain. By reconstructing the topology of conventional neural networks, we form a shift from pixels to rich-nodes. Such learnable rich-nodes serve as vital connectors to mine the semantic interaction between multi-scale feature maps. Additionally, we devise a self-cognitive prior loss to assist the network in extracting consistent information from limited samples, thereby improving the generalization performance of our model. Extensive experimental results show that our GraspIQA method is competitive on the five benchmark datasets and exhibits excellent accuracy in cross-dataset experiments. Moving forward, we aim to further enhance the scalability of our method, particularly in terms of its generalization capability from small to large datasets.