(Translated by https://www.hiragana.jp/)
Perception- and Fidelity-aware Reduced-Reference Super-Resolution Image Quality Assessment

Perception- and Fidelity-aware Reduced-Reference Super-Resolution Image Quality Assessment

Xinying Lin, Xuyang Liu, Hong Yang, Xiaohai He, Member, IEEE, and Honggang Chen, Member, IEEE This work was supported in part by the National Natural Science Foundation of China under Grant 62001316, in part by Sichuan Science and Technology Program under Grant 24QYCX0399, in part by the Open Foundation of Yunnan Key Laboratory of Software Engineering under Grant 2023SE206, in part by the Research Fund of Guangxi Key Lab of Multi-source Information Mining& Security under Grant MIMS22-14, and in part by the Fundamental Research Funds for the Central Universities under Grant SCU2023D062 and under Grant 2022CDSN-15-SCU. (Corresponding author: Honggang Chen.) Xinying Lin is with the College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China, and also with Guangxi Key Lab of Multi-source Information Mining Security, Guangxi Normal University, Guilin 541004, China (email: linxinying@stu.scu.edu.cn). Xuyang Liu, Hong Yang, and Xiaohai He are with the College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China (email: liuxuyang@stu.scu.edu.cn; yhscu@scu.edu.cn; hxh@scu.edu.cn). Honggang Chen is with the College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China, and also with the Yunnan Key Laboratory of Software Engineering, Yunan University, Kunming 650600, China (e-mail: honggang_chen@scu.edu.cn).
Abstract

With the advent of image super-resolution (SR) algorithms, how to evaluate the quality of generated SR images has become an urgent task. Although full-reference methods perform well in SR image quality assessment (SR-IQA), their reliance on high-resolution (HR) images limits their practical applicability. Leveraging available reconstruction information as much as possible for SR-IQA, such as low-resolution (LR) images and the scale factors, is a promising way to enhance assessment performance for SR-IQA without HR for reference. In this letter, we attempt to evaluate the perceptual quality and reconstruction fidelity of SR images considering LR images and scale factors. Specifically, we propose a novel dual-branch reduced-reference SR-IQA network, i.e., Perception- and Fidelity-aware SR-IQA (PFIQA). The perception-aware branch evaluates the perceptual quality of SR images by leveraging the merits of global modeling of Vision Transformer (ViT) and local relation of ResNet, and incorporating the scale factor to enable comprehensive visual perception. Meanwhile, the fidelity-aware branch assesses the reconstruction fidelity between LR and SR images through their visual perception. The combination of the two branches substantially aligns with the human visual system, enabling a comprehensive SR image evaluation. Experimental results indicate that our PFIQA outperforms current state-of-the-art models across three widely-used SR-IQA benchmarks. Notably, PFIQA excels in assessing the quality of real-world SR images.

Index Terms:
Super-Resolution Image Quality Assessment, Reduced-Reference, Perceptual Quality, Reconstruction Fidelity.

I Introduction

Image super-resolution (SR) technology aims to produce more detailed high-resolution (HR) images from the given low-resolution (LR) images. It has been widely used in various fields, such as security and surveillance[1], medical imaging[2], and remote sensing imaging[3, 4]. Given the remarkable advancements in recent research on SR algorithms, including blind SR[5, 6], lightweight SR models[7, 8], arbitrary scale SR[9], and multimodal SR[10, 11], it has become imperative to evaluate the quality of the generated SR images. This evaluation is crucial for facilitating comparative analysis of reconstruction performance across various SR models and guiding the development of SR algorithms.

Numerous methods have been proposed for image quality assessment (IQA) [12, 13, 14], which can be categorized as subjective or objective. While subjective ones are more reliable, they are impractical due to high costs and external factors. Hence, objective IQA methods that align with subjective evaluations are currently a focus of research. IQT[15], MANIQA[14], AHIQ[16], and MAMIQA[17] are recently proposed IQA methods that achieve excellent subjective consistency. While generic IQA methods have shown satisfactory results, they often neglect the specific characteristics of SR, making them unsuitable to directly apply to SR images. SR algorithms aim to recover detailed information from LR images, so SR-IQA needs to not only focus on the visual quality of the image, but also consider the consistency with the LR image, a factor that generic IQA methods overlook. At the same time, SR images often exhibit mixed degradations, such as blurring, ringing, and aliasing artifacts, which are not effectively addressed by current generic IQA methods.

Refer to caption
Figure 1: Overview of proposed PFIQA method. This framework consists of Perception-aware and Fidelity-aware Assessment Branches for SR-IQA.

Recently, there has been an increase in the development of SR-IQA methods. Depending on the availability of lossless reference images, SR-IQA methods can be categorized into full-reference[18], reduced-reference[19, 11], and no-reference methods[20, 21]. For all three SR-IQA paradigms, evaluating the perceptual quality is of paramount importance, as it directly relates to human perceptual judgments and assessment of the given SR images[22, 18]. Moreover, reconstruction fidelity[23, 24] is also highly important since it reflects how faithfully the SR image represents the reference image details and content. Only reference-based SR-IQA methods can effectively consider reconstruction fidelity by comparing the SR image against the reference image. However, in practical scenarios where HR images are unavailable, the application of full-reference SR-IQA methods becomes challenging. Fortunately, the SR task inherently possesses available reference or auxiliary information, such as the paired LR image and the scale factor. These cues have significant reference value for evaluating SR images[18], but are not directly utilized by most SR-IQA methods[21, 20]. Recently, a few reduced-reference SR-IQA methods [19, 18, 25] have adopted CNN-based networks to get feature maps of LR and SR images, and simply calculated the similarity between the feature maps and utilized multi-layer perceptron to regress the quality scores. The entire process can be seen as a pixel-by-pixel comparison between the LR and SR images, which yet neglects the spatial coherence between the LR and SR images, resulting in an insufficient emphasis on reconstruction fidelity.

To address these issues, we present a novel reduced-reference Perception- and Fidelity-aware SR-IQA (PFIQA), which integrates LR images and scale factors as prior knowledge to assist in SR-IQA. Specifically, PFIQA consists of two assessment branches: the Perception-aware Assessment Branch (PA Branch) for evaluating perceptual quality and the Fidelity-aware Assessment Branch (FA Branch) for assessing reconstruction fidelity. Given that each patch holds unique visual details, we design a patch scoring module for each branch and a patch weighting module to assign varying degrees of attention to each patch, thus achieving fine-grained patch-wise prediction for SR-IQA. The outputs of the two branches are combined through a sum to provide a comprehensive evaluation of the SR images. Based on this delicate design, PFIQA can enhance the consistency between the network’s assessment results and human visual system (HVS).

Our main contributions can be summarized as follows:

  • We introduce PFIQA, a novel dual-branch reduced-reference SR-IQA network that comprehensively assesses the perceptual quality and reconstruction fidelity of SR images without requiring any reference HR images.

  • Our proposed PFIQA takes pairs of SR and LR images as input, leveraging the merits of global modeling of ViT and local relation of ResNet to enable comprehensive visual perception, and incorporates the scale factor to effectively align with HVS.

  • Extensive experiments on three widely-used SR-IQA benchmarks demonstrate that PFIQA shows superior performance compared with other SR-IQA methods.

II Methodology

As shown in Fig. 1, our proposed PFIQA is composed of two assessment branches, underging three key phases: (1) feature extraction to acquire global and local visual features; (2) adaptive fusion of visual features and the scale factor to obtain fine-grained representations conducive for SR-IQA; (3) patch-weighted quality regression to produce a perception-aware score and a fidelity-aware score and perform a sum of them to obtain the final quality score.

II-A Feature Extraction

The first phase of PFIQA is feature extraction, which consists of pre-trained Vision Transformer (ViT)[26] and ResNet [27]. Understanding the broader context of an image is crucial for the IQA task as it provides essential information about the overall structure and content. ViT excels in this aspect by primarily focusing on extracting global visual features. Its self-attention mechanism effectively captures long-range dependencies and encodes images into comprehensive global feature representations, which are valuable for SR-IQA. In addition to global features, paying attention to fine-grained details is equally important, especially in SR-IQA where humans tend to focus on intricate elements. Incorporating local visual information alongside global features can significantly enhance the accuracy and robustness of SR-IQA methods. Inspired by previous work [28, 29], we augment ViT with ResNet to better capture local visual features, thereby enriching the model’s ability to comprehensively represent images.

Specifically, a pair of LR and SR images are fed into ViT and ResNet, and then we take out the ViT feature maps of the early stage and ResNet feature maps of different scales. Due to the similarity in the feature extraction process between SR and LR images, we illustrate the process taking the SR image as an example. For ViT, we utilize the outputs from five intermediate blocks {0,1,2,3,4}01234\{0,1,2,3,4\}{ 0 , 1 , 2 , 3 , 4 } within ViT. The output feature from each block fsrp×p×csubscript𝑓𝑠𝑟superscript𝑝𝑝𝑐f_{sr}\in\mathbb{R}^{p\times p\times c}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p × italic_c end_POSTSUPERSCRIPT, where c=768𝑐768c=768italic_c = 768, p=28𝑝28p=28italic_p = 28, is concatenated into fsrGp×p×5csuperscriptsubscript𝑓𝑠𝑟𝐺superscript𝑝𝑝5𝑐f_{sr}^{G}\in\mathbb{R}^{p\times p\times 5c}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_p × italic_p × 5 italic_c end_POSTSUPERSCRIPT, and then reshaped into fsrG5c×p×psuperscriptsubscript𝑓𝑠𝑟superscript𝐺superscript5𝑐𝑝𝑝f_{sr}^{G^{\prime}}\in\mathbb{R}^{5c\times p\times p}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 5 italic_c × italic_p × italic_p end_POSTSUPERSCRIPT. For ResNet, we extract multi-scale features from four different stages of ResNet, and then interpolate these features to resize them to the same dimensions. After that, we concatenate them along the channel dimension and reshape them to obtain fsrL3840×p×psuperscriptsubscript𝑓𝑠𝑟superscript𝐿superscript3840𝑝𝑝f_{sr}^{L^{\prime}}\in\mathbb{R}^{3840\times p\times p}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3840 × italic_p × italic_p end_POSTSUPERSCRIPT. We apply a convolutional dimensionality reduction to get the final output global features fsrG256×p×psuperscriptsubscript𝑓𝑠𝑟𝐺superscript256𝑝𝑝f_{sr}^{G}\in\mathbb{R}^{256\times p\times p}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 × italic_p × italic_p end_POSTSUPERSCRIPT and local features fsrL256×p×psuperscriptsubscript𝑓𝑠𝑟𝐿superscript256𝑝𝑝f_{sr}^{L}\in\mathbb{R}^{256\times p\times p}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 256 × italic_p × italic_p end_POSTSUPERSCRIPT. Similarly, the paired LR image undergoes the same process, resulting in the output of global features flrGsuperscriptsubscript𝑓𝑙𝑟𝐺f_{lr}^{G}italic_f start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and local features flrLsuperscriptsubscript𝑓𝑙𝑟𝐿f_{lr}^{L}italic_f start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT.

For PA branch, the target is to evaluate the perceptual quality of the SR image itself. Therefore, we utilize the extracted feature of the SR images fsrGsuperscriptsubscript𝑓𝑠𝑟𝐺f_{sr}^{G}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and fsrLsuperscriptsubscript𝑓𝑠𝑟𝐿f_{sr}^{L}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT as input to PA branch. In contrast, for FA branch, in order to measure the fidelity between the SR image and its paired LR image, we take into consideration that LR and SR images exhibit differences in the feature space. After obtaining the global and local visual features extracted respectively by ViT and ResNet, we represent the features fdiffL=fsrLflrLsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝐿superscriptsubscript𝑓𝑠𝑟𝐿superscriptsubscript𝑓𝑙𝑟𝐿f_{diff}^{L}=f_{sr}^{L}-f_{lr}^{L}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT and fdiffG=fsrGflrGsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝐺superscriptsubscript𝑓𝑠𝑟𝐺superscriptsubscript𝑓𝑙𝑟𝐺f_{diff}^{G}=f_{sr}^{G}-f_{lr}^{G}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT - italic_f start_POSTSUBSCRIPT italic_l italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT as input to FA Branch. These features are represented as the difference between the global and local features to capture the distinction between the SR image and its paired LR image.

II-B Adaptive Fusion

To effectively combine the extracted visual features for the two branches, as well as auxiliary information from the scale factor for SR-IQA, we employ Adaptive Fusion Modules (AFMs). For each branch, the global and local features along with the scale factor, are utilized as inputs to the AFM. Since the processing by AFM in two branches is similar, with the only difference lying in their respective inputs, we provide a detailed description of the AFM in PA branch as an example. As shown in Fig. 1 (right), we first adaptively fuse global and local visual features, and then incorporate the scale factor information.

Global and Local Visual Features. The AFM is capable of learning the importance of features from global and local features and adaptively assigning appropriate weights to them, thereby obtaining comprehensive visual features for SR-IQA.

For PA branch, global features (fsrGsuperscriptsubscript𝑓𝑠𝑟𝐺f_{sr}^{G}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT) and local features (fsrLsuperscriptsubscript𝑓𝑠𝑟𝐿f_{sr}^{L}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT) are first fed into the AFM and then concatenated along a new dimension. Then, to effectively learn the weights of global and local features to fuse the two features adaptively, we employ a fusion operation implemented by a fully-connected layer followed by a ReLU activation function, which yields the features fsrsubscript𝑓𝑠𝑟f_{sr}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT. Similarly, for FA branch, we fuse fdiffGsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝐺f_{diff}^{G}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT and fdiffLsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝐿f_{diff}^{L}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT to obtain fdiffsubscript𝑓𝑑𝑖𝑓𝑓f_{diff}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT.

Scale Factor. The scale factor has a statistically significant influence on the subjective quality scores of SR images [20], suggesting its potential as a valuable indicator for assessing the quality of SR images and guiding SR-IQA. Therefore, we input the vector representing the scale factor S𝑆Sitalic_S into the AFM to generate SR-IQA-related features. First, the scale factor is passed through two fully-connected layers and then reshaped to obtain the scale factor feature fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT.

Subsequently, for PA branch, the scale factor feature fSsubscript𝑓𝑆f_{S}italic_f start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT is concatenated along the channel dimension with the features fsrsubscript𝑓𝑠𝑟f_{sr}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT. The concatenated features then go through a 3×3333\times 33 × 3 convolution, a ReLU activation, and another 3×3333\times 33 × 3 convolution to obtain the perceptual features fsrSsuperscriptsubscript𝑓𝑠𝑟𝑆f_{sr}^{S}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT. Similarly, for FA branch, the differential features fdiffSsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝑆f_{diff}^{S}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT are obtained.

II-C Patch-Weighted Quality Regression

The final phase of PFIQA is patch-weighted quality regression, which comprises two patch scoring modules and a patch weighting module to generate the score SPFsubscript𝑆𝑃𝐹S_{PF}italic_S start_POSTSUBSCRIPT italic_P italic_F end_POSTSUBSCRIPT of the input SR image. Since each pixel in the deep feature map corresponds to a distinct patch of the input image and encapsulates abundant information, the spatial dimension’s information is crucial. To capture the relationships between image patches, we predict scores for each pixel in the feature maps of the two branches and calculate the attention maps for each corresponding score. We obtain the scores for the two branches separately by performing a weighted sum of the individual scores. The weighted summation is employed to capture the significance of each region, simulating the behavior of the HVS. Finally, we add the scores from these two complementary branches to obtain the final predicted score.

Specifically, patch scoring module employs perceptual features fsrSsuperscriptsubscript𝑓𝑠𝑟𝑆f_{sr}^{S}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and differential features fdiffSsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝑆f_{diff}^{S}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT to evaluate the perception and fidelity of the SR image in a patch-wise manner. Taking the PA branch as an example, the visual features fsrSsuperscriptsubscript𝑓𝑠𝑟𝑆f_{sr}^{S}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT pass through a 3×3333\times 33 × 3 convolution, followed by a ReLU activation, and another 3×3333\times 33 × 3 convolution to produce the perception-aware score map spsubscript𝑠𝑝s_{p}italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. Similarly, for the FA branch, the process yields the fidelity-aware score map sfsubscript𝑠𝑓s_{f}italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT.

In parallel, the perceptual features fsrSsuperscriptsubscript𝑓𝑠𝑟𝑆f_{sr}^{S}italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT and differential features fdiffSsuperscriptsubscript𝑓𝑑𝑖𝑓𝑓𝑆f_{diff}^{S}italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT are concatenated and fed into a patch weighting module, which computes the perception-aware score weight map wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the fidelity-aware score weight map wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT for each image patch. This process can be represented as:

w=Sigmoid(Conv1(ReLU(Conv3(Concat(fsrS,fdiffS))))),𝑤𝑆𝑖𝑔𝑚𝑜𝑖𝑑𝐶𝑜𝑛𝑣1𝑅𝑒𝐿𝑈𝐶𝑜𝑛𝑣3𝐶𝑜𝑛𝑐𝑎𝑡superscriptsubscript𝑓𝑠𝑟𝑆superscriptsubscript𝑓𝑑𝑖𝑓𝑓𝑆\leavevmode\resizebox{422.77661pt}{}{$w=Sigmoid(Conv1(ReLU(Conv3(Concat(f_{sr}% ^{S},f_{diff}^{S})))))$},italic_w = italic_S italic_i italic_g italic_m italic_o italic_i italic_d ( italic_C italic_o italic_n italic_v 1 ( italic_R italic_e italic_L italic_U ( italic_C italic_o italic_n italic_v 3 ( italic_C italic_o italic_n italic_c italic_a italic_t ( italic_f start_POSTSUBSCRIPT italic_s italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_d italic_i italic_f italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_S end_POSTSUPERSCRIPT ) ) ) ) ) , (1)

where Conv1𝐶𝑜𝑛𝑣1Conv1italic_C italic_o italic_n italic_v 1 and Conv3𝐶𝑜𝑛𝑣3Conv3italic_C italic_o italic_n italic_v 3 represent 1×1111\times 11 × 1 and 3×3333\times 33 × 3 convolution, the score weight map w𝑤witalic_w is a two-channel tensor, with the first channel representing the perception-aware score weight map wpsubscript𝑤𝑝w_{p}italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and the second channel representing the fidelity-aware score weight map wfsubscript𝑤𝑓w_{f}italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT, i.e., w=[wp,wf]𝑤subscript𝑤𝑝subscript𝑤𝑓w=[w_{p},w_{f}]italic_w = [ italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ].

Finally, we utilize the two score weight maps to perform weighted summation on the perception-aware score map and fidelity-aware score map and then add them to obtain the final predicted quality score SPFsubscript𝑆𝑃𝐹S_{PF}italic_S start_POSTSUBSCRIPT italic_P italic_F end_POSTSUBSCRIPT. This can be represented as:

SPF=spwpwp+sfwfwf,subscript𝑆𝑃𝐹subscript𝑠𝑝subscript𝑤𝑝subscript𝑤𝑝subscript𝑠𝑓subscript𝑤𝑓subscript𝑤𝑓S_{PF}=\frac{s_{p}*w_{p}}{\sum w_{p}}+\frac{s_{f}*w_{f}}{\sum w_{f}},italic_S start_POSTSUBSCRIPT italic_P italic_F end_POSTSUBSCRIPT = divide start_ARG italic_s start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∗ italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∑ italic_w start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG + divide start_ARG italic_s start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∗ italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG ∑ italic_w start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG , (2)

where * means Hadamard product.

TABLE I: Performance comparison on three benchmark datasets. We highlight the best and the second results. Algorithms marked with asterisks (*) have been reproduced, while unmarked ones use results from the original paper.
Type Reference Methods QADS[30] WIND[31] RealSRQ[32] Average
PLCC SRCC PLCC SRCC PLCC SRCC PLCC SRCC
Generic IQA ×\times× NIQE*[13] 0.4055 0.3916 0.6286 0.5520 0.1452 0.0983 0.3931 0.3473
×\times× BRISQUE*[12] 0.5568 0.5851 0.5397 0.5085 0.0515 0.0075 0.3827 0.3670
×\times× CNNIQA*[33] 0.8791 0.8742 0.9328 0.8902 0.6711 0.6671 0.8277 0.8105
×\times× MANIQA*[14] 0.9804 0.9771 0.9771 0.9321 0.8631 0.7812 0.9402 0.8634
HR SIS [30] 0.9137 0.9132 0.8913 0.8777 - - - -
SR-IQA ×\times× NRQM*[34] 0.6867 0.6898 0.6869 0.6201 0.1451 0.0042 0.5062 0.4380
×\times× C2MT [21] 0.9719 0.9690 - - 0.7184 0.7043 - -
×\times× SFSN [35] 0.9174 0.9163 0.9525 0.9157 - - - -
×\times× SGH*[20] 0.9362 0.9475 0.9733 0.9563 0.6847 0.7069 0.8647 0.8702
LR DISQ*[19] 0.9127 0.9110 0.9785 0.9333 0.7885 0.7175 0.8932 0.8539
LR PSCT[18] 0.9620 0.9600 0.9596 0.9290 - - - -
LR PFIQA (Ours) 0.9830 0.9815 0.9827 0.9637 0.9269 0.8597 0.9642 0.9350

III Experiments

III-A Experimental Settings

Datasets and Evaluation Metrics. The evaluations are implemented on three datasets which are commonly used in the research of SR-IQA, including QADS[30], WIND[31] and RealSRQ[32]. We split the datasets randomly into training and test sets at an 8:2 ratio, repeat the partitioning and evaluation process 5 times for fair comparison, and report the average results as the final performance. We utilize two of the most commonly used metrics, including Pearson Linear Correlation Coefficient (PLCC) and Spearman Rankorder Correlation Coefficient (SRCC).

Implementation Details. We use ViT-B [26] and ResNet50 [27] models pre-trained on ImageNet [36], which are not updated during training. We bilinearly interpolate the LR image to achieve the same resolution as the SR image. For model training, we normalize all input images and randomly crop them into a size of 224×224224224224\times 224224 × 224. Additionally, a random horizontal flip rotation is applied to augment the training data. To optimize the model, we utilize the MSE loss between the predicted score and the ground truth score. During the test phase, we crop the four corners and the center of the original image, and the final score is the average of the scores from these cropped sections. We train PFIQA on a single NVIDIA GeForce RTX3090 GPU. The training process utilizes the AdamW optimizer. The minibatch size is set to 4.

III-B Experimental Results and Analysis

TABLE II: Comparison of leveraging available information of SR algorithms on RealSRQ[32].
# Assessment Branch Scale Factor PLCC SRCC
Perception Fidelity
(a) 0.9007 0.8070
(b) 0.9233 0.8564
(c) 0.9242 0.8573
(d) 0.9269 0.8597

Quantitative Comparison. The primary experimental results are reported in Table I, from which we can observe that: (1) On the whole, SR-IQA methods outperform most generic IQA methods because there are specifically designed for SR, including the model architecture and training data. (2) The proposed PFIQA model surpasses the compared SR-IQA methods on three benchmarks. This can be attributed to the fact that other SR-IQA methods do not consider the integration between perception and fidelity awareness when evaluating the quality of SR images. (3) When evaluated on the real-world SR-IQA dataset RealSRQ, PFIQA showcases a substantial improvement compared to other methods. The PLCC and SRCC metrics show an improvement of approximately 7.39%percent\%% and 10.05%percent\%% compared to the second-best MANIQA[14], which illustrates the superiority of PFIQA in real-world scenarios.

Effects of Reference Information. To further investigate the impact of reference information for SR-IQA, we compare the performance of PFIQA utilizing different reference information or not. In Table II, we can see that: (1) Using only the PA Branch (Table II (a)) can be viewed as a no-reference SR-IQA method, focusing solely on the perceptual quality of the SR image itself, hence leading to sub-optimal performance. (2) Considering the LR image as reference and using only the FA Branch (Table II (b)) can be viewed as a plain reduced-reference SR-IQA method, which accounts for the consistency between the SR and LR images, thereby ensuring reconstruction fidelity to a certain extent. Compared to Table II (a), this brings a significant performance improvement. (3) Combining the PA Branch and FA Branch (Table II (c)), integrates both perceptual quality and reconstruction fidelity, thus obtaining better performance compared to Table II (a,b). (4) Building upon considering the LR image as the reference, and additionally taking into account the scale factor (Table II (d)), PFIQA maximally leverages the available information from SR algorithms, thereby achieving the best SR-IQA performance.

TABLE III: Comparison of different visual feature extraction and fusion strategies on RealSRQ[32].
# Feature Fusion Method PLCC SRCC
ResNet ViT
(a) - 0.9070 0.8302
(b) - 0.9010 0.8105
(c) Concatenation 0.9198 0.8460
(d) Adaptive Fusion 0.9269 0.8597

Effects of Different Visual Feature Extraction Methods and Fusion Strategies. Since we use ViT and ResNet to extract global and local visual features from images, it is necessary to investigate how they impact the performance of SR-IQA. In Table III, we can find that: (1) Using only each of the visual features extracted by ViT and ResNet (Table III (a,b)) can achieve satisfactory SR-IQA results compared to other SOTA methods in Table I. (2) Utilizing both global and local visual features (Table III (c)), leads to better performance compared to Table III (a,b). This demonstrates the complementary nature of global and local features for comprehensive SR-IQA. (3) Further analysis of the adaptive fusion of global and local features, as indicated in Table III (d), reveals that PFIQA attains superior performance. This demonstrates the efficacy of the AFMs in learning the importance of features from global and local features.

IV Conclusion

In this letter, we propose a novel reduced-reference Perception- and Fidelity-aware SR-IQA (PFIQA) network, which integrates LR images and the scale factors as prior knowledge to comprehensively assess the perceptual quality and reconstruction fidelity of SR images. We leverage the merits of global modeling of ViT and local relation of CNN to enable comprehensive visual feature extraction, and incorporate the scale factors to obtain features that are more relevant to SR-IQA. The extensive results from three benchmark datasets showcase the efficacy of PFIQA.

References

  • [1] H. Wu, J. Chen, T. Wang, X. Lai, and J. Cao, “Ship license plate super-resolution in the wild,” IEEE Signal Processing Letters, vol. 30, pp. 394–398, 2023.
  • [2] M.-I. Georgescu, R. T. Ionescu, A.-I. Miron, O. Savencu, N.-C. Ristea, N. Verga, and F. S. Khan, “Multimodal multi-head convolutional attention with various kernel sizes for medical image super-resolution,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2195–2205.
  • [3] Y. Xiao, Q. Yuan, K. Jiang, J. He, X. Jin, and L. Zhang, “EDiffSR: An efficient diffusion probabilistic model for remote sensing image super-resolution,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–14, 2024.
  • [4] J. Shin, Y.-H. Jo, B.-K. Khim, and S. M. Kim, “U-net super-resolution model of goci to goci-ii image conversion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024.
  • [5] R. Neshatavar, M. Yavartanoo, S. Son, and K. M. Lee, “ICF-SRSR: Invertible scale-conditional function for self-supervised real-world single image super-resolution,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1557–1567.
  • [6] H. Chen, L. Dong, H. Yang, X. He, and C. Zhu, “Unsupervised real-world image super-resolution via dual synthetic-to-realistic and realistic-to-synthetic translations,” IEEE Signal Processing Letters, vol. 29, pp. 1282–1286, 2022.
  • [7] J. Wu, Y. Wang, and X. Zhang, “Lightweight asymmetric convolutional distillation network for single image super-resolution,” IEEE Signal Processing Letters, vol. 30, pp. 733–737, 2023.
  • [8] Y. Wang and T. Zhang, “OSFFNet: Omni-stage feature fusion network for lightweight image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5660–5668.
  • [9] Y. Zhao, Q. Teng, H. Chen, S. Zhang, X. He, Y. Li, and R. E. Sheriff, “Activating more information in arbitrary-scale image super-resolution,” IEEE Transactions on Multimedia, vol. 26, pp. 7946–7961, 2024.
  • [10] Y. Zhou, L. Gao, Z. Tang, and B. Wei, “Recognition-guided diffusion model for scene text image super-resolution,” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, pp. 2940–2944.
  • [11] C. Noguchi, S. Fukuda, and M. Yamanaka, “Scene text image super-resolution based on text-conditional diffusion models,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1485–1495.
  • [12] A. Mittal, A. K. Moorthy, and A. C. Bovik, “No-reference image quality assessment in the spatial domain,” IEEE Transactions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012.
  • [13] A. Mittal, R. Soundararajan, and A. C. Bovik, “Making a “completely blind” image quality analyzer,” IEEE Signal Processing Letters, vol. 20, no. 3, pp. 209–212, 2012.
  • [14] S. Yang, T. Wu, S. Shi, S. Lao, Y. Gong, M. Cao, J. Wang, and Y. Yang, “MANIQA: Multi-dimension attention network for no-reference image quality assessment,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 1190–1199.
  • [15] M. Cheon, S.-J. Yoon, B. Kang, and J. Lee, “Perceptual image quality assessment with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 433–442.
  • [16] S. Lao, Y. Gong, S. Shi, S. Yang, T. Wu, J. Wang, W. Xia, and Y. Yang, “Attentions help cnns see better: Attention-based hybrid image quality assessment network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1140–1149.
  • [17] L. Yu, J. Li, F. Pakdaman, M. Ling, and M. Gabbouj, “MAMIQA: No-reference image quality assessment based on multiscale attention mechanism with natural scene statistics,” IEEE Signal Processing Letters, vol. 30, pp. 588–592, 2023.
  • [18] K. Zhang, T. Zhao, W. Chen, Y. Niu, J. Hu, and W. Lin, “Perception-driven similarity-clarity tradeoff for image super-resolution quality assessment,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  • [19] T. Zhao, Y. Lin, Y. Xu, W. Chen, and Z. Wang, “Learning-based quality assessment for image super-resolution,” IEEE Transactions on Multimedia, vol. 24, pp. 3570–3581, 2021.
  • [20] J. Fu, “Scale guided hypernetwork for blind super-resolution image quality assessment,” arXiv preprint arXiv:2306.02398, 2023.
  • [21] H. Li, K. Zhang, Z. Niu, and H. Shi, “C2MT: A credible and class-aware multi-task transformer for sr-iqa,” IEEE Signal Processing Letters, vol. 29, pp. 2662–2666, 2022.
  • [22] K. Zhang, T. Zhao, W. Chen, Y. Niu, and J. Hu, “SPQE: Structure-and-perception-based quality evaluation for image super-resolution,” arXiv preprint arXiv:2205.03584, 2022.
  • [23] X. Huang, W. Li, J. Hu, H. Chen, and Y. Wang, “RefSR-NeRF: Towards high fidelity and super resolution view synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8244–8253.
  • [24] X. Luo, Y. Xie, Y. Qu, and Y. Fu, “SkipDiff: Adaptive skip diffusion model for high-fidelity perceptual image super-resolution,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 5, 2024, pp. 4017–4025.
  • [25] F. Zhou, W. Sheng, Z. Lu, B. Kang, M. Chen, and G. Qiu, “Super-resolution image visual quality assessment based on structure–texture features,” Signal Processing: Image Communication, vol. 117, p. 117025, 2023.
  • [26] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [28] A. Saha, S. Mishra, and A. C. Bovik, “Re-IQA: Unsupervised learning for image quality assessment in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5846–5855.
  • [29] C. Chen, J. Mo, J. Hou, H. Wu, L. Liao, W. Sun, Q. Yan, and W. Lin, “TOPIQ: A top-down approach from semantics to distortions for image quality assessment,” IEEE Transactions on Image Processing, 2024.
  • [30] F. Zhou, R. Yao, B. Liu, and G. Qiu, “Visual quality assessment for super-resolved images: Database and method,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3528–3541, 2019.
  • [31] H. Yeganeh, M. Rostami, and Z. Wang, “Objective quality assessment of interpolated natural images,” IEEE Transactions on Image Processing, vol. 24, no. 11, pp. 4651–4663, 2015.
  • [32] Q. Jiang, Z. Liu, K. Gu, F. Shao, X. Zhang, H. Liu, and W. Lin, “Single image super-resolution quality assessment: a real-world dataset, subjective studies, and an objective metric,” IEEE Transactions on Image Processing, vol. 31, pp. 2279–2294, 2022.
  • [33] L. Kang, P. Ye, Y. Li, and D. Doermann, “Convolutional neural networks for no-reference image quality assessment,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1733–1740.
  • [34] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, “Learning a no-reference quality metric for single-image super-resolution,” Computer Vision and Image Understanding, vol. 158, pp. 1–16, 2017.
  • [35] W. Zhou and Z. Wang, “Quality assessment of image super-resolution: Balancing deterministic and statistical fidelity,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 934–942.
  • [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015.