¹¹institutetext: College of Computer Science, Sichuan University, China ²²institutetext: Department of Computer and Information Science, University of Macao, Macao SAR ³³institutetext: Engineering Research Center of Machine Learning and Industry Intelligence, MOE, China

${M}^{3}$ :Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask

Xinyu Yang 11 Xiaochen Ma 11 Xuekang Zhu 11 Bo Du 11 Lei Su 11 Bingkui Tong 11 Zeyu Lei 1122 Jizhe Zhou Corresponding author.1133

Abstract

In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba’s PS Bar) are manipulated using various techniques, and creating a dataset from these images will significantly enrich the types of manipulations in our data. However, images on the internet suffer from resolution and clarity issues, and the masks obtained by simply subtracting the manipulated image from the original contain various noises. These noises are difficult to remove, rendering the masks unusable for IML models. Inspired by the field of change detection, we treat the original and manipulated images as changes over time for the same image and view the data generation task as a change detection task. However, due to clarity issues between images, conventional change detection models perform poorly. Therefore, we introduced a super-resolution module and proposed the Manipulation Mask Manufacturer (MMM) framework. It enhances the resolution of both the original and tampered images, thereby improving image details for better comparison. Simultaneously, the framework converts the original and tampered images into feature embeddings and concatenates them, effectively modeling the context. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), a dataset that covers a wide range of manipulation techniques. We aim to contribute to the fields of image forensics and manipulation detection by providing more realistic manipulation data through MMM and MMMD. Detailed information about MMMD and the download link can be found at: the code and datasets will be made available.

Keywords:

image manipulation localization(IML) Dataset Generation Arbitrary-Scale Super-Resolution.

1 INTRODUCTION

Advances in digital image processing have made software like Photoshop and GIMP more powerful, facilitating widespread image manipulation. This proliferation of false information and manipulated images threatens public knowledge, trust, and safety. Thus, image manipulation localization has emerged, and in some literature, it is also referred to as "forgery detection" or "tamper detection." Its purpose is to discern whether an input image is manipulated or authentic and to depict the exact manipulated parts of an image through a mask. These parts are semantically different from the original content (the original image before manipulation). It does not include purely generated images (e.g., images generated from pure text) or the introduction of noise or other non-semantic changes through image processing techniques that do not alter the underlying meaning of the image. Standard tampered images and their masks are shown in Fig. 1.

Refer to caption — Figure 1: Tampered images and their corresponding masks for the image manipulation localization task.

However, image manipulation localization lacks large-scale, well-annotated datasets. Most existing datasets are manually created and annotated by researchers, with limited tampering types and techniques, and the volume of datasets is also restricted. This leads to models with poor generalization and robustness. Therefore, we thought of creating datasets by sourcing a large number of original and manipulated images from the internet. But we found that images from the internet suffer from compression and clarity issues, and simply subtracting the original and manipulated images results in noisy images, as shown in the third row of Fig. 2, that traditional methods struggle to clean up.

Change detection involves identifying differences at the same location over different times, which is similar to our task of detecting differences between the original and manipulated images. Inspired by this, we treat the original and manipulated images as changes over time for the same picture, viewing the dataset generation task as a change detection task. However, due to the clarity disparity[16] between the two images in our task, directly using change detection models is not ideal. Therefore, we introduced a super-resolution processing module to enhance the details of the two images before generating the mask. This is the main idea behind our proposed Manipulation Mask Manufacturer (MMM) framework. Some of the masks we generated and their corresponding original images and tampered images are shown in Fig.2.

Table 1: The primary datasets in the field of image manipulation localization. Most of these datasets suffer from issues such as limited quantity, single type of tampering, and inability to grow further, and they are all tampered with by the issuers of the datasets themselves.

Dataset	Tampered Images	Image Sources		Dataset Growth		Type
Dataset	Tampered Images	Self	Others	Scalable	Non-scalable	Traditional	AI-manipulated
Columbia	180	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
CASIAv1	920	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
CASIAv2	5,063	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
Coverage	100	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
NIST16	564	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
DEFACTO	149,587	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$-$
IMD20	2,010	$\checkmark$	$-$	$-$	$\checkmark$	$\checkmark$	$\checkmark$
MMMD(Ours)	11,069	$-$	$\checkmark$	$\checkmark$	$-$	$\checkmark$	$\checkmark$

The framework pipeline inputs the original and tampered images into a Fully Convolutional Network (FCN)[6] to extract high-level features. These features are aligned using Maximum Mean Discrepancies (MMD)[7][8] and concatenated. They are then processed through a Cross-scale Local Attention Block (CSLAB)[9] and a Local Frequency Encoding Block (LFEB)[9] to enhance resolution and detail. Finally, the features are split, decoded, subtracted, and a mask is obtained.

Our framework has achieved excellent annotation results on the IMD2020[14], NIST16[15], and CASIAv2[10] datasets. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), containing 11,069 original images, tampered images, and masks, with potential for continuous growth. The dataset includes various resolutions and manipulation types, such as copy, paste, move, transform, Deepfake, Image Inpainting, Image Morphing, Reconstruction, and Image Style Transfer. It features diverse images like cartoons, portraits, landscapes, interiors, food, and accessories. The main parameters of our dataset compared to existing datasets are shown in Table 1. We used MMMD to train and test TruFor[25], MVSS-Net[4], and IML-ViT[1]. Other 10 models pre-trained on CASIAv2[10] struggled to achieve high metrics on our dataset, while models trained on our dataset demonstrated better generalization. This highlights the limitations of existing datasets.

In summary, our contributions are as follows:

•

We propose a Manipulation Mask Manufacturer (MMM) framework that can accurately annotate the differences between original and tampered images even when there is a significant disparity in their clarity.
•

We generated a large and diverse Manipulation Mask Manufacturer Dataset (MMMD) to address the lack of datasets in the field of image manipulation detection.
•

Models pre-trained with our MMMD achieved higher F1 scores and demonstrated better generalization. Other pre-trained models struggled to perform well on our dataset, highlighting the limitations of existing tampering detection datasets. Our dataset better reflects real-world tampering scenarios.

2 RELATED WORK

2.1 Existing dataset generation methods

Current methods for generating image manipulation detection datasets include manual manipulation, which involves editing images by hand using tools like Photoshop, a time-consuming and expertise-demanding process[17]. Automatic manipulation employs software tools and scripts to rapidly produce large volumes of data, though these images may look unnatural or have obvious manipulation traces[13]. Image synthesis combines elements from different images to create new visual scenes, enhancing dataset diversity but requiring complex techniques and substantial processing time[12]. The improved Total Variation Denoising Method[5] automatically subtracts a tampered image from the original to obtain a noisy mask, which is then denoised, but this method often fails to convert many types of tampered images due to diverse noise and low generalization capability of traditional methods.

2.2 Existing change detection methods

In recent years, change detection methods based on deep learning have rapidly developed. Hao Chen et al. proposed the Bitemporal Image Transformer (BIT)[20], which converts input images into a small number of semantic tokens, uses a transformer encoder to model contextual information within the compact token space-time domain, and then projects the context-rich tokens back into the pixel space through a decoder to enhance the original features. The final change detection results are generated through feature difference images. ChangeFormer[27] uses a transformer-based Siamese network architecture for change detection from a pair of co-registered remote sensing images. This method combines a hierarchical transformer encoder and a multi-layer perceptron (MLP) decoder, effectively extracting multi-scale long-range feature differences. In SNUNet-CD[28], ChangeFormer extracts multi-scale features of bitemporal images through a hierarchical transformer encoder and generates change detection maps by fusing these feature differences using a lightweight MLP decoder. However, since the change detection considers the same image at different times with consistent clarity, it does not need to account for image degradation. Therefore, existing change detection models are not effective for our data generation tasks.

2.3 Existing tampered datasets

The development of datasets in the field of image manipulation detection has been relatively slow. Currently, widely recognized and used datasets are still those from four or five years ago, or even from over a decade ago.

Datasets for Traditional Tampering Techniques Almost all datasets include traditional tampering methods like splicing, copy-move, removal, and various image enhancements to produce “fake” or “forged” images. Columbia[11] uses cropping and splicing, embedding parts from other images into a single image. CASIAv1[10] employs Adobe Photoshop for cutting and pasting, including geometric transformations like scaling and rotation. CASIAv2[10] adds more post-processing and has a richer variety of images, divided into eight categories: scenes, animals, buildings, people, plants, objects, nature, and textures. COVERAGE[12] consists of real images taken with an iPhone 6 front camera, processed with Photoshop CS4 using methods like translation, scaling, rotation, free transformation, lighting changes, and combinations thereof. NIST[15] uses local pixel modification, compression, noise addition, blurring, and geometric transformations. DEFACTO[13], based on the MSCOCO[18] database, aims to produce semantically meaningful forged images, including splicing, copy-move, object removal, and warping.

Deep Learning-Based Tampering Datasets Modern image tampering techniques have achieved unprecedented realism through artificial intelligence and deep learning, particularly with Generative Adversarial Networks (GANs). These techniques include deepfakes, which can perform facial replacement, expression synthesis, and generate images of non-existent people, making image and video tampering very realistic[19]. Deep learning also excels in image restoration and enhancement by denoising, filling in missing parts, and improving resolution, thus making damaged images look new and low-resolution images appear clear and detailed. Tools and frameworks such as TensorFlow, PyTorch, Keras, and OpenCV have greatly simplified the implementation and application of these techniques. In IMD2020[14], GANs were used to generate tampered regions of images, and inpainting techniques were employed to fill in missing or damaged parts of images, making them appear natural and coherent. This also presents greater challenges for image manipulation detection. Current models are increasingly in need of diverse data that better reflects real-world scenarios.

3 PROPOSED METHOD

3.1 The Pipeline of the Entire Framework

The entire MMM framework is divided into three modules: feature extraction and concatenation, super-resolution processing, and feature separation mask generation. We obtain the original images and a large number of tampered images from the network, keeping only the images of the same size as our original data. After obtaining the original and tampered images, we input them into our Manipulation Mask Manufacturer (MMM) framework. The MMM framework extracts high-level features from the original and tampered images using a Fully Convolutional Network (FCN)[6]. These features are aligned with Maximum Mean Discrepancies (MMD)[7][8] and concatenated. The idea of concatenating high-level features is inspired by the Bitemporal Image Transformer (BIT)[20]. During super-resolution, the concatenated features are processed by the Cross-scale Local Attention Block (CSLAB)[9] and the Local Frequency Encoding Block (LFEB)[9] to enhance the resolution and detail representation of the images. The framework then separates these embeddings, uses decoders to generate residual images, and combines them with the original images. The final mask is produced by computing the absolute difference between the high-resolution features of the original and tampered images. The entire MMM structure is shown in Fig.3.

3.2 Specific Processing Algorithm

Extraction and Concatenation of Image Features First, we use a Fully Convolutional Network (FCN)[6] to extract high-level features from the original image and the tampered image, respectively. Then, these features are input into encoder $E_{\theta 1}$ θしーた 1 end_POSTSUBSCRIPT and encoder $E_{\theta 2}$ θしーた 2 end_POSTSUBSCRIPT, resulting in the feature embeddings $Z_{1}\in\mathbb{R}^{H\times W\times C}$ and $Z_{2}\in\mathbb{R}^{H\times W\times C}$ . $Z_{1}$ and $Z_{2}$ are subjected to Maximum Mean Discrepancies (MMD)[7][8] calculation to eliminate the differences in data distribution, allowing the model to focus more on the differences in content. Simultaneously, $Z_{1}$ and $Z_{2}$ are concatenated into $Z_{3}\in\mathbb{R}^{H\times W\times 2C}$ .

Arbitrary-Scale Super-Resolution $Z_{3}$ will be projected by four separate convolutional layers to obtain four latent embeddings, corresponding to query $q$ , key $k$ , value $v$ , and frequency $f$ . Since the sizes of the two images are the same, we use the coordinates and cell of the original image. The original image and the tampered image will generate 2D high-resolution coordinates based on an arbitrary upsampling scale $\mathbf{r}=\left\{r_{h},r_{w}\right\}$ in 2D low-resolution coordinates. Next, the 2D coordinates, along with $q$ , $k$ , and $v$ , will be input into the Cross-Scale Local Attention Block (CSLAB)[9] to estimate a local latent embedding $Z_{3new}\in\mathbb{R}^{G_{h}G_{w}\times 2C}$ . $f$ and the 2D coordinates will also be input into the Local Frequency Encoding Block (LFEB)[9] to estimate a local frequency embedding ${f_{new}}\in\mathbb{R}^{G_{h}G_{w}\times 2C}$ . Specifically, $G_{h}$ and $G_{w}$ represent the height and width of the local grids used for performing local coordinate sampling. CSLAB[9] and LFEB[9] estimate $Z_{3new}$ and ${f_{new}}$ as follows:

Z_{3new}=CSLAB(\delta x,q,k,v)

δでるた italic_x , italic_q , italic_k , italic_v )

(1)

f_{new}=\operatorname{LFEB}(\delta x,f)

δでるた italic_x , italic_f )

(2)

\delta\mathbf{x}=\left\{x_{q}-x^{(i,j)}\right\}_{i\in\left\{1,2,\ldots,G_{h}% \right\},j\in\left\{1,2,\ldots,G_{w}\right\}}

δでるた bold_x = { italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ { 1 , 2 , … , italic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } , italic_j ∈ { 1 , 2 , … , italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } end_POSTSUBSCRIPT

(3)

Separate image features and generate a mask $Z_{1new}$ and $Z_{2new}$ are derived from splitting $Z_{3new}$ . Decoder $\mathrm{D}\phi_{1}$ and Decoder $\mathrm{D}\phi_{2}$ respectively utilize these embeddings along with the provided cell to generate residual images. Then, the original and tampered images are upsampled and added element-wise to their corresponding residual images. This results in the high-resolution high-level features of the original and tampered images. We subtract these two high-resolution high-level features and take the absolute value to obtain the final mask.

Loss Function The loss function of the model is defined as follows:

L=\frac{1}{H_{p}\times W_{p}}\sum_{h=1}^{H}\sum_{w=1}^{W}l\left(P_{hw},M_{hw}\right)

(4)

Here, $H_{p}\times W_{p}$ is the total number of pixels in the image, $P_{hw}$ is the probability distribution of the model at $(h,w)$ , and $M_{hw}$ is the mask label at position $(h,w)$ . $l(P_{hw},M_{hw})$ is the cross-entropy, which is calculated as:

l(P_{hw},M_{hw})=-\sum_{c}M_{hw}^{(c)}\log(P_{hw}^{(c)})

(5)

where $c$ indexes the classes, $M_{hw}^{(c)}$ is a binary indicator (0 or 1) if class label $c$ is the correct classification for position $(h,w)$ , and $P_{hw}^{(c)}$ is the predicted probability of class $c$ at position $(h,w)$ . The cross-entropy measures the dissimilarity between the true label distribution and the predicted probability distribution, and it is used to optimize the model parameters by minimizing this dissimilarity.

Post-processing of masks Since the first image from the PS forum is assumed to be original and subsequent ones tampered, but this isn’t always true, it leads to noisy, mostly white masks. Special tampering techniques also cause noisy masks. Therefore, masks with more than 70% or less than 1% white area are deemed invalid and removed.

4 EXPERIMENTAL RESULTS AND ANALYSIS

4.1 Accuracy of the Model on Existing Datasets

Our innovative use of deep learning for annotating image manipulation detection datasets has no existing comparable methods. Thus, we train and validate our model on the NIST16[15], IMD2020[14], and CASIAV2.0[10] datasets to demonstrate its effectiveness in distinguishing between original and tampered images. The experimental results are shown in Table 2. The model performs exceptionally well on all three datasets, with high levels across all metrics (F1, Precision, Recall, IoU, and Accuracy). It demonstrates strong generalization capability on image manipulation detection datasets, particularly on datasets involving traditional tampering methods. The performance is slightly lower on the IMD2020[14], which uses deep learning techniques such as GANs and inpainting, but overall, the model exhibits good adaptability and performance across various datasets.

Table 2: Performance of Our Model on Different Datasets.

Dataset	F1	Precision	Recall	IoU	Accuracy
IMD2020	0.88	0.90	0.87	0.81	0.97
NIST16	0.94	0.95	0.92	0.89	0.98
CASIAv2	0.95	0.96	0.95	0.91	0.98

Implementation Details Our models are implemented on PyTorch. Our training is set with a learning rate of 0.01 and a maximum of 100 epochs. The learning rate decay iterations are set to 100. Validation is performed after each training epoch, and the model that performs best on the validation set is used for evaluation on the test set.

Evaluation Metrics We use the F1-score to evaluate the performance of our model. It balances between tamper detection (recall) and avoiding false positives (precision), preventing the bias that comes from solely pursuing high recall or high precision. The formula for F1-score is as follows:

\text{ F1-score }=2\times\left(\frac{\text{ Precision }\times\text{ Receall }}% {\text{ Precision }+\text{ Recall }}\right)

(6)

Precision and Recall are expressed using the four metrics: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Additionally, we use IoU (Intersection over Union) to represent the model’s localization accuracy and Accuracy to evaluate the overall performance of the model. The formula of IoU is $\mathrm{IoU}=\mathrm{TP}/(\mathrm{TP}+\mathrm{FN}+\mathrm{FP})$ and The formula of Accuracy is $\mathrm{Accuracy}=(\mathrm{TP}+\mathrm{TN})/(\mathrm{TP}+\mathrm{TN}+\mathrm{% FN}+\mathrm{FP})$ .

4.2 Effect of the Generated Dataset on Existing Models

We used our MMMD to train three models, TruFor[25], MVSS-Net[4], and IML-ViT[1], and validated their generalization. The results are shown in Table 3.

Table 3: The F1 scores of various models pre-trained on MMMD across different datasets. Left: Pre-trained with CASIAv2, Right: Pre-trained with MMMD (ours).

	COVERAGE	Columbia	NIST16	IMD2020
TruFor(CASIAv2/MMMD)	0.42/0.28	0.86/0.78	0.32/0.26	0.32/0.31
MVSS-Net	0.26/0.34	0.39/0.49	0.25/0.28	0.28/0.32
IML-ViT	0.43/0.29	0.78/0.81	0.33/0.34	0.33/0.37

As shown in the table, MVSS-Net and IML-ViT trained using MMMD achieved higher metrics across various datasets compared to models pre-trained on CASIAv2. They exhibit higher generalization and robustness on MMMD. However, the TruFor model pre-trained using MMMD only achieved lower metrics.

We used MMMD to validate image manipulation detection models Mantra-Net[21], MVSS-Net[4], CAT-Net[23], ObjectFormer[2], NCL-IML[24], TruFor[25], IML-ViT[1] and PSCC-Net[26], pre-trained on CASIAv2[10] and discovered the shortcomings of existing datasets compared to our MMMD. The results are shown in Table 4.

Table 4: The F1 scores of various models pre-trained on CASIAv2 across different datasets.

	COVERAGE	Columbia	CASIAv1	MMMD(Ours)
Mantra-Net	0.08	0.46	0.12	0.09
MVSS-Net	0.26	0.39	0.53	0.26
CAT-Net	0.30	0.58	0.58	0.30
ObjectFormer	0.29	0.34	0.43	0.32
NCL-IML	0.22	0.45	0.50	0.26
TruFor	0.42	0.86	0.72	0.30
IML-ViT	0.43	0.78	0.72	0.24
PSCC-Net	0.23	0.60	0.38	0.32

As shown in the table, various models pre-trained on CASIAv2[10] struggled to achieve high metrics on MMMD. This reflects that existing datasets have fewer types of tampering and image varieties compared to MMMD, making it difficult for current models to learn more comprehensively and broadly.

Evaluation Metrics We also use the F1-score to measure the accuracy of the model’s detection, with the calculation formula given in Equation 6.

5 CONCLUSION

In this paper, we creatively propose a Manipulation Mask Manufacturer (MMM) framework for generating image manipulation detection datasets. It addresses the issues of small dataset size, poor quality, and limited types of tampering detection in the field of image manipulation detection. It concatenates image feature embeddings, performs context modeling, and captures long-range relationships between pixels. It uses MMD to eliminate the differences in data distribution. Extensive experiments have validated the effectiveness of our method. We demonstrated the strong performance of the MMM framework on existing datasets. The MMMD dataset we proposed better aligns with the real-world tampering scenarios that manipulation detection models must face. This will help these models improve their generalization and robustness in practical applications. We also demonstrated that MMMD outperforms other datasets in training effectiveness.

References

[1] X. Ma, B. Du, X. Liu, A. Y. A. Hammadi, and J. Zhou, “Iml-vit: Image manipulation localization by vision transformer,” arXiv preprint arXiv:2307.14863, 2023.
[2] J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang, “Objectformer for image manipulation detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2364–2373.
[3] J. Hao, Z. Zhang, S. Yang, D. Xie, and S. Pu, “Transforensics: image forgery localization with dense self-attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 055–15 064.
[4] C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3539–3553, 2022.
[5] X. Yang and J. Zhou, “Manipulation mask generator: High-quality image manipulation mask generation method based on modified total variation noise reduction,” in 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML). IEEE, 2023, pp. 218–223.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
[7] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, “Equivalence of distance-based and rkhs-based statistics in hypothesis testing,” The annals of statistics, pp. 2263–2291, 2013.
[8] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning. PMLR, 2015, pp. 97–105.
[9] H.-W. Chen, Y.-S. Xu, M.-F. Hong, Y.-M. Tsai, H.-K. Kuo, and C.-Y. Lee, “Cascaded local implicit transformer for arbitrary-scale super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 257–18 267.
[10] J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in 2013 IEEE China summit and international conference on signal and information processing. IEEE, 2013, pp. 422–426.
[11] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
[12] B. Wen, Y. Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage—a novel database for copy-move forgery detection,” in 2016 IEEE international conference on image processing (ICIP). IEEE, 2016, pp. 161–165.
[13] G. Mahfoudi, B. Tajini, F. Retraint, F. Morain-Nicolier, J. L. Dugelay, and P. Marc, “Defacto: Image and face manipulation dataset,” in 2019 27Th european signal processing conference (EUSIPCO). IEEE, 2019, pp. 1–5.
[14] A. Novozamsky, B. Mahdian, and S. Saic, “Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2020, pp. 71–80.
[15] H. Guan, M. Kozak, E. Robertson, Y. Lee, A. N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW). IEEE, 2019, pp. 63–72.
[16] Y. Tan, H. Zheng, Y. Zhu, X. Yuan, X. Lin, D. Brady, and L. Fang, “Crossnet++: Cross-scale large-parallax warping for reference-based super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4291–4305, 2020.
[17] R. Thakur and R. Rohilla, “Recent advances in digital image manipulation detection techniques: A brief review,” Forensic science international, vol. 312, p. 110311, 2020.
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer, 2014, pp. 740–755.
[19] P. Zhuang, H. Li, S. Tan, B. Li, and J. Huang, “Image tampering localization using a dense fully convolutional network,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 2986–2999, 2021.
[20] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
[21] Y. Wu, W. AbdAlmageed, and P. Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9543–9552.
[22] X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, “Span: Spatial pyramid attention network for image manipulation localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020, pp. 312–328.
[23] M.-J. Kwon, S.-H. Nam, I.-J. Yu, H.-K. Lee, and C. Kim, “Learning jpeg compression artifacts for image manipulation detection and localization,” International Journal of Computer Vision, vol. 130, no. 8, pp. 1875–1895, 2022.
[24] J. Zhou, X. Ma, X. Du, A. Y. Alhammadi, and W. Feng, “Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 346–22 356.
[25] F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 606–20 615.
[26] X. Liu, Y. Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7505–7517, 2022.
[27] W. G. C. Bandara and V. M. Patel, “A transformer-based siamese network for change detection,” in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, 2022, pp. 207–210.
[28] S. Fang, K. Li, J. Shao, and Z. Li, “Snunet-cd: A densely connected siamese network for change detection of vhr images,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.

M3superscript𝑀3{M}^{3}italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask