(Translated by https://www.hiragana.jp/)
𝑀³:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask
11institutetext: College of Computer Science, Sichuan University, China 22institutetext: Department of Computer and Information Science, University of Macao, Macao SAR 33institutetext: Engineering Research Center of Machine Learning and Industry Intelligence, MOE, China

M3superscript𝑀3{M}^{3}italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT:Manipulation Mask Manufacturer for Arbitrary-Scale Super-Resolution Mask

Xinyu Yang 11    Xiaochen Ma 11    Xuekang Zhu 11    Bo Du 11    Lei Su 11    Bingkui Tong 11    Zeyu Lei 1122    Jizhe Zhou Corresponding author.1133
Abstract

In the field of image manipulation localization (IML), the small quantity and poor quality of existing datasets have always been major issues. A dataset containing various types of manipulations will greatly help improve the accuracy of IML models. Images on the internet (such as those on Baidu Tieba’s PS Bar) are manipulated using various techniques, and creating a dataset from these images will significantly enrich the types of manipulations in our data. However, images on the internet suffer from resolution and clarity issues, and the masks obtained by simply subtracting the manipulated image from the original contain various noises. These noises are difficult to remove, rendering the masks unusable for IML models. Inspired by the field of change detection, we treat the original and manipulated images as changes over time for the same image and view the data generation task as a change detection task. However, due to clarity issues between images, conventional change detection models perform poorly. Therefore, we introduced a super-resolution module and proposed the Manipulation Mask Manufacturer (MMM) framework. It enhances the resolution of both the original and tampered images, thereby improving image details for better comparison. Simultaneously, the framework converts the original and tampered images into feature embeddings and concatenates them, effectively modeling the context. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), a dataset that covers a wide range of manipulation techniques. We aim to contribute to the fields of image forensics and manipulation detection by providing more realistic manipulation data through MMM and MMMD. Detailed information about MMMD and the download link can be found at: the code and datasets will be made available.

Keywords:
image manipulation localization(IML) Dataset Generation Arbitrary-Scale Super-Resolution.

1 INTRODUCTION

Advances in digital image processing have made software like Photoshop and GIMP more powerful, facilitating widespread image manipulation. This proliferation of false information and manipulated images threatens public knowledge, trust, and safety. Thus, image manipulation localization has emerged, and in some literature, it is also referred to as "forgery detection" or "tamper detection." Its purpose is to discern whether an input image is manipulated or authentic and to depict the exact manipulated parts of an image through a mask. These parts are semantically different from the original content (the original image before manipulation). It does not include purely generated images (e.g., images generated from pure text) or the introduction of noise or other non-semantic changes through image processing techniques that do not alter the underlying meaning of the image. Standard tampered images and their masks are shown in Fig. 1.

Refer to caption
Figure 1: Tampered images and their corresponding masks for the image manipulation localization task.

However, image manipulation localization lacks large-scale, well-annotated datasets. Most existing datasets are manually created and annotated by researchers, with limited tampering types and techniques, and the volume of datasets is also restricted. This leads to models with poor generalization and robustness. Therefore, we thought of creating datasets by sourcing a large number of original and manipulated images from the internet. But we found that images from the internet suffer from compression and clarity issues, and simply subtracting the original and manipulated images results in noisy images, as shown in the third row of Fig. 2, that traditional methods struggle to clean up.

Change detection involves identifying differences at the same location over different times, which is similar to our task of detecting differences between the original and manipulated images. Inspired by this, we treat the original and manipulated images as changes over time for the same picture, viewing the dataset generation task as a change detection task. However, due to the clarity disparity[16] between the two images in our task, directly using change detection models is not ideal. Therefore, we introduced a super-resolution processing module to enhance the details of the two images before generating the mask. This is the main idea behind our proposed Manipulation Mask Manufacturer (MMM) framework. Some of the masks we generated and their corresponding original images and tampered images are shown in Fig.2.

Refer to caption
Figure 2: MMM framework generated result images. From highest to lowest, the sequence is as follows: original image, tampered image, image obtained by directly subtracting the two images and binarizing with a threshold of 30, and MMM predicted image.
Table 1: The primary datasets in the field of image manipulation localization. Most of these datasets suffer from issues such as limited quantity, single type of tampering, and inability to grow further, and they are all tampered with by the issuers of the datasets themselves.
Dataset Tampered Images Image Sources Dataset Growth Type
Self Others Scalable Non-scalable Traditional AI-manipulated
Columbia 180 \checkmark -- -- \checkmark \checkmark --
CASIAv1 920 \checkmark -- -- \checkmark \checkmark --
CASIAv2 5,063 \checkmark -- -- \checkmark \checkmark --
Coverage 100 \checkmark -- -- \checkmark \checkmark --
NIST16 564 \checkmark -- -- \checkmark \checkmark --
DEFACTO 149,587 \checkmark -- -- \checkmark \checkmark --
IMD20 2,010 \checkmark -- -- \checkmark \checkmark \checkmark
MMMD(Ours) 11,069 -- \checkmark \checkmark -- \checkmark \checkmark

The framework pipeline inputs the original and tampered images into a Fully Convolutional Network (FCN)[6] to extract high-level features. These features are aligned using Maximum Mean Discrepancies (MMD)[7][8] and concatenated. They are then processed through a Cross-scale Local Attention Block (CSLAB)[9] and a Local Frequency Encoding Block (LFEB)[9] to enhance resolution and detail. Finally, the features are split, decoded, subtracted, and a mask is obtained.

Our framework has achieved excellent annotation results on the IMD2020[14], NIST16[15], and CASIAv2[10] datasets. Additionally, we created the Manipulation Mask Manufacturer Dataset (MMMD), containing 11,069 original images, tampered images, and masks, with potential for continuous growth. The dataset includes various resolutions and manipulation types, such as copy, paste, move, transform, Deepfake, Image Inpainting, Image Morphing, Reconstruction, and Image Style Transfer. It features diverse images like cartoons, portraits, landscapes, interiors, food, and accessories. The main parameters of our dataset compared to existing datasets are shown in Table 1. We used MMMD to train and test TruFor[25], MVSS-Net[4], and IML-ViT[1]. Other 10 models pre-trained on CASIAv2[10] struggled to achieve high metrics on our dataset, while models trained on our dataset demonstrated better generalization. This highlights the limitations of existing datasets.

In summary, our contributions are as follows:

  • We propose a Manipulation Mask Manufacturer (MMM) framework that can accurately annotate the differences between original and tampered images even when there is a significant disparity in their clarity.

  • We generated a large and diverse Manipulation Mask Manufacturer Dataset (MMMD) to address the lack of datasets in the field of image manipulation detection.

  • Models pre-trained with our MMMD achieved higher F1 scores and demonstrated better generalization. Other pre-trained models struggled to perform well on our dataset, highlighting the limitations of existing tampering detection datasets. Our dataset better reflects real-world tampering scenarios.

2 RELATED WORK

2.1 Existing dataset generation methods

Current methods for generating image manipulation detection datasets include manual manipulation, which involves editing images by hand using tools like Photoshop, a time-consuming and expertise-demanding process[17]. Automatic manipulation employs software tools and scripts to rapidly produce large volumes of data, though these images may look unnatural or have obvious manipulation traces[13]. Image synthesis combines elements from different images to create new visual scenes, enhancing dataset diversity but requiring complex techniques and substantial processing time[12]. The improved Total Variation Denoising Method[5] automatically subtracts a tampered image from the original to obtain a noisy mask, which is then denoised, but this method often fails to convert many types of tampered images due to diverse noise and low generalization capability of traditional methods.

2.2 Existing change detection methods

In recent years, change detection methods based on deep learning have rapidly developed. Hao Chen et al. proposed the Bitemporal Image Transformer (BIT)[20], which converts input images into a small number of semantic tokens, uses a transformer encoder to model contextual information within the compact token space-time domain, and then projects the context-rich tokens back into the pixel space through a decoder to enhance the original features. The final change detection results are generated through feature difference images. ChangeFormer[27] uses a transformer-based Siamese network architecture for change detection from a pair of co-registered remote sensing images. This method combines a hierarchical transformer encoder and a multi-layer perceptron (MLP) decoder, effectively extracting multi-scale long-range feature differences. In SNUNet-CD[28], ChangeFormer extracts multi-scale features of bitemporal images through a hierarchical transformer encoder and generates change detection maps by fusing these feature differences using a lightweight MLP decoder. However, since the change detection considers the same image at different times with consistent clarity, it does not need to account for image degradation. Therefore, existing change detection models are not effective for our data generation tasks.

2.3 Existing tampered datasets

The development of datasets in the field of image manipulation detection has been relatively slow. Currently, widely recognized and used datasets are still those from four or five years ago, or even from over a decade ago.

Datasets for Traditional Tampering Techniques Almost all datasets include traditional tampering methods like splicing, copy-move, removal, and various image enhancements to produce “fake” or “forged” images. Columbia[11] uses cropping and splicing, embedding parts from other images into a single image. CASIAv1[10] employs Adobe Photoshop for cutting and pasting, including geometric transformations like scaling and rotation. CASIAv2[10] adds more post-processing and has a richer variety of images, divided into eight categories: scenes, animals, buildings, people, plants, objects, nature, and textures. COVERAGE[12] consists of real images taken with an iPhone 6 front camera, processed with Photoshop CS4 using methods like translation, scaling, rotation, free transformation, lighting changes, and combinations thereof. NIST[15] uses local pixel modification, compression, noise addition, blurring, and geometric transformations. DEFACTO[13], based on the MSCOCO[18] database, aims to produce semantically meaningful forged images, including splicing, copy-move, object removal, and warping.

Deep Learning-Based Tampering Datasets Modern image tampering techniques have achieved unprecedented realism through artificial intelligence and deep learning, particularly with Generative Adversarial Networks (GANs). These techniques include deepfakes, which can perform facial replacement, expression synthesis, and generate images of non-existent people, making image and video tampering very realistic[19]. Deep learning also excels in image restoration and enhancement by denoising, filling in missing parts, and improving resolution, thus making damaged images look new and low-resolution images appear clear and detailed. Tools and frameworks such as TensorFlow, PyTorch, Keras, and OpenCV have greatly simplified the implementation and application of these techniques. In IMD2020[14], GANs were used to generate tampered regions of images, and inpainting techniques were employed to fill in missing or damaged parts of images, making them appear natural and coherent. This also presents greater challenges for image manipulation detection. Current models are increasingly in need of diverse data that better reflects real-world scenarios.

3 PROPOSED METHOD

3.1 The Pipeline of the Entire Framework

The entire MMM framework is divided into three modules: feature extraction and concatenation, super-resolution processing, and feature separation mask generation. We obtain the original images and a large number of tampered images from the network, keeping only the images of the same size as our original data. After obtaining the original and tampered images, we input them into our Manipulation Mask Manufacturer (MMM) framework. The MMM framework extracts high-level features from the original and tampered images using a Fully Convolutional Network (FCN)[6]. These features are aligned with Maximum Mean Discrepancies (MMD)[7][8] and concatenated. The idea of concatenating high-level features is inspired by the Bitemporal Image Transformer (BIT)[20]. During super-resolution, the concatenated features are processed by the Cross-scale Local Attention Block (CSLAB)[9] and the Local Frequency Encoding Block (LFEB)[9] to enhance the resolution and detail representation of the images. The framework then separates these embeddings, uses decoders to generate residual images, and combines them with the original images. The final mask is produced by computing the absolute difference between the high-resolution features of the original and tampered images. The entire MMM structure is shown in Fig.3.

Refer to caption
Figure 3: The proposed MMM framework. The local sampling operation samples input embeddings based on a grid of coordinates.

3.2 Specific Processing Algorithm

Extraction and Concatenation of Image Features First, we use a Fully Convolutional Network (FCN)[6] to extract high-level features from the original image and the tampered image, respectively. Then, these features are input into encoder Eθしーた1subscript𝐸𝜃1E_{\theta 1}italic_E start_POSTSUBSCRIPT italic_θしーた 1 end_POSTSUBSCRIPT and encoder Eθしーた2subscript𝐸𝜃2E_{\theta 2}italic_E start_POSTSUBSCRIPT italic_θしーた 2 end_POSTSUBSCRIPT, resulting in the feature embeddings Z1H×W×Csubscript𝑍1superscript𝐻𝑊𝐶Z_{1}\in\mathbb{R}^{H\times W\times C}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT and Z2H×W×Csubscript𝑍2superscript𝐻𝑊𝐶Z_{2}\in\mathbb{R}^{H\times W\times C}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × italic_C end_POSTSUPERSCRIPT. Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are subjected to Maximum Mean Discrepancies (MMD)[7][8] calculation to eliminate the differences in data distribution, allowing the model to focus more on the differences in content. Simultaneously, Z1subscript𝑍1Z_{1}italic_Z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Z2subscript𝑍2Z_{2}italic_Z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are concatenated into Z3H×W×2Csubscript𝑍3superscript𝐻𝑊2𝐶Z_{3}\in\mathbb{R}^{H\times W\times 2C}italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 italic_C end_POSTSUPERSCRIPT.

Arbitrary-Scale Super-Resolution Z3subscript𝑍3Z_{3}italic_Z start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT will be projected by four separate convolutional layers to obtain four latent embeddings, corresponding to query q𝑞qitalic_q, key k𝑘kitalic_k, value v𝑣vitalic_v, and frequency f𝑓fitalic_f. Since the sizes of the two images are the same, we use the coordinates and cell of the original image. The original image and the tampered image will generate 2D high-resolution coordinates based on an arbitrary upsampling scale 𝐫={rh,rw}𝐫subscript𝑟subscript𝑟𝑤\mathbf{r}=\left\{r_{h},r_{w}\right\}bold_r = { italic_r start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } in 2D low-resolution coordinates. Next, the 2D coordinates, along with q𝑞qitalic_q, k𝑘kitalic_k, and v𝑣vitalic_v, will be input into the Cross-Scale Local Attention Block (CSLAB)[9] to estimate a local latent embedding Z3newGhGw×2Csubscript𝑍3𝑛𝑒𝑤superscriptsubscript𝐺subscript𝐺𝑤2𝐶Z_{3new}\in\mathbb{R}^{G_{h}G_{w}\times 2C}italic_Z start_POSTSUBSCRIPT 3 italic_n italic_e italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × 2 italic_C end_POSTSUPERSCRIPT. f𝑓fitalic_f and the 2D coordinates will also be input into the Local Frequency Encoding Block (LFEB)[9] to estimate a local frequency embedding fnewGhGw×2Csubscript𝑓𝑛𝑒𝑤superscriptsubscript𝐺subscript𝐺𝑤2𝐶{f_{new}}\in\mathbb{R}^{G_{h}G_{w}\times 2C}italic_f start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT × 2 italic_C end_POSTSUPERSCRIPT. Specifically, Ghsubscript𝐺G_{h}italic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and Gwsubscript𝐺𝑤G_{w}italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT represent the height and width of the local grids used for performing local coordinate sampling. CSLAB[9] and LFEB[9] estimate Z3newsubscript𝑍3𝑛𝑒𝑤Z_{3new}italic_Z start_POSTSUBSCRIPT 3 italic_n italic_e italic_w end_POSTSUBSCRIPT and fnewsubscript𝑓𝑛𝑒𝑤{f_{new}}italic_f start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT as follows:

Z3new=CSLAB(δでるたx,q,k,v)subscript𝑍3𝑛𝑒𝑤𝐶𝑆𝐿𝐴𝐵𝛿𝑥𝑞𝑘𝑣Z_{3new}=CSLAB(\delta x,q,k,v)italic_Z start_POSTSUBSCRIPT 3 italic_n italic_e italic_w end_POSTSUBSCRIPT = italic_C italic_S italic_L italic_A italic_B ( italic_δでるた italic_x , italic_q , italic_k , italic_v ) (1)
fnew=LFEB(δでるたx,f)subscript𝑓𝑛𝑒𝑤LFEB𝛿𝑥𝑓f_{new}=\operatorname{LFEB}(\delta x,f)italic_f start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT = roman_LFEB ( italic_δでるた italic_x , italic_f ) (2)
δでるた𝐱={xqx(i,j)}i{1,2,,Gh},j{1,2,,Gw}𝛿𝐱subscriptsubscript𝑥𝑞superscript𝑥𝑖𝑗formulae-sequence𝑖12subscript𝐺𝑗12subscript𝐺𝑤\delta\mathbf{x}=\left\{x_{q}-x^{(i,j)}\right\}_{i\in\left\{1,2,\ldots,G_{h}% \right\},j\in\left\{1,2,\ldots,G_{w}\right\}}italic_δでるた bold_x = { italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT - italic_x start_POSTSUPERSCRIPT ( italic_i , italic_j ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i ∈ { 1 , 2 , … , italic_G start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT } , italic_j ∈ { 1 , 2 , … , italic_G start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT } end_POSTSUBSCRIPT (3)

Separate image features and generate a mask Z1newsubscript𝑍1𝑛𝑒𝑤Z_{1new}italic_Z start_POSTSUBSCRIPT 1 italic_n italic_e italic_w end_POSTSUBSCRIPT and Z2newsubscript𝑍2𝑛𝑒𝑤Z_{2new}italic_Z start_POSTSUBSCRIPT 2 italic_n italic_e italic_w end_POSTSUBSCRIPT are derived from splitting Z3newsubscript𝑍3𝑛𝑒𝑤Z_{3new}italic_Z start_POSTSUBSCRIPT 3 italic_n italic_e italic_w end_POSTSUBSCRIPT. Decoder Dϕ1Dsubscriptitalic-ϕ1\mathrm{D}\phi_{1}roman_D italic_ϕ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Decoder Dϕ2Dsubscriptitalic-ϕ2\mathrm{D}\phi_{2}roman_D italic_ϕ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively utilize these embeddings along with the provided cell to generate residual images. Then, the original and tampered images are upsampled and added element-wise to their corresponding residual images. This results in the high-resolution high-level features of the original and tampered images. We subtract these two high-resolution high-level features and take the absolute value to obtain the final mask.

Loss Function The loss function of the model is defined as follows:

L=1Hp×Wph=1Hw=1Wl(Phw,Mhw)𝐿1subscript𝐻𝑝subscript𝑊𝑝superscriptsubscript1𝐻superscriptsubscript𝑤1𝑊𝑙subscript𝑃𝑤subscript𝑀𝑤L=\frac{1}{H_{p}\times W_{p}}\sum_{h=1}^{H}\sum_{w=1}^{W}l\left(P_{hw},M_{hw}\right)italic_L = divide start_ARG 1 end_ARG start_ARG italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_w = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT italic_l ( italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT ) (4)

Here, Hp×Wpsubscript𝐻𝑝subscript𝑊𝑝H_{p}\times W_{p}italic_H start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT × italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT is the total number of pixels in the image, Phwsubscript𝑃𝑤P_{hw}italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT is the probability distribution of the model at (h,w)𝑤(h,w)( italic_h , italic_w ), and Mhwsubscript𝑀𝑤M_{hw}italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT is the mask label at position (h,w)𝑤(h,w)( italic_h , italic_w ). l(Phw,Mhw)𝑙subscript𝑃𝑤subscript𝑀𝑤l(P_{hw},M_{hw})italic_l ( italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT ) is the cross-entropy, which is calculated as:

l(Phw,Mhw)=cMhw(c)log(Phw(c))𝑙subscript𝑃𝑤subscript𝑀𝑤subscript𝑐superscriptsubscript𝑀𝑤𝑐superscriptsubscript𝑃𝑤𝑐l(P_{hw},M_{hw})=-\sum_{c}M_{hw}^{(c)}\log(P_{hw}^{(c)})italic_l ( italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT , italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT ) = - ∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT roman_log ( italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT ) (5)

where c𝑐citalic_c indexes the classes, Mhw(c)superscriptsubscript𝑀𝑤𝑐M_{hw}^{(c)}italic_M start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT is a binary indicator (0 or 1) if class label c𝑐citalic_c is the correct classification for position (h,w)𝑤(h,w)( italic_h , italic_w ), and Phw(c)superscriptsubscript𝑃𝑤𝑐P_{hw}^{(c)}italic_P start_POSTSUBSCRIPT italic_h italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_c ) end_POSTSUPERSCRIPT is the predicted probability of class c𝑐citalic_c at position (h,w)𝑤(h,w)( italic_h , italic_w ). The cross-entropy measures the dissimilarity between the true label distribution and the predicted probability distribution, and it is used to optimize the model parameters by minimizing this dissimilarity.

Post-processing of masks Since the first image from the PS forum is assumed to be original and subsequent ones tampered, but this isn’t always true, it leads to noisy, mostly white masks. Special tampering techniques also cause noisy masks. Therefore, masks with more than 70% or less than 1% white area are deemed invalid and removed.

4 EXPERIMENTAL RESULTS AND ANALYSIS

4.1 Accuracy of the Model on Existing Datasets

Our innovative use of deep learning for annotating image manipulation detection datasets has no existing comparable methods. Thus, we train and validate our model on the NIST16[15], IMD2020[14], and CASIAV2.0[10] datasets to demonstrate its effectiveness in distinguishing between original and tampered images. The experimental results are shown in Table 2. The model performs exceptionally well on all three datasets, with high levels across all metrics (F1, Precision, Recall, IoU, and Accuracy). It demonstrates strong generalization capability on image manipulation detection datasets, particularly on datasets involving traditional tampering methods. The performance is slightly lower on the IMD2020[14], which uses deep learning techniques such as GANs and inpainting, but overall, the model exhibits good adaptability and performance across various datasets.

Table 2: Performance of Our Model on Different Datasets.
Dataset F1 Precision Recall IoU Accuracy
IMD2020 0.88 0.90 0.87 0.81 0.97
NIST16 0.94 0.95 0.92 0.89 0.98
CASIAv2 0.95 0.96 0.95 0.91 0.98

Implementation Details Our models are implemented on PyTorch. Our training is set with a learning rate of 0.01 and a maximum of 100 epochs. The learning rate decay iterations are set to 100. Validation is performed after each training epoch, and the model that performs best on the validation set is used for evaluation on the test set.

Evaluation Metrics We use the F1-score to evaluate the performance of our model. It balances between tamper detection (recall) and avoiding false positives (precision), preventing the bias that comes from solely pursuing high recall or high precision. The formula for F1-score is as follows:

 F1-score =2×( Precision × Receall  Precision + Recall ) F1-score 2 Precision  Receall  Precision  Recall \text{ F1-score }=2\times\left(\frac{\text{ Precision }\times\text{ Receall }}% {\text{ Precision }+\text{ Recall }}\right)F1-score = 2 × ( divide start_ARG Precision × Receall end_ARG start_ARG Precision + Recall end_ARG ) (6)

Precision and Recall are expressed using the four metrics: True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN). Additionally, we use IoU (Intersection over Union) to represent the model’s localization accuracy and Accuracy to evaluate the overall performance of the model. The formula of IoU is IoU=TP/(TP+FN+FP)IoUTPTPFNFP\mathrm{IoU}=\mathrm{TP}/(\mathrm{TP}+\mathrm{FN}+\mathrm{FP})roman_IoU = roman_TP / ( roman_TP + roman_FN + roman_FP ) and The formula of Accuracy is Accuracy=(TP+TN)/(TP+TN+FN+FP)AccuracyTPTNTPTNFNFP\mathrm{Accuracy}=(\mathrm{TP}+\mathrm{TN})/(\mathrm{TP}+\mathrm{TN}+\mathrm{% FN}+\mathrm{FP})roman_Accuracy = ( roman_TP + roman_TN ) / ( roman_TP + roman_TN + roman_FN + roman_FP ).

4.2 Effect of the Generated Dataset on Existing Models

We used our MMMD to train three models, TruFor[25], MVSS-Net[4], and IML-ViT[1], and validated their generalization. The results are shown in Table 3.

Table 3: The F1 scores of various models pre-trained on MMMD across different datasets. Left: Pre-trained with CASIAv2, Right: Pre-trained with MMMD (ours).
Model Dataset COVERAGE Columbia NIST16 IMD2020
TruFor(CASIAv2/MMMD) 0.42/0.28 0.86/0.78 0.32/0.26 0.32/0.31
MVSS-Net 0.26/0.34 0.39/0.49 0.25/0.28 0.28/0.32
IML-ViT 0.43/0.29 0.78/0.81 0.33/0.34 0.33/0.37

As shown in the table, MVSS-Net and IML-ViT trained using MMMD achieved higher metrics across various datasets compared to models pre-trained on CASIAv2. They exhibit higher generalization and robustness on MMMD. However, the TruFor model pre-trained using MMMD only achieved lower metrics.

We used MMMD to validate image manipulation detection models Mantra-Net[21], MVSS-Net[4], CAT-Net[23], ObjectFormer[2], NCL-IML[24], TruFor[25], IML-ViT[1] and PSCC-Net[26], pre-trained on CASIAv2[10] and discovered the shortcomings of existing datasets compared to our MMMD. The results are shown in Table 4.

Table 4: The F1 scores of various models pre-trained on CASIAv2 across different datasets.
Model Dataset COVERAGE Columbia CASIAv1 MMMD(Ours)
Mantra-Net 0.08 0.46 0.12 0.09
MVSS-Net 0.26 0.39 0.53 0.26
CAT-Net 0.30 0.58 0.58 0.30
ObjectFormer 0.29 0.34 0.43 0.32
NCL-IML 0.22 0.45 0.50 0.26
TruFor 0.42 0.86 0.72 0.30
IML-ViT 0.43 0.78 0.72 0.24
PSCC-Net 0.23 0.60 0.38 0.32

As shown in the table, various models pre-trained on CASIAv2[10] struggled to achieve high metrics on MMMD. This reflects that existing datasets have fewer types of tampering and image varieties compared to MMMD, making it difficult for current models to learn more comprehensively and broadly.

Evaluation Metrics We also use the F1-score to measure the accuracy of the model’s detection, with the calculation formula given in Equation 6.

5 CONCLUSION

In this paper, we creatively propose a Manipulation Mask Manufacturer (MMM) framework for generating image manipulation detection datasets. It addresses the issues of small dataset size, poor quality, and limited types of tampering detection in the field of image manipulation detection. It concatenates image feature embeddings, performs context modeling, and captures long-range relationships between pixels. It uses MMD to eliminate the differences in data distribution. Extensive experiments have validated the effectiveness of our method. We demonstrated the strong performance of the MMM framework on existing datasets. The MMMD dataset we proposed better aligns with the real-world tampering scenarios that manipulation detection models must face. This will help these models improve their generalization and robustness in practical applications. We also demonstrated that MMMD outperforms other datasets in training effectiveness.

References

  • [1] X. Ma, B. Du, X. Liu, A. Y. A. Hammadi, and J. Zhou, “Iml-vit: Image manipulation localization by vision transformer,” arXiv preprint arXiv:2307.14863, 2023.
  • [2] J. Wang, Z. Wu, J. Chen, X. Han, A. Shrivastava, S.-N. Lim, and Y.-G. Jiang, “Objectformer for image manipulation detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2364–2373.
  • [3] J. Hao, Z. Zhang, S. Yang, D. Xie, and S. Pu, “Transforensics: image forgery localization with dense self-attention,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 055–15 064.
  • [4] C. Dong, X. Chen, R. Hu, J. Cao, and X. Li, “Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 3, pp. 3539–3553, 2022.
  • [5] X. Yang and J. Zhou, “Manipulation mask generator: High-quality image manipulation mask generation method based on modified total variation noise reduction,” in 2023 IEEE 4th International Conference on Pattern Recognition and Machine Learning (PRML).   IEEE, 2023, pp. 218–223.
  • [6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  • [7] D. Sejdinovic, B. Sriperumbudur, A. Gretton, and K. Fukumizu, “Equivalence of distance-based and rkhs-based statistics in hypothesis testing,” The annals of statistics, pp. 2263–2291, 2013.
  • [8] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep adaptation networks,” in International conference on machine learning.   PMLR, 2015, pp. 97–105.
  • [9] H.-W. Chen, Y.-S. Xu, M.-F. Hong, Y.-M. Tsai, H.-K. Kuo, and C.-Y. Lee, “Cascaded local implicit transformer for arbitrary-scale super-resolution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 257–18 267.
  • [10] J. Dong, W. Wang, and T. Tan, “Casia image tampering detection evaluation database,” in 2013 IEEE China summit and international conference on signal and information processing.   IEEE, 2013, pp. 422–426.
  • [11] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
  • [12] B. Wen, Y. Zhu, R. Subramanian, T.-T. Ng, X. Shen, and S. Winkler, “Coverage—a novel database for copy-move forgery detection,” in 2016 IEEE international conference on image processing (ICIP).   IEEE, 2016, pp. 161–165.
  • [13] G. Mahfoudi, B. Tajini, F. Retraint, F. Morain-Nicolier, J. L. Dugelay, and P. Marc, “Defacto: Image and face manipulation dataset,” in 2019 27Th european signal processing conference (EUSIPCO).   IEEE, 2019, pp. 1–5.
  • [14] A. Novozamsky, B. Mahdian, and S. Saic, “Imd2020: A large-scale annotated dataset tailored for detecting manipulated images,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, 2020, pp. 71–80.
  • [15] H. Guan, M. Kozak, E. Robertson, Y. Lee, A. N. Yates, A. Delgado, D. Zhou, T. Kheyrkhah, J. Smith, and J. Fiscus, “Mfc datasets: Large-scale benchmark datasets for media forensic challenge evaluation,” in 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).   IEEE, 2019, pp. 63–72.
  • [16] Y. Tan, H. Zheng, Y. Zhu, X. Yuan, X. Lin, D. Brady, and L. Fang, “Crossnet++: Cross-scale large-parallax warping for reference-based super-resolution,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 12, pp. 4291–4305, 2020.
  • [17] R. Thakur and R. Rohilla, “Recent advances in digital image manipulation detection techniques: A brief review,” Forensic science international, vol. 312, p. 110311, 2020.
  • [18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  • [19] P. Zhuang, H. Li, S. Tan, B. Li, and J. Huang, “Image tampering localization using a dense fully convolutional network,” IEEE Transactions on Information Forensics and Security, vol. 16, pp. 2986–2999, 2021.
  • [20] H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection with transformers,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–14, 2021.
  • [21] Y. Wu, W. AbdAlmageed, and P. Natarajan, “Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 9543–9552.
  • [22] X. Hu, Z. Zhang, Z. Jiang, S. Chaudhuri, Z. Yang, and R. Nevatia, “Span: Spatial pyramid attention network for image manipulation localization,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16.   Springer, 2020, pp. 312–328.
  • [23] M.-J. Kwon, S.-H. Nam, I.-J. Yu, H.-K. Lee, and C. Kim, “Learning jpeg compression artifacts for image manipulation detection and localization,” International Journal of Computer Vision, vol. 130, no. 8, pp. 1875–1895, 2022.
  • [24] J. Zhou, X. Ma, X. Du, A. Y. Alhammadi, and W. Feng, “Pre-training-free image manipulation localization through non-mutually exclusive contrastive learning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 346–22 356.
  • [25] F. Guillaro, D. Cozzolino, A. Sud, N. Dufour, and L. Verdoliva, “Trufor: Leveraging all-round clues for trustworthy image forgery detection and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 606–20 615.
  • [26] X. Liu, Y. Liu, J. Chen, and X. Liu, “Pscc-net: Progressive spatio-channel correlation network for image manipulation detection and localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 11, pp. 7505–7517, 2022.
  • [27] W. G. C. Bandara and V. M. Patel, “A transformer-based siamese network for change detection,” in IGARSS 2022 - 2022 IEEE International Geoscience and Remote Sensing Symposium, 2022, pp. 207–210.
  • [28] S. Fang, K. Li, J. Shao, and Z. Li, “Snunet-cd: A densely connected siamese network for change detection of vhr images,” IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1–5, 2022.