Towards Generalizable Tumor Synthesis

Qi Chen¹ Xiaoxi Chen² Haorui Song³ Zhiwei Xiong¹
Alan Yuille³ Chen Wei^3,* Zongwei Zhou^3,
¹University of Science and Technology of China
²Shanghai Jiao Tong University
³Johns Hopkins University
Code and Visual Turing Test: https://github.com/MrGiovanni/DiffTumor Correspondence to Chen Wei (weichen3012@gmail.com) and Zongwei Zhou (zzhou82@jh.edu)

Abstract

Tumor synthesis enables the creation of artificial tumors in medical images, facilitating the training of AI models for tumor detection and segmentation. However, success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and, furthermore, the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g., hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors ( $<$ 2cm) tend to have similar imaging characteristics in computed tomography (CT), whether they originate in the liver, pancreas, or kidneys. We have ascertained that generative AI models, e.g., Diffusion Models, can create realistic tumors generalized to a range of organs even when trained on a limited number of tumor examples from only one organ. Moreover, we have shown that AI models trained on these synthetic tumors can be generalized to detect and segment real tumors from CT volumes, encompassing a broad spectrum of patient demographics, imaging protocols, and healthcare facilities.

1 Introduction

Refer to caption — Figure 1: Generalizable tumor synthesis across organs. Early-stage tumors present similar imaging characteristics in computed tomography (CT), whether they are located in the liver, pancreas, or kidneys. Leveraging this observation, we develop a generative AI model on a few examples of annotated tumors in a specific organ, e.g., the liver (in purple). This AI model (in purple), trained exclusively on liver tumors, can directly create synthetic tumors in those organs where CT volumes of annotated tumors are relatively scarce, e.g., the pancreas (in cyan) and kidneys (in blue and green). By integrating synthetic tumors into extensive CT volumes of healthy organs—routinely collected in clinical settings—we can substantially augment the training set for tumor segmentation. This enhancement can also significantly improve the AI generalizability across CT volumes sourced from diverse hospitals and patient demographics.

Tumor synthesis enables the creation of artificial tumor examples in medical images [jordon2018pate, yoon2019time, chen2021synthetic], it is particularly valuable when there is a dearth or complete absence of per-voxel annotated real tumors (e.g., early-stage tumors) for effective AI training. Typically, to train AI models for tumor detection in multiple ( $N$ ) organs, annotated real tumor examples from each of these organs are necessary, and ideally, in substantial numbers [li2024well, kang2023label, liu2023clip, chou2024acquiring, zhu2022assembling]. Furthermore, AI models often fail to generalize across images from different hospitals, which may vary due to variations in imaging protocols, patient demographics, and scanner manufacturers [orbes2019multi, zhou2022interpreting, zhou2021towards]. The challenge amplifies with the need for extensive manual annotations, a task that could demand up to 25 human years for annotating just one tumor type [xia2022felix, chen2023cancerunit, abi2023automatic]. The task of collecting and annotating a comprehensive dataset encompassing tumors from multiple organs ( $N$ ) and images from numerous hospitals ( $M$ ) is daunting, considering both annotation cost and complexity ( $N$ $\times$ $M$ ). We hypothesize that tumor synthesis could solve this challenge by creating various tumor types across non-tumor images from multiple hospitals, even when only one tumor type is available, thereby simplifying the complexity from $N$ $\times$ $M$ to $1$ $\times$ $M$ .

Success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and, furthermore, the resulting AI models being generalizable in detecting real tumors in images sourced from different hospitals. Previous studies have introduced generative models to create synthetic medical data (not limited to tumors) such as polyp detection from colonoscopy videos [shin2018abnormal], COVID-19 detection from Chest X-ray [yao2021label, lyu2022pseudo, gao2023synthetic], and diabetic lesion detection from retinal images [wang2022anomaly]—refer to §5 for a comprehensive review. However, these studies have primarily focused on enhancing the detection and segmentation of specific tumors without fully exploring the wider generalizability of these models across different organs and patient demographics.

This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors ( $<$ 2cm) tend to have similar imaging characteristics in computed tomography (CT)¹¹1Note that, owing to the public dataset constraints, we have only verified the similarity across early hepatocellular carcinoma and intrahepatic cholangiocarcinoma from the liver, pancreatic ductal adenocarcinoma from the pancreas, and renal cell carcinoma from kidneys.. Early-stage tumors typically present small, round, or oval shapes with minimal deformation and exhibit relatively simple and uniform textures in CT volumes [choi2014ct]. Hence, early tumors in parenchymal organs (e.g., liver, spleen, pancreas, adrenal glands, and kidneys) should appear similarly, as shown in Figure 1. The major difference is the contrast between the tumors and background organs or other anatomical structures rather than the tumors themselves. Using four public datasets and our proprietary datasets, §2 verifies the similarity of early-stage tumors across various organs.

Leveraging this observation, we introduce a novel framework, termed DiffTumor, that can learn the common imaging characteristics of tumors across various organs, and the generated synthetic tumors are useful for training AI models to detect and segment real tumors from CT volumes of varying patient demographics. The development of DiffTumor is composed of three stages. ① Training an Autoencoder Model—consisting of an encoder and decoder—on 9,262 unlabeled three-dimensional CT volumes. The use of large, diverse datasets can enhance the model’s ability to generalize across CT volumes of different patient demographics and reduce the need for annotated tumor volumes for training Diffusion Models in the subsequent stages. The proxy task is image reconstruction, which facilitates the model in learning comprehensive latent features. ② Training a Diffusion Model—a specific type of generative models—using latent features and tumor masks as conditions. Once trained, this model can generate latent features necessary for reconstructing CT volumes with tumors based on arbitrary masks. ③ Training a Segmentation Model using synthetic tumors, which are reconstructed by the decoder, and their corresponding masks. With a large repository of healthy CT volumes, our DiffTumor framework can produce a vast array of synthetic tumors, varying in location, size, shape, texture, and intensity, therefore fostering high-performing AI models for tumor detection/segmentation.

The key contributions of this paper are two-fold. Firstly, we have verified with feature analysis, reader studies, and clinical knowledge that early-stage tumors ( $<$ 2cm) manifest with similar imaging characteristics across various organs in CT volumes, establishing the foundation for the development of generalizable tumor synthesis. Secondly, we have developed a three-stage tumor synthesis framework, DiffTumor, that trains generative models with minimal annotations (Figure 5; one annotated CT volume), creates synthetic tumors in real-time (Figure 6; 100 ms/tumor), and improves early-stage tumor detection (Figure 7; improved sensitivity up to +28.6%). In summary, compared with training AI on extensively annotated CT volumes of real tumors, our DiffTumor is generalizable from two critical perspectives.

1.

DiffTumor can create visually realistic tumors generalizable to a range of organs even when the diffusion model was trained on a limited number of tumor examples from a specific organ (§4.2; +10.7% DSC).
2.

DiffTumor can develop an AI model to detect and segment real tumors generalizable to a variety of CT volumes of varied patient demographics, imaging protocols, and healthcare facilities (§4.3; +9.1% DSC).

2 Preliminary

We observe that early-stage tumors ( $<$ 2cm)²²2Based on the TNM system, the most widely used staging system for classifying a malignancy tumor, we recognize a primary malignant tumor with a diameter less than 2 cm and no evidence of nearby lymph node involvement or metastasis as an early-stage tumor [burke2004outcome, ficarra2007tnm, rindi2012tnm]. often share similar imaging characteristics in CT volumes, whether they originate in the liver, pancreas, or kidneys. This observation, when confirmed, could have profound implications for generative AI in medical imaging. It suggests that generative AI might be trained on one tumor type, for which data and annotations are more easily obtained, and then applied to create various tumor types in different organs, where acquiring sufficient data can be challenging. Using synthetic tumors can substantially improve AI performance in tumor detection and segmentation in practice. In light of this, we have rigorously pursued the validation through four approaches as follows.

(1) Radiologist reader study. The objective of the reader study is to assess the ability of radiologists to recognize the organ class of early cancer. We uniformly crop 360 CT images of the tumor region from three abdominal organs, as per the annotations. In order to exclude the influence of surrounding organ textures on recognition, we only retain a small amount of organ textures in the tumor boundary region. Examples of CT crops used for the reader study are provided in Figure 2, and more examples are in Appendix A. Three expert radiologists, qualified under the Quality Standards Act, participate in the reader study. The recognition results are shown in Figure 2(a). The nearly random probability of the precision and recall scores indicates that the appearance of early-stage tumors is so similar that even experienced radiologists have difficulty distinguishing the organ types of these tumors.

(2) Radiomics feature analysis. We now analyze the similarity in Radiomics features³³3We utilize the official Radiomics feature repository [van2017computational, wang2017comparison] to extract the appearance features, which include 3D shape-based features, gray level co-occurrence matrix, gray level run length matrix, gray level size zone matrix, neighboring gray-tone difference matrix, and gray level dependence matrix. More details can be seen in Appendix B. of early-stage tumors. Quantitatively, we train three types of learning-based classifiers, including support vector machine (SVM), Random Forests, and AdaBoost, to identify the organ types of early-stage tumors. To draw a general conclusion, we conducted ten repeated experiments and calculated the precision and recall scores of these classifiers in both the training and test sets. The final results for SVM show that the precision and recall for the training set are close to 1, indicating that SVM is well-trained and capable of learning a decent decision boundary for the training set. However, the precision scores for the test set are nearly equivalent to random probability, as shown in Figure 2(a). Similarly, Random Forest achieves a precision of 50.3% and a recall of 54.9%, while AdaBoost achieves a precision of 35.4% and a recall of 49.5%. This suggests that even a well-trained classifier struggles to recognize the organ types of unseen early-stage tumors. Qualitatively, Figure 2(b) visualizes the feature mapping in a two-dimensional space using t-SNE. The appearance features of early-stage tumors are distributed in a joint feature space, and there is no separation for different organ types.

(3) Deep feature analysis. We investigate the similarity of early tumors across different organs using deep features extracted by ResNet and DenseNet. These two networks are trained to classify the types of organs affected by early-stage tumors. ResNet achieves a precision of 59.7% and a recall of 55.6%; DenseNet achieves a precision of 44.3% and a recall of 61.1%. As seen, no matter whether using hand-craft features or deep features, the results reached a consistent observation—none of the classifiers can distinguish early tumors correctly among the three organs.

(4) Clinical evidence and justification. Tumorigenesis is a gradual, multi-step process involving cellular and histological changes that culminate in successively malignant lesions [choi2014ct]. Histologically, early-stage tumors often exhibit well-to-moderately differentiated neoplastic cells with mild atypia, limited hemorrhage, and necrosis [chu2017diagnosis, dunnick2016renal, ayuso2018diagnosis]. This cellular similarity leads to shared imaging features across various parenchymal organs (e.g., liver, pancreas, spleen, adrenal gland, kidney). Early-stage tumors typically appear as relatively homogeneous nodules with indistinct margins and small diameters in CT images [skarin2015atlas]. These consistent characteristics, observed across populations, ages, and genders, suggest that tumor synthesis models could learn and generalize shared imaging features across organs.

•

Liver tumors: Hepatocellular carcinoma (HCC) in the early stage presents as a small, well-differentiated nodule with minimal metastatic potential [fowler2021pathologic]. Active neoangiogenesis leads to reduced portal triads and an isoattenuating or hypoattenuating appearance (wash-out) in the venous phase compared to surrounding parenchyma [ayuso2018diagnosis].
•

Pancreatic tumors: Multiphase CT with intravenous contrast is preferred for diagnosing suspected pancreatic lesions [laeseke2015combining]. Most pancreatic ductal adenocarcinomas (PDACs) exhibit hypoenhancement relative to surrounding tissue due to their dense fibroblastic stroma [elbanna2020imaging].
•

Kidney tumors: CT is the gold standard for evaluating renal cell carcinoma (RCC) [leveridge2010imaging]. Clear cell RCC, the most common subtype, typically presents as a small, hypoattenuating renal lesion with surrounding homogenous enhancement in the nephrographic phase [pmid29668296].

3 DiffTumor

3.1 Autoencoder Model

Diffusion models directly applied to three-dimensional CT volumes incur significant computational costs. To address this, Latent Diffusion Models (LDMs) [rombach2022high] operate within a compressed, lower-dimensional latent space. Following this approach, we construct our diffusion model within the latent space of 3D CT volumes. Our first step involves training a 3D autoencoder to learn meaningful, compressed latent representations. We adapt the Vector Quantized Generative Adversarial Networks (VQGAN) [esser2021taming] architecture, replacing 2D convolutions with their 3D counterparts.

Formally, we denote a CT sub-volume as $\bm{x}$ $\in$ $\mathbb{R}^{H\times W\times D}$ , where $H$ denotes the height, $W$ the width, and $D$ the depth. The CT sub-volume $\bm{x}$ is first converted to latent features $\bm{z}$ by an encoder $f$ and a quantization operation $\mathbf{q}$ , i.e., $\bm{z}$ $=$ $\mathbf{q}\left(f(\bm{x})\right)$ $\in$ $\mathbb{Z}^{h\times w\times d}$ , where $h$ denotes the feature height, $w$ the feature width, and $d$ the feature depth. In the vector quantization step, the latent features $\bm{z}$ are quantized into $\bm{c}_{z}$ $\in$ $\mathbb{R}^{h\times w\times d\times c}$ by replacing each one with its closest corresponding codebook vector in the learned codebook $\mathcal{C}=\left\{\bm{c}_{i}\right\}_{i=1}^{K}$ . $K$ is the codebook size. Finally, a decoder $g$ reconstructs the latent from $\bm{c}_{z}$ to $\hat{\bm{x}}=g\left(\bm{c}_{z}\right)$ . The loss is a summation of three terms:

\footnotesize\mathcal{L}_{\text{recon}}+\mathcal{L}_{\text{codebook}}+\alpha% \mathcal{L}_{\text{commit}},

αあるふぁ caligraphic_L start_POSTSUBSCRIPT commit end_POSTSUBSCRIPT ,

(1)

where $\mathcal{L}_{\text{recon}}$ $=$ ${\|\bm{x}-\hat{\bm{x}}\|_{1}}$ , $\mathcal{L}_{\text{codebook}}$ $=$ ${\left\|\operatorname{sg}\left[f(\bm{x})\right]-\bm{c}_{z}\right\|_{2}^{2}}$ , and $\mathcal{L}_{\text{commit}}$ $=$ ${\alpha\left\|\operatorname{sg}\left[\bm{c}_{z}\right]-f(\bm{x})\right\|_{2}^{% 2}}$ αあるふぁ ∥ roman_sg [ bold_italic_c start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ] - italic_f ( bold_italic_x ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. $\operatorname{sg}[\cdot]$ denotes the stop-gradient operation and $\alpha$ αあるふぁ the coefficient.

In addition to these three loss terms, we also adopt a perceptual loss and a discriminator to improve the reconstruction quality. For 3D CT reconstruction, we adopt a 3D volume discriminator $D_{v}$ to penalize implausible artifacts for the 3D reconstruction of $\hat{\bm{x}}$ , and a 2D slice discriminator $D_{s}$ to encourage per-slice quality. To stabilize the GAN training, we add the feature matching losses $\mathcal{L}_{\text{match }}$ . Moreover, due to the CT volumes being preprocessed to isotropic volume, we constrain the high-frequency texture for all three planes of $\hat{\bm{x}}$ by using perceptual loss for projected reconstruction slices $\hat{\bm{x}}_{HW}$ $\in$ $\mathbb{R}^{H\times W}$ , $\hat{\bm{x}}_{HD}$ $\in$ $\mathbb{R}^{H\times D}$ , $\hat{\bm{x}}_{WD}$ $\in$ $\mathbb{R}^{W\times D}$ . The overall objective of the Autoencoder is:

		$\displaystyle\min_{f,g,\mathcal{C}}\left(\mathcal{L}_{\text{recon }}+\mathcal{% L}_{\text{codebook }}+\alpha\mathcal{L}_{\text{commit }}+\mathcal{L}_{\text{% match }}+\mathcal{L}_{\text{perceptual }}\right)$ αあるふぁ caligraphic_L start_POSTSUBSCRIPT commit end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT match end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT perceptual end_POSTSUBSCRIPT )		(2)
		$\displaystyle+\min_{f,g,\mathcal{C}}\left(\max_{D_{s},D_{v}}\left(\mathcal{L}_% {\text{disc }}\right)\right),$		(2)

where

\footnotesize\mathcal{L}_{\mathrm{disc}}=\log D_{s/v}(\bm{x})+\log\left(1-D_{s% /v}(\hat{\bm{x}})\right),

(3)

\footnotesize\mathcal{L}_{\text{match }}=\sum_{i}\left\|D_{s/v}^{(i)}(\hat{\bm% {x}})-D_{s/v}^{(i)}(\bm{x})\right\|_{1}.

(4)

$D_{s/v}^{(i)}$ denotes the $i_{th}$ layer of discriminators.

3.2 Diffusion Model

We aim to synthesize realistic and diverse CT volumes with tumors to facilitate the training of the tumor segmentation model. Given the fact that healthy CT volumes are much more accessible than CT volumes with tumors, we focus only on tumor synthesis, and we do not intend to model organ textures outside of the tumors, which can be easily obtained from healthy CT volumes. To be specific, our diffusion model is conditioned on both a tumor mask that indicates the shape and location of tumors in the latent feature and the healthy region of CT volumes.

Formally, given a pair of tumor-present CT volume $\bm{x}_{0}$ and the mask of its tumor region $\bm{m}$ , the diffusion model is conditioned on both the tumor mask $\bm{m}$ and the healthy region $\bm{z}_{0}^{\textrm{healthy}}$ $\coloneqq$ $(1-\bm{m})\odot\bm{z}_{0}$ . The diffusion model approximates the distribution of the latent features of tumor-present CT volumes. In the forward process, the latent feature $\bm{z}_{0}$ is gradually converted to white Gaussian noise $\bm{z}_{T}\sim\mathcal{N}(0,1)$ by recursively adding a small amount of Gaussian noise $T$ times following the Markov process below:

\footnotesize p\left(\bm{z}_{t}\mid\bm{z}_{t-1}\right)=\mathcal{N}\left(\bm{z}% _{t};\sqrt{1-\beta_{t}}\bm{z}_{t-1},\beta_{t}\mathbf{I}\right),

βべーた start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_βべーた start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) ,

(5)

where $t\in[1,2,..,T]$ denotes the timestep and $\beta_{1:T}$ βべーた start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT is the variance schedule of noise.

In the inference, we synthesize the latent feature of CT volumes by sampling from $p\left(\bm{z}_{0}\mid\bm{z}_{0}^{\textrm{healthy}},\bm{m}\right)$ , which is approximated by recursively sampling from $p\left(\bm{z}_{t-1},\mid\bm{z}_{t},\bm{z}_{0}^{\textrm{healthy}},\bm{m}\right)$ . The training objective of our diffusion model [ho2020denoising] is as follows:

\footnotesize\mathbb{E}_{\bm{z}_{0},\epsilon\sim\mathcal{N}(0,1),t}\left[\left% \|\epsilon-\epsilon_{\theta}\left(\bm{z}_{t},\bm{z}_{0}^{\textrm{healthy}},\bm% {m},t\right)\right\|_{2}^{2}\right],

θしーた end_POSTSUBSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_z start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT healthy end_POSTSUPERSCRIPT , bold_italic_m , italic_t ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,

(6)

where $\epsilon_{\theta}(\cdot,t)$ θしーた end_POSTSUBSCRIPT ( ⋅ , italic_t ) is a 3D U-Net with interleaved self-attention layers and convolutional layers [ho2020denoising, nichol2021improved] that predicts the noise given the input. To reduce the heavy computational cost for 3D CT volumes, we factorize the self-attention over the entire 3D data to first only attend to each 2D slide and then attend to the depth dimension, inspired by 3D video Transformers [bertasius2021space, arnab2021vivit]. This design largely reduces the computation cost of the self-attention layers in the 3D U-Net.

3.3 Segmentation Model

We construct a large-scale database of healthy CT volumes as a basis for our tumor synthesis method. This database includes 1,246 volumes with healthy livers, 1,901 with healthy pancreases, and 1,005 with healthy kidneys, ensuring diversity across ages, genders, nationalities, and acquisition protocols. Following Hu et al. [hu2023label], we generate realistic tumor-like shapes using ellipsoids and refine them with expert radiologist feedback for clinical plausibility (implementation details in Appendix E). By combining these generated tumor masks with the healthy CT volumes (Figure 3–Stage 3), we synthesize tumors across various domains, promoting generalizability in our model.

4 Experiments & Results

Real-tumor datasets. LiTS [bilic2019liver], MSD-Pancreas [antonelli2021medical], and KiTS [heller2020international] were used for training and testing Segmentation Models on the liver, pancreas, and kidneys, respectively. We performed 5-fold cross-validation on 118 tumor CT volumes for LiTS and 120 tumor CT volumes for MSD-Pancreas and KiTS.

Healthy CT datasets. We collect a large repository of healthy CT volumes. Due to the computational cost and memory limitation for training, we only randomly selected 120 healthy CT volumes for the kidney and pancreas, respectively. For liver, we adopt the same healthy CT volumes as Hu et al. [hu2023label]. More details about dataset and implementation can be found in Appendix E.

4.1 Visual Turing Test

We conduct the Visual Turing Test on 240 CT volumes for three organs, respectively, where 120 volumes are with real tumors, and the remaining 120 volumes are with synthesized tumors by our method. Four radiologists, with varying levels of experience ranging from junior to senior and professional, are involved in this test. The total Visual Turing Test took 144 hours (2,880 CTs). Following Hu et al. [hu2023label], each sample is inspected in a 3D view to be classified as either real or synthetic, allowing for the observation of a continuous slice sequence. The testing results are shown in Table 1. All radiologists are able to identify real tumors with a high sensitivity score (above 90%). This indicates their familiarity with the characteristics of real tumors. However, the low specificity scores (below 40%) on the three types of tumors for radiologists R1 and R3 suggest that the synthetic data strongly resembles real tumors, leading to most synthetic tumors being misidentified as real ones. As for R2 and R4, who have more experience, the specificity scores are higher than those of R1 and R3, approximating 50%. This indicates that nearly 50% of synthetic samples are still incorrectly identified as real tumors. These results confirm the efficacy of DiffTumor in generating visually realistic tumors.

		R1	R2	R3	R4
liver	sensitivity (%)	98.3	99.2	100	100
	specificity (%)	31.7	53.3	39.2	45.8
	accuracy (%)	65.0	76.3	69.6	72.9
pancreas	sensitivity (%)	96.7	100	100	98.3
	specificity (%)	22.5	44.2	34.2	38.8
	accuracy (%)	59.6	72.1	67.1	68.3
kidney	sensitivity (%)	95.8	98.3	99.2	97.5
	specificity (%)	36.7	55.0	40.8	51.7
	accuracy (%)	66.3	76.7	70.0	74.6

•

positives: real tumors ( $N$ = 120); negatives: synthetic tumors ( $N$ = 120).

Table 1: Visual Turing Test over three organs has been conducted with four radiologists (R1–R4). Each radiologist was provided with 240 three-dimensional CT volumes of each organ, including 120 scans with real tumors and the remaining 120 with synthetic ones. Radiologists were tasked to label each CT volume as real or synthetic. A lower specificity score indicates a higher number of synthetic tumors being identified as real.

4.2 Generalizable to Different Organs

DiffTumor can generate visually realistic tumors generalizable to a range of organs, although Diffusion Model is only trained on a specific organ tumor. In order to verify the effectiveness of our method’s generalization capacity across different organs, we conducted comparative experiments across three different abdominal organs. This involved training all Segmentation Models on tumor data from a single organ, and then applying that training to the other two organs. For the results of our method, we train DiffTumor on source organ data, then utilize healthy CT volumes to synthesize tumors in the target organ, which are used for further training of Segmentation Models. To showcase the broad applicability of our synthetic data, we compare the early-stage tumor detection capabilities across three commonly used backbones. The generalization result, shown in Table 2, suggests that it is difficult for Segmentation Models trained on real data to generalize across different organs, leading to poor performance in early-stage tumor detection. Hu et al. [hu2023label] introduces a modeling-based method, which can maintain consistent sensitivity scores for the same target, regardless of the source domain. The generalization ability of DiffTumor across organs surpasses that of most models, except in the setting that generalizing tumors from kidney to liver. Moreover, we demonstrate the strength of DiffTumor used as an augmentation method for real tumors in the same organ, as shown in Table 2. In particular, there is a notable improvement of 10.7% in the Dice Similarity Coefficient (DSC) for kidney tumors when using nnU-Net backbone. Furthermore, a decrease in the standard deviations of the DSC scores suggests that Segmentation Models become more stable. The significant improvement in DSC scores across all three organs proves that DiffTumor is an effective data augmentation method to enhance the performance of Segmentation Model.

Early-stage tumor detection performance (tumor-wise Sensitivity %).
source $\backslash$ target		liver	pancreas	kidneys
liver	real tumors	75.6	0	2.4
	Hu et al. [hu2023label]	77.8	56.3	52.4
	DiffTumor	82.2	56.3	76.2
pancreas	real tumors	0.7	64.3	0
	Hu et al. [hu2023label]	74.1	67.0	52.4
	DiffTumor	75.3	71.4	71.4
kidney	real tumors	0.1	0	50.0
	Hu et al. [hu2023label]	74.1	56.3	66.7
	DiffTumor	68.8	61.6	78.6

All-stage tumor segmentation performance (DSC %).
backbone	method	liver	pancreas	kidneys
U-Net	real tumors	62.3 $\pm$ 28.3	56.0 $\pm$ 24.8	75.1 $\pm$ 27.2
U-Net	DiffTumor	70.9 $\pm$ 21.1	64.8 $\pm$ 24.5	84.2 $\pm$ 9.5
nnU-Net	real tumors	64.3 $\pm$ 26.5	59.9 $\pm$ 23.8	73.8 $\pm$ 20.9
nnU-Net	DiffTumor	73.6 $\pm$ 18.1	63.6 $\pm$ 27.7	84.5 $\pm$ 11.5
SwinUNETR	real tumors	65.1 $\pm$ 23.5	52.2 $\pm$ 31.2	80.6 $\pm$ 19.6
SwinUNETR	DiffTumor	71.4 $\pm$ 19.1	62.2 $\pm$ 26.1	85.1 $\pm$ 8.7

Table 2: Generalizable to different organs: comparison of generalization for early-stage tumor detection under different source organs. The scores in bold represent the best performance in each domain. DiffTumor achieves the best performance in almost all domains. Furthermore, DiffTumor serves as an effective data augmentation method for real tumors in three abdominal organs, yielding substantial improvements in all-stage tumor segmentation. Additional results for different segmentation backbones with more metrics can be found in the Appendix C.

4.3 Generalizable to Different Demographics

The ability of Segmentation Model to be generalizable to different demographics is critically important. It indicates that the model can effectively process CT scans from a diverse population, including various ages, genders, and ethnicities. To affirm the enhancement of DiffTumor for Segmentation Model to detect and segment real tumors across different individuals, we evaluate the generalization ability of Segmentation Model using a proprietary dataset at Hopkins [xia2022felix]. This dataset includes various real pancreatic tumors (PDAC and Cyst) from diverse patient demographics. We utilize DiffTumor with Diffusion Model trained on MSD-Pancreas to enhance Segmentation Model. More details about the dataset and experiment setting can be found in Appendix D. Figure 4 shows that our synthetic data can yield an average improvement of 6.9% in DSC and 16.4% in sensitivity with the U-Net backbone. In particular, the improvement for people aged 50–60 is significant, with an enhancement of 18.9% in sensitivity and 9.1% in DSC. For both males and females, there are noticeable performance improvements for tumor detection and segmentation. These results demonstrate that our synthetic data can provide valuable assistance in clinical tumor analysis for individuals across various age groups and genders.

4.4 Advantages of DiffTumor

(1) Reduced annotations for Diffusion Model. The quality of synthetic data produced by a generative model is typically heavily reliant on the quantity and diversity of the paired training data used during the training phase [chlap2021review, jaipuria2020deflating, ramesh2022hierarchical].We study the relationship between the number of annotated real tumors needed for the Diffusion Model and the performance of the Segmentation Model. We find that the relationship between the amount of paired training data and the quality of synthetic data isn’t always linear, as shown in Figure 5. In particular, DiffTumor only requires just one annotated tumor to train the Diffusion Model and generate synthetic tumors for the subsequent training of Segmentation Model. This contradicts the typical experience in computer vision [ramesh2022hierarchical], which generally requires large-scale data for training. The results indicate that for training the Diffusion Model, particularly for early tumors, we can rely on a smaller number of real tumors. This finding could have important implications for the efficiency and cost-effectiveness of training DiffTumor.

(2) Accelerated tumor synthesis. The speed of tumor synthesis plays a crucial role in the practical application of synthetic data. Real-time synthes can significantly speed up the training process of Segmentation Model. The speed of generating synthetic tumors in Diffusion Model is significantly influenced by the timestep $T$ . We examine the impact of the timestep on the performance of Segmentation Model. The synthetic quality using DDPM sampling [ho2020denoising] with different timestep is illustrated in Figure 6. As can be seen, when $T=1$ , the model collapses and fails to synthesize realistic textures for both the organ and tumor textures. Consequently, using these synthetic data to train the Segmentation Model results in poor performance. However, when $T$ is increased to more than 1, the corresponding texture can be well-generated, leading to good performance in the Segmentation Model. In consideration of the trade-off between performance and efficiency, we default to a timestep of $T=4$ for early tumor synthesis. This balance allows for the generation of high-quality synthetic data while maintaining a reasonable efficiency level.

(3) Improved early tumor detection. Detecting tumors in their early stages can greatly increase the chances of successful treatment and survival. However, obtaining early-stage cancer data is challenging in practice and such cases in real datasets remain scarce. This limits the AI model’s ability to detect early tumors. As shown in Figure 7, there are several failure cases for Segmentation Model trained on real data. However, with the incorporation of our synthetic data, Segmentation Models’ capability to detect early-stage tumors improves significantly. This is one of the primary reasons why DiffTumor can achieve the best performance as displayed in Table 2. This demonstrates the value and efficacy of synthetic data in enhancing early tumor detection.

5 Related Work

Generative models such as Energy-Based Models [lecun2006tutorial, zhao2016energy, du2019implicit], Variational Autoencoders (VAE) [kingma2013auto, kingma2014semi, kingma2019introduction], Generative Adversarial Networks (GAN) [goodfellow2014generative, goodfellow2016deep, goodfellow2020generative, creswell2018generative, jordon2018pate, yoon2019time, chen2022mask], and normalizing flows [papamakarios2021normalizing, kobyzev2020normalizing, yu2021fastflow] have shown significant potential in creating realistic images. Among these, Diffusion Models [sohl2015deep, ho2020denoising, song2020score] and their variants [vahdat2021score, kim2021maximum, rombach2022high] have recently emerged as particularly advanced in image generation. In the medical field, generative models have been effectively utilized for tasks like image-to-image translation [lyu2022conversion, meng2022novel, ozbey2023unsupervised], reconstruction [song2021solving, xie2022measurement], segmentation [fernandez2022can, kim2022diffusion, wolleb2022diffusion], image denoising [gong2023pet], and anomaly detection [wyatt2022anoddpm, siddiquee2019learning, xiang2023squid]. In this work, we focus on generating tumors in abdominal organs based on the textures of the surrounding organs, which significantly reduces the annotated data required for training. Refer to Appendix F for a more comprehensive comparison with our DiffTumor.

Tumor synthesis that is widely effective for a variety of organs is an attractive topic. Successful works related to tumor synthesis based on various medical modalities include colon polyp synthesis in colonoscopy videos [shin2018abnormal], tumor cell synthesis in fluorescence microscopy images [horvath2022metgan], synthesized brain tumors [billot2023synthseg, zhang2023self] and myocardial pathology [zhang2024lefusion] in MRI, lung nodule synthesis in CT images [han2019synthesizing, yang2019class, jin2021free], and lesion in dermatoscopic images [du2023boosting]. Additionally, there are many works on synthesizing non-cancerous lesions such as COVID-19 lesion synthesis in chest CT [lyu2022pseudo, yao2021label], and diabetic lesion synthesis in retinal images [wang2022anomaly]. Recent studies have improved the realism of synthetic tumors in the liver [zhang2023unsupervised, lyu2022learning, hu2022synthetic, hu2023synthetic] and pancreas [wei2022pancreatic, li2023early]. AI trained on these synthetic tumors perform similarly well as those trained with real tumors. However, these methods need to be redesigned for tumors in other organs, which severely limits the generalization capabilities. In this work, we learn the tumor distribution based on generative models, i.e., Diffusion Models, to realize generalizable tumor synthesis.

6 Conclusion

This work introduces DiffTumor for generalizable tumor synthesis. We leverage the observation that early-stage tumors share similar imaging characteristics in CT scans across different organs (e.g., liver, pancreas, kidneys). As a result, DiffTumor trained solely on annotated liver tumors can directly synthesize tumors in other organs with limited annotated data (e.g., pancreas, kidney). By augmenting large-scale datasets of healthy organs (readily available in clinical settings) with these synthetic tumors, we substantially expand training data for tumor segmentation models. This augmentation significantly improves AI generalizability across diverse hospital systems and patient populations.

Acknowledgments. This work was supported by the Lustgarten Foundation for Pancreatic Cancer Research and the Patrick J. McGovern Foundation Award. We thank Yuxiang Lai, Qian Yu, and Wenxuan Li for their constructive suggestions at several stages of the project.

References

Abi Nader et al. [2023] Clément Abi Nader, Rebeca Vetil, Laura Kate Wood, Marc-Michel Rohe, Alexandre Bône, Hedvig Karteszi, and Marie-Pierre Vullierme. Automatic detection of pancreatic lesions and main pancreatic duct dilatation on portal venous ct scans using deep learning. Investigative Radiology, 2023.
Antonelli et al. [2021] Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, Bram van Ginneken, et al. The medical segmentation decathlon. arXiv preprint arXiv:2106.05735, 2021.
Arnab et al. [2021] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
Ayuso et al. [2018] Carmen Ayuso, Jordi Rimola, Ramón Vilana, Marta Burrel, Anna Darnell, Ángeles García-Criado, Luis Bianchi, Ernest Belmonte, Carla Caparroz, Marta Barrufet, et al. Diagnosis and staging of hepatocellular carcinoma (hcc): current guidelines. European journal of radiology, 101:72–81, 2018.
Bertasius et al. [2021] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, pages 813–824. PMLR, 2021.
Bilic et al. [2019] Patrick Bilic, Patrick Ferdinand Christ, Eugene Vorontsov, Grzegorz Chlebus, Hao Chen, Qi Dou, Chi-Wing Fu, Xiao Han, Pheng-Ann Heng, Jürgen Hesser, et al. The liver tumor segmentation benchmark (lits). arXiv preprint arXiv:1901.04056, 2019.
Billot et al. [2023] Benjamin Billot, Douglas N Greve, Oula Puonti, Axel Thielscher, Koen Van Leemput, Bruce Fischl, Adrian V Dalca, Juan Eugenio Iglesias, et al. Synthseg: Segmentation of brain mri scans of any contrast and resolution without retraining. Medical image analysis, 86:102789, 2023.
Burke [2004] Harry B Burke. Outcome prediction and the future of the tnm staging system, 2004.
Chen et al. [2023] Jieneng Chen, Yingda Xia, Jiawen Yao, Ke Yan, Jianpeng Zhang, Le Lu, Fakai Wang, Bo Zhou, Mingyan Qiu, Qihang Yu, et al. Cancerunit: Towards a single unified model for effective detection, segmentation, and diagnosis of eight major cancers using a large collection of ct scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21327–21338, 2023.
Chen et al. [2022] Qi Chen, Mingxing Li, Jiacheng Li, Bo Hu, and Zhiwei Xiong. Mask rearranging data augmentation for 3d mitochondria segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 36–46. Springer, 2022.
Chen et al. [2021] Richard J Chen, Ming Y Lu, Tiffany Y Chen, Drew FK Williamson, and Faisal Mahmood. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
Chlap et al. [2021] Phillip Chlap, Hang Min, Nym Vandenberg, Jason Dowling, Lois Holloway, and Annette Haworth. A review of medical image data augmentation techniques for deep learning applications. Journal of Medical Imaging and Radiation Oncology, 65(5):545–563, 2021.
Choi et al. [2014] Jin-Young Choi, Jeong-Min Lee, and Claude B Sirlin. Ct and mr imaging diagnosis and staging of hepatocellular carcinoma: part i. development, growth, and spread: key pathologic and imaging aspects. Radiology, 272(3):635–654, 2014.
Chou et al. [2024] Yu-Cheng Chou, Bowen Li, Deng-Ping Fan, Alan Yuille, and Zongwei Zhou. Acquiring weak annotations for tumor localization in temporal and volumetric data. Machine Intelligence Research, pages 1–13, 2024.
Chu et al. [2017] Linda C Chu, Michael G Goggins, and Elliot K Fishman. Diagnosis and detection of pancreatic cancer. The Cancer Journal, 23(6):333–342, 2017.
Chu et al. [2019] Linda C Chu, Seyoun Park, Satomi Kawamoto, Daniel F Fouladi, Shahab Shayesteh, Eva S Zinreich, Jefferson S Graves, Karen M Horton, Ralph H Hruban, Alan L Yuille, et al. Utility of ct radiomics features in differentiation of pancreatic ductal adenocarcinoma from normal pancreatic tissue. American Journal of Roentgenology, 213(2):349–357, 2019.
Creswell et al. [2018] Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A Bharath. Generative adversarial networks: An overview. IEEE signal processing magazine, 35(1):53–65, 2018.
Du et al. [2023] Shiyi Du, Xiaosong Wang, Yongyi Lu, Yuyin Zhou, Shaoting Zhang, Alan Yuille, Kang Li, and Zongwei Zhou. Boosting dermatoscopic lesion segmentation via diffusion models with visual and textual prompts. arXiv preprint arXiv:2310.02906, 2023.
Du and Mordatch [2019] Yilun Du and Igor Mordatch. Implicit generation and modeling with energy based models. Advances in Neural Information Processing Systems, 32, 2019.
Dunnick [2016] N Reed Dunnick. Renal cell carcinoma: staging and surveillance. Abdominal Radiology, 41:1079–1085, 2016.
Elbanna et al. [2020] Khaled Y Elbanna, Hyun-Jung Jang, and Tae Kyoung Kim. Imaging diagnosis and staging of pancreatic ductal adenocarcinoma: a comprehensive review. Insights into imaging, 11(1):1–13, 2020.
Esser et al. [2021] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
Fernandez et al. [2022] Virginia Fernandez, Walter Hugo Lopez Pinaya, Pedro Borges, Petru-Daniel Tudosiu, Mark S Graham, Tom Vercauteren, and M Jorge Cardoso. Can segmentation models be trained with fully synthetically generated data? In International Workshop on Simulation and Synthesis in Medical Imaging, pages 79–90. Springer, 2022.
Ficarra et al. [2007] Vincenzo Ficarra, Antonio Galfano, Mariangela Mancini, Guido Martignoni, and Walter Artibani. Tnm staging system for renal-cell carcinoma: current status and future perspectives. The lancet oncology, 8(6):554–558, 2007.
Fowler et al. [2021] Kathryn J Fowler, Adam Burgoyne, Tyler J Fraum, Mojgan Hosseini, Shintaro Ichikawa, Sooah Kim, Azusa Kitao, Jeong Min Lee, Valérie Paradis, Bachir Taouli, et al. Pathologic, molecular, and prognostic radiologic features of hepatocellular carcinoma. Radiographics, 41(6):1611–1631, 2021.
Gao et al. [2023] Cong Gao, Benjamin D Killeen, Yicheng Hu, Robert B Grupp, Russell H Taylor, Mehran Armand, and Mathias Unberath. Synthetic data accelerates the development of generalizable learning-based algorithms for x-ray image analysis. Nature Machine Intelligence, 5(3):294–308, 2023.
Gong et al. [2023] Kuang Gong, Keith Johnson, Georges El Fakhri, Quanzheng Li, and Tinsu Pan. Pet image denoising based on denoising diffusion probabilistic model. European Journal of Nuclear Medicine and Molecular Imaging, pages 1–11, 2023.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Goodfellow et al. [2016] Ian Goodfellow, Yoshua Bengio, Aaron Courville, and Yoshua Bengio. Deep learning. MIT press Cambridge, 2016.
Goodfellow et al. [2020] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
Han et al. [2019] Changhee Han, Yoshiro Kitamura, Akira Kudo, Akimichi Ichinose, Leonardo Rundo, Yujiro Furukawa, Kazuki Umemoto, Yuanzhong Li, and Hideki Nakayama. Synthesizing diverse lung nodules wherever massively: 3d multi-conditional gan-based ct image augmentation for object detection. In 2019 International Conference on 3D Vision (3DV), pages 729–737. IEEE, 2019.
Hatamizadeh et al. [2021] Ali Hatamizadeh, Vishwesh Nath, Yucheng Tang, Dong Yang, Holger R Roth, and Daguang Xu. Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In International MICCAI Brainlesion Workshop, pages 272–284. Springer, 2021.
Heller et al. [2020] Nicholas Heller, Sean McSweeney, Matthew Thomas Peterson, Sarah Peterson, Jack Rickman, Bethany Stai, Resha Tejpaul, Makinna Oestreich, Paul Blake, Joel Rosenberg, et al. An international challenge to use artificial intelligence to define the state-of-the-art in kidney and kidney tumor segmentation in ct imaging., 2020.
Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
Horvath et al. [2022] Izabela Horvath, Johannes Paetzold, Oliver Schoppe, Rami Al-Maskari, Ivan Ezhov, Suprosanna Shit, Hongwei Li, Ali Ertürk, and Bjoern Menze. Metgan: Generative tumour inpainting and modality synthesis in light sheet microscopy. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 227–237, 2022.
Hu et al. [2022] Qixin Hu, Junfei Xiao, Yixiong Chen, Shuwen Sun, Jie-Neng Chen, Alan Yuille, and Zongwei Zhou. Synthetic tumors make ai segment tumors better. NeurIPS Workshop on Medical Imaging meets NeurIPS, 2022.
Hu et al. [2023a] Qixin Hu, Yixiong Chen, Junfei Xiao, Shuwen Sun, Jieneng Chen, Alan L Yuille, and Zongwei Zhou. Label-free liver tumor segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7422–7432, 2023a.
Hu et al. [2023b] Qixin Hu, Alan Yuille, and Zongwei Zhou. Synthetic data as validation. arXiv preprint arXiv:2310.16052, 2023b.
Isensee et al. [2021] Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Petersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods, 18(2):203–211, 2021.
Jaipuria et al. [2020] Nikita Jaipuria, Xianling Zhang, Rohan Bhasin, Mayar Arafa, Punarjay Chakravarty, Shubham Shrivastava, Sagar Manglani, and Vidya N Murali. Deflating dataset bias using synthetic data augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 772–773, 2020.
Jin et al. [2021] Qiangguo Jin, Hui Cui, Changming Sun, Zhaopeng Meng, and Ran Su. Free-form tumor synthesis in computed tomography images via richer generative adversarial network. Knowledge-Based Systems, 218:106753, 2021.
Jordon et al. [2018] James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate-gan: Generating synthetic data with differential privacy guarantees. In International conference on learning representations, 2018.
Kang et al. [2023] Mintong Kang, Bowen Li, Zengle Zhu, Yongyi Lu, Elliot K Fishman, Alan Yuille, and Zongwei Zhou. Label-assemble: Leveraging multiple datasets with partial labels. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2023.
Kim et al. [2022] Boah Kim, Yujin Oh, and Jong Chul Ye. Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566, 2022.
Kim et al. [2021] Dongjun Kim, Byeonghu Na, Se Jung Kwon, Dongsoo Lee, Wanmo Kang, and Il-chul Moon. Maximum likelihood training of parametrized diffusion model. 2021.
Kingma and Welling [2013] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
Kingma et al. [2014] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. Advances in neural information processing systems, 27, 2014.
Kingma et al. [2019] Diederik P Kingma, Max Welling, et al. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
Kobyzev et al. [2020] Ivan Kobyzev, Simon JD Prince, and Marcus A Brubaker. Normalizing flows: An introduction and review of current methods. IEEE transactions on pattern analysis and machine intelligence, 43(11):3964–3979, 2020.
Laeseke et al. [2015] Paul F Laeseke, Ru Chen, R Brooke Jeffrey, Teresa A Brentnall, and Jürgen K Willmann. Combining in vitro diagnostics with in vivo imaging for earlier detection of pancreatic ductal adenocarcinoma: challenges and solutions. Radiology, 277(3):644–661, 2015.
LeCun et al. [2006] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and Fujie Huang. A tutorial on energy-based learning. Predicting structured data, 1(0), 2006.
Leveridge et al. [2010] Michael J Leveridge, Peter J Bostrom, George Koulouris, Antonio Finelli, and Nathan Lawrentschuk. Imaging renal cell carcinoma with ultrasonography, ct and mri. Nature Reviews Urology, 7(6):311–325, 2010.
Li et al. [2023] Bowen Li, Yu-Cheng Chou, Shuwen Sun, Hualin Qiao, Alan Yuille, and Zongwei Zhou. Early detection and localization of pancreatic cancer by label-free tumor synthesis. MICCAI Workshop on Big Task Small Data, 1001-AI, 2023.
Li et al. [2024] Wenxuan Li, Alan Yuille, and Zongwei Zhou. How well do supervised models transfer to 3d image segmentation? In The Twelfth International Conference on Learning Representations, 2024.
Liu et al. [2023] Jie Liu, Yixiao Zhang, Jie-Neng Chen, Junfei Xiao, Yongyi Lu, Bennett A Landman, Yixuan Yuan, Alan Yuille, Yucheng Tang, and Zongwei Zhou. Clip-driven universal model for organ segmentation and tumor detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21152–21164, 2023.
Lyu et al. [2022a] Fei Lyu, Mang Ye, Jonathan Frederik Carlsen, Kenny Erleben, Sune Darkner, and Pong C Yuen. Pseudo-label guided image synthesis for semi-supervised covid-19 pneumonia infection segmentation. IEEE Transactions on Medical Imaging, 2022a.
Lyu et al. [2022b] Fei Lyu, Mang Ye, Andy J Ma, Terry Cheuk-Fung Yip, Grace Lai-Hung Wong, and Pong C Yuen. Learning from synthetic ct images via test-time training for liver tumor segmentation. IEEE transactions on medical imaging, 41(9):2510–2520, 2022b.
Lyu and Wang [2022] Qing Lyu and Ge Wang. Conversion between ct and mri images using diffusion and score-matching models. arXiv preprint arXiv:2209.12104, 2022.
Meng et al. [2022] Xiangxi Meng, Yuning Gu, Yongsheng Pan, Nizhuan Wang, Peng Xue, Mengkang Lu, Xuming He, Yiqiang Zhan, and Dinggang Shen. A novel unified conditional score-based generative framework for multi-modal medical image completion. arXiv preprint arXiv:2207.03430, 2022.
Nichol and Dhariwal [2021] Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
Orbes-Arteaga et al. [2019] Mauricio Orbes-Arteaga, Thomas Varsavsky, Carole H Sudre, Zach Eaton-Rosen, Lewis J Haddow, Lauge Sørensen, Mads Nielsen, Akshay Pai, Sébastien Ourselin, Marc Modat, et al. Multi-domain adaptation in brain mri through paired consistency and adversarial learning. In Domain Adaptation and Representation Transfer and Medical Image Learning with Less Labels and Imperfect Data, pages 54–62. Springer, 2019.
Özbey et al. [2023] Muzaffer Özbey, Onat Dalmaz, Salman UH Dar, Hasan A Bedel, Şaban Özturk, Alper Güngör, and Tolga Çukur. Unsupervised medical image translation with adversarial diffusion models. IEEE Transactions on Medical Imaging, 2023.
Papamakarios et al. [2021] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. The Journal of Machine Learning Research, 22(1):2617–2680, 2021.
Qu et al. [2023] Chongyu Qu, Tiezheng Zhang, Hualin Qiao, Jie Liu, Yucheng Tang, Alan Yuille, and Zongwei Zhou. Abdomenatlas-8k: Annotating 8,000 abdominal ct volumes for multi-organ segmentation in three weeks. Conference on Neural Information Processing Systems, 2023.
Ramesh et al. [2022] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
Rindi et al. [2012] Guido Rindi, Massimo Falconi, Catherine Klersy, L Albarello, L Boninsegna, MW Buchler, C Capella, Martyn Caplin, Anne Couvelard, Claudio Doglioni, et al. Tnm staging of neoplasms of the endocrine pancreas: results from a large international cohort study. Journal of the National Cancer Institute, 104(10):764–777, 2012.
Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
Saharia et al. [2022] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–10, 2022.
Shin et al. [2018] Younghak Shin, Hemin Ali Qadir, and Ilangko Balasingham. Abnormal colon polyp image synthesis using conditional adversarial networks for improved detection performance. IEEE Access, 6:56007–56017, 2018.
Siddiquee et al. [2019] Md Mahfuzur Rahman Siddiquee, Zongwei Zhou, Nima Tajbakhsh, Ruibin Feng, Michael B Gotway, Yoshua Bengio, and Jianming Liang. Learning fixed points in generative adversarial networks: From image-to-image translation to disease detection and localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 191–200, 2019.
Skarin [2015] Arthur T Skarin. Atlas of Diagnostic Oncology E-Book. Elsevier Health Sciences, 2015.
Sohl-Dickstein et al. [2015] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International conference on machine learning, pages 2256–2265. PMLR, 2015.
Song et al. [2020] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
Song et al. [2021] Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models. arXiv preprint arXiv:2111.08005, 2021.
Vahdat et al. [2021] Arash Vahdat, Karsten Kreis, and Jan Kautz. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34:11287–11302, 2021.
Van Griethuysen et al. [2017] Joost JM Van Griethuysen, Andriy Fedorov, Chintan Parmar, Ahmed Hosny, Nicole Aucoin, Vivek Narayan, Regina GH Beets-Tan, Jean-Christophe Fillion-Robin, Steve Pieper, and Hugo JWL Aerts. Computational radiomics system to decode the radiographic phenotype. Cancer research, 77(21):e104–e107, 2017.
Wang et al. [2017] Hongkai Wang, Zongwei Zhou, Yingci Li, Zhonghua Chen, Peiou Lu, Wenzhi Wang, Wanyu Liu, and Lijuan Yu. Comparison of machine learning methods for classifying mediastinal lymph node metastasis of non-small cell lung cancer from 18f-fdg pet/ct images. EJNMMI research, 7(1):1–11, 2017.
Wang et al. [2022] Hualin Wang, Yuhong Zhou, Jiong Zhang, Jianqin Lei, Dongke Sun, Feng Xu, and Xiayu Xu. Anomaly segmentation in retinal images with poisson-blending data augmentation. Medical Image Analysis, page 102534, 2022.
Wang et al. [2018] Z. J. Wang, A. C. Westphalen, and R. J. Zagoria. CT and MRI of small renal masses. Br J Radiol, 91(1087):20180131, 2018.
Wei et al. [2022] Zihan Wei, Yizhou Chen, Qiu Guan, Haigen Hu, Qianwei Zhou, Zhicheng Li, Xinli Xu, Alejandro Frangi, and Feng Chen. Pancreatic image augmentation based on local region texture synthesis for tumor segmentation. In International Conference on Artificial Neural Networks, pages 419–431. Springer, 2022.
Wolleb et al. [2022] Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin. Diffusion models for implicit image segmentation ensembles. In International Conference on Medical Imaging with Deep Learning, pages 1336–1348. PMLR, 2022.
Wyatt et al. [2022] Julian Wyatt, Adam Leach, Sebastian M Schmon, and Chris G Willcocks. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 650–656, 2022.
Xia et al. [2022] Yingda Xia, Qihang Yu, Linda Chu, Satomi Kawamoto, Seyoun Park, Fengze Liu, Jieneng Chen, Zhuotun Zhu, Bowen Li, Zongwei Zhou, et al. The felix project: Deep networks to detect pancreatic neoplasms. medRxiv, 2022.
Xiang et al. [2023] Tiange Xiang, Yixiao Zhang, Yongyi Lu, Alan L Yuille, Chaoyi Zhang, Weidong Cai, and Zongwei Zhou. Squid: Deep feature in-painting for unsupervised anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23890–23901, 2023.
Xie and Li [2022] Yutong Xie and Quanzheng Li. Measurement-conditioned denoising diffusion probabilistic model for under-sampled medical image reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 655–664. Springer, 2022.
Yang et al. [2019] Jie Yang, Siqi Liu, Sasa Grbic, Arnaud Arindra Adiyoso Setio, Zhoubing Xu, Eli Gibson, Guillaume Chabin, Bogdan Georgescu, Andrew F Laine, and Dorin Comaniciu. Class-aware adversarial lung nodule synthesis in ct images. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), pages 1348–1352. IEEE, 2019.
Yao et al. [2021] Qingsong Yao, Li Xiao, Peihang Liu, and S Kevin Zhou. Label-free segmentation of covid-19 lesions in lung ct. IEEE Transactions on Medical Imaging, 2021.
Yoon et al. [2019] Jinsung Yoon, Daniel Jarrett, and Mihaela Van der Schaar. Time-series generative adversarial networks. Advances in neural information processing systems, 32, 2019.
Yu et al. [2021] Jiawei Yu, Ye Zheng, Xiang Wang, Wei Li, Yushuang Wu, Rui Zhao, and Liwei Wu. Fastflow: Unsupervised anomaly detection and localization via 2d normalizing flows. arXiv preprint arXiv:2111.07677, 2021.
Zhang et al. [2024] Hantao Zhang, Jiancheng Yang, Shouhong Wan, and Pascal Fua. Lefusion: Synthesizing myocardial pathology on cardiac mri via lesion-focus diffusion models. arXiv preprint arXiv:2403.14066, 2024.
Zhang et al. [2023a] Xiaoman Zhang, Weidi Xie, Chaoqin Huang, Ya Zhang, Xin Chen, Qi Tian, and Yanfeng Wang. Self-supervised tumor segmentation with sim2real adaptation. IEEE Journal of Biomedical and Health Informatics, 2023a.
Zhang et al. [2023b] Yixiao Zhang, Xinyi Li, Huimiao Chen, Alan L Yuille, Yaoyao Liu, and Zongwei Zhou. Continual learning for abdominal multi-organ and tumor segmentation. In International conference on medical image computing and computer-assisted intervention, pages 35–45. Springer, 2023b.
Zhang et al. [2023c] Zhaoxiang Zhang, Hanqiu Deng, and Xingyu Li. Unsupervised liver tumor segmentation with pseudo anomaly synthesis. In International Workshop on Simulation and Synthesis in Medical Imaging, pages 86–96. Springer, 2023c.
Zhao et al. [2016] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. arXiv preprint arXiv:1609.03126, 2016.
Zhou [2021] Zongwei Zhou. Towards Annotation-Efficient Deep Learning for Computer-Aided Diagnosis. PhD thesis, Arizona State University, 2021.
Zhou et al. [2022] Zongwei Zhou, Michael B Gotway, and Jianming Liang. Interpreting medical images. In Intelligent Systems in Medicine and Health, pages 343–371. Springer, 2022.
Zhu et al. [2022] Zengle Zhu, Mintong Kang, Alan Yuille, and Zongwei Zhou. Assembling and exploiting large-scale existing labels of common thorax diseases for improved covid-19 classification using chest radiographs. In Radiological Society of North America (RSNA), 2022.

This appendix is organized as follows: §A provides visual examples for reader study and Visual Turing Test. §B provides a description of Radiomics Features. §C provides additional results for generalizable to multiple organs. §D provides additional results for generalizable to different patient demographics. §E provides the details of used datasets and implementation for DiffTumor and Segmentation Model. §F provides discussions about comparison with related works, unrealistic generation, and challenging case analysis.

Appendix A Visual Examples

Appendix B Description of Radiomics Features

Radiomics Features [van2017computational] consist of a comprehensive set of quantitative, high-dimensional imaging attributes derived from radiographic images, such as Computed Tomography (CT) scans, Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) scans. These attributes capture a broad spectrum of image characteristics, including shape, intensity, texture, and wavelet, among others. The scope of Radiomics Features applications is expansive, ranging from predicting disease prognosis to formulating treatment strategies and evaluating treatment response. The effectiveness of these features has been validated across various medical fields, notably in oncology, neurology, and cardiology.

The key characteristics of Radiomics Features include their high-throughput capacity and reproducibility, which facilitate a detailed characterization of tumor phenotypes. These features are multivariate, incorporating first-order statistics, shape and size-based features, textural features, and filter-based features.

In this paper, we utilize the official Radiomics feature repository⁴⁴4https://github.com/AIM-Harvard/pyradiomics/ to extract the appearance features, which include 3D shape-based features (16 dimensions), gray level co-occurrence matrix (24 dimensions), gray level run length matrix (16 dimensions), gray level size zone matrix (16 dimensions), neighboring gray-tone difference matrix (5 dimensions), and gray level dependence matrix (14 dimensions). The shape descriptors are independent of the gray value and are extracted from the tumor mask. The definitions of these features can be referenced at https://pyradiomics.readthedocs.io/en/latest/features.html. Based on the tumor mask annotations, we are able to extract only the appearance features of tumors. Consequently, for each early-stage tumor, a 91-dimensional vector can be obtained. Ultimately, these features from all early-stage tumors are aggregated for Radiomics feature analysis in Figure 2.

Appendix C Generalizable to Multiple Organs

nnU-Net [isensee2021nnu]
source $\backslash$ target		liver	pancreas	kidneys
liver	real tumors	77.4	1.8	2.4
	Hu et al. [hu2023label]	78.0	56.3	61.9
	DiffTumor	80.9	60.7	71.4
pancreas	real tumors	1.5	67.0	2.4
	Hu et al. [hu2023label]	76.2	68.8	61.9
	DiffTumor	72.8	75.0	81.0
kidney	real tumors	0.9	0.9	59.5
	Hu et al. [hu2023label]	76.2	56.3	69.0
	DiffTumor	77.4	63.4	76.2

Swin UNETR [hatamizadeh2021swin]
source $\backslash$ target		liver	pancreas	kidneys
liver	real tumors	76.2	2.7	0
	Hu et al. [hu2023label]	79.2	63.4	71.4
	DiffTumor	83.1	69.6	73.8
pancreas	real tumors	1.6	70.5	4.8
	Hu et al. [hu2023label]	66.4	73.2	71.4
	DiffTumor	76.2	79.5	90.5
kidney	real tumors	0.6	0	69.0
	Hu et al. [hu2023label]	66.4	63.4	76.2
	DiffTumor	76.2	61.6	81.0

Table 3: Generalizable across various organs. We show the comparison of generalization for early-stage tumor detection (measured in tumor-wise Sensitivity %) using additional backbones. The scores highlighted in bold denote the superior performance in each respective domain. Consistent with the results on U-Net presented in Table 2, DiffTumor demonstrates superior performance across nearly all domains on nnU-Net and Swin UNETR.

U-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	62.3	72.8	63.8	54.5	59.0	62.5
real tumors	NSD (%)	63.4	74.6	63.3	55.5	61.6	63.7
DiffTumor	DSC (%)	70.9	74.0	67.9	59.1	60.6	66.5
DiffTumor	NSD (%)	71.2	73.9	70.4	61.0	63.3	68.0
nnU-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	64.3	70.2	64.3	56.3	59.6	62.9
real tumors	NSD (%)	65.7	72.7	63.1	59.3	62.6	64.7
DiffTumor	DSC (%)	73.6	73.9	67.6	64.9	63.8	68.8
DiffTumor	NSD (%)	75.3	73.9	67.9	69.0	66.5	70.5
Swin UNETR	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	65.1	69.4	57.4	59.0	58.2	61.8
real tumors	NSD (%)	65.9	71.7	53.1	62.1	61.9	62.8
DiffTumor	DSC (%)	71.4	71.7	71.6	62.2	62.4	67.9
DiffTumor	NSD (%)	73.5	72.4	74.5	66.5	66.0	70.6
real tumors denotes Segmentation Model trained on 95 CT scans containing real tumors.
DiffTumor denotes Segmentation Model trained on 95 CT scans containing real tumors and 116 healthy CT scans.

Table 4: Liver tumor segmentation performance on 5-fold cross-validation. We conduct a comparative analysis of the Segmentation Model (U-Net, nnU-Net, Swin UNETR) trained on both synthetic and real tumors against the model trained exclusively on real tumors, employing 5-fold cross-validation. The evaluation metrics employed include the Dice Similarity Coefficient (DSC) and the Normalized Surface Distance (NSD). DiffTumor consistently enhances liver tumor segmentation performance across these three backbones.

U-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	56.0	51.9	45.5	59.4	43.2	51.2
real tumors	NSD (%)	51.0	49.9	43.6	57.7	40.2	48.5
DiffTumor	DSC (%)	64.8	58.0	57.7	67.9	51.8	60.0
DiffTumor	NSD (%)	60.5	55.3	58.3	67.5	47.9	57.9
nnU-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	59.9	50.0	44.8	63.5	50.4	53.7
real tumors	NSD (%)	55.7	47.0	47.0	62.3	48.0	52.0
DiffTumor	DSC (%)	63.6	60.5	62.5	67.8	55.3	61.9
DiffTumor	NSD (%)	61.1	59.1	63.4	67.7	54.9	61.2
Swin UNETR	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	52.2	49.5	50.9	60.0	52.1	52.9
real tumors	NSD (%)	49.1	49.7	50.9	58.9	47.4	51.2
DiffTumor	DSC (%)	62.2	60.2	59.0	69.7	53.8	61.0
DiffTumor	NSD (%)	58.7	58.4	62.8	67.2	51.2	59.7
real tumors denotes Segmentation Model trained on 96 CT scans containing real tumors.
DiffTumor denotes Segmentation Model trained on 96 CT scans containing real tumors and 120 healthy CT scans.

Table 5: Pancreatic tumor segmentation performance on 5-fold cross-validation. We execute a comparative study of the Segmentation Model (U-Net, nnU-Net, Swin UNETR) trained on both synthetic and real tumors against the model trained exclusively on real tumors, utilizing 5-fold cross-validation. The employed evaluation metrics are the Dice Similarity Coefficient (DSC) and the Normalized Surface Distance (NSD). DiffTumorconsistently yields a significant improvement in pancreatic tumor segmentation across these three backbones. It should be noted that the segmentation of pancreatic tumors is deemed the most challenging task among the three abdominal organs in study. The enhancement observed in pancreatic tumor segmentation is the most substantial among the three.

U-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	75.1	68.0	69.0	78.1	70.6	72.0
real tumors	NSD (%)	68.4	59.0	57.7	68.3	62.4	63.2
DiffTumor	DSC (%)	84.2	76.7	79.4	80.6	74.1	79.0
DiffTumor	NSD (%)	76.6	64.5	70.7	71.7	65.8	69.9
nnU-Net	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	73.8	76.8	80.0	80.5	73.4	76.9
real tumors	NSD (%)	62.7	70.2	71.2	70.8	67.5	68.5
DiffTumor	DSC (%)	84.5	83.4	81.6	83.9	77.3	82.1
DiffTumor	NSD (%)	78.3	74.4	74.1	76.9	72.3	75.2
Swin UNETR	metrics	fold0	fold1	fold2	fold3	fold4	average
real tumors	DSC (%)	80.6	64.9	79.1	76.0	72.2	74.6
real tumors	NSD (%)	74.2	55.5	68.2	67.4	64.8	66.0
DiffTumor	DSC (%)	85.1	77.2	81.2	85.7	79.9	81.8
DiffTumor	NSD (%)	79.2	70.8	74.1	78.0	74.1	75.2
real tumors denotes Segmentation Model trained on 96 CT scans containing real tumors.
DiffTumor denotes Segmentation Model trained on 96 CT scans containing real tumors and 120 healthy CT scans.

Table 6: Kidney tumor segmentation performance on 5-fold cross-validation. We perform a comparative analysis of the Segmentation Model (U-Net, nnU-Net, Swin UNETR) trained on both synthetic and real tumors versus the model trained exclusively on real tumors, employing 5-fold cross-validation. Our evaluation metrics include the Dice Similarity Coefficient (DSC) and the Normalized Surface Distance (NSD). Similar to the other two tumor segmentation tasks, DiffTumor can deliver substantial improvements in kidney tumor segmentation across these three prevalent backbones.

Appendix D Generalizable to Different Patient Demographics

Appendix E Dataset & Implementation Details

E.1 Dataset Details

Real-tumor datasets. LiTS [bilic2019liver] comprises 131 and 70 contrast-enhanced 3-D abdominal CT scans for training and testing, respectively. This dataset was compiled utilizing various scanners and protocols from six unique clinical sites, which resulted in a significant variation in in-plane resolution (ranging from 0.55 to 1.0 mm) and slice spacing (ranging from 0.45 to 6.0 mm). The MSD-Pancreas [antonelli2021medical] dataset comprises 420 portal-venous phase CT scans from patients who underwent pancreatic mass resection. It includes 281 CT scans designated for training and 139 CT scans for testing. The annotations provided correspond to the pancreatic parenchyma and pancreatic mass. KiTS [heller2020international] includes 210 CT scans for training and 90 CT scans for testing. Each CT scan features one or more kidney tumors. The University of Minnesota Medical Center provides the annotations.

Healthy-organ datasets. AbdomenAtlas-8K [qu2023annotating] is currently the most extensive multi-organ dataset, with annotations for the spleen, liver, kidneys, stomach, gallbladder, pancreas, aorta, and IVC in 8,448 CT volumes, which equates to 3.2 million slices. AbdomenAtlas-8K consolidates datasets from 26 distinct hospitals worldwide. In this study, we utilize the CLIP-Driven Universal Model [liu2023clip, zhang2023continual] to select CT scans that feature the corresponding healthy abdominal organs. This model, ranking first in the Medical Segmentation Decathlon (MSD) competition, has demonstrated high sensitivity and specificity in tumor detection. Consequently, we employ the pre-trained weights⁵⁵5https://github.com/ljwztc/CLIP-Driven-Universal-Model using the Swin UNETR backbone to identify CT scans where the prediction includes the organs but does not include the corresponding tumors. Through this process, we have obtained 1246 CT volumes with healthy livers, 1901 CT volumes with healthy pancreas, and 1005 CT volumes with healthy kidneys.

Proprietary dataset. It comprises 5,038 CT scans with 21 annotated organs, with each case having been scanned by contrast-enhanced CT in both venous and arterial phases, utilizing Siemens MDCT scanners. In this study, we utilize 532 CT scans containing 690 PDAC to assess the generalizability of the Segmentation Model across varied patient demographics. The test set we used includes 243 CT scans of males and 289 CT scans of females. Additionally, the age range of these patients spans from 20 to 146, essentially covering all stages of a person’s life.

E.2 Implementation Details

Autoencoder Model. In this study, we train Autoencoder Model on a total of 9262 CT scans from the AbdomenAtlas-8K dataset and a private dataset. The purpose is to learn a general low-dimensional latent representation of CT scans. The model processes the CT volume into a latent feature, reducing the original input volume’s size by 1/4 in height, width, and depth, respectively. We set the codebook size and dimensionality at 16384 and 8, respectively. CT scan orientation is adjusted to specific axcodes, and isotropic spacing is applied to resample each scan, resulting in a uniform voxel size of $1.0\times 1.0\times 1.0mm^{3}$ . Additionally, the intensity in each scan is truncated to the range $\left[-175,250\right]$ and then linearly normalized to $\left[-1,1\right]$ . During training, we crop random fixed-sized $96\times 96\times 96$ regions. We employ the Adam optimizer for training with $\beta_{1}$ βべーた start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and $\beta_{2}$ βべーた start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT hyperparameters set to 0.9 and 0.999, respectively, a learning rate of 0.0003, and a batch size of 4 per GPU. The training is conducted over a week on a node with four A100 GPUs, completing 200k iterations.

Diffusion Model. In this study, we train the corresponding Diffusion Model specifically for tumors of three different abdominal organs. The data preprocessing carried out during the training phase is identical to the approach used for training Autoencoder Model. Besides, we utilize the Adam optimizer for training with $\beta_{1}$ βべーた start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and $\beta_{2}$ βべーた start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT hyperparameters set to 0.9 and 0.999, respectively, a learning rate of 0.0001, and a batch size of 10 per GPU. The training is conducted over the course of a day on a node with an A100 GPU for 60k iterations.

Segmentation Model. The code for the Segmentation Model is implemented in Python using MONAI⁶⁶6https://monai.io/. In this study, we implement Swin UNETR based on the Swin UNETR Base variant. The orientation of CT scans is adjusted to specific axcodes. Isotropic spacing is utilized to resample each scan to achieve a uniform voxel size of $1.0\times 1.0\times 1.0mm^{3}$ . Besides, the intensity in each scan is truncated to the range $\left[-175,250\right]$ and then linearly normalized to $\left[0,1\right]$ . During training, we crop random ﬁxed-sized $96\times 96\times 96$ regions with the center being a foreground or background voxel based on the predeﬁned ratio. Additionally, the input patch is randomly rotated by 90 degrees, and the intensity is shifted with a 0.1 offset, each with probabilities of 0.1 and 0.2, respectively. To avoid confusion between the organs on the right and left sides, mirroring augmentation is not employed. All models on real tumors are trained for 3,000 epochs and models on synthetic and real tumors are trained for 2,000 epochs. Moreover, the base learning rate is set at 0.0002, and the batch size is set at two. We adopt the linear warmup strategy and the cosine annealing learning rate schedule. For details on the tumor synthesis process during the training of the Segmentation Model, please refer to the provided code. For inference, we use the sliding window strategy by setting the overlapping area ratio to 0.75. Besides, to rule out tumor mask predictions that do not belong to the respective organs, we use the pseudo labels of organs obtained through [liu2023clip] to process the predictions of Segmentation Models.

Appendix F Discussion

F.1 Comparison with Related Works

In recent works, Hu et al. [hu2023label] have synthesized tumors in the liver using a model-based approach. This approach, guided by radiologists, involves several image-processing operations such as ellipse generation, elastic deformation, salt-noise generation, Gaussian filtering, scaling, and clipping. The synthetic tumors are realistic in comparison to real liver tumors. Notably, the AI trained with synthetic tumors achieves segmentation/detection performance that is comparable to the performance of the AI trained with real tumors.

However, the approach of Hu et al. [hu2023label] requires significant effort and expertise to identify the proper imaging characteristics of tumors. In other words, the resulting synthetic tumors need to be explicitly specified by radiologists, tailored to the specific types of tumors and must be redesigned for tumors in other organs. To demonstrate the superiority of DiffTumor for enhancing tumor segmentation in the three abdominal organs, we compare it with the representative tumor synthetic strategy by Hu et al. [hu2023label]. The results can be found in Table 7.

Methods	liver		pancreas		kidneys
Methods	DSC (%)	NSD (%)	DSC (%)	NSD (%)	DSC (%)	NSD (%)
real tumors	62.3	63.4	56.0	51.0	75.1	68.4
Hu et al. [hu2023label]	69.7	70.9	55.9	49.9	80.8	71.0
DiffTumor	70.9	71.2	64.8	60.5	84.2	76.6

Table 7: Comparison for tumor segmentation enhancement. The comparison for all-stage tumor segmentation is conducted based on the U-Net backbone. While Hu et al. [hu2023label] is designed for liver tumor synthesis, DiffTumor can bring more significant improvement in liver tumor segmentation. The synthesized tumors by Hu et al. [hu2023label] can also boost the DSC and NSD scores for kidney tumor segmentation. However, DiffTumor can yield better results. Additionally, when adding the synthesized tumors by Huet al. [hu2023label] for pancreatic tumor segmentation, the DSC and NSD scores even drop compared with training solely on real tumors. This suggests that the synthetic strategy may not be suitable for the synthesis of pancreatic tumors. On the contrary, DiffTumor can significantly improve pancreatic tumor segmentation. These results underline the superiority of DiffTumor in enhancing tumor segmentation across various abdominal organs.

F.2 Unrealistic Generation

Although DiffTumor is capable of generating highly realistic tumors, it also occasionally produces some that are less convincing. Consequently, about 50% of the tumors are identified as inauthentic by the more experienced radiologist in the Visual Turing test. The tumors deemed inauthentic fall short in several aspects, such as shape, attenuation and noise distribution. Some synthetic tumors have inaccurate shapes, resembling flat, strip-like, or sickle-shaped lesions. In contrast, early-stage tumors originating from parenchymal organs typically exhibit a round or oval shape. Larger tumors fail to display a mass effect, characterized by the displacement of normal structures due to the tumor’s inherent volume. Furthermore, some synthetic tumors inaccurately display a lower density, which is similar to that of fat or fluid. Finally, the noise distribution in some synthetic tumors does not match that in the CT background. We show several unrealistic generation cases in Figure 15.

F.3 Challenging Cases Analysis

Instances of low performance are observed with all Segmentation Models (U-Net, nnU-Net, Swin UNETR) trained on both synthetic and real tumors. These instances often involve tumors with uncommon imaging features, as identified by experienced radiologists. Examples of these tumors from the liver, kidney, and pancreas are provided in Figure 16.