Deep Dive into Probabilistic Delta Debugging: Insights and Simplifications

Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian1, Xinru Cheng, Chengnian Sun University of Waterloo, Waterloo, Canada
{m492zhan, zhenyang.xu, x59cheng, cnsun}@uwaterloo.ca 1The Hong Kong University of Science and Technology, Hong Kong, China
yqtian@ust.hk

Abstract

Given a list $L$ of elements and a property $\psi$ ψぷさい that $L$ exhibits, ddmin is a well-known test input minimization algorithm designed to automatically eliminate $\psi$ ψぷさい-irrelevant elements from $L$ . This algorithm is extensively adopted in test input minimization and software debloating. Recently, ProbDD, an advanced variant of ddmin, has been proposed and achieved state-of-the-art performance. Employing Bayesian optimization, ProbDD predicts the likelihood of each element in $L$ being pertinent to $\psi$ ψぷさい, and statistically decides which elements and how many should be removed each time. Despite its impressive results, the theoretical probabilistic model of ProbDD is complex, and the specific factors driving its superior performance have not been thoroughly investigated.

In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. This analysis is complemented by empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further understand and demystify this state-of-the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency.

Based on these findings, we propose CDD, a simplified version of ProbDD, which reduces the complexity in both theory and implementation. CDD assists in validating the correctness of our key findings, such as the role of probabilities in ProbDD serving as monotonically increasing counters for each element, and in identifying the main factors that contribute to ProbDD’s superior performance. Comprehensive evaluations across $76$ benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite its simplification. These insights provide valuable guidance for future research and applications of test input minimization algorithms.

Index Terms:

Program Reduction, Delta Debugging, Software Debloating, Test Input Minimization

I Introduction

Delta Debugging [1] is a seminal family of algorithms designed for software debugging, among which ddmin stands out as a classic test input minimization (a.k.a., test input reduction) algorithm. Given a list $L$ of elements (modeling the test input) and a property $\psi$ ψぷさい that $L$ exhibits, ddmin aims to remove elements in $L$ that are irrelevant to $\psi$ ψぷさい, such that the resulting list is smaller than $L$ yet still satisfies $\psi$ ψぷさい. The ddmin algorithm plays a crucial role in software testing, debugging and maintenance [2, 3, 4, 5, 6], since compact, informative bug-triggering inputs are easier for developers to effectively identify root causes than large bug-triggering inputs with bug-irrelevant information [7].

To minimize a test input $I$ that satisfies $\psi$ ψぷさい, ddmin has been used in two primary manners. In the first manner, $I$ is initially segmented into a list, denoted as $L$ , which could be segmented based on characters, tokens, lines, etc. Subsequently, ddmin is directly applied to $L$ [1, 8]. Alternatively, ddmin serves as a pivotal component within advanced, structure-aware test input minimization algorithms, including Perses [9], HDD [10], C-Reduce [12], and Chisel [13]. These algorithms leverage the inherent structures of $I$ to expedite the minimization process or further reduce its size. Generally, these algorithms initiate by parsing $I$ into a tree structure, such as a parse tree. They then iteratively extract a list $L$ of tree nodes from the tree using heuristics and apply ddmin to $L$ to gradually condense the tree. Both manners underscore the fundamental role of ddmin as the cornerstone of test input minimization.

In the past years, different variants of ddmin have been proposed to improve its performance [13, 14, 15, 16], among which Probabilistic Delta Debugging (ProbDD) [14] is the state of the art, with notable superiority to other algorithms [1, 13]. When reducing $L$ , ProbDD utilizes a theoretical probabilistic model based on Bayesian optimization to predict how likely every element in $L$ is essential to preserve the property $\psi$ ψぷさい, by assigning a probability to each element. ProbDD prioritizes deleting elements with lower probabilities, as such elements generally have a lower possibility of being $\psi$ ψぷさい-relevant. Before each deletion attempt, an optimal subset of elements is determined by maximizing the Expected Reduction Gain.¹¹1In each attempt, the Expected Reduction Gain is defined as the expected number of elements removed. Higher Expected Reduction Gain is preferred, as it indicates an expectation to delete more elements through this attempt. If the deletion of this subset fails to preserve $\psi$ ψぷさい, the probabilistic model increases the probability assigned to each element in the subset. As reported [17], aided by such a probabilistic model, ProbDD significantly outperforms ddmin by reducing the execution time and the query number.²²2A query is a run of the property test $\psi$ ψぷさい.

However, this probabilistic model in ProbDD is rather intricate, and the underlying mechanisms for its superior performance have not been adequately studied. The original paper of ProbDD merely showed its performance numbers without deep ablation analysis on such achievements. Specifically, the following questions are important to the research field of test input minimization, but have not been answered yet.

1.

What is the fundamental role of probabilities in ProbDD, and can they be simplified without impacting performance?
2.

What specific bottlenecks does ProbDD overcome to achieve improvement compared to ddmin?
3.

How does randomness in ProbDD contribute to the performance improvement?
4.

What are the potential limitations of ProbDD?

Gaining a deeper understanding of the state of the art, i.e., ProbDD, is highly valuable for test input minimization tasks. By clarifying the intrinsic reasons behind its superiority, we can facilitate researchers to understand the essence of the probabilistic model, as well as its strengths and limitations. Such demystification, in our view, paves the way for enlightening future research and guides users to more effectively apply ddmin and its variants for test input minimization.

To this end, we conduct the first in-depth analysis of ProbDD, starting by theoretically simplifying its probabilistic model. In the original ProbDD, probabilities are used to calculate the Expected Reduction Gain, which is subsequently used to determine the next subset size. However, this process necessitates iterative calculations, impeding the simplification and comprehension of ProbDD. In our study, we initially establish the analytical correlation between the probability and subset size, allowing for probabilities and subset sizes to be explicitly calculated through formulas, thus eliminating the need for iterative updates. Further, through mathematical derivation, we discover that the probability and subset size can be considered nearly independent, each varying at an approximate ratio on their own. By theoretical prediction, the probability increases approximately by a factor of $\frac{1}{1-e^{-1}}$ ( $\approx$ 1.582), while the subset size decreases by a factor of $1-e^{-1}$ ( $\approx$ 0.632), thus providing the potential for simplifying ProbDD.

Building upon our theoretical analysis, we conducted extensive evaluations of ddmin, ProbDD, and CDD across $76$ diverse benchmarks. The experimental results confirm the correctness of our theoretical analysis, demonstrating how ProbDD addresses bottlenecks in ddmin by skipping inefficient queries, reveals the impact of randomness on results, and highlights the limitations of ProbDD. These findings provide valuable guidance for future research and the development of test input minimization algorithms.

Based on the aforementioned analysis, we propose Counter-Based Delta Debugging (CDD), a simplified version of ProbDD, to explain ProbDD’s high performance. By replacing probabilities with counters, CDD eliminates the probability computations required by ProbDD, thus reducing theoretical and implementation complexity. Our experiments demonstrate that CDD aligns with ProbDD in both effectiveness and efficiency, which validates our previous analysis and findings.

Key Findings. Through both theoretical analysis and empirical experiments, our key findings are:

1.

Through theoretical derivation, the probabilities in ProbDD essentially serve as monotonically increasing counters, and can be simplified. This suggests that the probability mechanism itself may not be a critical factor in ProbDD’s superior performance.
2.

The performance bottlenecks addressed by ProbDD are inefficient deletion attempts on complements of subsets and previously tried subsets, which should be considered to enhance efficiency.
3.

Randomness in ProbDD has no significant impact on the performance. Test input minimization is an NP-complete problem, randomness in ProbDD does not enhance the likelihood of finding optimal solutions.
4.

ProbDD is faster than ddmin, but at the cost of not guaranteeing 1-minimality.³³3A list is considered to have 1-minimality if removing any single element from it results in the loss of its property. The trade-off between effectiveness and efficiency is inevitable, and should be leveraged accordingly in different scenarios.

Contributions. We make the following major contributions.

•

We perform the first in-depth theoretical analysis for ProbDD, the state-of-the-art algorithm in test input minimization tasks, and identify the latent correlation between the subset size and the probability of elements.
•

We propose CDD, a much simplified version of ProbDD.
•

We evaluate ddmin, ProbDD and CDD on $76$ benchmarks, validating the correctness of our theoretical analysis. Additional experiments and statistical analysis on ProbDD further explain its superior performance, reveal the effectiveness of randomness, and the limitations of ProbDD.

Paper Organization. The remainder of the paper is structured as follows: section II introduces the symbols used in this study and detailed workflow of ddmin and ProbDD. section III and section IV present our in-depth analysis on ProbDD, simplifying the model of probability and subset size. section V describes empirical experiments and their results, from which additional findings are derived. section VI introduces CDD, which simplifies ProbDD based on our earlier findings while maintaining equivalent performance. section VII illustrates related work and section VIII concludes this study.

II Preliminaries

To facilitate comprehension, Table I lists all the important symbols used in this paper. Next, this section introduces ddmin and ProbDD, with the running example shown in Fig. 1(a).

TABLE I: The symbols used in this paper.

Symbol	Description	Symbol	Description
$L$	the list to minimize	$s$	the size of $S$
$\psi$ ψぷさい	the property to preserve	$E(s)$	Expected Reduction Gain with the first $s$ elements
${l}_{i}$	the $i$ -th element of $L$	$e$	Euler’s number
${l}_{i}.p$	the probability of ${l}_{i}$	$r$	the round number
$v_{i}$	a variant of $L$	$s_{r}$	the subset size in round $r$
$S$	a subset of $L$	$p_{r}$	the probability of each element in round $r$

⬇

{l}_{1}

:import math, sys

{l}_{2}

:input = sys.argv[1]

{l}_{3}

:a = int(input)

{l}_{4}

:b = math.e

{l}_{5}

:c = 3

{l}_{6}

:d = pow(b, a) + c

{l}_{7}

:c = math.log(d, b)

{l}_{8}

:crash(c)

(a) Original.

⬇

{l}_{1}

:import math, sys

{l}_{2}

:input = sys.argv[1]

{l}_{3}

:a = int(input)

{l}_{4}

:b = math.e

{l}_{5}

: c = 3

{l}_{6}

:d = pow(b, a) + c

{l}_{7}

:c = math.log(d, b)

{l}_{8}

:crash(c)

(b) By ddmin.

⬇

{l}_{1}

: import math, sys@\mt{2e}\myelement{2}:\mt{3s}input = sys.argv[1]\mt{3e}\myelement{3}:\mt{4s}a = int(input)\mt{4e}\myelement{4}:\mt{5s}b = math.e\mt{5e}\myelement{5}:c = 3\myelement{6}:\mt{6s}d = pow(b, a) + c\mt{6e}\myelement{7}:\mt{7s}c = math.log(d, b)\mt{7e}\myelement{8}:crash(c)

Figure 1: A running example in Python. Fig. 1(a) shows the original program, represented as a list of 8 elements (

{l}_{1}

{l}_{2}

\cdots

{l}_{8}

), in which

{l}_{8}

(i.e., crash(c)) triggers the crash. Fig. 1(b) and Fig. 1(c) show the minimized results by ddmin and ProbDD, with removed elements masked in gray. Both minimized programs still trigger the crash. Note that ProbDD cannot consistently guarantee the result in Fig. 1(c) and might produce larger results, due to its inherent randomness.

TABLE II: Step-by-step outcomes from ddmin on the running example. In each column, a variant is generated and tested against the property

\psi

ψぷさい. These variants are sequentially generated from left to right. The first row displays the variant identifier, and the second row displays round number

r

and subset size

s

. In the following rows, the symbol “✓” denotes an element is included by a certain variant, while gray cells signify that the element have been removed. For the last row, T indicates that the variant still preserves the property

\psi

ψぷさい, whereas F indicates not.

Initial

Variants

v₁

v₂

v₃

v₄

v₅

v₆

v₇

v₈

v₉

v₁₀

v₁₁

v₁₂

v₁₃

v₁₄

v₁₅

v₁₆

v₁₇

v₁₈

v₁₉

v₂₀

v₂₁

v₂₂

v₂₃

v₂₄

v₂₅

v₂₆

v₂₇

v₂₈

v₂₉

v₃₀

Element

Round

r=1

(

s

=4)

r=2

(

s

=2)

r=3

(

s

=1)

{l}_{1}

✓

{l}_{2}

✓

{l}_{3}

✓

{l}_{4}

✓

{l}_{5}

✓

{l}_{6}

✓

{l}_{7}

✓

{l}_{8}

✓

\psi

ψぷさい

II-A The ddmin Algorithm

The ddmin algorithm [1] is the first algorithm to systematically minimize a bug-triggering input to its essence. It takes the following two inputs:

•

$L$ : a list of elements representing a bug-triggering input. For example, $L$ can be a list of bytes, characters, lines, tokens, or parse tree nodes extracted from the bug-triggering input.
•

$\psi$ ψぷさい: a property that $L$ has. Formally, $\psi$ ψぷさい can be defined as a predicate that returns T if a list of elements preserves the property, F otherwise.

and returns a minimal subset of $L$ that still preserves $\psi$ ψぷさい, from which excluding any single element will make the minimal subset lose $\psi$ ψぷさい. This algorithm has been widely used in practice to facilitate developers in debugging [12, 9, 8, 24]. It generally consists of the following three steps.

Initialize. Start by setting the initial subset size $s$ to half of the input list $L$ , i.e., $s$ = $|L|/2$ .

Step 1: Minimize to Subset. Divide the input list $L$ into subsets, each with the current subset size. For each subset $S$ , check whether $S$ alone satisfies $\psi$ ψぷさい. If yes, keep only $S$ and restart from Step 1 with $L=S$ and the subset size as half of the new $L$ ; otherwise go to Step 2.

Step 2: Minimize to Complement. Test whether the complement of each subset $S$ (i.e., $L/S=\{e|e\in L\wedge e\not\in S\}$ ) satisfies $\psi$ ψぷさい. If yes, keep the complement of $S$ and restart from Step 2 with $L=L/S$ . Otherwise, go to Step 3.

Step 3: Subdivide. If any of the remaining subsets has at least two elements and thus can be further divided, halve the subset size, i.e., $s=s/2$ and go back to Step 1. If no subset can be further divided (i.e., the subset size is 1), ddmin terminates and returns the remaining elements as the result.

Round Number $r$ . Note that we introduce a round number $r$ at the second column of Table II. Within each round, the list $L$ is divided into subsets of a fixed size, on which Step 1 and Step 2 are applied. A new round begins when no further progress can be made with the current subset size. This round number is not explicitly present in the original ddmin algorithm but exists implicitly. In subsequent sections, we will also use this concept to introduce and simplify the ProbDD algorithm.

Table II illustrates the step-by-step minimization process of ddmin with the running example in Fig. 1(a). Initially, the input $L$ is [ ${l}_{1}$ , ${l}_{2}$ , $\cdots$ , ${l}_{8}$ ]. The ddmin algorithm iteratively generates variants by gradually decreasing the size of subsets from 4, 2 to 1.

1.

Round 1 ( $s$ =4). At the beginning, ddmin splits $L$ into two subsets and generates two variants v₁ and v₂. However, neither of them preserves $\psi$ ψぷさい.
2.

Round 2 ( $s$ =2). Next, ddmin continues to subdivide these two subsets into smaller ones, and generates eight variants (i.e., v₃, v₄, $\cdots$ , v₁₀) by using these subsets and their complements. Specifically, the first four variants (v₃, v₄, v₅, v₆) are the subsets, and the next four variants (v₇, v₈, v₉, v₁₀) are the complements of these subsets. Again, none of these eight variants preserves $\psi$ ψぷさい.
3.

Round 3 ( $s$ =1). Finally, ddmin decreases subset size $s$ from 2 to 1, and generates more variants. This time, v₂₃, which is the complement of the subset { ${l}_{5}$ }, preserves $\psi$ ψぷさい. Hence, the subset { ${l}_{5}$ } is permanently removed from $L$ . Then for each of the remaining subsets { ${l}_{1}$ }, { ${l}_{2}$ }, $\cdots$ , { ${l}_{8}$ }, ddmin restarts testing the complement of each subset, i.e., from v₂₄ to v₃₀. However, none of these variants preserves $\psi$ ψぷさい, and no subset can be further divided, so ddmin terminates with the variant v₂₃ as the final result.

II-B Probabilistic Delta Debugging (ProbDD)

Wang et al. [14] proposed the state-of-the-art algorithm ProbDD, significantly surpassing ddmin in minimizing bug-triggering programs on C compilers and benchmarks in software debloating. ProbDD employs Bayesian optimization [18] to model the minimization problem. ProbDD assigns a probability to each element in $L$ , representing its likelihood of being essential for preserving the property $\psi$ ψぷさい. At each step during the minimization process, ProbDD selects a subset of elements expected to yield the highest Expected Reduction Gain, and targets these elements in the subset for deletion. In this section, we outline ProbDD’s workflow in Algorithm 1, paving the way for a deeper understanding and analysis of ProbDD.

Input:

L

: a list to be minimized.

Input:

\psi:\mathbb{L}\rightarrow\mathbb{B}

ψぷさい : blackboard_L → blackboard_B: the property to be preserved by

L

Input:

p_{0}

: the initial probability given by the user.

Output: the minimized list that still exhibits the property

\psi

ψぷさい.

// Initialize the probability of each element with

p_{0}

3 foreach ${l}\in L$ do

{l}.p\leftarrow p_{0}

// The round number

r

, initially 0.

r

is not explicitly used in the original ProbDD algorithm. It is displayed for demonstrating ProbDD’s implicit principles.

r\leftarrow 0

6 while $\exists{l}\in L:{l}.p<1$ do

// Select elements from

L

for deletion attempt.

S\leftarrow\textnormal{{SelectSubset}}(L)

// Check if removing the subset preserves the property

\texttt{temp}\leftarrow L\setminus S

9 if $\psi(\texttt{temp})$ ψぷさい ( temp ) = T then

L\leftarrow\texttt{temp}

10 else

// Calculate the factor to update probabilities

\texttt{factor}\leftarrow\frac{1}{1-\prod_{{l}\in S}(1-{l}.p)}

// Update the probabilities of elements in the subset

12 foreach ${l}\in S$ do

{l}.p\leftarrow\texttt{factor}\times{l}.p

14 if All elements’ probability have been updated then

/* Move to the next round. */

r=r+1

18return

L

20Function SelectSubset(L):

Input:

L

: a list of elements to be reduced.

Output: The subset of elements that maximizes the Expected Reduction Gain.

/* Sort

L

by ascending probability, with elements having the same probability in random order. */

\texttt{sortedL}\leftarrow\textnormal{{RandomizeThenSort}}(L)

S\leftarrow\emptyset

\texttt{currentMaxGain}\leftarrow 0

24 foreach ${l}\in\texttt{sortedL}$ do

\texttt{tempSubset}\leftarrow S\cup\{{l}\}

\texttt{gain}\leftarrow|\texttt{tempSubset}|\times\prod_{{l}\in\texttt{% tempSubset}}(1-{l}.p)

27 if $\texttt{gain}>\texttt{currentMaxGain}$ then

\texttt{currentMaxGain}\leftarrow\texttt{gain}

S\leftarrow\texttt{tempSubset}

30 else break

32 return

S

Algorithm 1 ProbDD( $L,\psi$ ψぷさい)

Initialize (Algorithm 1). In $L$ , ProbDD assigns each element an initial probability $p_{0}$ on Algorithm 1, representing the prior likelihood that each element cannot be removed.

Step 1: Select elements (Algorithm 1, Algorithm 1–1). First, ProbDD sorts the elements in $L$ by probability in ascending order on Algorithm 1, and the order of elements with the same probability is determined randomly. Then, on Algorithm 1, it calculates the subset to be removed in the next attempt via the proposed Expected Reduction Gain $E(s)$ , as shown in Equation 1, with $E(s)$ denoting the expected gain obtained via removing the first $s$ elements in $L$ selected for deletion, and ${l}_{i}$ . $p$ denoting the current probability of the $i$ -th element in $L$ .

\displaystyle E(s)=s\times\prod^{s}_{i=1}{(1-{l}_{i}.p)}

(1)

Note that ProbDD has an invariant that the subset $S$ chosen for deletion attempt is always the first $s$ elements in $L$ . Every time, the first $s^{*}$ elements are selected as the optimal subset $S$ , where $s^{*}$ maximizes the Expected Reduction Gain $E(s)$ , elaborated as Equation 2.

\displaystyle s^{*}=\operatorname*{arg\,max}_{s}E(s)

(2)

Step 2: Delete the Subset (Algorithm 1-1). If $\psi$ ψぷさい is still preserved after the removal of $S$ , ProbDD removes subset $S$ on Algorithm 1, i.e., keeps only the complement of $S$ , and proceeds to Step 1. If $\psi$ ψぷさい cannot be preserved after the removal, on Algorithms 1 and 1, ProbDD updates the probability of each element in the subset $S$ via Equation 3, and resumes at Step 1. It is important to note that if an element ${l}_{i}$ has been individually deleted but failed, its probability ${l}_{i}$ . $p$ will be set to 1, indicating that this element cannot be removed and will no longer be considered for deletion.

\displaystyle{l}_{i}.p\leftarrow\frac{{l}_{i}.p}{1-\prod_{{l}\in S}{(1-{l}.p)}}

(3)

Step 3: Check Termination (Algorithm 1). If every element either has been deleted, or possesses a probability of 1, ProbDD terminates. If not, it returns to Step 1.

Round Number $r$ . Similar to the concept of rounds in ddmin (see Table II), ProbDD also has an implicit round number $r$ , as introduced on Algorithm 1 in Algorithm 1 and the second row of Table III. During a round, the subset size is the same and every subset in $L$ is attempted for deletion. Once the probabilities of all elements have been updated, the next round begins (i.e., $r\leftarrow r+1$ on Algorithm 1).

TABLE III: Step-by-step outcomes from ProbDD on the running example. Similar to Table II, round number, subset size and the details of each variants are presented. For each variant, the probability of each element is noted alongside.

Initial		Variants	v₁		v₂		v₃		v₄		v₅		v₆		v₇		v₈
Element	Prob	Round	$r=1$ ( $s$ =4)				$r=2$ ( $s$ =2)				$r=3$ ( $s$ =1)
${l}_{1}$	0.25			0.37	✓	0.37		0.61	✓	0.61	✓	0.61	✓	0.61
${l}_{2}$	0.25		✓	0.25
${l}_{3}$	0.25		✓	0.25
${l}_{4}$	0.25			0.37	✓	0.37	✓	0.37		0.61
${l}_{5}$	0.25			0.37	✓	0.37		0.61	✓	0.61	✓	0.61		1	✓	1	✓	1
${l}_{6}$	0.25		✓	0.25
${l}_{7}$	0.25		✓	0.25
${l}_{8}$	0.25			0.37	✓	0.37	✓	0.37		0.61	✓	0.61	✓	0.61	✓	0.61		1
$\psi$ ψぷさい			F		T		F		F		T		F		T		F

Table III illustrates the step-by-step results of ProbDD. Following the study of ProbDD [14], the initial probability $p_{0}$ is set to 0.25, resulting in subsets with a size of 4 as per Equation 2.

1.

Round 1 ( $s$ =4). Similar to the example in the original paper of ProbDD [14], we assume ProbDD selects ( ${l}_{1}$ , ${l}_{4}$ , ${l}_{5}$ , ${l}_{8}$ ) to delete due to the randomness, thus resulting the variant v₁. However, v₁ fails to exhibit $\psi$ ψぷさい, leading to the probability of these selected elements being updated from 0.25 to $\frac{0.25}{1-(1-0.25)^{4}}\approx 0.37$ , based on Equation 3. Next, the remaining elements with lower probability, i.e., ( ${l}_{2}$ , ${l}_{3}$ , ${l}_{6}$ , ${l}_{7}$ ), are prioritized and selected for deletion, resulting in v₂. This time, the property test passes and these elements are removed.
2.

Round 2 ( $s$ =2). Given that all probabilities of remaining elements become 0.37, the next subset size becomes 2. Subsequently, subset ( ${l}_{1}$ , ${l}_{5}$ ) are attempted to remove in v₃ and later subset ( ${l}_{4}$ , ${l}_{8}$ ) are attempted to remove in v₄, though no subset can be successfully removed. After these two attempts, all probabilities update to $\frac{0.37}{1-(1-0.37)^{2}}\approx 0.61$ .
3.

Round 3 ( $s$ =1). Finally, the subset size becomes 1, so each individual element is selected to remove alone. The elements ${l}_{4}$ and ${l}_{1}$ are finally removed from the final result in v₅ and v₇, respectively, while ${l}_{5}$ and ${l}_{8}$ are verified as non-removable, thus being returned as the final result.

III On the Probabilities in ProbDD

Beginning with this section, we will systematically present our findings. Each finding will be introduced by first stating the result, followed by a comprehensive explanation. Regarding theoretical proofs, we only demonstrate lemmas and theorems, with the corresponding proofs presented in the supplementary materials. In this section, we theoretically analyze the trend of probability changes across rounds.

An Illustrative Example. The running example illustrated in Table III leads to this finding. Observation reveals that after each element has been attempted for deletion once, i.e., completing one round, the probabilities of all remaining elements are updated. The initial probability is 0.25; after v₂, it changes to 0.37; following v₄, it increases to 0.61; and by the end of v₈, it reaches 1. Consequently, we hypothesize that with each deletion attempt, the probability approximately increases in a predictable manner. Through appropriate simplification, we can theoretically model this trend, and thereby model the entire progression of probability changes.

III-A Assumption for Theoretical Analysis

Besides the above observation from a concrete example, theoretical analysis is necessary. To refine the mathematical model of ProbDD for easier representation, analysis and derivation, we assume that the number of elements in $L$ is always divisible by the subset size. With this assumption, the probability of each element will be updated in the same manner; as a result, before and after each round, the probabilities of all elements are always the same, as shown in III.1. This assumption is often applicable in practice. For instance, in the running example in Table III, before each round, the probabilities associated with each remaining element are identical, ensuring that all subsets are of identical size. Furthermore, the probabilities of elements are updated to the same next value after the round.

Lemma III.1.

If the number of elements in $L$ is always divisible by the subset size, the probabilities of all elements are always the same.

While it is not always possible for the number of elements to be divisible by the subset size, the elements will still be partitioned as evenly as possible. However, such indivisibilities make the theoretical simplification of ProbDD nearly impossible. Based on our observation when running ProbDD, being slightly uneven during partitioning does not significantly affect probability updates. Moreover, we will demonstrate that the simplified algorithm derived from this assumption has no significant difference from ProbDD in section VI, via thorough experimental evaluation.

III-B Probability vs. Subset Size Correlation

In the second step, we derive the correlation between probability and subset size. Based on the assumption in the previous step, the probability of each element is identical and represented as $p_{r}$ in round $r$ , thus the formula of Expected Reduction Gain from Equation 1 can be simplified to

\displaystyle E(s)=s\times(1-p_{r})^{s}

(4)

Given the probability of elements $p_{r}$ in the round $r$ , $s_{r}$ can be derived through gradient-based optimization, i.e., $E^{\prime}(s_{r})=0$ . Therefore, the optimal size $s_{r}$ to maximize $E(s)$ is $-\frac{1}{\ln(1-p_{r})}$ . Subsequently, we can also deduce the next probability to be $p_{r+1}=\frac{p_{r}}{1-(1-p_{r})^{s_{r}}}$ . In summary, the correlation between probability and subset size can be simplified as Equation 5 and Equation 6, in which subset size $s_{r}$ is determined by probability $p_{r}$ , and probability $p_{r+1}$ in the next round is determined by both $p_{r}$ and $s_{r}$ .

	$\displaystyle s_{r}$	$\displaystyle=-\frac{1}{\ln(1-p_{r})}$		(5)
	$\displaystyle p_{r+1}$	$\displaystyle=\frac{p_{r}}{1-(1-p_{r})^{s_{r}}}$		(6)

III-C Trend of Probability Changes

Through Equation 6, $p_{r+1}>p_{r}$ always holds, indicating a monotonic increase of the probability of elements. However, there is still room for simplification, as $s_{r}$ can be represented by $p_{r}$ , implying that $p_{r+1}$ can be represented solely by $p_{r}$ .

Lemma III.2.

$p$ is increased by a factor $\frac{1}{1-e^{-1}}$ , i.e.,

\displaystyle p_{r}=\frac{p_{r-1}}{1-e^{-1}}=\frac{p_{0}}{(1-e^{-1})^{r}}

(7)

Therefore, through empirical observations on the running example, coupled with theoretical derivation and simplification, we have identified the pattern of probability changes w.r.t. the round number $r$ , i.e., $p_{r}=\frac{p_{0}}{(1-e^{-1})^{r}}\approx 1.582^{r}\times p_{0}$ .

IV On the Size of Subsets in ProbDD

In this section, we theoretically analyze the trend of subset size changes across rounds.

IV-A Demystifying How Subset Size Changes

Based on our previous finding that the probability can be approximately estimated by the current round number via a factor. Consequently, we observe a similar pattern in the changes of the subset size in each round.

Lemma IV.1.

$s_{r+1}$ can be expressed by solely $s_{r}$

\displaystyle s_{r+1}=\frac{1}{\ln(\frac{1-e^{-1}}{e^{-\frac{1}{s_{r}}}-e^{-1}% })}

(8)

Despite deriving that $s_{r+1}$ depends solely on $s_{r}$ , the trend of subset size is still implicit and obscure. For a clearer approximation, we propose the linear boundaries of $s_{r+1}$ in terms of $s_{r}$ ,

Lemma IV.2.

The lower bound of $s_{r+1}$ w.r.t $s_{r}$ is

\displaystyle s_{r+1}\geq(1-e^{-1})s_{r}-1

(9)

Lemma IV.3.

The upper bound of $s_{r+1}$ w.r.t $s_{r}$ is

\displaystyle s_{r+1}\leq(1-e^{-1})s_{r}

(10)

Theorem IV.4.

Subset size $s$ is initialized as Equation 5, updated by Equation 8, and constraint by two linear boundaries Equation 9 and Equation 10:

\left\{\begin{array}[]{l l}s_{0}&=-\frac{1}{\ln(1-p_{0})}\\ s_{r+1}&=\frac{1}{\ln(\frac{1-e^{-1}}{e^{-\frac{1}{s_{r}}}-e^{-1}})}\\ s_{r+1}&\leq(1-e^{-1})s_{r}\\ s_{r+1}&\geq(1-e^{-1})s_{r}-1\end{array}\right.

(11)

Aided by these two bounds, we obtain a complete representation Equation 11 to model subset size $s$ . It is worth noting that the size decreases approximately by a factor $1-e^{-1}$ $\approx$ 0.632, until reaching 1. Alternatively speaking, the subset size $s_{r}$ after round $r$ is roughly $s_{0}\times 0.632^{r}$ , allowing the subset size to be analytically pre-determined, and thus providing the potential for simplification of ProbDD and leading to the proposal of CDD (see details in section VI).

V Empirical Experiments

In addition to the theoretical derivation above, we conduct an extensive experimental evaluation on ddmin and ProbDD to gain deeper insights and achieve further discoveries. Specifically, we reproduce the experiments on ddmin and ProbDD by Wang et al. [14], and then delve deeper into ProbDD, analyzing its randomness, the bottlenecks it overcomes, and its 1-minimality. Furthermore, we evaluate our proposed CDD (which will be presented in section VI), validating our previous theoretical analysis. Due to limited space, we present the results of both ProbDD and CDD together within this section, but this section primarily focuses on discussing ProbDD, while the next section will focus on CDD.

V-A Benchmarks

To extensively evaluate ddmin, ProbDD and CDD, we use the following three benchmark suites ( $76$ benchmarks in total), covering various use scenarios of minimization algorithms.

•

Benchmark-C ( $\text{BM}_{\text{C}}$ ): 20 large bug-triggering programs in C language, each of which triggers a real-world compiler bug in either LLVM or GCC. The original size of benchmarks ranges from 4,397 tokens to 212,259 tokens. This benchmark suite has been used to evaluate test input minimization work [9, 14, 19].
•

Benchmark-Debloat ( $\text{BM}_{\text{DBT}}$ ): source programs of 10 command-line utilities. The original size of benchmarks ranges from 34,801 tokens to 163,296 tokens. This benchmark suite was collected by Heo et al. [13] and used to evaluate software debloating techniques [13, 20, 21].
•

Benchmark-XML ( $\text{BM}_{\text{XML}}$ ): 46 XML inputs triggering 8 unique bugs in Basex, a widely-used XML processing tool. The original size of benchmarks ranges from 19,290 tokens to 20,750 tokens. This benchmark suite is generated via Xpress [22] and collected by the authors of this study, as the original XML dataset used in ProbDD paper is not publicly available.

V-B Evaluation Metrics

We measure the following aspects as metrics.

Final Size. This metric assesses the effectiveness of reduction. When reducing a list $L$ with a certain property $\psi$ ψぷさい, a smaller final list is preferred, indicating that more irrelevant elements have been successfully eliminated. In all benchmark suites, the metric is measured by the number of tokens.

Execution Time. The execution time of a minimization algorithm reflects its efficiency. A minimization algorithm taking less time is more desirable, and execution time is measured in seconds.

Query Number. This metric further evaluates the algorithm’s efficiency. During the reduction process, each time a smaller variant is produced, the algorithm verifies whether this variant still preserves the property $\psi$ ψぷさい, referred to as a query. Since queries consume time, a lower query number is favorable.

P-value. We calculate the p-value via a paired t-test between every two algorithms, to investigate whether the performance differences are significant. A p-value below 0.05 denotes a significant distinction between the two groups of data. Otherwise, the observed difference lacks statistical significance.

V-C The Wrapping Frameworks

The ddmin algorithm and its variants usually serve as the fundamental algorithm. To apply them to a concrete scenario, an outer wrapping framework is generally needed to handle the structure of the input. In our evaluation, we choose the same wrapping frameworks as those used by ProbDD paper. For those tree-structured bug-triggering inputs, i.e., $\text{BM}_{\text{C}}$ and $\text{BM}_{\text{XML}}$ , we use Picireny 21.8 [23], an implementation of HDD [10]. Picireny parses such inputs into trees, and then invokes Picire 21.8 [24], an open-sourced Delta Debugging library with ddmin, ProbDD and CDD implemented, to reduce each level of the trees. For software debloating on $\text{BM}_{\text{DBT}}$ , Chisel [13] is employed, in which ddmin, ProbDD and CDD are integrated.

All experiments are conducted on a server running Ubuntu 22.04.3 LTS, equipped with Intel Xeon Gold 6348 CPUs @ 2.60GHz, providing a total of 120 threads, and 4 TB of RAM. To ensure the reproducibility, we employ docker images to release the source code and the configuration. Each benchmark is reduced using a single thread. Following the ProbDD paper, we run each algorithm on each benchmark 5 times and calculate the geometric average results.

V-D Reproduction Study of ProbDD

TABLE IV: The final size, execution time and query number of ddmin, ProbDD and CDD on all benchmark suites. The "-" indicates a timeout, and the results of unfinished benchmarks under each algorithm are highlighted in gray. Only benchmarks finished by all algorithms are considered to compute mean values, and we position them prominently at the top rows of each suite. Considering the limited space and the extensive number of

\text{BM}_{\text{XML}}

, all of which have been finished, we only present the average results.

			Final size (#)			Execution time (s)			Query number
	Benchmark	Original size (#)	ddmin	ProbDD	CDD	ddmin	ProbDD	CDD	ddmin	ProbDD	CDD
	LLVM-22382	9,987	350	367	350	1,450	975	1,004	11,388	5,461	4,540
	LLVM-23353	30,196	321	324	324	2,787	1,235	1,390	11,719	4,199	3,839
	LLVM-25900	78,960	941	921	945	9,010	3,501	3,104	35,740	15,026	10,685
	LLVM-27747	173,840	431	509	431	6,972	2,653	3,757	20,000	5,862	6,976
	GCC-59903	57,581	1,185	1,206	753	7,752	3,469	3,727	47,698	15,396	11,648
	GCC-61383	32,449	959	955	978	10,729	6,220	6,027	43,716	13,712	11,933
	GCC-61917	85,359	882	923	902	4,993	2,780	3,472	31,414	12,037	12,485
	GCC-65383	43,942	706	700	709	4,334	3,128	3,528	25,051	9,309	8,022
	GCC-71626	4,397	184	184	184	111	104	124	1,608	1,196	1,119
	LLVM-22704	184,444	95,930	788	790	-	10,762	9,218	14,312	15,973	12,230
	LLVM-26760	209,577	15,123	498	498	-	5,010	4,995	17,749	9,530	8,256
	LLVM-31259	48,799	1,033	1,035	1,051	-	8,350	6,915	28,192	13,210	11,331
	GCC-64990	148,931	39,192	709	741	-	10,625	9,402	24,212	11,965	11,361
	GCC-66186	47,481	1,012	1,008	1,013	-	8,494	9,920	37,682	13,025	15,768
	LLVM-23309	33,310	1,532	1,270	1,286	-	-	9,376	34,173	17,038	14,840
	LLVM-27137	174,538	119,115	56,098	47,195	-	-	-	5,845	4,410	4,455
	GCC-60116	75,224	17,598	2,797	1,894	-	-	-	19,061	10,391	9,007
	GCC-66375	65,488	1,381	1,242	1,220	-	-	-	27,831	7,662	8,511
	GCC-70127	154,816	26,613	1,119	1,068	-	-	-	18,213	7,659	6,692
	GCC-70586	212,259	36,692	1,820	1,606	-	-	-	17,732	13,707	15,991
$\text{BM}_{\text{C}}$	Mean	64,599	566	582	541	3,333	1,819	2,018	18,483	7,276	6,483
	mkdir-5.2.1	34,801	8,625	8,407	8,497	3,771	1,692	1,432	11,969	2,469	1,909
	chown-8.2	43,869	9,765	9,178	9,190	-	6,057	5,321	25,446	7,108	5,448
	rm-8.4	44,459	11,293	8,411	8,463	-	4,241	3,758	22,744	4,862	4,262
	bzip2-1.0.5	70,530	37,941	37,506	37,510	-	-	-	5,959	1,349	1,032
	date-8.21	53,442	38,696	20,768	21,109	-	-	-	46,241	9,446	8,538
	grep-2.19	127,681	127,024	75,228	82,847	-	-	-	56,235	4,750	4,195
	gzip-1.2.4	45,929	32,575	26,011	24,290	-	-	-	20,900	3,697	3,487
	sort-8.16	88,068	81,544	44,171	48,353	-	-	-	43,189	1,972	1,864
	tar-1.14	163,296	157,806	97,670	93,365	-	-	-	35,369	3,117	2,930
	uniq-8.16	63,861	18,041	19,071	17,379	-	-	-	13,411	4,150	1,953
$\text{BM}_{\text{DBT}}$	Mean	65,151	8,625	8,407	8,497	3,771	1,692	1,432	11,969	2,469	1,909
$\text{BM}_{\text{XML}}$	Mean	20,190	56	57	55	840	639	660	453	293	281
All	Mean	31,989	89	90	88	1,076	769	801	872	510	481

To comprehensively reproduce the results of ProbDD [14], we evaluate ddmin and ProbDD using three benchmark suites, containing a total of $76$ benchmarks. Following the settings of ProbDD [14], we set the empirically estimated remaining rate as the initialization probability $p_{0}$ , specifically, 0.1 for $\text{BM}_{\text{C}}$ and $\text{BM}_{\text{DBT}}$ , and 2.5e-3 for $\text{BM}_{\text{XML}}$ . Same as ProbDD’s paper, we employ three hours (10,800 seconds) as the timeout threshold. If the algorithm does not complete a benchmark within the time limit, the smallest result achieved and the corresponding query numbers are still recorded. Due to the significant differences between completed and uncompleted results, we only consider benchmarks completed by all algorithms when calculating averages and p-values. The detailed results are shown in Table IV.

Efficiency and Effectiveness. Through our reproduction study, we find that the performance of ProbDD aligns with the results reported in the original paper, showing that ProbDD is significantly more efficient than ddmin. On benchmarks that can be completed by all algorithms, ProbDD requires $28.53$ % less time and $41.51$ % fewer queries, with p-value being 3.50e-05 and 2.82e-03, respectively. Moreover, we assess the effectiveness by measuring the sizes of the final minimized results. The effectiveness of ddmin and ProbDD varies across each benchmark, but neither algorithm consistently outperforms the other, as substantiated by a p-value of 0.71, which is much higher than 0.05.

V-E Impact of Randomness in ProbDD

In ProbDD, elements with different probabilities are sorted accordingly, while elements with the same probability are randomly shuffled. However, randomness alone intuitively does not ensure a higher probability of escaping local optima and the effect of this randomness on performance has not been thoroughly investigated.

To this end, we conduct an ablation study by removing such randomness, creating a variant called ProbDD-no-random. We evaluate this variant across all benchmarks. The results indicate that the randomness does not significantly impact performance. Specifically, in terms of final size, execution time, and query number, ProbDD-no-random achieves 90, 770, and 559 compared to 90, 769, and 510 of ProbDD, respectively. The p-values of 0.67, 0.95, and 0.75 indicate that the differences are not significant.

V-F Bottleneck ProbDD Overcomes

In the study of ProbDD, the authors demonstrate that ProbDD is more efficient than the baseline approach (ddmin) in tree-based reduction scenarios, where the inputs are parsed into tree representations before reduction. Therefore, to uncover the root cause of this superiority, we follow the same application scenario and analyze the behavior of ProbDD in reducing the tree-structured inputs.

Refer to caption — (a) On $\text{BM}_{\text{C}}$

To further understand why ProbDD is more efficient than ddmin, we conduct in-depth statistical analysis on the query number (number of deletion attempts). Intuitively, performance bottlenecks lie in those queries with low success rates, impairing ddmin’s efficiency. Existing studies [16, 15] also demonstrate the presence of queries with low success rates. Therefore, to qualitatively and quantitatively identify the exact bottlenecks impairing ddmin, we statistically analyze all the queries in ddmin and categorize them into three types:

1.

Complement: Queries attempting to remove the complement of a subset. According to ddmin algorithm, given a subset (smaller than half of the list $L$ ), it attempts to remove either the subset or its complement. However, evidence [16] shows that keeping a small subset and removing its complement is not likely to succeed, especially on structured inputs like programs.
2.

Revisit: Queries attempting to remove the previously tried subset. After removing a subset, ddmin restarts the process from the first subset, leading to repeated deletion attempts on earlier subsets. Although the removal of one subset may allow another subset to be removable, such repetitions rarely succeed and thus offer limited improvement for the reduction [15].
3.

Other: All other queries.

In addition to categorizing queries in ddmin into the above types, we also calculate the success rate of each type, aiming to reveal the bottlenecks of ddmin. Fig. 2 illustrates the distribution of queries for all types within ddmin, as well as the query number for ProbDD. We only consider completed benchmarks on $\text{BM}_{\text{C}}$ and $\text{BM}_{\text{XML}}$ , as they reflect the distribution throughout the entire minimization process. However, only one benchmark is completed across all algorithms in $\text{BM}_{\text{DBT}}$ . Therefore, for this benchmark suite, we include all benchmarks, including those unfinished ones, to ensure the results are statistically meaningful.

On both $\text{BM}_{\text{C}}$ and $\text{BM}_{\text{XML}}$ , ddmin performs almost the same number of successful queries, compared to those of ProbDD. Specifically, on $\text{BM}_{\text{C}}$ , ddmin performs $74+3+3,614=3,691$ successful queries, close to 3,633 queries from ProbDD. Similarly, on $\text{BM}_{\text{XML}}$ , the success query number of ddmin is $2+830+1,485=2,317$ , demonstrating only minimal differences compared to ProbDD’s 2,315 successful queries. On $\text{BM}_{\text{DBT}}$ , however, the number of successful queries of ddmin is not close to those of ProbDD, as most benchmarks are not completed. Besides, ddmin always performs significantly more failed queries, resulting in a larger total query number and thus a longer execution time, as previously discussed in section V-D.

On all benchmark suites, a large portion of ddmin’s queries is categorized as Complement and Revisit; however, they both have a very low success rate. For instance, on $\text{BM}_{\text{C}}$ , out of a total of 220,563 queries, Complement and Revisit account for 119,363 (54.12%) and 36,652 (16.62%), respectively. Within such queries in Complement and Revisit, merely 3 (<0.01%) and 74 (0.20%) queries succeed, i.e., only a tiny portion of attempts successfully reduce elements. These success rates are far less than those of queries within Other (5.60%), as well as those of ProbDD (4.85%). On the other benchmark suites, a similar phenomenon is observed.

Queries within Complement and Revisit categories constitute a large portion yet prove to be largely inefficient, wasting a significant amount of time and resources. On the contrary, those in Other achieve a much higher success rate, on par with that of ProbDD, and are responsible for most of the successful deletions. Therefore, we believe that these two categories, where queries are inefficient, are the main bottlenecks behind ddmin’s low efficiency. However, these bottlenecks are absent in ProbDD, as it does not consider complements of subsets and previously tried subsets for deletion.

V-G 1-Minimality of ProbDD?

Although ProbDD avoids Revisit queries to enhance efficiency, some reduction potentials may be missed, as the deletion of a certain subset may enable a previously tried subset to become removable. Therefore, a limitation of ProbDD lies in that it increases efficiency by sacrificing 1-minimality. To substantiate this limitation, we examine how frequently ProbDD generates a list that is not 1-minimal, i.e., can be further reduced by removing a single element. For instance, statistical analysis on $\text{BM}_{\text{C}}$ reveals that among 6,871 invocations of ProbDD, 76 of them fail to generate a 1-minimal result, accounting for $1.11$ %. For these failed invocations, an average of 1.49 elements (tree nodes) can be further removed via single-element deletion.

However, such limitation is not apparent across all benchmark suites, as the results from ProbDD are not consistently larger than those from ddmin. Our further investigation reveals that these benchmarks are reduced on wrapper frameworks Picireny and Chisel. Both frameworks employ iterative loops to achieve a fixpoint, effectively reducing some elements missed in the first iteration.

VI Implications: a counter-based model

Building on the aforementioned demystification of ProbDD, we discover that probability can be optimized away, and subset size can be pre-computed. Hence, we propose Counter-Based Delta Debugging (CDD), to reduce the complexity of both the theory and implementation of ProbDD, and validate the correctness of our prior theoretical proofs.

Input:

L

: a list of element to be reduced.

Input:

\psi:\mathbb{L}\rightarrow\mathbb{B}

ψぷさい : blackboard_L → blackboard_B: the property to be preserved by

L

Input:

p_{0}

: the initial probability given by the user.

Output: the minimized list that still exhibits the property

\psi

ψぷさい .

r\leftarrow 0

// The round number, initially 0.

3 do

/* Compute subset size by round number */

s\leftarrow

ComputeSize (

r

p_{0}

)

/* Partition L into subsets with

s

elements. If it does not divide evenly, leave a smaller remainder as the final subset. */

\texttt{subsets}\leftarrow

Partition (L,

s

)

6 foreach $\texttt{subset}\in\texttt{subsets}$ do

\texttt{temp}\leftarrow L\setminus\texttt{subset}

/* Remove

subset

if it is removable */

8 if $\psi(\texttt{temp})$ ψぷさい ( temp ) is true then

L\leftarrow\texttt{temp}

/* Update the

r

and move to next round. */

r\leftarrow r+1

15while $s>1$

16return

L

18Function ComputeSize( $r,p_{0}$ ):

Input:

r

: the current round number.

Input:

p_{0}

: the initial probability given by the user.

Output: The size of the subset to be used in the current round.

/* Calculate the initial subset size from

p_{0}

s_{0}\leftarrow\lfloor-\frac{1}{\ln(1-p_{0})}\rfloor

/* Calculate the subset size for the current round */

s\leftarrow\lfloor s_{0}\times 0.632^{r}\rfloor

21 return

s

Algorithm 2 CDD ( $L,\psi$ ψぷさい)

Subset size pre-calculation. Based on Equation 11 in section III, the size for each round can be pre-calculated. Therefore, as shown at Algorithm 2 – Algorithm 2 in Algorithm 2, we utilize the current round $r$ and the initial probability $p_{0}$ to determine the subset size $s$ . The size of the selected subset decreases as the round counter increases. This is intuitively reasonable since, after a sufficient number of attempts on a large size have been made, it becomes more advantageous to gradually reduce the subset size for future trials. Furthermore, this trend aligns well with that of ProbDD, in which probabilities of elements gradually increase, resulting in a smaller subset size.

Main workflow. The simplified ProbDD is illustrated in Algorithm 2, from Algorithm 2 to Algorithm 2. Before each round, the CDD pre-calculates the subset size on Algorithm 2 and then partitions $L$ using this size on Algorithm 2. Then, similar to ddmin, it attempts to remove each subset on Algorithm 2 – Algorithm 2. The subset size continuously decreases until it reaches 1, meaning each element will be individually removed once.

Revisiting the running example. Returning to Table III, under the same conditions, CDD achieves the same results as ProbDD but without the need for probability calculations. This is because both the probability and subset size $s$ can be directly determined from the round number $r$ .

Evaluation. In Table IV, CDD completes the most benchmarks, totaling 69, followed by ProbDD with 68 benchmarks completed, and ddmin with 61 benchmarks completed. CDD outperforms ddmin w.r.t. efficiency, with $25.56$ % less time and $44.84$ % fewer queries. Meanwhile, CDD performs on par with ProbDD w.r.t. final size, execution time and query number, with a p-value of 0.38, 0.13 and 0.06, respectively, indicating insignificance between these two algorithms. CDD is expected to perform on par with ProbDD since it is designed to provide further insight and unravel the complexities of ProbDD, rather than to surpass its capabilities. Furthermore, its comparable performance to ProbDD further validates the non-necessity of randomness and our assumption in III.1.

Bottleneck and 1-minimality. Returning to the bottlenecks presented in Fig. 2, CDD possesses a query number and success rate close to those of ProbDD, indicating that CDD also overcomes the bottlenecks of ddmin. Additionally, similar to ProbDD, 1-minimality is absent in CDD, although iterations help mitigate this issue.

VII Related Work

In this section, we discuss related work of test input minimization around three aspects: effectiveness, efficiency, and the utilization of domain knowledge.

Effectiveness. Test input minimization is an NP-complete problem, in which achieving the global minimum is usually infeasible. Therefore, existing approaches to improving effectiveness mainly aim to escape local minima by performing more exhaustive searches. Since enumerating all possible subsets is infeasible, Vulcan [11] and C-Reduce [12] enumerate all combinations of elements within a small sliding window, and exhaustively attempt to delete each combination, resulting in smaller final program sizes. In contrast, ProbDD and CDD do not exhibit clear actions targeted at breaking through local optima, suggesting they cannot achieve better effectiveness than ddmin, as aligned with our evaluation in section V.

Efficiency. If parallelism is not considered, the core of boosting efficiency is the enhanced capability to avoid relatively inefficient queries. For example, Hodovan and Kiss [16] proposed disregarding attempts to remove the complement of subsets, the success rate of which is unacceptably low in some scenarios. Besides, Gharachorlu and Sumner [15] proposed One Pass Delta Debugging (OPDD), which continues with the subset next to the deleted one, rather than starting over from the first subset. This optimization also avoids some redundant queries in ddmin, reducing runtime by 65%. As revealed by our analysis, these two above-mentioned optimizations are implicitly incorporated within ProbDD and CDD, and thereby contributing to their higher efficiency than ddmin.

Utilization of domain knowledge. There is an inherent trade-off between effectiveness and efficiency in test input minimization. For the same algorithm, achieving a better result, i.e., a smaller local optimum, requires more queries to be spent on trial and error. However, employing domain knowledge [12, 25, 26, 27] can still improve the overall performance. For instance, J-Reduce is both more effective and efficient than HDD on reducing Java programs, as it escapes more local optima by program transformations while simultaneously avoiding more inefficient queries via semantic constraints, leveraging the semantics of Java. Our analysis on ProbDD indicates that the probabilities primarily function as counters and do not utilize or effectively learn the domain knowledge of an input. Besides, the evaluation on CDD, a simplified algorithm without utilizing probability, demonstrates that prioritizing elements via such probabilities does not yield significant benefits, thus validating our analysis.

VIII Conclusion

This paper conducts the first in-depth analysis of ProbDD, which is the state-of-the-art variant of ddmin, to further comprehend and demystify its superior performance. With theoretical analysis of the probabilistic model in ProbDD, we reveal that probabilities essentially serve as monotonically increasing counters, and propose CDD for simplification. Evaluations on $76$ benchmarks from test input minimization and software debloating confirm that CDD performs on par with ProbDD, substantiating our theoretical analysis. Furthermore, our examination on query success rate and randomness uncovers that ProbDD’s superiority stems from skipping inefficient queries. Finally, we discuss trade-offs in ddmin and ProbDD, providing insights for future research and applications of test input minimization algorithms.

References

[1] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183–200, 2002.
[2] GCC. (2020) A guide to testcase reduction. Accessed: 2023-04-30. [Online]. Available: https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction
[3] LLVM. (2022) How to submit an llvm bug report. Accessed: 2023-04-30. [Online]. Available: https://llvm.org/docs/HowToSubmitABug.html
[4] WebKit. (2001) Webkit: Test case reduction. Accessed: 2023-04-30. [Online]. Available: https://webkit.org/test-case-reduction/
[5] ASF Bugzilla. (2001) ASF bugzilla: Bug writing guidelines. Accessed: 2023-04-30. [Online]. Available: https://bz.apache.org/bugzilla/page.cgi?id=bug-writing.html
[6] Bugzilla. (2001) Bugzilla: Reporting a new bug. Accessed: 2023-04-30. [Online]. Available: https://bugzilla.readthedocs.io/en/5.2/using/filing.html#reporting-a-new-bug
[7] A. Donaldson and D. MacIver. (2021, May) Test Case Reduction: Beyond Bugs. [Online]. Available: https://blog.sigplan.org/2021/05/25/test-case-reduction-beyond-bugs
[8] A. F. Donaldson, P. Thomson, V. Teliman, S. Milizia, A. P. Maselco, and A. Karpiński, “Test-case reduction and deduplication almost for free with transformation-based compiler testing,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1017–1032.
[9] C. Sun, Y. Li, Q. Zhang, T. Gu, and Z. Su, “Perses: Syntax-guided program reduction,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 361–371.
[10] G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Proceedings of the 28th International Conference on Software Engineering, 2006, pp. 142–151.
[11] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 636–664, 2023.
[12] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for c compiler bugs,” in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012, pp. 335–346.
[13] K. Heo, W. Lee, P. Pashakhanloo, and M. Naik, “Effective program debloating via reinforcement learning,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 380–394.
[14] G. Wang, R. Shen, J. Chen, Y. Xiong, and L. Zhang, “Probabilistic delta debugging,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 881–892.
[15] G. Gharachorlu and N. Sumner, “Avoiding the familiar to speed up test case reduction,” in 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2018, pp. 426–437.
[16] R. Hodován and Á. Kiss, “Practical improvements to the minimizing delta debugging algorithm.” in ICSOFT-EA, 2016, pp. 241–248.
[17] G. Wang. (2021) Probdd. Accessed: 2023-04-30. [Online]. Available: https://github.com/Amocy-Wang/ProbDD
[18] M. Pelikan, D. E. Goldberg, E. Cantú-Paz et al., “Boa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1. Citeseer, 1999, pp. 525–532.
[19] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proc. ACM Program. Lang., vol. 7, no. OOPSLA1, apr 2023. [Online]. Available: https://doi.org/10.1145/3586049
[20] C. Qian, H. Hu, M. Alharthi, S. P. H. Chung, T. Kim, and W. Lee, “Razor: A framework for post-deployment software debloating.” in USENIX Security Symposium, 2019, pp. 1733–1750.
[21] M. Alhanahnah, R. Jain, V. Rastogi, S. Jha, and T. Reps, “Lightweight, multi-stage, compiler-assisted application specialization,” in 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P). IEEE, 2022, pp. 251–269.
[22] S. Li and M. Rigger, “Finding xpath bugs in xml document processors via differential testing,” arXiv preprint arXiv:2401.05112, 2024.
[23] A. Kiss, R. Hodován, and D. Vince. (2016) Picireny. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picireny
[24] ——. (2016) Picire. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picire
[25] C. G. Kalhauge and J. Palsberg, “Binary reduction of dependency graphs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 556–566.
[26] ——, “Logical bytecode reduction,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1003–1016.
[27] M. Zhang, Y. Tian, Z. Xu, Y. Dong, S. H. Tan, and C. Sun, “Lpr: Large language models-aided program reduction,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: ACM, 2024, p. 13.