Deep Dive into Probabilistic Delta Debugging: Insights and Simplifications
Abstract
Given a list of elements and a property that exhibits, ddmin is a well-known test input minimization algorithm designed to automatically eliminate -irrelevant elements from . This algorithm is extensively adopted in test input minimization and software debloating. Recently, ProbDD, an advanced variant of ddmin, has been proposed and achieved state-of-the-art performance. Employing Bayesian optimization, ProbDD predicts the likelihood of each element in being pertinent to , and statistically decides which elements and how many should be removed each time. Despite its impressive results, the theoretical probabilistic model of ProbDD is complex, and the specific factors driving its superior performance have not been thoroughly investigated.
In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. This analysis is complemented by empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further understand and demystify this state-of-the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency.
Based on these findings, we propose CDD, a simplified version of ProbDD, which reduces the complexity in both theory and implementation. CDD assists in validating the correctness of our key findings, such as the role of probabilities in ProbDD serving as monotonically increasing counters for each element, and in identifying the main factors that contribute to ProbDD’s superior performance. Comprehensive evaluations across benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite its simplification. These insights provide valuable guidance for future research and applications of test input minimization algorithms.
Index Terms:
Program Reduction, Delta Debugging, Software Debloating, Test Input MinimizationI Introduction
Delta Debugging [1] is a seminal family of algorithms designed for software debugging, among which ddmin stands out as a classic test input minimization (a.k.a., test input reduction) algorithm. Given a list of elements (modeling the test input) and a property that exhibits, ddmin aims to remove elements in that are irrelevant to , such that the resulting list is smaller than yet still satisfies . The ddmin algorithm plays a crucial role in software testing, debugging and maintenance [2, 3, 4, 5, 6], since compact, informative bug-triggering inputs are easier for developers to effectively identify root causes than large bug-triggering inputs with bug-irrelevant information [7].
To minimize a test input that satisfies , ddmin has been used in two primary manners. In the first manner, is initially segmented into a list, denoted as , which could be segmented based on characters, tokens, lines, etc. Subsequently, ddmin is directly applied to [1, 8]. Alternatively, ddmin serves as a pivotal component within advanced, structure-aware test input minimization algorithms, including Perses [9], HDD [10], C-Reduce [12], and Chisel [13]. These algorithms leverage the inherent structures of to expedite the minimization process or further reduce its size. Generally, these algorithms initiate by parsing into a tree structure, such as a parse tree. They then iteratively extract a list of tree nodes from the tree using heuristics and apply ddmin to to gradually condense the tree. Both manners underscore the fundamental role of ddmin as the cornerstone of test input minimization.
In the past years, different variants of ddmin have been proposed to improve its performance [13, 14, 15, 16], among which Probabilistic Delta Debugging (ProbDD) [14] is the state of the art, with notable superiority to other algorithms [1, 13]. When reducing , ProbDD utilizes a theoretical probabilistic model based on Bayesian optimization to predict how likely every element in is essential to preserve the property , by assigning a probability to each element. ProbDD prioritizes deleting elements with lower probabilities, as such elements generally have a lower possibility of being -relevant. Before each deletion attempt, an optimal subset of elements is determined by maximizing the Expected Reduction Gain.111In each attempt, the Expected Reduction Gain is defined as the expected number of elements removed. Higher Expected Reduction Gain is preferred, as it indicates an expectation to delete more elements through this attempt. If the deletion of this subset fails to preserve , the probabilistic model increases the probability assigned to each element in the subset. As reported [17], aided by such a probabilistic model, ProbDD significantly outperforms ddmin by reducing the execution time and the query number.222A query is a run of the property test .
However, this probabilistic model in ProbDD is rather intricate, and the underlying mechanisms for its superior performance have not been adequately studied. The original paper of ProbDD merely showed its performance numbers without deep ablation analysis on such achievements. Specifically, the following questions are important to the research field of test input minimization, but have not been answered yet.
-
1.
What is the fundamental role of probabilities in ProbDD, and can they be simplified without impacting performance?
-
2.
What specific bottlenecks does ProbDD overcome to achieve improvement compared to ddmin?
-
3.
How does randomness in ProbDD contribute to the performance improvement?
-
4.
What are the potential limitations of ProbDD?
Gaining a deeper understanding of the state of the art, i.e., ProbDD, is highly valuable for test input minimization tasks. By clarifying the intrinsic reasons behind its superiority, we can facilitate researchers to understand the essence of the probabilistic model, as well as its strengths and limitations. Such demystification, in our view, paves the way for enlightening future research and guides users to more effectively apply ddmin and its variants for test input minimization.
To this end, we conduct the first in-depth analysis of ProbDD, starting by theoretically simplifying its probabilistic model. In the original ProbDD, probabilities are used to calculate the Expected Reduction Gain, which is subsequently used to determine the next subset size. However, this process necessitates iterative calculations, impeding the simplification and comprehension of ProbDD. In our study, we initially establish the analytical correlation between the probability and subset size, allowing for probabilities and subset sizes to be explicitly calculated through formulas, thus eliminating the need for iterative updates. Further, through mathematical derivation, we discover that the probability and subset size can be considered nearly independent, each varying at an approximate ratio on their own. By theoretical prediction, the probability increases approximately by a factor of ( 1.582), while the subset size decreases by a factor of ( 0.632), thus providing the potential for simplifying ProbDD.
Building upon our theoretical analysis, we conducted extensive evaluations of ddmin, ProbDD, and CDD across diverse benchmarks. The experimental results confirm the correctness of our theoretical analysis, demonstrating how ProbDD addresses bottlenecks in ddmin by skipping inefficient queries, reveals the impact of randomness on results, and highlights the limitations of ProbDD. These findings provide valuable guidance for future research and the development of test input minimization algorithms.
Based on the aforementioned analysis, we propose Counter-Based Delta Debugging (CDD), a simplified version of ProbDD, to explain ProbDD’s high performance. By replacing probabilities with counters, CDD eliminates the probability computations required by ProbDD, thus reducing theoretical and implementation complexity. Our experiments demonstrate that CDD aligns with ProbDD in both effectiveness and efficiency, which validates our previous analysis and findings.
Key Findings. Through both theoretical analysis and empirical experiments, our key findings are:
-
1.
Through theoretical derivation, the probabilities in ProbDD essentially serve as monotonically increasing counters, and can be simplified. This suggests that the probability mechanism itself may not be a critical factor in ProbDD’s superior performance.
-
2.
The performance bottlenecks addressed by ProbDD are inefficient deletion attempts on complements of subsets and previously tried subsets, which should be considered to enhance efficiency.
-
3.
Randomness in ProbDD has no significant impact on the performance. Test input minimization is an NP-complete problem, randomness in ProbDD does not enhance the likelihood of finding optimal solutions.
-
4.
ProbDD is faster than ddmin, but at the cost of not guaranteeing 1-minimality.333A list is considered to have 1-minimality if removing any single element from it results in the loss of its property. The trade-off between effectiveness and efficiency is inevitable, and should be leveraged accordingly in different scenarios.
Contributions. We make the following major contributions.
-
•
We perform the first in-depth theoretical analysis for ProbDD, the state-of-the-art algorithm in test input minimization tasks, and identify the latent correlation between the subset size and the probability of elements.
-
•
We propose CDD, a much simplified version of ProbDD.
-
•
We evaluate ddmin, ProbDD and CDD on benchmarks, validating the correctness of our theoretical analysis. Additional experiments and statistical analysis on ProbDD further explain its superior performance, reveal the effectiveness of randomness, and the limitations of ProbDD.
Paper Organization. The remainder of the paper is structured as follows: section II introduces the symbols used in this study and detailed workflow of ddmin and ProbDD. section III and section IV present our in-depth analysis on ProbDD, simplifying the model of probability and subset size. section V describes empirical experiments and their results, from which additional findings are derived. section VI introduces CDD, which simplifies ProbDD based on our earlier findings while maintaining equivalent performance. section VII illustrates related work and section VIII concludes this study.
II Preliminaries
To facilitate comprehension, Table I lists all the important symbols used in this paper. Next, this section introduces ddmin and ProbDD, with the running example shown in Fig. 1(a).
Symbol | Description | Symbol | Description |
the list to minimize | the size of | ||
the property to preserve | Expected Reduction Gain with the first elements | ||
the -th element of | Euler’s number | ||
the probability of | the round number | ||
a variant of | the subset size in round | ||
a subset of | the probability of each element in round |
Initial | Variants | v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | v9 | v10 | v11 | v12 | v13 | v14 | v15 | v16 | v17 | v18 | v19 | v20 | v21 | v22 | v23 | v24 | v25 | v26 | v27 | v28 | v29 | v30 |
Element | Round | (=4) | (=2) | (=1) | |||||||||||||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||||||||||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||
✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||||||||||||||
F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | F | T | F | F | F | F | F | F | F |
II-A The ddmin Algorithm
The ddmin algorithm [1] is the first algorithm to systematically minimize a bug-triggering input to its essence. It takes the following two inputs:
-
•
: a list of elements representing a bug-triggering input. For example, can be a list of bytes, characters, lines, tokens, or parse tree nodes extracted from the bug-triggering input.
-
•
: a property that has. Formally, can be defined as a predicate that returns T if a list of elements preserves the property, F otherwise.
and returns a minimal subset of that still preserves , from which excluding any single element will make the minimal subset lose . This algorithm has been widely used in practice to facilitate developers in debugging [12, 9, 8, 24]. It generally consists of the following three steps.
Initialize. Start by setting the initial subset size to half of the input list , i.e., =.
Step 1: Minimize to Subset. Divide the input list into subsets, each with the current subset size. For each subset , check whether alone satisfies . If yes, keep only and restart from Step 1 with and the subset size as half of the new ; otherwise go to Step 2.
Step 2: Minimize to Complement. Test whether the complement of each subset (i.e., ) satisfies . If yes, keep the complement of and restart from Step 2 with . Otherwise, go to Step 3.
Step 3: Subdivide. If any of the remaining subsets has at least two elements and thus can be further divided, halve the subset size, i.e., and go back to Step 1. If no subset can be further divided (i.e., the subset size is 1), ddmin terminates and returns the remaining elements as the result.
Round Number . Note that we introduce a round number at the second column of Table II. Within each round, the list is divided into subsets of a fixed size, on which Step 1 and Step 2 are applied. A new round begins when no further progress can be made with the current subset size. This round number is not explicitly present in the original ddmin algorithm but exists implicitly. In subsequent sections, we will also use this concept to introduce and simplify the ProbDD algorithm.
Table II illustrates the step-by-step minimization process of ddmin with the running example in Fig. 1(a). Initially, the input is [, , , ]. The ddmin algorithm iteratively generates variants by gradually decreasing the size of subsets from 4, 2 to 1.
-
1.
Round 1 (=4). At the beginning, ddmin splits into two subsets and generates two variants v1 and v2. However, neither of them preserves .
-
2.
Round 2 (=2). Next, ddmin continues to subdivide these two subsets into smaller ones, and generates eight variants (i.e., v3, v4, , v10) by using these subsets and their complements. Specifically, the first four variants (v3, v4, v5, v6) are the subsets, and the next four variants (v7, v8, v9, v10) are the complements of these subsets. Again, none of these eight variants preserves .
-
3.
Round 3 (=1). Finally, ddmin decreases subset size from 2 to 1, and generates more variants. This time, v23, which is the complement of the subset {}, preserves . Hence, the subset {} is permanently removed from . Then for each of the remaining subsets {}, {}, , {}, ddmin restarts testing the complement of each subset, i.e., from v24 to v30. However, none of these variants preserves , and no subset can be further divided, so ddmin terminates with the variant v23 as the final result.
II-B Probabilistic Delta Debugging (ProbDD)
Wang et al. [14] proposed the state-of-the-art algorithm ProbDD, significantly surpassing ddmin in minimizing bug-triggering programs on C compilers and benchmarks in software debloating. ProbDD employs Bayesian optimization [18] to model the minimization problem. ProbDD assigns a probability to each element in , representing its likelihood of being essential for preserving the property . At each step during the minimization process, ProbDD selects a subset of elements expected to yield the highest Expected Reduction Gain, and targets these elements in the subset for deletion. In this section, we outline ProbDD’s workflow in Algorithm 1, paving the way for a deeper understanding and analysis of ProbDD.
Initialize (Algorithm 1). In , ProbDD assigns each element an initial probability on Algorithm 1, representing the prior likelihood that each element cannot be removed.
Step 1: Select elements (Algorithm 1, Algorithm 1–1). First, ProbDD sorts the elements in by probability in ascending order on Algorithm 1, and the order of elements with the same probability is determined randomly. Then, on Algorithm 1, it calculates the subset to be removed in the next attempt via the proposed Expected Reduction Gain , as shown in Equation 1, with denoting the expected gain obtained via removing the first elements in selected for deletion, and . denoting the current probability of the -th element in .
(1) |
Note that ProbDD has an invariant that the subset chosen for deletion attempt is always the first elements in . Every time, the first elements are selected as the optimal subset , where maximizes the Expected Reduction Gain , elaborated as Equation 2.
(2) |
Step 2: Delete the Subset (Algorithm 1-1). If is still preserved after the removal of , ProbDD removes subset on Algorithm 1, i.e., keeps only the complement of , and proceeds to Step 1. If cannot be preserved after the removal, on Algorithms 1 and 1, ProbDD updates the probability of each element in the subset via Equation 3, and resumes at Step 1. It is important to note that if an element has been individually deleted but failed, its probability . will be set to 1, indicating that this element cannot be removed and will no longer be considered for deletion.
(3) |
Step 3: Check Termination (Algorithm 1). If every element either has been deleted, or possesses a probability of 1, ProbDD terminates. If not, it returns to Step 1.
Round Number . Similar to the concept of rounds in ddmin (see Table II), ProbDD also has an implicit round number , as introduced on Algorithm 1 in Algorithm 1 and the second row of Table III. During a round, the subset size is the same and every subset in is attempted for deletion. Once the probabilities of all elements have been updated, the next round begins (i.e., on Algorithm 1).
Initial | Variants | v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |||||||||
Element | Prob | Round | (=4) | (=2) | (=1) | |||||||||||||
0.25 | 0.37 | ✓ | 0.37 | 0.61 | ✓ | 0.61 | ✓ | 0.61 | ✓ | 0.61 | ||||||||
0.25 | ✓ | 0.25 | ||||||||||||||||
0.25 | ✓ | 0.25 | ||||||||||||||||
0.25 | 0.37 | ✓ | 0.37 | ✓ | 0.37 | 0.61 | ||||||||||||
0.25 | 0.37 | ✓ | 0.37 | 0.61 | ✓ | 0.61 | ✓ | 0.61 | 1 | ✓ | 1 | ✓ | 1 | |||||
0.25 | ✓ | 0.25 | ||||||||||||||||
0.25 | ✓ | 0.25 | ||||||||||||||||
0.25 | 0.37 | ✓ | 0.37 | ✓ | 0.37 | 0.61 | ✓ | 0.61 | ✓ | 0.61 | ✓ | 0.61 | 1 | |||||
F | T | F | F | T | F | T | F |
Table III illustrates the step-by-step results of ProbDD. Following the study of ProbDD [14], the initial probability is set to 0.25, resulting in subsets with a size of 4 as per Equation 2.
-
1.
Round 1 (=4). Similar to the example in the original paper of ProbDD [14], we assume ProbDD selects (, , , ) to delete due to the randomness, thus resulting the variant v1. However, v1 fails to exhibit , leading to the probability of these selected elements being updated from 0.25 to , based on Equation 3. Next, the remaining elements with lower probability, i.e., (, , , ), are prioritized and selected for deletion, resulting in v2. This time, the property test passes and these elements are removed.
-
2.
Round 2 (=2). Given that all probabilities of remaining elements become 0.37, the next subset size becomes 2. Subsequently, subset (, ) are attempted to remove in v3 and later subset (, ) are attempted to remove in v4, though no subset can be successfully removed. After these two attempts, all probabilities update to .
-
3.
Round 3 (=1). Finally, the subset size becomes 1, so each individual element is selected to remove alone. The elements and are finally removed from the final result in v5 and v7, respectively, while and are verified as non-removable, thus being returned as the final result.
III On the Probabilities in ProbDD
Beginning with this section, we will systematically present our findings. Each finding will be introduced by first stating the result, followed by a comprehensive explanation. Regarding theoretical proofs, we only demonstrate lemmas and theorems, with the corresponding proofs presented in the supplementary materials. In this section, we theoretically analyze the trend of probability changes across rounds.
An Illustrative Example. The running example illustrated in Table III leads to this finding. Observation reveals that after each element has been attempted for deletion once, i.e., completing one round, the probabilities of all remaining elements are updated. The initial probability is 0.25; after v2, it changes to 0.37; following v4, it increases to 0.61; and by the end of v8, it reaches 1. Consequently, we hypothesize that with each deletion attempt, the probability approximately increases in a predictable manner. Through appropriate simplification, we can theoretically model this trend, and thereby model the entire progression of probability changes.
III-A Assumption for Theoretical Analysis
Besides the above observation from a concrete example, theoretical analysis is necessary. To refine the mathematical model of ProbDD for easier representation, analysis and derivation, we assume that the number of elements in is always divisible by the subset size. With this assumption, the probability of each element will be updated in the same manner; as a result, before and after each round, the probabilities of all elements are always the same, as shown in III.1. This assumption is often applicable in practice. For instance, in the running example in Table III, before each round, the probabilities associated with each remaining element are identical, ensuring that all subsets are of identical size. Furthermore, the probabilities of elements are updated to the same next value after the round.
Lemma III.1.
If the number of elements in is always divisible by the subset size, the probabilities of all elements are always the same.
While it is not always possible for the number of elements to be divisible by the subset size, the elements will still be partitioned as evenly as possible. However, such indivisibilities make the theoretical simplification of ProbDD nearly impossible. Based on our observation when running ProbDD, being slightly uneven during partitioning does not significantly affect probability updates. Moreover, we will demonstrate that the simplified algorithm derived from this assumption has no significant difference from ProbDD in section VI, via thorough experimental evaluation.
III-B Probability vs. Subset Size Correlation
In the second step, we derive the correlation between probability and subset size. Based on the assumption in the previous step, the probability of each element is identical and represented as in round , thus the formula of Expected Reduction Gain from Equation 1 can be simplified to
(4) |
Given the probability of elements in the round , can be derived through gradient-based optimization, i.e., . Therefore, the optimal size to maximize is . Subsequently, we can also deduce the next probability to be . In summary, the correlation between probability and subset size can be simplified as Equation 5 and Equation 6, in which subset size is determined by probability , and probability in the next round is determined by both and .
(5) | ||||
(6) |
III-C Trend of Probability Changes
Through Equation 6, always holds, indicating a monotonic increase of the probability of elements. However, there is still room for simplification, as can be represented by , implying that can be represented solely by .
Lemma III.2.
is increased by a factor , i.e.,
(7) |
Therefore, through empirical observations on the running example, coupled with theoretical derivation and simplification, we have identified the pattern of probability changes w.r.t. the round number , i.e., .
IV On the Size of Subsets in ProbDD
In this section, we theoretically analyze the trend of subset size changes across rounds.
IV-A Demystifying How Subset Size Changes
Based on our previous finding that the probability can be approximately estimated by the current round number via a factor. Consequently, we observe a similar pattern in the changes of the subset size in each round.
Lemma IV.1.
can be expressed by solely
(8) |
Despite deriving that depends solely on , the trend of subset size is still implicit and obscure. For a clearer approximation, we propose the linear boundaries of in terms of ,
Lemma IV.2.
The lower bound of w.r.t is
(9) |
Lemma IV.3.
The upper bound of w.r.t is
(10) |
Theorem IV.4.
Subset size is initialized as Equation 5, updated by Equation 8, and constraint by two linear boundaries Equation 9 and Equation 10:
(11) |
Aided by these two bounds, we obtain a complete representation Equation 11 to model subset size . It is worth noting that the size decreases approximately by a factor 0.632, until reaching 1. Alternatively speaking, the subset size after round is roughly , allowing the subset size to be analytically pre-determined, and thus providing the potential for simplification of ProbDD and leading to the proposal of CDD (see details in section VI).
V Empirical Experiments
In addition to the theoretical derivation above, we conduct an extensive experimental evaluation on ddmin and ProbDD to gain deeper insights and achieve further discoveries. Specifically, we reproduce the experiments on ddmin and ProbDD by Wang et al. [14], and then delve deeper into ProbDD, analyzing its randomness, the bottlenecks it overcomes, and its 1-minimality. Furthermore, we evaluate our proposed CDD (which will be presented in section VI), validating our previous theoretical analysis. Due to limited space, we present the results of both ProbDD and CDD together within this section, but this section primarily focuses on discussing ProbDD, while the next section will focus on CDD.
V-A Benchmarks
To extensively evaluate ddmin, ProbDD and CDD, we use the following three benchmark suites ( benchmarks in total), covering various use scenarios of minimization algorithms.
-
•
Benchmark-C (): 20 large bug-triggering programs in C language, each of which triggers a real-world compiler bug in either LLVM or GCC. The original size of benchmarks ranges from 4,397 tokens to 212,259 tokens. This benchmark suite has been used to evaluate test input minimization work [9, 14, 19].
- •
-
•
Benchmark-XML (): 46 XML inputs triggering 8 unique bugs in Basex, a widely-used XML processing tool. The original size of benchmarks ranges from 19,290 tokens to 20,750 tokens. This benchmark suite is generated via Xpress [22] and collected by the authors of this study, as the original XML dataset used in ProbDD paper is not publicly available.
V-B Evaluation Metrics
We measure the following aspects as metrics.
Final Size. This metric assesses the effectiveness of reduction. When reducing a list with a certain property , a smaller final list is preferred, indicating that more irrelevant elements have been successfully eliminated. In all benchmark suites, the metric is measured by the number of tokens.
Execution Time. The execution time of a minimization algorithm reflects its efficiency. A minimization algorithm taking less time is more desirable, and execution time is measured in seconds.
Query Number. This metric further evaluates the algorithm’s efficiency. During the reduction process, each time a smaller variant is produced, the algorithm verifies whether this variant still preserves the property , referred to as a query. Since queries consume time, a lower query number is favorable.
P-value. We calculate the p-value via a paired t-test between every two algorithms, to investigate whether the performance differences are significant. A p-value below 0.05 denotes a significant distinction between the two groups of data. Otherwise, the observed difference lacks statistical significance.
V-C The Wrapping Frameworks
The ddmin algorithm and its variants usually serve as the fundamental algorithm. To apply them to a concrete scenario, an outer wrapping framework is generally needed to handle the structure of the input. In our evaluation, we choose the same wrapping frameworks as those used by ProbDD paper. For those tree-structured bug-triggering inputs, i.e., and , we use Picireny 21.8 [23], an implementation of HDD [10]. Picireny parses such inputs into trees, and then invokes Picire 21.8 [24], an open-sourced Delta Debugging library with ddmin, ProbDD and CDD implemented, to reduce each level of the trees. For software debloating on , Chisel [13] is employed, in which ddmin, ProbDD and CDD are integrated.
All experiments are conducted on a server running Ubuntu 22.04.3 LTS, equipped with Intel Xeon Gold 6348 CPUs @ 2.60GHz, providing a total of 120 threads, and 4 TB of RAM. To ensure the reproducibility, we employ docker images to release the source code and the configuration. Each benchmark is reduced using a single thread. Following the ProbDD paper, we run each algorithm on each benchmark 5 times and calculate the geometric average results.
V-D Reproduction Study of ProbDD
Final size (#) | Execution time (s) | Query number | |||||||||
Benchmark | Original size (#) | ddmin | ProbDD | CDD | ddmin | ProbDD | CDD | ddmin | ProbDD | CDD | |
LLVM-22382 | 9,987 | 350 | 367 | 350 | 1,450 | 975 | 1,004 | 11,388 | 5,461 | 4,540 | |
LLVM-23353 | 30,196 | 321 | 324 | 324 | 2,787 | 1,235 | 1,390 | 11,719 | 4,199 | 3,839 | |
LLVM-25900 | 78,960 | 941 | 921 | 945 | 9,010 | 3,501 | 3,104 | 35,740 | 15,026 | 10,685 | |
LLVM-27747 | 173,840 | 431 | 509 | 431 | 6,972 | 2,653 | 3,757 | 20,000 | 5,862 | 6,976 | |
GCC-59903 | 57,581 | 1,185 | 1,206 | 753 | 7,752 | 3,469 | 3,727 | 47,698 | 15,396 | 11,648 | |
GCC-61383 | 32,449 | 959 | 955 | 978 | 10,729 | 6,220 | 6,027 | 43,716 | 13,712 | 11,933 | |
GCC-61917 | 85,359 | 882 | 923 | 902 | 4,993 | 2,780 | 3,472 | 31,414 | 12,037 | 12,485 | |
GCC-65383 | 43,942 | 706 | 700 | 709 | 4,334 | 3,128 | 3,528 | 25,051 | 9,309 | 8,022 | |
GCC-71626 | 4,397 | 184 | 184 | 184 | 111 | 104 | 124 | 1,608 | 1,196 | 1,119 | |
LLVM-22704 | 184,444 | 95,930 | 788 | 790 | - | 10,762 | 9,218 | 14,312 | 15,973 | 12,230 | |
LLVM-26760 | 209,577 | 15,123 | 498 | 498 | - | 5,010 | 4,995 | 17,749 | 9,530 | 8,256 | |
LLVM-31259 | 48,799 | 1,033 | 1,035 | 1,051 | - | 8,350 | 6,915 | 28,192 | 13,210 | 11,331 | |
GCC-64990 | 148,931 | 39,192 | 709 | 741 | - | 10,625 | 9,402 | 24,212 | 11,965 | 11,361 | |
GCC-66186 | 47,481 | 1,012 | 1,008 | 1,013 | - | 8,494 | 9,920 | 37,682 | 13,025 | 15,768 | |
LLVM-23309 | 33,310 | 1,532 | 1,270 | 1,286 | - | - | 9,376 | 34,173 | 17,038 | 14,840 | |
LLVM-27137 | 174,538 | 119,115 | 56,098 | 47,195 | - | - | - | 5,845 | 4,410 | 4,455 | |
GCC-60116 | 75,224 | 17,598 | 2,797 | 1,894 | - | - | - | 19,061 | 10,391 | 9,007 | |
GCC-66375 | 65,488 | 1,381 | 1,242 | 1,220 | - | - | - | 27,831 | 7,662 | 8,511 | |
GCC-70127 | 154,816 | 26,613 | 1,119 | 1,068 | - | - | - | 18,213 | 7,659 | 6,692 | |
GCC-70586 | 212,259 | 36,692 | 1,820 | 1,606 | - | - | - | 17,732 | 13,707 | 15,991 | |
Mean | 64,599 | 566 | 582 | 541 | 3,333 | 1,819 | 2,018 | 18,483 | 7,276 | 6,483 | |
mkdir-5.2.1 | 34,801 | 8,625 | 8,407 | 8,497 | 3,771 | 1,692 | 1,432 | 11,969 | 2,469 | 1,909 | |
chown-8.2 | 43,869 | 9,765 | 9,178 | 9,190 | - | 6,057 | 5,321 | 25,446 | 7,108 | 5,448 | |
rm-8.4 | 44,459 | 11,293 | 8,411 | 8,463 | - | 4,241 | 3,758 | 22,744 | 4,862 | 4,262 | |
bzip2-1.0.5 | 70,530 | 37,941 | 37,506 | 37,510 | - | - | - | 5,959 | 1,349 | 1,032 | |
date-8.21 | 53,442 | 38,696 | 20,768 | 21,109 | - | - | - | 46,241 | 9,446 | 8,538 | |
grep-2.19 | 127,681 | 127,024 | 75,228 | 82,847 | - | - | - | 56,235 | 4,750 | 4,195 | |
gzip-1.2.4 | 45,929 | 32,575 | 26,011 | 24,290 | - | - | - | 20,900 | 3,697 | 3,487 | |
sort-8.16 | 88,068 | 81,544 | 44,171 | 48,353 | - | - | - | 43,189 | 1,972 | 1,864 | |
tar-1.14 | 163,296 | 157,806 | 97,670 | 93,365 | - | - | - | 35,369 | 3,117 | 2,930 | |
uniq-8.16 | 63,861 | 18,041 | 19,071 | 17,379 | - | - | - | 13,411 | 4,150 | 1,953 | |
Mean | 65,151 | 8,625 | 8,407 | 8,497 | 3,771 | 1,692 | 1,432 | 11,969 | 2,469 | 1,909 | |
Mean | 20,190 | 56 | 57 | 55 | 840 | 639 | 660 | 453 | 293 | 281 | |
All | Mean | 31,989 | 89 | 90 | 88 | 1,076 | 769 | 801 | 872 | 510 | 481 |
To comprehensively reproduce the results of ProbDD [14], we evaluate ddmin and ProbDD using three benchmark suites, containing a total of benchmarks. Following the settings of ProbDD [14], we set the empirically estimated remaining rate as the initialization probability , specifically, 0.1 for and , and 2.5e-3 for . Same as ProbDD’s paper, we employ three hours (10,800 seconds) as the timeout threshold. If the algorithm does not complete a benchmark within the time limit, the smallest result achieved and the corresponding query numbers are still recorded. Due to the significant differences between completed and uncompleted results, we only consider benchmarks completed by all algorithms when calculating averages and p-values. The detailed results are shown in Table IV.
Efficiency and Effectiveness. Through our reproduction study, we find that the performance of ProbDD aligns with the results reported in the original paper, showing that ProbDD is significantly more efficient than ddmin. On benchmarks that can be completed by all algorithms, ProbDD requires % less time and % fewer queries, with p-value being 3.50e-05 and 2.82e-03, respectively. Moreover, we assess the effectiveness by measuring the sizes of the final minimized results. The effectiveness of ddmin and ProbDD varies across each benchmark, but neither algorithm consistently outperforms the other, as substantiated by a p-value of 0.71, which is much higher than 0.05.
V-E Impact of Randomness in ProbDD
In ProbDD, elements with different probabilities are sorted accordingly, while elements with the same probability are randomly shuffled. However, randomness alone intuitively does not ensure a higher probability of escaping local optima and the effect of this randomness on performance has not been thoroughly investigated.
To this end, we conduct an ablation study by removing such randomness, creating a variant called ProbDD-no-random. We evaluate this variant across all benchmarks. The results indicate that the randomness does not significantly impact performance. Specifically, in terms of final size, execution time, and query number, ProbDD-no-random achieves 90, 770, and 559 compared to 90, 769, and 510 of ProbDD, respectively. The p-values of 0.67, 0.95, and 0.75 indicate that the differences are not significant.
V-F Bottleneck ProbDD Overcomes
In the study of ProbDD, the authors demonstrate that ProbDD is more efficient than the baseline approach (ddmin) in tree-based reduction scenarios, where the inputs are parsed into tree representations before reduction. Therefore, to uncover the root cause of this superiority, we follow the same application scenario and analyze the behavior of ProbDD in reducing the tree-structured inputs.
To further understand why ProbDD is more efficient than ddmin, we conduct in-depth statistical analysis on the query number (number of deletion attempts). Intuitively, performance bottlenecks lie in those queries with low success rates, impairing ddmin’s efficiency. Existing studies [16, 15] also demonstrate the presence of queries with low success rates. Therefore, to qualitatively and quantitatively identify the exact bottlenecks impairing ddmin, we statistically analyze all the queries in ddmin and categorize them into three types:
-
1.
Complement: Queries attempting to remove the complement of a subset. According to ddmin algorithm, given a subset (smaller than half of the list ), it attempts to remove either the subset or its complement. However, evidence [16] shows that keeping a small subset and removing its complement is not likely to succeed, especially on structured inputs like programs.
-
2.
Revisit: Queries attempting to remove the previously tried subset. After removing a subset, ddmin restarts the process from the first subset, leading to repeated deletion attempts on earlier subsets. Although the removal of one subset may allow another subset to be removable, such repetitions rarely succeed and thus offer limited improvement for the reduction [15].
-
3.
Other: All other queries.
In addition to categorizing queries in ddmin into the above types, we also calculate the success rate of each type, aiming to reveal the bottlenecks of ddmin. Fig. 2 illustrates the distribution of queries for all types within ddmin, as well as the query number for ProbDD. We only consider completed benchmarks on and , as they reflect the distribution throughout the entire minimization process. However, only one benchmark is completed across all algorithms in . Therefore, for this benchmark suite, we include all benchmarks, including those unfinished ones, to ensure the results are statistically meaningful.
On both and , ddmin performs almost the same number of successful queries, compared to those of ProbDD. Specifically, on , ddmin performs successful queries, close to 3,633 queries from ProbDD. Similarly, on , the success query number of ddmin is , demonstrating only minimal differences compared to ProbDD’s 2,315 successful queries. On , however, the number of successful queries of ddmin is not close to those of ProbDD, as most benchmarks are not completed. Besides, ddmin always performs significantly more failed queries, resulting in a larger total query number and thus a longer execution time, as previously discussed in section V-D.
On all benchmark suites, a large portion of ddmin’s queries is categorized as Complement and Revisit; however, they both have a very low success rate. For instance, on , out of a total of 220,563 queries, Complement and Revisit account for 119,363 (54.12%) and 36,652 (16.62%), respectively. Within such queries in Complement and Revisit, merely 3 (<0.01%) and 74 (0.20%) queries succeed, i.e., only a tiny portion of attempts successfully reduce elements. These success rates are far less than those of queries within Other (5.60%), as well as those of ProbDD (4.85%). On the other benchmark suites, a similar phenomenon is observed.
Queries within Complement and Revisit categories constitute a large portion yet prove to be largely inefficient, wasting a significant amount of time and resources. On the contrary, those in Other achieve a much higher success rate, on par with that of ProbDD, and are responsible for most of the successful deletions. Therefore, we believe that these two categories, where queries are inefficient, are the main bottlenecks behind ddmin’s low efficiency. However, these bottlenecks are absent in ProbDD, as it does not consider complements of subsets and previously tried subsets for deletion.
V-G 1-Minimality of ProbDD?
Although ProbDD avoids Revisit queries to enhance efficiency, some reduction potentials may be missed, as the deletion of a certain subset may enable a previously tried subset to become removable. Therefore, a limitation of ProbDD lies in that it increases efficiency by sacrificing 1-minimality. To substantiate this limitation, we examine how frequently ProbDD generates a list that is not 1-minimal, i.e., can be further reduced by removing a single element. For instance, statistical analysis on reveals that among 6,871 invocations of ProbDD, 76 of them fail to generate a 1-minimal result, accounting for %. For these failed invocations, an average of 1.49 elements (tree nodes) can be further removed via single-element deletion.
However, such limitation is not apparent across all benchmark suites, as the results from ProbDD are not consistently larger than those from ddmin. Our further investigation reveals that these benchmarks are reduced on wrapper frameworks Picireny and Chisel. Both frameworks employ iterative loops to achieve a fixpoint, effectively reducing some elements missed in the first iteration.
VI Implications: a counter-based model
Building on the aforementioned demystification of ProbDD, we discover that probability can be optimized away, and subset size can be pre-computed. Hence, we propose Counter-Based Delta Debugging (CDD), to reduce the complexity of both the theory and implementation of ProbDD, and validate the correctness of our prior theoretical proofs.
Subset size pre-calculation. Based on Equation 11 in section III, the size for each round can be pre-calculated. Therefore, as shown at Algorithm 2 – Algorithm 2 in Algorithm 2, we utilize the current round and the initial probability to determine the subset size . The size of the selected subset decreases as the round counter increases. This is intuitively reasonable since, after a sufficient number of attempts on a large size have been made, it becomes more advantageous to gradually reduce the subset size for future trials. Furthermore, this trend aligns well with that of ProbDD, in which probabilities of elements gradually increase, resulting in a smaller subset size.
Main workflow. The simplified ProbDD is illustrated in Algorithm 2, from Algorithm 2 to Algorithm 2. Before each round, the CDD pre-calculates the subset size on Algorithm 2 and then partitions using this size on Algorithm 2. Then, similar to ddmin, it attempts to remove each subset on Algorithm 2 – Algorithm 2. The subset size continuously decreases until it reaches 1, meaning each element will be individually removed once.
Revisiting the running example. Returning to Table III, under the same conditions, CDD achieves the same results as ProbDD but without the need for probability calculations. This is because both the probability and subset size can be directly determined from the round number .
Evaluation. In Table IV, CDD completes the most benchmarks, totaling 69, followed by ProbDD with 68 benchmarks completed, and ddmin with 61 benchmarks completed. CDD outperforms ddmin w.r.t. efficiency, with % less time and % fewer queries. Meanwhile, CDD performs on par with ProbDD w.r.t. final size, execution time and query number, with a p-value of 0.38, 0.13 and 0.06, respectively, indicating insignificance between these two algorithms. CDD is expected to perform on par with ProbDD since it is designed to provide further insight and unravel the complexities of ProbDD, rather than to surpass its capabilities. Furthermore, its comparable performance to ProbDD further validates the non-necessity of randomness and our assumption in III.1.
Bottleneck and 1-minimality. Returning to the bottlenecks presented in Fig. 2, CDD possesses a query number and success rate close to those of ProbDD, indicating that CDD also overcomes the bottlenecks of ddmin. Additionally, similar to ProbDD, 1-minimality is absent in CDD, although iterations help mitigate this issue.
VII Related Work
In this section, we discuss related work of test input minimization around three aspects: effectiveness, efficiency, and the utilization of domain knowledge.
Effectiveness. Test input minimization is an NP-complete problem, in which achieving the global minimum is usually infeasible. Therefore, existing approaches to improving effectiveness mainly aim to escape local minima by performing more exhaustive searches. Since enumerating all possible subsets is infeasible, Vulcan [11] and C-Reduce [12] enumerate all combinations of elements within a small sliding window, and exhaustively attempt to delete each combination, resulting in smaller final program sizes. In contrast, ProbDD and CDD do not exhibit clear actions targeted at breaking through local optima, suggesting they cannot achieve better effectiveness than ddmin, as aligned with our evaluation in section V.
Efficiency. If parallelism is not considered, the core of boosting efficiency is the enhanced capability to avoid relatively inefficient queries. For example, Hodovan and Kiss [16] proposed disregarding attempts to remove the complement of subsets, the success rate of which is unacceptably low in some scenarios. Besides, Gharachorlu and Sumner [15] proposed One Pass Delta Debugging (OPDD), which continues with the subset next to the deleted one, rather than starting over from the first subset. This optimization also avoids some redundant queries in ddmin, reducing runtime by 65%. As revealed by our analysis, these two above-mentioned optimizations are implicitly incorporated within ProbDD and CDD, and thereby contributing to their higher efficiency than ddmin.
Utilization of domain knowledge. There is an inherent trade-off between effectiveness and efficiency in test input minimization. For the same algorithm, achieving a better result, i.e., a smaller local optimum, requires more queries to be spent on trial and error. However, employing domain knowledge [12, 25, 26, 27] can still improve the overall performance. For instance, J-Reduce is both more effective and efficient than HDD on reducing Java programs, as it escapes more local optima by program transformations while simultaneously avoiding more inefficient queries via semantic constraints, leveraging the semantics of Java. Our analysis on ProbDD indicates that the probabilities primarily function as counters and do not utilize or effectively learn the domain knowledge of an input. Besides, the evaluation on CDD, a simplified algorithm without utilizing probability, demonstrates that prioritizing elements via such probabilities does not yield significant benefits, thus validating our analysis.
VIII Conclusion
This paper conducts the first in-depth analysis of ProbDD, which is the state-of-the-art variant of ddmin, to further comprehend and demystify its superior performance. With theoretical analysis of the probabilistic model in ProbDD, we reveal that probabilities essentially serve as monotonically increasing counters, and propose CDD for simplification. Evaluations on benchmarks from test input minimization and software debloating confirm that CDD performs on par with ProbDD, substantiating our theoretical analysis. Furthermore, our examination on query success rate and randomness uncovers that ProbDD’s superiority stems from skipping inefficient queries. Finally, we discuss trade-offs in ddmin and ProbDD, providing insights for future research and applications of test input minimization algorithms.
References
- [1] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183–200, 2002.
- [2] GCC. (2020) A guide to testcase reduction. Accessed: 2023-04-30. [Online]. Available: https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction
- [3] LLVM. (2022) How to submit an llvm bug report. Accessed: 2023-04-30. [Online]. Available: https://llvm.org/docs/HowToSubmitABug.html
- [4] WebKit. (2001) Webkit: Test case reduction. Accessed: 2023-04-30. [Online]. Available: https://webkit.org/test-case-reduction/
- [5] ASF Bugzilla. (2001) ASF bugzilla: Bug writing guidelines. Accessed: 2023-04-30. [Online]. Available: https://bz.apache.org/bugzilla/page.cgi?id=bug-writing.html
- [6] Bugzilla. (2001) Bugzilla: Reporting a new bug. Accessed: 2023-04-30. [Online]. Available: https://bugzilla.readthedocs.io/en/5.2/using/filing.html#reporting-a-new-bug
- [7] A. Donaldson and D. MacIver. (2021, May) Test Case Reduction: Beyond Bugs. [Online]. Available: https://blog.sigplan.org/2021/05/25/test-case-reduction-beyond-bugs
- [8] A. F. Donaldson, P. Thomson, V. Teliman, S. Milizia, A. P. Maselco, and A. Karpiński, “Test-case reduction and deduplication almost for free with transformation-based compiler testing,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1017–1032.
- [9] C. Sun, Y. Li, Q. Zhang, T. Gu, and Z. Su, “Perses: Syntax-guided program reduction,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 361–371.
- [10] G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Proceedings of the 28th International Conference on Software Engineering, 2006, pp. 142–151.
- [11] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 636–664, 2023.
- [12] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for c compiler bugs,” in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012, pp. 335–346.
- [13] K. Heo, W. Lee, P. Pashakhanloo, and M. Naik, “Effective program debloating via reinforcement learning,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 380–394.
- [14] G. Wang, R. Shen, J. Chen, Y. Xiong, and L. Zhang, “Probabilistic delta debugging,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 881–892.
- [15] G. Gharachorlu and N. Sumner, “Avoiding the familiar to speed up test case reduction,” in 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS). IEEE, 2018, pp. 426–437.
- [16] R. Hodován and Á. Kiss, “Practical improvements to the minimizing delta debugging algorithm.” in ICSOFT-EA, 2016, pp. 241–248.
- [17] G. Wang. (2021) Probdd. Accessed: 2023-04-30. [Online]. Available: https://github.com/Amocy-Wang/ProbDD
- [18] M. Pelikan, D. E. Goldberg, E. Cantú-Paz et al., “Boa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1. Citeseer, 1999, pp. 525–532.
- [19] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proc. ACM Program. Lang., vol. 7, no. OOPSLA1, apr 2023. [Online]. Available: https://doi.org/10.1145/3586049
- [20] C. Qian, H. Hu, M. Alharthi, S. P. H. Chung, T. Kim, and W. Lee, “Razor: A framework for post-deployment software debloating.” in USENIX Security Symposium, 2019, pp. 1733–1750.
- [21] M. Alhanahnah, R. Jain, V. Rastogi, S. Jha, and T. Reps, “Lightweight, multi-stage, compiler-assisted application specialization,” in 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P). IEEE, 2022, pp. 251–269.
- [22] S. Li and M. Rigger, “Finding xpath bugs in xml document processors via differential testing,” arXiv preprint arXiv:2401.05112, 2024.
- [23] A. Kiss, R. Hodován, and D. Vince. (2016) Picireny. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picireny
- [24] ——. (2016) Picire. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picire
- [25] C. G. Kalhauge and J. Palsberg, “Binary reduction of dependency graphs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 556–566.
- [26] ——, “Logical bytecode reduction,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1003–1016.
- [27] M. Zhang, Y. Tian, Z. Xu, Y. Dong, S. H. Tan, and C. Sun, “Lpr: Large language models-aided program reduction,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis. New York, NY, USA: ACM, 2024, p. 13.