(Translated by https://www.hiragana.jp/)
Deep Dive into Probabilistic Delta Debugging: Insights and Simplifications

Deep Dive into Probabilistic Delta Debugging: Insights and Simplifications

Mengxiao Zhang, Zhenyang Xu, Yongqiang Tian1, Xinru Cheng, Chengnian Sun University of Waterloo, Waterloo, Canada
{m492zhan, zhenyang.xu, x59cheng, cnsun}@uwaterloo.ca
1The Hong Kong University of Science and Technology, Hong Kong, China
yqtian@ust.hk
Abstract

Given a list L𝐿Litalic_L of elements and a property ψぷさい𝜓\psiitalic_ψぷさい that L𝐿Litalic_L exhibits, ddmin is a well-known test input minimization algorithm designed to automatically eliminate ψぷさい𝜓\psiitalic_ψぷさい-irrelevant elements from L𝐿Litalic_L. This algorithm is extensively adopted in test input minimization and software debloating. Recently, ProbDD, an advanced variant of ddmin, has been proposed and achieved state-of-the-art performance. Employing Bayesian optimization, ProbDD predicts the likelihood of each element in L𝐿Litalic_L being pertinent to ψぷさい𝜓\psiitalic_ψぷさい, and statistically decides which elements and how many should be removed each time. Despite its impressive results, the theoretical probabilistic model of ProbDD is complex, and the specific factors driving its superior performance have not been thoroughly investigated.

In this paper, we conduct the first in-depth theoretical analysis of ProbDD, clarifying the trends in probability and subset size changes and simplifying the probability model. This analysis is complemented by empirical experiments, including success rate analysis, ablation studies, and examinations of trade-offs and limitations, to further understand and demystify this state-of-the-art algorithm. Our success rate analysis reveals how ProbDD effectively addresses bottlenecks that slow down ddmin by skipping inefficient queries that attempt to delete complements of subsets and previously tried subsets. The ablation study illustrates that randomness in ProbDD has no significant impact on efficiency.

Based on these findings, we propose CDD, a simplified version of ProbDD, which reduces the complexity in both theory and implementation. CDD assists in validating the correctness of our key findings, such as the role of probabilities in ProbDD serving as monotonically increasing counters for each element, and in identifying the main factors that contribute to ProbDD’s superior performance. Comprehensive evaluations across 76767676 benchmarks in test input minimization and software debloating demonstrate that CDD can achieve the same performance as ProbDD, despite its simplification. These insights provide valuable guidance for future research and applications of test input minimization algorithms.

Index Terms:
Program Reduction, Delta Debugging, Software Debloating, Test Input Minimization

I Introduction

Delta Debugging [1] is a seminal family of algorithms designed for software debugging, among which ddmin stands out as a classic test input minimization (a.k.a., test input reduction) algorithm. Given a list L𝐿Litalic_L of elements (modeling the test input) and a property ψぷさい𝜓\psiitalic_ψぷさい that L𝐿Litalic_L exhibits, ddmin aims to remove elements in L𝐿Litalic_L that are irrelevant to ψぷさい𝜓\psiitalic_ψぷさい, such that the resulting list is smaller than L𝐿Litalic_L yet still satisfies ψぷさい𝜓\psiitalic_ψぷさい. The ddmin algorithm plays a crucial role in software testing, debugging and maintenance [2, 3, 4, 5, 6], since compact, informative bug-triggering inputs are easier for developers to effectively identify root causes than large bug-triggering inputs with bug-irrelevant information [7].

To minimize a test input I𝐼Iitalic_I that satisfies ψぷさい𝜓\psiitalic_ψぷさい, ddmin has been used in two primary manners. In the first manner, I𝐼Iitalic_I is initially segmented into a list, denoted as L𝐿Litalic_L, which could be segmented based on characters, tokens, lines, etc. Subsequently, ddmin is directly applied to L𝐿Litalic_L [1, 8]. Alternatively, ddmin serves as a pivotal component within advanced, structure-aware test input minimization algorithms, including Perses [9], HDD [10], C-Reduce [12], and Chisel [13]. These algorithms leverage the inherent structures of I𝐼Iitalic_I to expedite the minimization process or further reduce its size. Generally, these algorithms initiate by parsing I𝐼Iitalic_I into a tree structure, such as a parse tree. They then iteratively extract a list L𝐿Litalic_L of tree nodes from the tree using heuristics and apply ddmin to L𝐿Litalic_L to gradually condense the tree. Both manners underscore the fundamental role of ddmin as the cornerstone of test input minimization.

In the past years, different variants of ddmin have been proposed to improve its performance  [13, 14, 15, 16], among which Probabilistic Delta Debugging (ProbDD) [14] is the state of the art, with notable superiority to other algorithms [1, 13]. When reducing L𝐿Litalic_L, ProbDD utilizes a theoretical probabilistic model based on Bayesian optimization to predict how likely every element in L𝐿Litalic_L is essential to preserve the property ψぷさい𝜓\psiitalic_ψぷさい, by assigning a probability to each element. ProbDD prioritizes deleting elements with lower probabilities, as such elements generally have a lower possibility of being ψぷさい𝜓\psiitalic_ψぷさい-relevant. Before each deletion attempt, an optimal subset of elements is determined by maximizing the Expected Reduction Gain.111In each attempt, the Expected Reduction Gain is defined as the expected number of elements removed. Higher Expected Reduction Gain is preferred, as it indicates an expectation to delete more elements through this attempt. If the deletion of this subset fails to preserve ψぷさい𝜓\psiitalic_ψぷさい, the probabilistic model increases the probability assigned to each element in the subset. As reported [17], aided by such a probabilistic model, ProbDD significantly outperforms ddmin by reducing the execution time and the query number.222A query is a run of the property test ψぷさい𝜓\psiitalic_ψぷさい.

However, this probabilistic model in ProbDD is rather intricate, and the underlying mechanisms for its superior performance have not been adequately studied. The original paper of ProbDD merely showed its performance numbers without deep ablation analysis on such achievements. Specifically, the following questions are important to the research field of test input minimization, but have not been answered yet.

  1. 1.

    What is the fundamental role of probabilities in ProbDD, and can they be simplified without impacting performance?

  2. 2.

    What specific bottlenecks does ProbDD overcome to achieve improvement compared to ddmin?

  3. 3.

    How does randomness in ProbDD contribute to the performance improvement?

  4. 4.

    What are the potential limitations of ProbDD?

Gaining a deeper understanding of the state of the art, i.e., ProbDD, is highly valuable for test input minimization tasks. By clarifying the intrinsic reasons behind its superiority, we can facilitate researchers to understand the essence of the probabilistic model, as well as its strengths and limitations. Such demystification, in our view, paves the way for enlightening future research and guides users to more effectively apply ddmin and its variants for test input minimization.

To this end, we conduct the first in-depth analysis of ProbDD, starting by theoretically simplifying its probabilistic model. In the original ProbDD, probabilities are used to calculate the Expected Reduction Gain, which is subsequently used to determine the next subset size. However, this process necessitates iterative calculations, impeding the simplification and comprehension of ProbDD. In our study, we initially establish the analytical correlation between the probability and subset size, allowing for probabilities and subset sizes to be explicitly calculated through formulas, thus eliminating the need for iterative updates. Further, through mathematical derivation, we discover that the probability and subset size can be considered nearly independent, each varying at an approximate ratio on their own. By theoretical prediction, the probability increases approximately by a factor of 11e111superscript𝑒1\frac{1}{1-e^{-1}}divide start_ARG 1 end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG (\approx 1.582), while the subset size decreases by a factor of 1e11superscript𝑒11-e^{-1}1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (\approx 0.632), thus providing the potential for simplifying ProbDD.

Building upon our theoretical analysis, we conducted extensive evaluations of ddmin, ProbDD, and CDD across 76767676 diverse benchmarks. The experimental results confirm the correctness of our theoretical analysis, demonstrating how ProbDD addresses bottlenecks in ddmin by skipping inefficient queries, reveals the impact of randomness on results, and highlights the limitations of ProbDD. These findings provide valuable guidance for future research and the development of test input minimization algorithms.

Based on the aforementioned analysis, we propose Counter-Based Delta Debugging (CDD), a simplified version of ProbDD, to explain ProbDD’s high performance. By replacing probabilities with counters, CDD eliminates the probability computations required by ProbDD, thus reducing theoretical and implementation complexity. Our experiments demonstrate that CDD aligns with ProbDD in both effectiveness and efficiency, which validates our previous analysis and findings.

Key Findings.  Through both theoretical analysis and empirical experiments, our key findings are:

  1. 1.

    Through theoretical derivation, the probabilities in ProbDD essentially serve as monotonically increasing counters, and can be simplified. This suggests that the probability mechanism itself may not be a critical factor in ProbDD’s superior performance.

  2. 2.

    The performance bottlenecks addressed by ProbDD are inefficient deletion attempts on complements of subsets and previously tried subsets, which should be considered to enhance efficiency.

  3. 3.

    Randomness in ProbDD has no significant impact on the performance. Test input minimization is an NP-complete problem, randomness in ProbDD does not enhance the likelihood of finding optimal solutions.

  4. 4.

    ProbDD is faster than ddmin, but at the cost of not guaranteeing 1-minimality.333A list is considered to have 1-minimality if removing any single element from it results in the loss of its property. The trade-off between effectiveness and efficiency is inevitable, and should be leveraged accordingly in different scenarios.

Contributions.  We make the following major contributions.

  • We perform the first in-depth theoretical analysis for ProbDD, the state-of-the-art algorithm in test input minimization tasks, and identify the latent correlation between the subset size and the probability of elements.

  • We propose CDD, a much simplified version of ProbDD.

  • We evaluate ddmin, ProbDD and CDD on 76767676 benchmarks, validating the correctness of our theoretical analysis. Additional experiments and statistical analysis on ProbDD further explain its superior performance, reveal the effectiveness of randomness, and the limitations of ProbDD.

Paper Organization.  The remainder of the paper is structured as follows: section II introduces the symbols used in this study and detailed workflow of ddmin and ProbDD. section III and section IV present our in-depth analysis on ProbDD, simplifying the model of probability and subset size. section V describes empirical experiments and their results, from which additional findings are derived. section VI introduces CDD, which simplifies ProbDD based on our earlier findings while maintaining equivalent performance. section VII illustrates related work and section VIII concludes this study.

II Preliminaries

To facilitate comprehension, Table I lists all the important symbols used in this paper. Next, this section introduces ddmin and ProbDD, with the running example shown in Fig. 1(a).

TABLE I: The symbols used in this paper.
Symbol Description Symbol Description
L𝐿Litalic_L the list to minimize s𝑠sitalic_s the size of S𝑆Sitalic_S
ψぷさい𝜓\psiitalic_ψぷさい the property to preserve E(s)𝐸𝑠E(s)italic_E ( italic_s ) Expected Reduction Gain with the first s𝑠sitalic_s elements
lisubscript𝑙𝑖{l}_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the i𝑖iitalic_i-th element of L𝐿Litalic_L e𝑒eitalic_e Euler’s number
li.pformulae-sequencesubscript𝑙𝑖𝑝{l}_{i}.pitalic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_p the probability of lisubscript𝑙𝑖{l}_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT r𝑟ritalic_r the round number
visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT a variant of L𝐿Litalic_L srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT the subset size in round r𝑟ritalic_r
S𝑆Sitalic_S a subset of L𝐿Litalic_L prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT the probability of each element in round r𝑟ritalic_r
l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:import math, sys
l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:input = sys.argv[1]
l3subscript𝑙3{l}_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT:a = int(input)
l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT:b = math.e
l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT:c = 3
l6subscript𝑙6{l}_{6}italic_l start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT:d = pow(b, a) + c
l7subscript𝑙7{l}_{7}italic_l start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT:c = math.log(d, b)
l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT:crash(c)
(a) Original.
l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT:import math, sys
l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:input = sys.argv[1]
l3subscript𝑙3{l}_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT:a = int(input)
l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT:b = math.e
l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT: c = 3
l6subscript𝑙6{l}_{6}italic_l start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT:d = pow(b, a) + c
l7subscript𝑙7{l}_{7}italic_l start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT:c = math.log(d, b)
l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT:crash(c)
(b) By ddmin.
l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT: import math, sys@\mt{2e}\myelement{2}:\mt{3s}input = sys.argv[1]\mt{3e}\myelement{3}:\mt{4s}a = int(input)\mt{4e}\myelement{4}:\mt{5s}b = math.e\mt{5e}\myelement{5}:c = 3\myelement{6}:\mt{6s}d = pow(b, a) + c\mt{6e}\myelement{7}:\mt{7s}c = math.log(d, b)\mt{7e}\myelement{8}:crash(c)
(c) By ProbDD.
Figure 1: A running example in Python. Fig. 1(a) shows the original program, represented as a list of 8 elements (l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, \cdots, l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT), in which l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT (i.e., crash(c)) triggers the crash. Fig. 1(b) and Fig. 1(c) show the minimized results by ddmin and ProbDD, with removed elements masked in gray. Both minimized programs still trigger the crash. Note that ProbDD cannot consistently guarantee the result in Fig. 1(c) and might produce larger results, due to its inherent randomness.

          

TABLE II: Step-by-step outcomes from ddmin on the running example. In each column, a variant is generated and tested against the property ψぷさい𝜓\psiitalic_ψぷさい. These variants are sequentially generated from left to right. The first row displays the variant identifier, and the second row displays round number r𝑟ritalic_r and subset size s𝑠sitalic_s. In the following rows, the symbol “✓” denotes an element is included by a certain variant, while gray cells signify that the element have been removed. For the last row, T indicates that the variant still preserves the property ψぷさい𝜓\psiitalic_ψぷさい, whereas F indicates not.
Initial Variants v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 v19 v20 v21 v22 v23 v24 v25 v26 v27 v28 v29 v30
Element Round r=1𝑟1r=1italic_r = 1 (s𝑠sitalic_s=4) r=2𝑟2r=2italic_r = 2 (s𝑠sitalic_s=2) r=3𝑟3r=3italic_r = 3 (s𝑠sitalic_s=1)
l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
l3subscript𝑙3{l}_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT
l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT
l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT
l6subscript𝑙6{l}_{6}italic_l start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT
l7subscript𝑙7{l}_{7}italic_l start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT
l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT
ψぷさい𝜓\psiitalic_ψぷさい F F F F F F F F F F F F F F F F F F F F F F T F F F F F F F

II-A The ddmin Algorithm

The ddmin algorithm [1] is the first algorithm to systematically minimize a bug-triggering input to its essence. It takes the following two inputs:

  • L𝐿Litalic_L: a list of elements representing a bug-triggering input. For example, L𝐿Litalic_L can be a list of bytes, characters, lines, tokens, or parse tree nodes extracted from the bug-triggering input.

  • ψぷさい𝜓\psiitalic_ψぷさい: a property that L𝐿Litalic_L has. Formally, ψぷさい𝜓\psiitalic_ψぷさい can be defined as a predicate that returns T if a list of elements preserves the property, F otherwise.

and returns a minimal subset of L𝐿Litalic_L that still preserves ψぷさい𝜓\psiitalic_ψぷさい, from which excluding any single element will make the minimal subset lose ψぷさい𝜓\psiitalic_ψぷさい. This algorithm has been widely used in practice to facilitate developers in debugging [12, 9, 8, 24]. It generally consists of the following three steps.

Initialize.  Start by setting the initial subset size s𝑠sitalic_s to half of the input list L𝐿Litalic_L, i.e., s𝑠sitalic_s=|L|/2𝐿2|L|/2| italic_L | / 2.

Step 1: Minimize to Subset.  Divide the input list L𝐿Litalic_L into subsets, each with the current subset size. For each subset S𝑆Sitalic_S, check whether S𝑆Sitalic_S alone satisfies ψぷさい𝜓\psiitalic_ψぷさい. If yes, keep only S𝑆Sitalic_S and restart from Step 1 with L=S𝐿𝑆L=Sitalic_L = italic_S and the subset size as half of the new L𝐿Litalic_L; otherwise go to Step 2.

Step 2: Minimize to Complement.  Test whether the complement of each subset S𝑆Sitalic_S (i.e., L/S={e|eLeS}𝐿𝑆conditional-set𝑒𝑒𝐿𝑒𝑆L/S=\{e|e\in L\wedge e\not\in S\}italic_L / italic_S = { italic_e | italic_e ∈ italic_L ∧ italic_e ∉ italic_S }) satisfies ψぷさい𝜓\psiitalic_ψぷさい. If yes, keep the complement of S𝑆Sitalic_S and restart from Step 2 with L=L/S𝐿𝐿𝑆L=L/Sitalic_L = italic_L / italic_S. Otherwise, go to Step 3.

Step 3: Subdivide.  If any of the remaining subsets has at least two elements and thus can be further divided, halve the subset size, i.e., s=s/2𝑠𝑠2s=s/2italic_s = italic_s / 2 and go back to Step 1. If no subset can be further divided (i.e., the subset size is 1), ddmin terminates and returns the remaining elements as the result.

Round Number rrritalic_r.  Note that we introduce a round number r𝑟ritalic_r at the second column of Table II. Within each round, the list L𝐿Litalic_L is divided into subsets of a fixed size, on which Step 1 and Step 2 are applied. A new round begins when no further progress can be made with the current subset size. This round number is not explicitly present in the original ddmin algorithm but exists implicitly. In subsequent sections, we will also use this concept to introduce and simplify the ProbDD algorithm.

Table II illustrates the step-by-step minimization process of ddmin with the running example in Fig. 1(a). Initially, the input L𝐿Litalic_L is [l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, \cdots, l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT]. The ddmin algorithm iteratively generates variants by gradually decreasing the size of subsets from 4, 2 to 1.

  1. 1.

    Round 1 (s𝑠sitalic_s=4). At the beginning, ddmin splits L𝐿Litalic_L into two subsets and generates two variants v1 and v2. However, neither of them preserves ψぷさい𝜓\psiitalic_ψぷさい.

  2. 2.

    Round 2 (s𝑠sitalic_s=2). Next, ddmin continues to subdivide these two subsets into smaller ones, and generates eight variants (i.e., v3, v4, \cdots, v10) by using these subsets and their complements. Specifically, the first four variants (v3, v4, v5, v6) are the subsets, and the next four variants (v7, v8, v9, v10) are the complements of these subsets. Again, none of these eight variants preserves ψぷさい𝜓\psiitalic_ψぷさい.

  3. 3.

    Round 3 (s𝑠sitalic_s=1). Finally, ddmin decreases subset size s𝑠sitalic_s from 2 to 1, and generates more variants. This time, v23, which is the complement of the subset {l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT}, preserves ψぷさい𝜓\psiitalic_ψぷさい. Hence, the subset {l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT} is permanently removed from L𝐿Litalic_L. Then for each of the remaining subsets {l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT}, {l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT}, \cdots, {l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT}, ddmin restarts testing the complement of each subset, i.e., from v24 to v30. However, none of these variants preserves ψぷさい𝜓\psiitalic_ψぷさい, and no subset can be further divided, so ddmin terminates with the variant v23 as the final result.

II-B Probabilistic Delta Debugging (ProbDD)

Wang et al. [14] proposed the state-of-the-art algorithm ProbDD, significantly surpassing ddmin in minimizing bug-triggering programs on C compilers and benchmarks in software debloating. ProbDD employs Bayesian optimization [18] to model the minimization problem. ProbDD assigns a probability to each element in L𝐿Litalic_L, representing its likelihood of being essential for preserving the property ψぷさい𝜓\psiitalic_ψぷさい. At each step during the minimization process, ProbDD selects a subset of elements expected to yield the highest Expected Reduction Gain, and targets these elements in the subset for deletion. In this section, we outline ProbDD’s workflow in Algorithm 1, paving the way for a deeper understanding and analysis of ProbDD.

1
Input: L𝐿Litalic_L: a list to be minimized.
Input: ψぷさい:𝕃𝔹:𝜓𝕃𝔹\psi:\mathbb{L}\rightarrow\mathbb{B}italic_ψぷさい : blackboard_L → blackboard_B: the property to be preserved by L𝐿Litalic_L.
Input: p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: the initial probability given by the user.
Output: the minimized list that still exhibits the property ψぷさい𝜓\psiitalic_ψぷさい.
2
// Initialize the probability of each element with p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3 foreach lL𝑙𝐿{l}\in Litalic_l ∈ italic_L do  l.pp0formulae-sequence𝑙𝑝subscript𝑝0{l}.p\leftarrow p_{0}italic_l . italic_p ← italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
4
// The round number r𝑟ritalic_r, initially 0. r𝑟ritalic_r is not explicitly used in the original ProbDD algorithm. It is displayed for demonstrating ProbDD’s implicit principles.
5 r0𝑟0r\leftarrow 0italic_r ← 0
6 while lL:l.p<1\exists{l}\in L:{l}.p<1∃ italic_l ∈ italic_L : italic_l . italic_p < 1 do
       // Select elements from L𝐿Litalic_L for deletion attempt.
7       SSelectSubset(L)𝑆SelectSubset𝐿S\leftarrow\textnormal{{SelectSubset}}(L)italic_S ← SelectSubset ( italic_L )
       // Check if removing the subset preserves the property
8       tempLStemp𝐿𝑆\texttt{temp}\leftarrow L\setminus Stemp ← italic_L ∖ italic_S
9       if ψぷさい(temp)𝜓temp\psi(\texttt{temp})italic_ψぷさい ( temp ) = T then  Ltemp𝐿tempL\leftarrow\texttt{temp}italic_L ← temp
10       else
             // Calculate the factor to update probabilities
11             factor11lS(1l.p)\texttt{factor}\leftarrow\frac{1}{1-\prod_{{l}\in S}(1-{l}.p)}factor ← divide start_ARG 1 end_ARG start_ARG 1 - ∏ start_POSTSUBSCRIPT italic_l ∈ italic_S end_POSTSUBSCRIPT ( 1 - italic_l . italic_p ) end_ARG
             // Update the probabilities of elements in the subset
12             foreach lS𝑙𝑆{l}\in Sitalic_l ∈ italic_S do  l.pfactor×l.pformulae-sequence𝑙𝑝factor𝑙𝑝{l}.p\leftarrow\texttt{factor}\times{l}.pitalic_l . italic_p ← factor × italic_l . italic_p
13            
14       if All elements’ probability have been updated then
             /* Move to the next round. */
15             r=r+1𝑟𝑟1r=r+1italic_r = italic_r + 1
16            
17      
18return L𝐿Litalic_L
19
20Function SelectSubset(L):
       Input: L𝐿Litalic_L: a list of elements to be reduced.
       Output: The subset of elements that maximizes the Expected Reduction Gain.
       /* Sort L𝐿Litalic_L by ascending probability, with elements having the same probability in random order. */
21       sortedLRandomizeThenSort(L)sortedLRandomizeThenSort𝐿\texttt{sortedL}\leftarrow\textnormal{{RandomizeThenSort}}(L)sortedL ← RandomizeThenSort ( italic_L )
22       S𝑆S\leftarrow\emptysetitalic_S ← ∅
23       currentMaxGain0currentMaxGain0\texttt{currentMaxGain}\leftarrow 0currentMaxGain ← 0
24       foreach lsortedL𝑙sortedL{l}\in\texttt{sortedL}italic_l ∈ sortedL do
25             tempSubsetS{l}tempSubset𝑆𝑙\texttt{tempSubset}\leftarrow S\cup\{{l}\}tempSubset ← italic_S ∪ { italic_l }
26             gain|tempSubset|×ltempSubset(1l.p)\texttt{gain}\leftarrow|\texttt{tempSubset}|\times\prod_{{l}\in\texttt{% tempSubset}}(1-{l}.p)gain ← | tempSubset | × ∏ start_POSTSUBSCRIPT italic_l ∈ tempSubset end_POSTSUBSCRIPT ( 1 - italic_l . italic_p )
27             if gain>currentMaxGaingaincurrentMaxGain\texttt{gain}>\texttt{currentMaxGain}gain > currentMaxGain then
28                   currentMaxGaingaincurrentMaxGaingain\texttt{currentMaxGain}\leftarrow\texttt{gain}currentMaxGain ← gain
29                   StempSubset𝑆tempSubsetS\leftarrow\texttt{tempSubset}italic_S ← tempSubset
30            else  break
31            
32      return S𝑆Sitalic_S
33
Algorithm 1 ProbDD(L,ψぷさい𝐿𝜓L,\psiitalic_L , italic_ψぷさい)

Initialize (Algorithm 1).  In L𝐿Litalic_L, ProbDD assigns each element an initial probability p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on Algorithm 1, representing the prior likelihood that each element cannot be removed.

Step 1: Select elements (Algorithm 1, Algorithm 11).  First, ProbDD sorts the elements in L𝐿Litalic_L by probability in ascending order on Algorithm 1, and the order of elements with the same probability is determined randomly. Then, on Algorithm 1, it calculates the subset to be removed in the next attempt via the proposed Expected Reduction Gain E(s)𝐸𝑠E(s)italic_E ( italic_s ), as shown in Equation 1, with E(s)𝐸𝑠E(s)italic_E ( italic_s ) denoting the expected gain obtained via removing the first s𝑠sitalic_s elements in L𝐿Litalic_L selected for deletion, and lisubscript𝑙𝑖{l}_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.p𝑝pitalic_p denoting the current probability of the i𝑖iitalic_i-th element in L𝐿Litalic_L.

E(s)=s×i=1s(1li.p)\displaystyle E(s)=s\times\prod^{s}_{i=1}{(1-{l}_{i}.p)}italic_E ( italic_s ) = italic_s × ∏ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ( 1 - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_p ) (1)

Note that ProbDD has an invariant that the subset S𝑆Sitalic_S chosen for deletion attempt is always the first s𝑠sitalic_s elements in L𝐿Litalic_L. Every time, the first ssuperscript𝑠s^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT elements are selected as the optimal subset S𝑆Sitalic_S, where ssuperscript𝑠s^{*}italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT maximizes the Expected Reduction Gain E(s)𝐸𝑠E(s)italic_E ( italic_s ), elaborated as Equation 2.

s=argmaxsE(s)superscript𝑠subscriptargmax𝑠𝐸𝑠\displaystyle s^{*}=\operatorname*{arg\,max}_{s}E(s)italic_s start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_E ( italic_s ) (2)

Step 2: Delete the Subset (Algorithm 1-1).  If ψぷさい𝜓\psiitalic_ψぷさい is still preserved after the removal of S𝑆Sitalic_S, ProbDD removes subset S𝑆Sitalic_S on Algorithm 1, i.e., keeps only the complement of S𝑆Sitalic_S, and proceeds to Step 1. If ψぷさい𝜓\psiitalic_ψぷさい cannot be preserved after the removal, on Algorithms 1 and 1, ProbDD updates the probability of each element in the subset S𝑆Sitalic_S via Equation 3, and resumes at Step 1. It is important to note that if an element lisubscript𝑙𝑖{l}_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has been individually deleted but failed, its probability lisubscript𝑙𝑖{l}_{i}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.p𝑝pitalic_p will be set to 1, indicating that this element cannot be removed and will no longer be considered for deletion.

li.pli.p1lS(1l.p)\displaystyle{l}_{i}.p\leftarrow\frac{{l}_{i}.p}{1-\prod_{{l}\in S}{(1-{l}.p)}}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_p ← divide start_ARG italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . italic_p end_ARG start_ARG 1 - ∏ start_POSTSUBSCRIPT italic_l ∈ italic_S end_POSTSUBSCRIPT ( 1 - italic_l . italic_p ) end_ARG (3)

Step 3: Check Termination (Algorithm 1).  If every element either has been deleted, or possesses a probability of 1, ProbDD terminates. If not, it returns to Step 1.

Round Number rrritalic_r.  Similar to the concept of rounds in ddmin (see Table II), ProbDD also has an implicit round number r𝑟ritalic_r, as introduced on Algorithm 1 in Algorithm 1 and the second row of Table III. During a round, the subset size is the same and every subset in L𝐿Litalic_L is attempted for deletion. Once the probabilities of all elements have been updated, the next round begins (i.e., rr+1𝑟𝑟1r\leftarrow r+1italic_r ← italic_r + 1 on Algorithm 1).

TABLE III: Step-by-step outcomes from ProbDD on the running example. Similar to Table II, round number, subset size and the details of each variants are presented. For each variant, the probability of each element is noted alongside.
Initial Variants v1 v2 v3 v4 v5 v6 v7 v8
Element Prob Round r=1𝑟1r=1italic_r = 1 (s𝑠sitalic_s=4) r=2𝑟2r=2italic_r = 2 (s𝑠sitalic_s=2) r=3𝑟3r=3italic_r = 3 (s𝑠sitalic_s=1)
l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT 0.25 0.37 0.37 0.61 0.61 0.61 0.61
l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 0.25 0.25
l3subscript𝑙3{l}_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT 0.25 0.25
l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT 0.25 0.37 0.37 0.37 0.61
l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT 0.25 0.37 0.37 0.61 0.61 0.61 1 1 1
l6subscript𝑙6{l}_{6}italic_l start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT 0.25 0.25
l7subscript𝑙7{l}_{7}italic_l start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT 0.25 0.25
l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT 0.25 0.37 0.37 0.37 0.61 0.61 0.61 0.61 1
ψぷさい𝜓\psiitalic_ψぷさい F T F F T F T F

Table III illustrates the step-by-step results of ProbDD. Following the study of ProbDD [14], the initial probability p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is set to 0.25, resulting in subsets with a size of 4 as per Equation 2.

  1. 1.

    Round 1 (s𝑠sitalic_s=4). Similar to the example in the original paper of ProbDD [14], we assume ProbDD selects (l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT, l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT) to delete due to the randomness, thus resulting the variant v1. However, v1 fails to exhibit ψぷさい𝜓\psiitalic_ψぷさい, leading to the probability of these selected elements being updated from 0.25 to 0.251(10.25)40.370.251superscript10.2540.37\frac{0.25}{1-(1-0.25)^{4}}\approx 0.37divide start_ARG 0.25 end_ARG start_ARG 1 - ( 1 - 0.25 ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ≈ 0.37, based on Equation 3. Next, the remaining elements with lower probability, i.e., (l2subscript𝑙2{l}_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, l3subscript𝑙3{l}_{3}italic_l start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, l6subscript𝑙6{l}_{6}italic_l start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, l7subscript𝑙7{l}_{7}italic_l start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT), are prioritized and selected for deletion, resulting in v2. This time, the property test passes and these elements are removed.

  2. 2.

    Round 2 (s𝑠sitalic_s=2). Given that all probabilities of remaining elements become 0.37, the next subset size becomes 2. Subsequently, subset (l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT) are attempted to remove in v3 and later subset (l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT, l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT) are attempted to remove in v4, though no subset can be successfully removed. After these two attempts, all probabilities update to 0.371(10.37)20.610.371superscript10.3720.61\frac{0.37}{1-(1-0.37)^{2}}\approx 0.61divide start_ARG 0.37 end_ARG start_ARG 1 - ( 1 - 0.37 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≈ 0.61.

  3. 3.

    Round 3 (s𝑠sitalic_s=1). Finally, the subset size becomes 1, so each individual element is selected to remove alone. The elements l4subscript𝑙4{l}_{4}italic_l start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT and l1subscript𝑙1{l}_{1}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are finally removed from the final result in v5 and v7, respectively, while l5subscript𝑙5{l}_{5}italic_l start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT and l8subscript𝑙8{l}_{8}italic_l start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT are verified as non-removable, thus being returned as the final result.

III On the Probabilities in ProbDD

Beginning with this section, we will systematically present our findings. Each finding will be introduced by first stating the result, followed by a comprehensive explanation. Regarding theoretical proofs, we only demonstrate lemmas and theorems, with the corresponding proofs presented in the supplementary materials. In this section, we theoretically analyze the trend of probability changes across rounds.

Finding 1: The probability assigned to each element increases monotonically with the round number r𝑟ritalic_r, by a factor of approximately 1.582. Essentially, the probability for each element can be expressed as a function of r𝑟ritalic_r and p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., pr1.582r×p0subscript𝑝𝑟superscript1.582𝑟subscript𝑝0p_{r}\approx 1.582^{r}\times p_{0}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≈ 1.582 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

An Illustrative Example.  The running example illustrated in Table III leads to this finding. Observation reveals that after each element has been attempted for deletion once, i.e., completing one round, the probabilities of all remaining elements are updated. The initial probability is 0.25; after v2, it changes to 0.37; following v4, it increases to 0.61; and by the end of v8, it reaches 1. Consequently, we hypothesize that with each deletion attempt, the probability approximately increases in a predictable manner. Through appropriate simplification, we can theoretically model this trend, and thereby model the entire progression of probability changes.

III-A Assumption for Theoretical Analysis

Besides the above observation from a concrete example, theoretical analysis is necessary. To refine the mathematical model of ProbDD for easier representation, analysis and derivation, we assume that the number of elements in L𝐿Litalic_L is always divisible by the subset size. With this assumption, the probability of each element will be updated in the same manner; as a result, before and after each round, the probabilities of all elements are always the same, as shown in III.1. This assumption is often applicable in practice. For instance, in the running example in Table III, before each round, the probabilities associated with each remaining element are identical, ensuring that all subsets are of identical size. Furthermore, the probabilities of elements are updated to the same next value after the round.

Lemma III.1.

If the number of elements in L𝐿Litalic_L is always divisible by the subset size, the probabilities of all elements are always the same.

While it is not always possible for the number of elements to be divisible by the subset size, the elements will still be partitioned as evenly as possible. However, such indivisibilities make the theoretical simplification of ProbDD nearly impossible. Based on our observation when running ProbDD, being slightly uneven during partitioning does not significantly affect probability updates. Moreover, we will demonstrate that the simplified algorithm derived from this assumption has no significant difference from ProbDD in section VI, via thorough experimental evaluation.

III-B Probability vs. Subset Size Correlation

In the second step, we derive the correlation between probability and subset size. Based on the assumption in the previous step, the probability of each element is identical and represented as prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in round r𝑟ritalic_r, thus the formula of Expected Reduction Gain from Equation 1 can be simplified to

E(s)=s×(1pr)s𝐸𝑠𝑠superscript1subscript𝑝𝑟𝑠\displaystyle E(s)=s\times(1-p_{r})^{s}italic_E ( italic_s ) = italic_s × ( 1 - italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (4)

Given the probability of elements prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT in the round r𝑟ritalic_r, srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can be derived through gradient-based optimization, i.e., E(sr)=0superscript𝐸subscript𝑠𝑟0E^{\prime}(s_{r})=0italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) = 0. Therefore, the optimal size srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT to maximize E(s)𝐸𝑠E(s)italic_E ( italic_s ) is 1ln(1pr)11subscript𝑝𝑟-\frac{1}{\ln(1-p_{r})}- divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG. Subsequently, we can also deduce the next probability to be pr+1=pr1(1pr)srsubscript𝑝𝑟1subscript𝑝𝑟1superscript1subscript𝑝𝑟subscript𝑠𝑟p_{r+1}=\frac{p_{r}}{1-(1-p_{r})^{s_{r}}}italic_p start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG 1 - ( 1 - italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG. In summary, the correlation between probability and subset size can be simplified as Equation 5 and Equation 6, in which subset size srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is determined by probability prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, and probability pr+1subscript𝑝𝑟1p_{r+1}italic_p start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT in the next round is determined by both prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT and srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

srsubscript𝑠𝑟\displaystyle s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT =1ln(1pr)absent11subscript𝑝𝑟\displaystyle=-\frac{1}{\ln(1-p_{r})}= - divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) end_ARG (5)
pr+1subscript𝑝𝑟1\displaystyle p_{r+1}italic_p start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT =pr1(1pr)srabsentsubscript𝑝𝑟1superscript1subscript𝑝𝑟subscript𝑠𝑟\displaystyle=\frac{p_{r}}{1-(1-p_{r})^{s_{r}}}= divide start_ARG italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG 1 - ( 1 - italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (6)

III-C Trend of Probability Changes

Through Equation 6, pr+1>prsubscript𝑝𝑟1subscript𝑝𝑟p_{r+1}>p_{r}italic_p start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT > italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT always holds, indicating a monotonic increase of the probability of elements. However, there is still room for simplification, as srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT can be represented by prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, implying that pr+1subscript𝑝𝑟1p_{r+1}italic_p start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT can be represented solely by prsubscript𝑝𝑟p_{r}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT.

Lemma III.2.

p𝑝pitalic_p is increased by a factor 11e111superscript𝑒1\frac{1}{1-e^{-1}}divide start_ARG 1 end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG, i.e.,

pr=pr11e1=p0(1e1)rsubscript𝑝𝑟subscript𝑝𝑟11superscript𝑒1subscript𝑝0superscript1superscript𝑒1𝑟\displaystyle p_{r}=\frac{p_{r-1}}{1-e^{-1}}=\frac{p_{0}}{(1-e^{-1})^{r}}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT italic_r - 1 end_POSTSUBSCRIPT end_ARG start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG = divide start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG (7)

Therefore, through empirical observations on the running example, coupled with theoretical derivation and simplification, we have identified the pattern of probability changes w.r.t. the round number r𝑟ritalic_r, i.e., pr=p0(1e1)r1.582r×p0subscript𝑝𝑟subscript𝑝0superscript1superscript𝑒1𝑟superscript1.582𝑟subscript𝑝0p_{r}=\frac{p_{0}}{(1-e^{-1})^{r}}\approx 1.582^{r}\times p_{0}italic_p start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT end_ARG ≈ 1.582 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

IV On the Size of Subsets in ProbDD

In this section, we theoretically analyze the trend of subset size changes across rounds.

Finding 2: The size of subsets in each round can be analytically pre-determined given only the initial probability p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, i.e., s01ln(1p0)subscript𝑠011subscript𝑝0s_{0}\approx-\frac{1}{\ln(1-p_{0})}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≈ - divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG. The size for each round decreases monotonically by a factor 1e11superscript𝑒11-e^{-1}1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT (\approx 0.632), i.e., the subset size srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT after r𝑟ritalic_r rounds is sr0.632r×s0=1ln(1p0)×0.632rsubscript𝑠𝑟superscript0.632𝑟subscript𝑠011subscript𝑝0superscript0.632𝑟s_{r}\approx 0.632^{r}\times s_{0}=-\frac{1}{\ln(1-p_{0})}\times 0.632^{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ≈ 0.632 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT × italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG × 0.632 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT.

IV-A Demystifying How Subset Size Changes

Based on our previous finding that the probability can be approximately estimated by the current round number via a factor. Consequently, we observe a similar pattern in the changes of the subset size in each round.

Lemma IV.1.

sr+1subscript𝑠𝑟1s_{r+1}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT can be expressed by solely srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT

sr+1=1ln(1e1e1sre1)subscript𝑠𝑟111superscript𝑒1superscript𝑒1subscript𝑠𝑟superscript𝑒1\displaystyle s_{r+1}=\frac{1}{\ln(\frac{1-e^{-1}}{e^{-\frac{1}{s_{r}}}-e^{-1}% })}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG roman_ln ( divide start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ) end_ARG (8)

Despite deriving that sr+1subscript𝑠𝑟1s_{r+1}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT depends solely on srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, the trend of subset size is still implicit and obscure. For a clearer approximation, we propose the linear boundaries of sr+1subscript𝑠𝑟1s_{r+1}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT in terms of srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT,

Lemma IV.2.

The lower bound of sr+1subscript𝑠𝑟1s_{r+1}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT w.r.t srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is

sr+1(1e1)sr1subscript𝑠𝑟11superscript𝑒1subscript𝑠𝑟1\displaystyle s_{r+1}\geq(1-e^{-1})s_{r}-1italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT ≥ ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 (9)
Lemma IV.3.

The upper bound of sr+1subscript𝑠𝑟1s_{r+1}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT w.r.t srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is

sr+1(1e1)srsubscript𝑠𝑟11superscript𝑒1subscript𝑠𝑟\displaystyle s_{r+1}\leq(1-e^{-1})s_{r}italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (10)
Theorem IV.4.

Subset size s𝑠sitalic_s is initialized as Equation 5, updated by Equation 8, and constraint by two linear boundaries Equation 9 and Equation 10:

{s0=1ln(1p0)sr+1=1ln(1e1e1sre1)sr+1(1e1)srsr+1(1e1)sr1casessubscript𝑠0absent11subscript𝑝0subscript𝑠𝑟1absent11superscript𝑒1superscript𝑒1subscript𝑠𝑟superscript𝑒1subscript𝑠𝑟1absent1superscript𝑒1subscript𝑠𝑟subscript𝑠𝑟1absent1superscript𝑒1subscript𝑠𝑟1\left\{\begin{array}[]{l l}s_{0}&=-\frac{1}{\ln(1-p_{0})}\\ s_{r+1}&=\frac{1}{\ln(\frac{1-e^{-1}}{e^{-\frac{1}{s_{r}}}-e^{-1}})}\\ s_{r+1}&\leq(1-e^{-1})s_{r}\\ s_{r+1}&\geq(1-e^{-1})s_{r}-1\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG roman_ln ( divide start_ARG 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG end_POSTSUPERSCRIPT - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_ARG ) end_ARG end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT end_CELL start_CELL ≤ ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_s start_POSTSUBSCRIPT italic_r + 1 end_POSTSUBSCRIPT end_CELL start_CELL ≥ ( 1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - 1 end_CELL end_ROW end_ARRAY (11)

Aided by these two bounds, we obtain a complete representation Equation 11 to model subset size s𝑠sitalic_s. It is worth noting that the size decreases approximately by a factor 1e11superscript𝑒11-e^{-1}1 - italic_e start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT \approx 0.632, until reaching 1. Alternatively speaking, the subset size srsubscript𝑠𝑟s_{r}italic_s start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT after round r𝑟ritalic_r is roughly s0×0.632rsubscript𝑠0superscript0.632𝑟s_{0}\times 0.632^{r}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 0.632 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, allowing the subset size to be analytically pre-determined, and thus providing the potential for simplification of ProbDD and leading to the proposal of CDD (see details in section VI).

V Empirical Experiments

In addition to the theoretical derivation above, we conduct an extensive experimental evaluation on ddmin and ProbDD to gain deeper insights and achieve further discoveries. Specifically, we reproduce the experiments on ddmin and ProbDD by Wang et al. [14], and then delve deeper into ProbDD, analyzing its randomness, the bottlenecks it overcomes, and its 1-minimality. Furthermore, we evaluate our proposed CDD (which will be presented in section VI), validating our previous theoretical analysis. Due to limited space, we present the results of both ProbDD and CDD together within this section, but this section primarily focuses on discussing ProbDD, while the next section will focus on CDD.

V-A Benchmarks

To extensively evaluate ddmin, ProbDD and CDD, we use the following three benchmark suites ( 76767676 benchmarks in total), covering various use scenarios of minimization algorithms.

  • Benchmark-C (BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT): 20 large bug-triggering programs in C language, each of which triggers a real-world compiler bug in either LLVM or GCC. The original size of benchmarks ranges from 4,397 tokens to 212,259 tokens. This benchmark suite has been used to evaluate test input minimization work [9, 14, 19].

  • Benchmark-Debloat (BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT): source programs of 10 command-line utilities. The original size of benchmarks ranges from 34,801 tokens to 163,296 tokens. This benchmark suite was collected by Heo et al. [13] and used to evaluate software debloating techniques [13, 20, 21].

  • Benchmark-XML (BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT): 46 XML inputs triggering 8 unique bugs in Basex, a widely-used XML processing tool. The original size of benchmarks ranges from 19,290 tokens to 20,750 tokens. This benchmark suite is generated via Xpress [22] and collected by the authors of this study, as the original XML dataset used in ProbDD paper is not publicly available.

V-B Evaluation Metrics

We measure the following aspects as metrics.

Final Size.  This metric assesses the effectiveness of reduction. When reducing a list L𝐿Litalic_L with a certain property ψぷさい𝜓\psiitalic_ψぷさい, a smaller final list is preferred, indicating that more irrelevant elements have been successfully eliminated. In all benchmark suites, the metric is measured by the number of tokens.

Execution Time.  The execution time of a minimization algorithm reflects its efficiency. A minimization algorithm taking less time is more desirable, and execution time is measured in seconds.

Query Number.  This metric further evaluates the algorithm’s efficiency. During the reduction process, each time a smaller variant is produced, the algorithm verifies whether this variant still preserves the property ψぷさい𝜓\psiitalic_ψぷさい, referred to as a query. Since queries consume time, a lower query number is favorable.

P-value.  We calculate the p-value via a paired t-test between every two algorithms, to investigate whether the performance differences are significant. A p-value below 0.05 denotes a significant distinction between the two groups of data. Otherwise, the observed difference lacks statistical significance.

V-C The Wrapping Frameworks

The ddmin algorithm and its variants usually serve as the fundamental algorithm. To apply them to a concrete scenario, an outer wrapping framework is generally needed to handle the structure of the input. In our evaluation, we choose the same wrapping frameworks as those used by ProbDD paper. For those tree-structured bug-triggering inputs, i.e., BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT and BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT, we use Picireny 21.8 [23], an implementation of HDD [10]. Picireny parses such inputs into trees, and then invokes Picire 21.8 [24], an open-sourced Delta Debugging library with ddmin, ProbDD and CDD implemented, to reduce each level of the trees. For software debloating on BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT, Chisel [13] is employed, in which ddmin, ProbDD and CDD are integrated.

All experiments are conducted on a server running Ubuntu 22.04.3 LTS, equipped with Intel Xeon Gold 6348 CPUs @ 2.60GHz, providing a total of 120 threads, and 4 TB of RAM. To ensure the reproducibility, we employ docker images to release the source code and the configuration. Each benchmark is reduced using a single thread. Following the ProbDD paper, we run each algorithm on each benchmark 5 times and calculate the geometric average results.

V-D Reproduction Study of ProbDD

TABLE IV: The final size, execution time and query number of ddmin, ProbDD and CDD on all benchmark suites. The "-" indicates a timeout, and the results of unfinished benchmarks under each algorithm are highlighted in gray. Only benchmarks finished by all algorithms are considered to compute mean values, and we position them prominently at the top rows of each suite. Considering the limited space and the extensive number of BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT, all of which have been finished, we only present the average results.
Final size (#) Execution time (s) Query number
Benchmark Original size (#) ddmin ProbDD CDD ddmin ProbDD CDD ddmin ProbDD CDD
LLVM-22382 9,987 350 367 350 1,450 975 1,004 11,388 5,461 4,540
LLVM-23353 30,196 321 324 324 2,787 1,235 1,390 11,719 4,199 3,839
LLVM-25900 78,960 941 921 945 9,010 3,501 3,104 35,740 15,026 10,685
LLVM-27747 173,840 431 509 431 6,972 2,653 3,757 20,000 5,862 6,976
GCC-59903 57,581 1,185 1,206 753 7,752 3,469 3,727 47,698 15,396 11,648
GCC-61383 32,449 959 955 978 10,729 6,220 6,027 43,716 13,712 11,933
GCC-61917 85,359 882 923 902 4,993 2,780 3,472 31,414 12,037 12,485
GCC-65383 43,942 706 700 709 4,334 3,128 3,528 25,051 9,309 8,022
GCC-71626 4,397 184 184 184 111 104 124 1,608 1,196 1,119
LLVM-22704 184,444 95,930 788 790 - 10,762 9,218 14,312 15,973 12,230
LLVM-26760 209,577 15,123 498 498 - 5,010 4,995 17,749 9,530 8,256
LLVM-31259 48,799 1,033 1,035 1,051 - 8,350 6,915 28,192 13,210 11,331
GCC-64990 148,931 39,192 709 741 - 10,625 9,402 24,212 11,965 11,361
GCC-66186 47,481 1,012 1,008 1,013 - 8,494 9,920 37,682 13,025 15,768
LLVM-23309 33,310 1,532 1,270 1,286 - - 9,376 34,173 17,038 14,840
LLVM-27137 174,538 119,115 56,098 47,195 - - - 5,845 4,410 4,455
GCC-60116 75,224 17,598 2,797 1,894 - - - 19,061 10,391 9,007
GCC-66375 65,488 1,381 1,242 1,220 - - - 27,831 7,662 8,511
GCC-70127 154,816 26,613 1,119 1,068 - - - 18,213 7,659 6,692
GCC-70586 212,259 36,692 1,820 1,606 - - - 17,732 13,707 15,991
BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT Mean 64,599 566 582 541 3,333 1,819 2,018 18,483 7,276 6,483
mkdir-5.2.1 34,801 8,625 8,407 8,497 3,771 1,692 1,432 11,969 2,469 1,909
chown-8.2 43,869 9,765 9,178 9,190 - 6,057 5,321 25,446 7,108 5,448
rm-8.4 44,459 11,293 8,411 8,463 - 4,241 3,758 22,744 4,862 4,262
bzip2-1.0.5 70,530 37,941 37,506 37,510 - - - 5,959 1,349 1,032
date-8.21 53,442 38,696 20,768 21,109 - - - 46,241 9,446 8,538
grep-2.19 127,681 127,024 75,228 82,847 - - - 56,235 4,750 4,195
gzip-1.2.4 45,929 32,575 26,011 24,290 - - - 20,900 3,697 3,487
sort-8.16 88,068 81,544 44,171 48,353 - - - 43,189 1,972 1,864
tar-1.14 163,296 157,806 97,670 93,365 - - - 35,369 3,117 2,930
uniq-8.16 63,861 18,041 19,071 17,379 - - - 13,411 4,150 1,953
BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT Mean 65,151 8,625 8,407 8,497 3,771 1,692 1,432 11,969 2,469 1,909
BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT Mean 20,190 56 57 55 840 639 660 453 293 281
All Mean 31,989 89 90 88 1,076 769 801 872 510 481

To comprehensively reproduce the results of ProbDD [14], we evaluate ddmin and ProbDD using three benchmark suites, containing a total of 76767676 benchmarks. Following the settings of ProbDD [14], we set the empirically estimated remaining rate as the initialization probability p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, specifically, 0.1 for BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT and BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT, and 2.5e-3 for BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT. Same as ProbDD’s paper, we employ three hours (10,800 seconds) as the timeout threshold. If the algorithm does not complete a benchmark within the time limit, the smallest result achieved and the corresponding query numbers are still recorded. Due to the significant differences between completed and uncompleted results, we only consider benchmarks completed by all algorithms when calculating averages and p-values. The detailed results are shown in Table IV.

Efficiency and Effectiveness.  Through our reproduction study, we find that the performance of ProbDD aligns with the results reported in the original paper, showing that ProbDD is significantly more efficient than ddmin. On benchmarks that can be completed by all algorithms, ProbDD requires 28.5328.5328.5328.53% less time and 41.5141.5141.5141.51% fewer queries, with p-value being 3.50e-05 and 2.82e-03, respectively. Moreover, we assess the effectiveness by measuring the sizes of the final minimized results. The effectiveness of ddmin and ProbDD varies across each benchmark, but neither algorithm consistently outperforms the other, as substantiated by a p-value of 0.71, which is much higher than 0.05.

V-E Impact of Randomness in ProbDD

Finding 3: Randomness has no significant impact on the performance of ProbDD.

In ProbDD, elements with different probabilities are sorted accordingly, while elements with the same probability are randomly shuffled. However, randomness alone intuitively does not ensure a higher probability of escaping local optima and the effect of this randomness on performance has not been thoroughly investigated.

To this end, we conduct an ablation study by removing such randomness, creating a variant called ProbDD-no-random. We evaluate this variant across all benchmarks. The results indicate that the randomness does not significantly impact performance. Specifically, in terms of final size, execution time, and query number, ProbDD-no-random achieves 90, 770, and 559 compared to 90, 769, and 510 of ProbDD, respectively. The p-values of 0.67, 0.95, and 0.75 indicate that the differences are not significant.

V-F Bottleneck ProbDD Overcomes

Finding 4: On tree-structured inputs, inefficient deletion attempts on complements and repeated attempts account for the bottlenecks of ddmin, which are overcome by ProbDD.

In the study of ProbDD, the authors demonstrate that ProbDD is more efficient than the baseline approach (ddmin) in tree-based reduction scenarios, where the inputs are parsed into tree representations before reduction. Therefore, to uncover the root cause of this superiority, we follow the same application scenario and analyze the behavior of ProbDD in reducing the tree-structured inputs.

Refer to caption
(a) On BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT
Refer to caption
(b) On BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT
Refer to caption
(c) On BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT
Figure 2: Visualization of queries within ddmin, ProbDD and CDD. In ddmin, three types of queries are displayed via stacked bars, the height of which denotes the query number. Within each bar, the number of successful queries, total queries and the corresponding success rate are annotated.

To further understand why ProbDD is more efficient than ddmin, we conduct in-depth statistical analysis on the query number (number of deletion attempts). Intuitively, performance bottlenecks lie in those queries with low success rates, impairing ddmin’s efficiency. Existing studies [16, 15] also demonstrate the presence of queries with low success rates. Therefore, to qualitatively and quantitatively identify the exact bottlenecks impairing ddmin, we statistically analyze all the queries in ddmin and categorize them into three types:

  1. 1.

    Complement: Queries attempting to remove the complement of a subset. According to ddmin algorithm, given a subset (smaller than half of the list L𝐿Litalic_L), it attempts to remove either the subset or its complement. However, evidence [16] shows that keeping a small subset and removing its complement is not likely to succeed, especially on structured inputs like programs.

  2. 2.

    Revisit: Queries attempting to remove the previously tried subset. After removing a subset, ddmin restarts the process from the first subset, leading to repeated deletion attempts on earlier subsets. Although the removal of one subset may allow another subset to be removable, such repetitions rarely succeed and thus offer limited improvement for the reduction [15].

  3. 3.

    Other: All other queries.

In addition to categorizing queries in ddmin into the above types, we also calculate the success rate of each type, aiming to reveal the bottlenecks of ddmin. Fig. 2 illustrates the distribution of queries for all types within ddmin, as well as the query number for ProbDD. We only consider completed benchmarks on BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT and BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT, as they reflect the distribution throughout the entire minimization process. However, only one benchmark is completed across all algorithms in BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT. Therefore, for this benchmark suite, we include all benchmarks, including those unfinished ones, to ensure the results are statistically meaningful.

On both BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT and BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT, ddmin performs almost the same number of successful queries, compared to those of ProbDD. Specifically, on BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT, ddmin performs 74+3+3,614=3,691formulae-sequence7433614369174+3+3,614=3,69174 + 3 + 3 , 614 = 3 , 691 successful queries, close to 3,633 queries from ProbDD. Similarly, on BMXMLsubscriptBMXML\text{BM}_{\text{XML}}BM start_POSTSUBSCRIPT XML end_POSTSUBSCRIPT, the success query number of ddmin is 2+830+1,485=2,317formulae-sequence2830148523172+830+1,485=2,3172 + 830 + 1 , 485 = 2 , 317, demonstrating only minimal differences compared to ProbDD’s 2,315 successful queries. On BMDBTsubscriptBMDBT\text{BM}_{\text{DBT}}BM start_POSTSUBSCRIPT DBT end_POSTSUBSCRIPT, however, the number of successful queries of ddmin is not close to those of ProbDD, as most benchmarks are not completed. Besides, ddmin always performs significantly more failed queries, resulting in a larger total query number and thus a longer execution time, as previously discussed in section V-D.

On all benchmark suites, a large portion of ddmin’s queries is categorized as Complement and Revisit; however, they both have a very low success rate. For instance, on BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT, out of a total of 220,563 queries, Complement and Revisit account for 119,363 (54.12%) and 36,652 (16.62%), respectively. Within such queries in Complement and Revisit, merely 3 (<0.01%) and 74 (0.20%) queries succeed, i.e., only a tiny portion of attempts successfully reduce elements. These success rates are far less than those of queries within Other (5.60%), as well as those of ProbDD (4.85%). On the other benchmark suites, a similar phenomenon is observed.

Queries within Complement and Revisit categories constitute a large portion yet prove to be largely inefficient, wasting a significant amount of time and resources. On the contrary, those in Other achieve a much higher success rate, on par with that of ProbDD, and are responsible for most of the successful deletions. Therefore, we believe that these two categories, where queries are inefficient, are the main bottlenecks behind ddmin’s low efficiency. However, these bottlenecks are absent in ProbDD, as it does not consider complements of subsets and previously tried subsets for deletion.

V-G 1-Minimality of ProbDD?

Finding 5: Improving efficiency by avoiding ineffective attempts presents a trade-off by not ensuring 1-minimality, while such limitation can be mitigated by iteratively running the reduction algorithm until a fixpoint is reached.

Although ProbDD avoids Revisit queries to enhance efficiency, some reduction potentials may be missed, as the deletion of a certain subset may enable a previously tried subset to become removable. Therefore, a limitation of ProbDD lies in that it increases efficiency by sacrificing 1-minimality. To substantiate this limitation, we examine how frequently ProbDD generates a list that is not 1-minimal, i.e., can be further reduced by removing a single element. For instance, statistical analysis on BMCsubscriptBMC\text{BM}_{\text{C}}BM start_POSTSUBSCRIPT C end_POSTSUBSCRIPT reveals that among 6,871 invocations of ProbDD, 76 of them fail to generate a 1-minimal result, accounting for 1.111.111.111.11%. For these failed invocations, an average of 1.49 elements (tree nodes) can be further removed via single-element deletion.

However, such limitation is not apparent across all benchmark suites, as the results from ProbDD are not consistently larger than those from ddmin. Our further investigation reveals that these benchmarks are reduced on wrapper frameworks Picireny and Chisel. Both frameworks employ iterative loops to achieve a fixpoint, effectively reducing some elements missed in the first iteration.

VI Implications: a counter-based model

Building on the aforementioned demystification of ProbDD, we discover that probability can be optimized away, and subset size can be pre-computed. Hence, we propose Counter-Based Delta Debugging (CDD), to reduce the complexity of both the theory and implementation of ProbDD, and validate the correctness of our prior theoretical proofs.

1
Input: L𝐿Litalic_L: a list of element to be reduced.
Input: ψぷさい:𝕃𝔹:𝜓𝕃𝔹\psi:\mathbb{L}\rightarrow\mathbb{B}italic_ψぷさい : blackboard_L → blackboard_B: the property to be preserved by L𝐿Litalic_L.
Input: p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: the initial probability given by the user.
Output: the minimized list that still exhibits the property ψぷさい𝜓\psiitalic_ψぷさい .
2
r0𝑟0r\leftarrow 0italic_r ← 0
  // The round number, initially 0.
3 do
       /* Compute subset size by round number */
4       s𝑠absents\leftarrowitalic_s ← ComputeSize (r𝑟ritalic_r, p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT)
       /* Partition L into subsets with s𝑠sitalic_s elements. If it does not divide evenly, leave a smaller remainder as the final subset. */
5       subsetssubsetsabsent\texttt{subsets}\leftarrowsubsets ← Partition (L, s𝑠sitalic_s)
6       foreach subsetsubsetssubsetsubsets\texttt{subset}\in\texttt{subsets}subset ∈ subsets do
7             tempLsubsettemp𝐿subset\texttt{temp}\leftarrow L\setminus\texttt{subset}temp ← italic_L ∖ subset
             /* Remove subset𝑠𝑢𝑏𝑠𝑒𝑡subsetitalic_s italic_u italic_b italic_s italic_e italic_t if it is removable */
8             if ψぷさい(temp)𝜓temp\psi(\texttt{temp})italic_ψぷさい ( temp ) is true then
9                   Ltemp𝐿tempL\leftarrow\texttt{temp}italic_L ← temp
10                  
11            
12      
      /* Update the r𝑟ritalic_r and move to next round. */
13       rr+1𝑟𝑟1r\leftarrow r+1italic_r ← italic_r + 1
14      
15while s>1𝑠1s>1italic_s > 1
16return L𝐿Litalic_L
17
18Function ComputeSize(r,p0𝑟subscript𝑝0r,p_{0}italic_r , italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT):
       Input: r𝑟ritalic_r: the current round number.
       Input: p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: the initial probability given by the user.
       Output: The size of the subset to be used in the current round.
       /* Calculate the initial subset size from p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT */
19       s01ln(1p0)subscript𝑠011subscript𝑝0s_{0}\leftarrow\lfloor-\frac{1}{\ln(1-p_{0})}\rflooritalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ← ⌊ - divide start_ARG 1 end_ARG start_ARG roman_ln ( 1 - italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ⌋
       /* Calculate the subset size for the current round */
20       ss0×0.632r𝑠subscript𝑠0superscript0.632𝑟s\leftarrow\lfloor s_{0}\times 0.632^{r}\rflooritalic_s ← ⌊ italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × 0.632 start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT ⌋
21       return s𝑠sitalic_s
22      
23
Algorithm 2 CDD (L,ψぷさい𝐿𝜓L,\psiitalic_L , italic_ψぷさい)

Subset size pre-calculation.  Based on Equation 11 in section III, the size for each round can be pre-calculated. Therefore, as shown at Algorithm 2Algorithm 2 in Algorithm 2, we utilize the current round r𝑟ritalic_r and the initial probability p0subscript𝑝0p_{0}italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to determine the subset size s𝑠sitalic_s. The size of the selected subset decreases as the round counter increases. This is intuitively reasonable since, after a sufficient number of attempts on a large size have been made, it becomes more advantageous to gradually reduce the subset size for future trials. Furthermore, this trend aligns well with that of ProbDD, in which probabilities of elements gradually increase, resulting in a smaller subset size.

Main workflow.  The simplified ProbDD is illustrated in Algorithm 2, from Algorithm 2 to Algorithm 2. Before each round, the CDD pre-calculates the subset size on Algorithm 2 and then partitions L𝐿Litalic_L using this size on Algorithm 2. Then, similar to ddmin, it attempts to remove each subset on Algorithm 2Algorithm 2. The subset size continuously decreases until it reaches 1, meaning each element will be individually removed once.

Revisiting the running example.  Returning to Table III, under the same conditions, CDD achieves the same results as ProbDD but without the need for probability calculations. This is because both the probability and subset size s𝑠sitalic_s can be directly determined from the round number r𝑟ritalic_r.

Evaluation.  In Table IV, CDD completes the most benchmarks, totaling 69, followed by ProbDD with 68 benchmarks completed, and ddmin with 61 benchmarks completed. CDD outperforms ddmin w.r.t. efficiency, with 25.5625.5625.5625.56% less time and 44.8444.8444.8444.84% fewer queries. Meanwhile, CDD performs on par with ProbDD w.r.t. final size, execution time and query number, with a p-value of 0.38, 0.13 and 0.06, respectively, indicating insignificance between these two algorithms. CDD is expected to perform on par with ProbDD since it is designed to provide further insight and unravel the complexities of ProbDD, rather than to surpass its capabilities. Furthermore, its comparable performance to ProbDD further validates the non-necessity of randomness and our assumption in III.1.

Bottleneck and 1-minimality.  Returning to the bottlenecks presented in Fig. 2, CDD possesses a query number and success rate close to those of ProbDD, indicating that CDD also overcomes the bottlenecks of ddmin. Additionally, similar to ProbDD, 1-minimality is absent in CDD, although iterations help mitigate this issue.

Finding 6: CDD always achieves comparable performance to ProbDD, which further supports our previous findings, including the theoretical simplifications regarding size and probability, analysis of randomness, bottlenecks, and 1-minimality.

VII Related Work

In this section, we discuss related work of test input minimization around three aspects: effectiveness, efficiency, and the utilization of domain knowledge.

Effectiveness.  Test input minimization is an NP-complete problem, in which achieving the global minimum is usually infeasible. Therefore, existing approaches to improving effectiveness mainly aim to escape local minima by performing more exhaustive searches. Since enumerating all possible subsets is infeasible, Vulcan [11] and C-Reduce [12] enumerate all combinations of elements within a small sliding window, and exhaustively attempt to delete each combination, resulting in smaller final program sizes. In contrast, ProbDD and CDD do not exhibit clear actions targeted at breaking through local optima, suggesting they cannot achieve better effectiveness than ddmin, as aligned with our evaluation in section V.

Efficiency.  If parallelism is not considered, the core of boosting efficiency is the enhanced capability to avoid relatively inefficient queries. For example, Hodovan and Kiss [16] proposed disregarding attempts to remove the complement of subsets, the success rate of which is unacceptably low in some scenarios. Besides, Gharachorlu and Sumner [15] proposed One Pass Delta Debugging (OPDD), which continues with the subset next to the deleted one, rather than starting over from the first subset. This optimization also avoids some redundant queries in ddmin, reducing runtime by 65%. As revealed by our analysis, these two above-mentioned optimizations are implicitly incorporated within ProbDD and CDD, and thereby contributing to their higher efficiency than ddmin.

Utilization of domain knowledge.  There is an inherent trade-off between effectiveness and efficiency in test input minimization. For the same algorithm, achieving a better result, i.e., a smaller local optimum, requires more queries to be spent on trial and error. However, employing domain knowledge [12, 25, 26, 27] can still improve the overall performance. For instance, J-Reduce is both more effective and efficient than HDD on reducing Java programs, as it escapes more local optima by program transformations while simultaneously avoiding more inefficient queries via semantic constraints, leveraging the semantics of Java. Our analysis on ProbDD indicates that the probabilities primarily function as counters and do not utilize or effectively learn the domain knowledge of an input. Besides, the evaluation on CDD, a simplified algorithm without utilizing probability, demonstrates that prioritizing elements via such probabilities does not yield significant benefits, thus validating our analysis.

VIII Conclusion

This paper conducts the first in-depth analysis of ProbDD, which is the state-of-the-art variant of ddmin, to further comprehend and demystify its superior performance. With theoretical analysis of the probabilistic model in ProbDD, we reveal that probabilities essentially serve as monotonically increasing counters, and propose CDD for simplification. Evaluations on 76767676 benchmarks from test input minimization and software debloating confirm that CDD performs on par with ProbDD, substantiating our theoretical analysis. Furthermore, our examination on query success rate and randomness uncovers that ProbDD’s superiority stems from skipping inefficient queries. Finally, we discuss trade-offs in ddmin and ProbDD, providing insights for future research and applications of test input minimization algorithms.

References

  • [1] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183–200, 2002.
  • [2] GCC. (2020) A guide to testcase reduction. Accessed: 2023-04-30. [Online]. Available: https://gcc.gnu.org/wiki/A_guide_to_testcase_reduction
  • [3] LLVM. (2022) How to submit an llvm bug report. Accessed: 2023-04-30. [Online]. Available: https://llvm.org/docs/HowToSubmitABug.html
  • [4] WebKit. (2001) Webkit: Test case reduction. Accessed: 2023-04-30. [Online]. Available: https://webkit.org/test-case-reduction/
  • [5] ASF Bugzilla. (2001) ASF bugzilla: Bug writing guidelines. Accessed: 2023-04-30. [Online]. Available: https://bz.apache.org/bugzilla/page.cgi?id=bug-writing.html
  • [6] Bugzilla. (2001) Bugzilla: Reporting a new bug. Accessed: 2023-04-30. [Online]. Available: https://bugzilla.readthedocs.io/en/5.2/using/filing.html#reporting-a-new-bug
  • [7] A. Donaldson and D. MacIver. (2021, May) Test Case Reduction: Beyond Bugs. [Online]. Available: https://blog.sigplan.org/2021/05/25/test-case-reduction-beyond-bugs
  • [8] A. F. Donaldson, P. Thomson, V. Teliman, S. Milizia, A. P. Maselco, and A. Karpiński, “Test-case reduction and deduplication almost for free with transformation-based compiler testing,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1017–1032.
  • [9] C. Sun, Y. Li, Q. Zhang, T. Gu, and Z. Su, “Perses: Syntax-guided program reduction,” in Proceedings of the 40th International Conference on Software Engineering, 2018, pp. 361–371.
  • [10] G. Misherghi and Z. Su, “Hdd: hierarchical delta debugging,” in Proceedings of the 28th International Conference on Software Engineering, 2006, pp. 142–151.
  • [11] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proceedings of the ACM on Programming Languages, vol. 7, no. OOPSLA1, pp. 636–664, 2023.
  • [12] J. Regehr, Y. Chen, P. Cuoq, E. Eide, C. Ellison, and X. Yang, “Test-case reduction for c compiler bugs,” in Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012, pp. 335–346.
  • [13] K. Heo, W. Lee, P. Pashakhanloo, and M. Naik, “Effective program debloating via reinforcement learning,” in Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, 2018, pp. 380–394.
  • [14] G. Wang, R. Shen, J. Chen, Y. Xiong, and L. Zhang, “Probabilistic delta debugging,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 881–892.
  • [15] G. Gharachorlu and N. Sumner, “Avoiding the familiar to speed up test case reduction,” in 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS).   IEEE, 2018, pp. 426–437.
  • [16] R. Hodován and Á. Kiss, “Practical improvements to the minimizing delta debugging algorithm.” in ICSOFT-EA, 2016, pp. 241–248.
  • [17] G. Wang. (2021) Probdd. Accessed: 2023-04-30. [Online]. Available: https://github.com/Amocy-Wang/ProbDD
  • [18] M. Pelikan, D. E. Goldberg, E. Cantú-Paz et al., “Boa: The bayesian optimization algorithm,” in Proceedings of the genetic and evolutionary computation conference GECCO-99, vol. 1.   Citeseer, 1999, pp. 525–532.
  • [19] Z. Xu, Y. Tian, M. Zhang, G. Zhao, Y. Jiang, and C. Sun, “Pushing the limit of 1-minimality of language-agnostic program reduction,” Proc. ACM Program. Lang., vol. 7, no. OOPSLA1, apr 2023. [Online]. Available: https://doi.org/10.1145/3586049
  • [20] C. Qian, H. Hu, M. Alharthi, S. P. H. Chung, T. Kim, and W. Lee, “Razor: A framework for post-deployment software debloating.” in USENIX Security Symposium, 2019, pp. 1733–1750.
  • [21] M. Alhanahnah, R. Jain, V. Rastogi, S. Jha, and T. Reps, “Lightweight, multi-stage, compiler-assisted application specialization,” in 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P).   IEEE, 2022, pp. 251–269.
  • [22] S. Li and M. Rigger, “Finding xpath bugs in xml document processors via differential testing,” arXiv preprint arXiv:2401.05112, 2024.
  • [23] A. Kiss, R. Hodován, and D. Vince. (2016) Picireny. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picireny
  • [24] ——. (2016) Picire. Accessed: 2023-04-30. [Online]. Available: https://github.com/renatahodovan/picire
  • [25] C. G. Kalhauge and J. Palsberg, “Binary reduction of dependency graphs,” in Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019, pp. 556–566.
  • [26] ——, “Logical bytecode reduction,” in Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2021, pp. 1003–1016.
  • [27] M. Zhang, Y. Tian, Z. Xu, Y. Dong, S. H. Tan, and C. Sun, “Lpr: Large language models-aided program reduction,” in Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis.   New York, NY, USA: ACM, 2024, p. 13.