Data Structures and Algorithms
See recent articles
- [1] arXiv:2408.04118 [pdf, html, other]
-
Title: Reducing Matroid Optimization to Basis SearchComments: 43 pages, 7 figures, 3 algorithmsSubjects: Data Structures and Algorithms (cs.DS); Distributed, Parallel, and Cluster Computing (cs.DC); Discrete Mathematics (cs.DM); Optimization and Control (math.OC)
In combinatorial optimization, matroids provide one of the most elegant structures for algorithm design. This is perhaps best identified by the Edmonds-Rado theorem relating the success of the simple greedy algorithm to the anatomy of the optimal basis of a matroid [Edm71; Rad57]. As a response, much energy has been devoted to understanding a matroid's favorable computational properties. Yet surprisingly, not much is understood where parallel algorithm design is concerned. Specifically, while prior work has investigated the task of finding an arbitrary basis in parallel computing settings [KUW88], the more complex task of finding the optimal basis remains unexplored. We initiate this study by reexamining Borůvka's minimum weight spanning tree algorithm in the language of matroid theory, identifying a new characterization of the optimal basis by way of a matroid's cocircuits as a result. Furthermore, we then combine such insights with special properties of binary matroids to reduce optimization in a binary matroid to the simpler task of search for an arbitrary basis, with only logarithmic asymptotic overhead. Consequentially, we are able to compose our reduction with a known basis search method of [KUW88] to obtain a novel algorithm for finding the optimal basis of a binary matroid with only sublinearly many adaptive rounds of queries to an independence oracle. To the authors' knowledge, this is the first parallel algorithm for matroid optimization to outperform the greedy algorithm in terms of adaptive complexity, for any class of matroid not represented by a graph.
- [2] arXiv:2408.04253 [pdf, html, other]
-
Title: Simple Linear-time Repetition FactorizationComments: Accepted for SPIRE 2024Subjects: Data Structures and Algorithms (cs.DS)
A factorization $f_1, \ldots, f_m$ of a string $w$ of length $n$ is called a repetition factorization of $w$ if $f_i$ is a repetition, i.e., $f_i$ is a form of $x^kx'$, where $x$ is a non-empty string, $x'$ is a (possibly-empty) proper prefix of $x$, and $k \geq 2$. Dumitran et al. [SPIRE 2015] presented an $O(n)$-time and space algorithm for computing an arbitrary repetition factorization of a given string of length $n$. Their algorithm heavily relies on the Union-Find data structure on trees proposed by Gabow and Tarjan [JCSS 1985] that works in linear time on the word RAM model, and an interval stabbing data structure of Schmidt [ISAAC 2009]. In this paper, we explore more combinatorial insights into the problem, and present a simple algorithm to compute an arbitrary repetition factorization of a given string of length $n$ in $O(n)$ time, without relying on data structures for Union-Find and interval stabbing. Our algorithm follows the approach by Inoue et al. [ToCS 2022] that computes the smallest/largest repetition factorization in $O(n \log n)$ time.
- [3] arXiv:2408.04517 [pdf, html, other]
-
Title: Approximating $\delta$-CoveringComments: 22 pages, 6 figures, extended abstract accepted at WAOA24Subjects: Data Structures and Algorithms (cs.DS)
$\delta$-Covering, for some covering range $\delta>0$, is a continuous facility location problem on undirected graphs where all edges have unit length. The facilities may be positioned on the vertices as well as on the interior of the edges. The goal is to position as few facilities as possible such that every point on every edge has distance at most $\delta$ to one of these facilities. For large $\delta$, the problem is similar to dominating set, which is hard to approximate, while for small $\delta$, say close to $1$, the problem is similar to vertex cover. In fact, as shown by Hartmann et al. [Math. Program. 22], $\delta$-Covering for all unit-fractions $\delta$ is polynomial time solvable, while for all other values of $\delta$ the problem is NP-hard.
We study the approximability of $\delta$-Covering for every covering range $\delta>0$. For $\delta \geq 3/2$, the problem is log-APX-hard, and allows an $\mathcal O(\log n)$ approximation. For every $\delta < 3/2$, there is a constant factor approximation of a minimum $\delta$-cover (and the problem is APX-hard when $\delta$ is not a unit-fraction). We further study the dependency of the approximation ratio on the covering range $\delta < 3/2$. By providing several polynomial time approximation algorithms and lower bounds under the Unique Games Conjecture, we narrow the possible approximation ratio, especially for $\delta$ close to the polynomial time solvable cases. - [4] arXiv:2408.04537 [pdf, html, other]
-
Title: Movelet TreesSubjects: Data Structures and Algorithms (cs.DS)
We combine Nishimoto and Tabei's move structure with a wavelet tree to show how, if $T [1..n]$ is over a constant-sized alphabet and its Burrows-Wheeler Transform (BWT) consists of $r$ runs, then we can store $T$ in $O \left( r \log \frac{n}{r} \right)$ bits such that when given a pattern $P [1..m]$, we can find the BWT interval for $P$ in $O (m)$ time.
- [5] arXiv:2408.04613 [pdf, other]
-
Title: Core-Sparse Monge Matrix Multiplication: Improved Algorithm and ApplicationsComments: Abstract shortened for arXivSubjects: Data Structures and Algorithms (cs.DS)
The task of min-plus matrix multiplication often arises in the context of distances in graphs and is known to be fine-grained equivalent to the All-Pairs Shortest Path problem. The non-crossing property of shortest paths in planar graphs gives rise to Monge matrices; the min-plus product of $n\times n$ Monge matrices can be computed in $O(n^2)$ time. Grid graphs arising in sequence alignment problems, such as longest common subsequence or longest increasing subsequence, are even more structured. Tiskin [SODA'10] modeled their behavior using simple unit-Monge matrices and showed that the min-plus product of such matrices can be computed in $O(n\log n)$ time. Russo [SPIRE'11] showed that the min-plus product of arbitrary Monge matrices can be computed in time $O((n+\delta)\log^3 n)$ parameterized by the core size $\delta$, which is $O(n)$ for unit-Monge matrices.
In this work, we provide a linear bound on the core size of the product matrix in terms of the core sizes of the input matrices and show how to solve the core-sparse Monge matrix multiplication problem in $O((n+\delta)\log n)$ time, matching the result of Tiskin for simple unit-Monge matrices. Our algorithm also allows $O(\log \delta)$-time witness recovery for any given entry of the output matrix. As an application of this functionality, we show that an array of size $n$ can be preprocessed in $O(n\log^3 n)$ time so that the longest increasing subsequence of any sub-array can be reconstructed in $O(l)$ time, where $l$ is the length of the reported subsequence; in comparison, Karthik C. S. and Rahul [arXiv'24] recently achieved $O(l+n^{1/2}\log^3 n)$-time reporting after $O(n^{3/2}\log^3 n)$-time preprocessing. Our faster core-sparse Monge matrix multiplication also enabled reducing two logarithmic factors in the running times of the recent algorithms for edit distance with integer weights [Gorbachev & Kociumaka, arXiv'24]. - [6] arXiv:2408.04615 [pdf, html, other]
-
Title: SSD Set System, Graph Decomposition and Hamiltonian CycleComments: 29 pages, 4 figuresSubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM); Combinatorics (math.CO)
In this paper, we first study what we call Superset-Subset-Disjoint (SSD) set system. Based on properties of SSD set system, we derive the following (I) to (IV):
(I) For a nonnegative integer $k$ and a graph $G=(V,E)$ with $|V|\ge2$, let $X_1,X_2,\dots,X_q\subsetneq V$ denote all maximal proper subsets of $V$ that induce $k$-edge-connected subgraphs. Then at least one of (a) and (b) holds: (a) $\{X_1,X_2,\dots,X_q\}$ is a partition of $V$; and (b) $V\setminus X_1, V\setminus X_2,\dots,V\setminus X_q$ are pairwise disjoint.
(II) For a strongly-connected (i.e., $k=1$) digraph $G$, we show that whether $V$ is in (a) and/or (b) can be decided in $O(n+m)$ time and that we can generate all such $X_1,X_2,\dots,X_q$ in $O(n+m+|X_1|+|X_2|+\dots+|X_q|)$ time, where $n=|V|$ and $m=|E|$.
(III) For a digraph $G$, we can enumerate in linear delay all vertex subsets of $V$ that induce strongly-connected subgraphs.
(IV) A digraph is Hamiltonian if there is a spanning subgraph that is strongly-connected and in the case (a). - [7] arXiv:2408.04620 [pdf, html, other]
-
Title: Regularized Unconstrained Weakly Submodular MaximizationComments: CIKM'24. Full paper including omitted proofsSubjects: Data Structures and Algorithms (cs.DS)
Submodular optimization finds applications in machine learning and data mining. In this paper, we study the problem of maximizing functions of the form $h = f-c$, where $f$ is a monotone, non-negative, weakly submodular set function and $c$ is a modular function. We design a deterministic approximation algorithm that runs with ${O}(\frac{n}{\epsilon}\log \frac{n}{\gamma \epsilon})$ oracle calls to function $h$, and outputs a set ${S}$ such that $h({S}) \geq \gamma(1-\epsilon)f(OPT)-c(OPT)-\frac{c(OPT)}{\gamma(1-\epsilon)}\log\frac{f(OPT)}{c(OPT)}$, where $\gamma$ is the submodularity ratio of $f$. Existing algorithms for this problem either admit a worse approximation ratio or have quadratic runtime. We also present an approximation ratio of our algorithm for this problem with an approximate oracle of $f$. We validate our theoretical results through extensive empirical evaluations on real-world applications, including vertex cover and influence diffusion problems for submodular utility function $f$, and Bayesian A-Optimal design for weakly submodular $f$. Our experimental results demonstrate that our algorithms efficiently achieve high-quality solutions.
New submissions for Friday, 9 August 2024 (showing 7 of 7 entries )
- [8] arXiv:2408.04122 (cross-list from cs.LG) [pdf, html, other]
-
Title: Overcoming Brittleness in Pareto-Optimal Learning-Augmented AlgorithmsSubjects: Machine Learning (cs.LG); Data Structures and Algorithms (cs.DS)
The study of online algorithms with machine-learned predictions has gained considerable prominence in recent years. One of the common objectives in the design and analysis of such algorithms is to attain (Pareto) optimal tradeoffs between the consistency of the algorithm, i.e., its performance assuming perfect predictions, and its robustness, i.e., the performance of the algorithm under adversarial predictions. In this work, we demonstrate that this optimization criterion can be extremely brittle, in that the performance of Pareto-optimal algorithms may degrade dramatically even in the presence of imperceptive prediction error. To remedy this drawback, we propose a new framework in which the smoothness in the performance of the algorithm is enforced by means of a user-specified profile. This allows us to regulate the performance of the algorithm as a function of the prediction error, while simultaneously maintaining the analytical notion of consistency/robustness tradeoffs, adapted to the profile setting. We apply this new approach to a well-studied online problem, namely the one-way trading problem. For this problem, we further address another limitation of the state-of-the-art Pareto-optimal algorithms, namely the fact that they are tailored to worst-case, and extremely pessimistic inputs. We propose a new Pareto-optimal algorithm that leverages any deviation from the worst-case input to its benefit, and introduce a new metric that allows us to compare any two Pareto-optimal algorithms via a dominance relation.
- [9] arXiv:2408.04445 (cross-list from cs.DM) [pdf, html, other]
-
Title: On some randomized algorithms and their evaluationComments: 14 pages. arXiv admin note: substantial text overlap with arXiv:1312.0192Journal-ref: Mathematics and Informatics, 63 (2) 2020, 202-217, ISSN 1310-2230Subjects: Discrete Mathematics (cs.DM); Data Structures and Algorithms (cs.DS); Combinatorics (math.CO)
The paper considers implementations of some randomized algorithms in connection with obtaining a random $n^2 \times n^2$ Sudoku matrix with programming language C++. For this purpose we describes the set $\Pi_n$ of all $(2n) \times n$ matrices, consisting of elements of the set $\mathbb{Z}_n =\{ 1,2,\ldots ,n\}$, such that every row is a permutation. We emphasize the relationship between these matrices and the $n^2 \times n^2$ Sudoku matrices. An algorithm to obtain random $\Pi_n$ matrices is presented in this paper. Several auxiliary algorithms that are related to the underlying problem have been described. We evaluated all algorithms according to two criteria - probability evaluation, and time for generation of random objects and checking of belonging to a specific set. This evaluations are interesting from both theoretical and practical point of view because they are particularly useful in the analysis of computer programs.
Cross submissions for Friday, 9 August 2024 (showing 2 of 2 entries )
- [10] arXiv:2401.04509 (replaced) [pdf, html, other]
-
Title: Linear-size Suffix Tries and Linear-size CDAWGs Simplified and ImprovedSubjects: Data Structures and Algorithms (cs.DS); Formal Languages and Automata Theory (cs.FL)
The linear-size suffix tries (LSTries) [Crochemore et al., TCS 2016] are a version of suffix trees in which the edge labels are single characters, yet are able to perform pattern matching queries in optimal time. Instead of explicitly storing the input text, LSTries have some extra non-branching internal nodes called type-2 nodes. The extended techniques are then used in the linear-size compact directed acyclic word graphs (LCDAWGs) [Takagi et al., SPIRE 2017], which can be stored with $O(el(T)+er(T))$ space (i.e. without the text), where $el(T)$ and $er(T)$ are the numbers of left- and right-extensions of the maximal repeats in the input text string $T$, respectively. In this paper, we present simpler alternatives to the aforementioned indexing structures, called the simplified LSTries (simLSTries) and the simplified LCDAWGs (simLCDAWGs), in which most of the type-2 nodes are removed. In particular, our simLCDAWGs require only $O(er(T))$ space and work on a weaker model of computation (i.e. the pointer machine model). This contrasts the $O(er(T))$-space CDAWG representation of [Belazzougui and Cunial, SPIRE 2017], which works on the word RAM model.
- [11] arXiv:2404.06401 (replaced) [pdf, other]
-
Title: Bounded Edit Distance: Optimal Static and Dynamic Algorithms for Small Integer WeightsComments: Abstract shortened for arXivSubjects: Data Structures and Algorithms (cs.DS)
The edit distance of two strings is the minimum number of insertions, deletions, and substitutions needed to transform one string into the other. The textbook algorithm determines the edit distance of length-$n$ strings in $O(n^2)$ time, which is optimal up to subpolynomial factors under Orthogonal Vectors Hypothesis. In the bounded version of the problem, parameterized by the edit distance $k$, the algorithm of Landau and Vishkin [JCSS'88] achieves $O(n+k^2)$ time, which is optimal as a function of $n$ and $k$.
The dynamic version of the problem asks to maintain the edit distance of two strings that change dynamically, with each update modeled as an edit. A folklore approach supports updates in $\tilde O(k^2)$ time, where $\tilde O(\cdot)$ hides polylogarithmic factors. Recently, Charalampopoulos, Kociumaka, and Mozes [CPM'20] showed an algorithm with update time $\tilde O(n)$, which is optimal under OVH in terms of $n$. The update time of $\tilde O(\min\{n,k^2\})$ raised an exciting open question of whether $\tilde O(k)$ is possible; we answer it affirmatively.
Our solution relies on tools originating from weighted edit distance, where the weight of each edit depends on the edit type and the characters involved. The textbook algorithm supports weights, but the Landau-Vishkin approach does not, and a simple $O(nk)$-time procedure long remained the fastest for bounded weighted edit distance. Only recently, Das et al. [STOC'23] provided an $O(n+k^5)$-time algorithm, whereas Cassis, Kociumaka, and Wellnitz [FOCS'23] presented an $\tilde O(n+\sqrt{nk^3})$-time solution and a matching conditional lower bound. In this paper, we show that, for integer edit weights between $0$ and $W$, weighted edit distance can be computed in $\tilde O(n+Wk^2)$ time and maintained dynamically in $\tilde O(W^2k)$ time per update. Our static algorithm can also be implemented in $\tilde O(n+k^{2.5})$ time. - [12] arXiv:2407.00573 (replaced) [pdf, html, other]
-
Title: A Simple Representation of Tree Covering Utilizing Balanced Parentheses and Efficient Implementation of Average-Case Optimal RMQsKou Hamada, Sankardeep Chakraborty, Seungbum Jo, Takuto Koriyama, Kunihiko Sadakane, Srinivasa Rao SattiComments: To appear in ESA 2024Subjects: Data Structures and Algorithms (cs.DS); Databases (cs.DB)
Tree covering is a technique for decomposing a tree into smaller-sized trees with desirable properties, and has been employed in various succinct data structures. However, significant hurdles stand in the way of a practical implementation of tree covering: a lot of pointers are used to maintain the tree-covering hierarchy and many indices for tree navigational queries consume theoretically negligible yet practically vast space. To tackle these problems, we propose a simple representation of tree covering using a balanced parenthesis representation. The key to the proposal is the observation that every micro tree splits into at most two intervals on the BP representation. Utilizing the representation, we propose several data structures that represent a tree and its tree cover, which consequently allow micro tree compression with arbitrary coding and efficient tree navigational queries. We also applied our data structure to average-case optimal RMQ by Munro et al.~[ESA 2021] and implemented the RMQ data structure. Our RMQ data structures spend less than $2n$ bits and process queries in a practical time on several settings of the performance evaluation, reducing the gap between theoretical space complexity and actual space consumption. We also implement tree navigational operations while using the same amount of space as the RMQ data structures. We believe the representation can be widely utilized for designing practically memory-efficient data structures based on tree covering.
- [13] arXiv:2407.19905 (replaced) [pdf, html, other]
-
Title: The Bidirected Cut Relaxation for Steiner Tree has Integrality Gap Smaller than 2Comments: updated one figureSubjects: Data Structures and Algorithms (cs.DS); Discrete Mathematics (cs.DM)
The Steiner tree problem is one of the most prominent problems in network design. Given an edge-weighted undirected graph and a subset of the vertices, called terminals, the task is to compute a minimum-weight tree containing all terminals (and possibly further vertices). The best-known approximation algorithms for Steiner tree involve enumeration of a (polynomial but) very large number of candidate components and are therefore slow in practice.
A promising ingredient for the design of fast and accurate approximation algorithms for Steiner tree is the bidirected cut relaxation (BCR): bidirect all edges, choose an arbitrary terminal as a root, and enforce that each cut containing some terminal but not the root has one unit of fractional edges leaving it. BCR is known to be integral in the spanning tree case [Edmonds'67], i.e., when all the vertices are terminals. For general instances, however, it was not even known whether the integrality gap of BCR is better than the integrality gap of the natural undirected relaxation, which is exactly 2. We resolve this question by proving an upper bound of 1.9988 on the integrality gap of BCR. - [14] arXiv:2402.12062 (replaced) [pdf, html, other]
-
Title: Causal Equal Protection as Algorithmic FairnessComments: 16 pages, 5 figuresSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
By combining the philosophical literature on statistical evidence and the interdisciplinary literature on algorithmic fairness, we revisit recent objections against classification parity in light of causal analyses of algorithmic fairness and the distinction between predictive and diagnostic evidence. We focus on trial proceedings as a black-box classification algorithm in which defendants are sorted into two groups by convicting or acquitting them. We defend a novel principle, causal equal protection, that combines classification parity with the causal approach. In the do-calculus, causal equal protection requires that individuals should not be subject to uneven risks of classification error because of their protected or socially salient characteristics. The explicit use of protected characteristics, however, may be required if it equalizes these risks.
- [15] arXiv:2404.12953 (replaced) [pdf, html, other]
-
Title: Low-Depth Spatial Tree AlgorithmsJournal-ref: IEEE International Parallel and Distributed Processing Symposium, IPDPS 2024, San Francisco, CA, USA, May 27-31 (2024) 180-192Subjects: Distributed, Parallel, and Cluster Computing (cs.DC); Data Structures and Algorithms (cs.DS)
Contemporary accelerator designs exhibit a high degree of spatial localization, wherein two-dimensional physical distance determines communication costs between processing elements. This situation presents considerable algorithmic challenges, particularly when managing sparse data, a pivotal component in progressing data science. The spatial computer model quantifies communication locality by weighting processor communication costs by distance, introducing a term named energy. Moreover, it integrates depth, a widely-utilized metric, to promote high parallelism. We propose and analyze a framework for efficient spatial tree algorithms within the spatial computer model. Our primary method constructs a spatial tree layout that optimizes the locality of the neighbors in the compute grid. This approach thereby enables locality-optimized messaging within the tree. Our layout achieves a polynomial factor improvement in energy compared to utilizing a PRAM approach. Using this layout, we develop energy-efficient treefix sum and lowest common ancestor algorithms, which are both fundamental building blocks for other graph algorithms. With high probability, our algorithms exhibit near-linear energy and poly-logarithmic depth. Our contributions augment a growing body of work demonstrating that computations can have both high spatial locality and low depth. Moreover, our work constitutes an advancement in the spatial layout of irregular and sparse computations.
- [16] arXiv:2406.04868 (replaced) [pdf, html, other]
-
Title: Perturb-and-Project: Differentially Private Similarities and MarginalsComments: 21 ppages, ICML 2024Subjects: Machine Learning (cs.LG); Cryptography and Security (cs.CR); Data Structures and Algorithms (cs.DS)
We revisit the input perturbations framework for differential privacy where noise is added to the input $A\in \mathcal{S}$ and the result is then projected back to the space of admissible datasets $\mathcal{S}$. Through this framework, we first design novel efficient algorithms to privately release pair-wise cosine similarities. Second, we derive a novel algorithm to compute $k$-way marginal queries over $n$ features. Prior work could achieve comparable guarantees only for $k$ even. Furthermore, we extend our results to $t$-sparse datasets, where our efficient algorithms yields novel, stronger guarantees whenever $t\le n^{5/6}/\log n\,.$ Finally, we provide a theoretical perspective on why \textit{fast} input perturbation algorithms works well in practice. The key technical ingredients behind our results are tight sum-of-squares certificates upper bounding the Gaussian complexity of sets of solutions.