Search | arXiv e-print repository

Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means

Authors: Max Dupré la Tour, David Saulpic

Abstract: Clustering is one of the staples of data analysis and unsupervised learning. As such, clustering algorithms are often used on massive data sets, and they need to be extremely fast. We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. For these, the go-to algorithm is $k$-means++, which yields an $O(\log k)$-approximation in time… ▽ More Clustering is one of the staples of data analysis and unsupervised learning. As such, clustering algorithms are often used on massive data sets, and they need to be extremely fast. We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. For these, the go-to algorithm is $k$-means++, which yields an $O(\log k)$-approximation in time $\tilde O(nkd)$. While it is possible to improve either the approximation factor [Lattanzi and Sohler, ICML19] or the running time [Cohen-Addad et al., NeurIPS 20], it is unknown how precise a linear-time algorithm can be. In this paper, we almost answer this question by presenting an almost linear-time algorithm to compute a constant-factor approximation. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2406.19926 [pdf, other]

Fully Dynamic k-Means Coreset in Near-Optimal Update Time

Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

Abstract: We study in this paper the problem of maintaining a solution to $k$-median and $k$-means clustering in a fully dynamic setting. To do so, we present an algorithm to efficiently maintain a coreset, a compressed version of the dataset, that allows easy computation of a clustering solution at query time. Our coreset algorithm has near-optimal update time of $\tilde O(k)$ in general metric spaces, whi… ▽ More We study in this paper the problem of maintaining a solution to $k$-median and $k$-means clustering in a fully dynamic setting. To do so, we present an algorithm to efficiently maintain a coreset, a compressed version of the dataset, that allows easy computation of a clustering solution at query time. Our coreset algorithm has near-optimal update time of $\tilde O(k)$ in general metric spaces, which reduces to $\tilde O(d)$ in the Euclidean space $\mathbb{R}^d$. The query time is $O(k^2)$ in general metrics, and $O(kd)$ in $\mathbb{R}^d$. To maintain a constant-factor approximation for $k$-median and $k$-means clustering in Euclidean space, this directly leads to an algorithm update time $\tilde O(d)$, and query time $\tilde O(kd + k^2)$. To maintain a $O(polylog~k)$-approximation, the query time is reduced to $\tilde O(kd)$. △ Less

Submitted 28 June, 2024; originally announced June 2024.

Comments: To appear at ESA 2024

arXiv:2406.11649 [pdf, other]

Making Old Things New: A Unified Algorithm for Differentially Private Clustering

Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

Abstract: As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. T… ▽ More As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. The study of each variation gave rise to new algorithms: the landscape of private clustering algorithms is therefore quite intricate. In this paper, we show that a 20-year-old algorithm can be slightly modified to work for any of these models. This provides a unified picture: while matching almost all previously known results, it allows us to improve some of them and extend it to a new privacy model, the continual observation setting, where the input is changing over time and the algorithm must output a new solution at each time step. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: Oral presentation at ICML 2024

arXiv:2405.01339 [pdf, other]

Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds

Authors: Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, David Saulpic, Chris Schwiegelshohn

Abstract: Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ωおめが$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A ver… ▽ More Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ωおめが$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A very popular method for coreset construction, both in theory and practice, is Sensitivity Sampling, where points are sampled in proportion to their importance. We show that Sensitivity Sampling yields optimal coresets of size $\tilde{O}(k/\varepsilon^2\min(\sqrt{k},\varepsilon^{-2}))$ for worst-case instances. Uniquely among all known coreset algorithms, for well-clusterable data sets with $Ωおめが(1)$ cost stability, Sensitivity Sampling gives coresets of size $\tilde{O}(k/\varepsilon^2)$, improving over the worst-case lower bound. Notably, Sensitivity Sampling does not have to know the cost stability in order to exploit it: It is appropriately sensitive to the clusterability of the data set while being oblivious to it. We also show that any coreset for stable instances consisting of only input points must have size $Ωおめが(k/\varepsilon^2)$. Our results for Sensitivity Sampling also extend to the $k$-median problem, and more general metric spaces. △ Less

Submitted 2 May, 2024; originally announced May 2024.

Comments: 57 pages

arXiv:2404.01936 [pdf, other]

Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

Authors: Andrew Draganov, David Saulpic, Chris Schwiegelshohn

Abstract: We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of poi… ▽ More We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points - while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size. We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time - within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available and has scripts to recreate the experiments. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2402.17327 [pdf, other]

Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably a… ▽ More We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably allows selecting a set of ``typical'' $k + 1/\varepsilon^2$ elements whose average loss corresponds to the average loss of the whole dataset, up to a multiplicative $(1\pm\varepsilon)$ factor and an additive $\varepsilon λらむだΦふぁい_k$, where $Φふぁい_k$ represents the $k$-means cost for the input embeddings and $λらむだ$ is the Hölder constant. We furthermore demonstrate the performance and scalability of our approach on fine-tuning foundation models and show that it outperforms state-of-the-art methods. We also show how it can be applied on linear regression, leading to a new sampling strategy that surprisingly matches the performances of leverage score sampling, while being conceptually simpler and more scalable. △ Less

Submitted 27 February, 2024; originally announced February 2024.

arXiv:2310.18034 [pdf, other]

Experimental Evaluation of Fully Dynamic k-Means via Coresets

Authors: Monika Henzinger, David Saulpic, Leonhard Sidl

Abstract: For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the… ▽ More For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the coreset provides a guarantee almost equivalent to running it on the full data. In this work, we study coresets in a fully-dynamic setting: points are added and deleted with the goal to efficiently maintain a coreset with which a k-means solution can be computed. Based on an algorithm from Henzinger and Kale [ESA'20], we present an efficient and practical implementation of a fully dynamic coreset algorithm, that improves the running time by up to a factor of 20 compared to our non-optimized implementation of the algorithm by Henzinger and Kale, without sacrificing more than 7% on the quality of the k-means solution. △ Less

Submitted 27 October, 2023; originally announced October 2023.

Comments: Accepted at ALENEX 24

arXiv:2310.04076 [pdf, other]

Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the preci… ▽ More In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the precision parameter $\varepsilon^{-1}$ or $k$. Furthermore, there is no coreset construction that succeeds with probability $1-1/n$ and whose size does not depend on the number of input points, $n$. This has led researchers in the area to ask what is the power of randomness for clustering sketches [Feldman, WIREs Data Mining Knowl. Discov'20]. Similarly, the best approximation ratio achievable deterministically without a complexity exponential in the dimension are $Ωおめが(1)$ for both $k$-median and $k$-means, even when allowing a complexity FPT in the number of clusters $k$. This stands in sharp contrast with the $(1+\varepsilon)$-approximation achievable in that case, when allowing randomization. In this paper, we provide deterministic sketches constructions for clustering, whose size bounds are close to the best-known randomized ones. We also construct a deterministic algorithm for computing $(1+\varepsilon)$-approximation to $k$-median and $k$-means in high dimensional Euclidean spaces in time $2^{k^2/\varepsilon^{O(1)}} poly(nd)$, close to the best randomized complexity. Furthermore, our new insights on sketches also yield a randomized coreset construction that uses uniform sampling, that immediately improves over the recent results of [Braverman et al. FOCS '22] by a factor $k$. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: FOCS 2023. Abstract reduced for arxiv requirements

arXiv:2307.03430 [pdf, ps, other]

Differential Privacy for Clustering Under Continual Observation

Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

Abstract: We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number… ▽ More We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number $T$ of updates. The multiplicative error is almost the same as non privately. To do so we show how to perform dimension reduction under continual observation and combine it with a differentially private greedy approximation algorithm for $k$-means. We also partially extend our results to the $k$-median problem. △ Less

Submitted 27 July, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

arXiv:2211.08184 [pdf, other]

Improved Coresets for Euclidean $k$-Means

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar

Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weigh… ▽ More Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weighted subset known as a coreset and then run any algorithm on this subset. The guarantee of the coreset is that for any candidate solution, the ratio between coreset cost and the cost of the original instance is less than a $(1\pm \varepsilon)$ factor. The current state of the art coreset size is $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for Euclidean $k$-means and $\tilde O(\min(k^{2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for Euclidean $k$-median. The best known lower bound for both problems is $Ωおめが(k \varepsilon^{-2})$. In this paper, we improve the upper bounds $\tilde O(\min(k^{3/2} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-4}))$ for $k$-means and $\tilde O(\min(k^{4/3} \cdot \varepsilon^{-2},k\cdot \varepsilon^{-3}))$ for $k$-median. In particular, ours is the first provable bound that breaks through the $k^2$ barrier while retaining an optimal dependency on $\varepsilon$. △ Less

Submitted 16 November, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

arXiv:2206.08646 [pdf, other]

Scalable Differentially Private Clustering via Hierarchically Separated Trees

Authors: Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii

Abstract: We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / εいぷしろん^2)$, where $εいぷしろん$ is the priv… ▽ More We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / εいぷしろん^2)$, where $εいぷしろん$ is the privacy guarantee. (The dimension term, $d$, can be replaced with $O(\log k)$ using standard dimension reduction techniques.) Although the worst-case guarantee is worse than that of state of the art private clustering methods, the algorithm we propose is practical, runs in near-linear, $\tilde{O}(nkd)$, time and scales to tens of millions of points. We also show that our method is amenable to parallelization in large-scale distributed computing environments. In particular we show that our private algorithms can be implemented in logarithmic number of MPC rounds in the sublinear memory regime. Finally, we complement our theoretical analysis with an empirical evaluation demonstrating the algorithm's efficiency and accuracy in comparison to other privacy clustering baselines. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: To appear at KDD'22

arXiv:2202.12793 [pdf, other]

Towards Optimal Lower Bounds for k-median and k-means Coresets

Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn

Abstract: Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern da… ▽ More Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern data analysis and massive data applications have given raise to the notion of coreset: a small (weighted) subset of the input point set preserving the cost of any solution to the problem up to a multiplicative $(1 \pm \varepsilon)$ factor, hence reducing from large to small scale the input to the problem. In this paper, we present improved lower bounds for coresets in various metric spaces. In finite metrics consisting of $n$ points and doubling metrics with doubling constant $D$, we show that any coreset for $(k,z)$ clustering must consist of at least $Ωおめが(k \varepsilon^{-2} \log n)$ and $Ωおめが(k \varepsilon^{-2} D)$ points, respectively. Both bounds match previous upper bounds up to polylog factors. In Euclidean spaces, we show that any coreset for $(k,z)$ clustering must consists of at least $Ωおめが(k\varepsilon^{-2})$ points. We complement these lower bounds with a coreset construction consisting of at most $\tilde{O}(k\varepsilon^{-2}\cdot \min(\varepsilon^{-z},k))$ points. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2111.04589 [pdf, other]

An Improved Local Search Algorithm for k-Median

Authors: Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu, Hoon Oh, David Saulpic

Abstract: We present a new local-search algorithm for the $k$-median clustering problem. We show that local optima for this algorithm give a $(2.836+εいぷしろん)$-approximation; our result improves upon the $(3+εいぷしろん)$-approximate local-search algorithm of Arya et al. [STOC 01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximat… ▽ More We present a new local-search algorithm for the $k$-median clustering problem. We show that local optima for this algorithm give a $(2.836+εいぷしろん)$-approximation; our result improves upon the $(3+εいぷしろん)$-approximate local-search algorithm of Arya et al. [STOC 01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximation guarantee for the problem. The new ingredient in our algorithm is the use of a potential function based on both the closest and second-closest facilities to each client. Specifically, the potential is the sum over all clients, of the distance of the client to its closest facility, plus (a small constant times) the truncated distance to its second-closest facility. We move from one solution to another only if the latter can be obtained by swapping a constant number of facilities, and has a smaller potential than the former. This refined potential allows us to avoid the bad local optima given by Arya et al. for the local-search algorithm based only on the cost of the solution. △ Less

Submitted 8 November, 2021; originally announced November 2021.

Comments: To appear at SODA 22

ACM Class: F.2.2

arXiv:2104.06133 [pdf, other]

doi 10.1145/3406325.3451022

A New Coreset Framework for Clustering

Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

Abstract: Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known a… ▽ More Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known as \emph{coresets}, has been an important research direction over the last 15 years. In this paper, we present a new, simple coreset framework that simultaneously improves upon the best known bounds for a large variety of settings, ranging from Euclidean space, doubling metric, minor-free metric, and the general metric cases. △ Less

Submitted 29 July, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

Comments: Improved presentation. Adds a simpler suboptimal proof for interesting points, and an improved analysis for planar graphs. Corrects errors in the construction of centroid sets

arXiv:2006.12897 [pdf, other]

Polynomial Time Approximation Schemes for Clustering in Low Highway Dimension Graphs

Authors: Andreas Emil Feldmann, David Saulpic

Abstract: We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by t… ▽ More We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by the number of clusters $k$, the highway dimension, and the approximation factor) [Becker et al. ESA~2018, Braverman et al. 2020]. In this paper we show that a polynomial-time approximation scheme (PTAS) exists (assuming constant highway dimension). We also show that the considered problems are NP-hard on graphs of highway dimension 1. △ Less

Submitted 31 May, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

ACM Class: F.2.2

arXiv:1901.09877 [pdf, other]

Dominating Sets and Connected Dominating Sets in Dynamic Graphs

Authors: Niklas Hjuler, Giuseppe F. Italiano, Nikos Parotsidis, David Saulpic

Abstract: In this paper we study the dynamic versions of two basic graph problems: Minimum Dominating Set and its variant Minimum Connected Dominating Set. For those two problems, we present algorithms that maintain a solution under edge insertions and edge deletions in time $O(Δでるた\cdot \text{polylog}~n)$ per update, where $Δでるた$ is the maximum vertex degree in the graph. In both cases, we achieve an approximati… ▽ More In this paper we study the dynamic versions of two basic graph problems: Minimum Dominating Set and its variant Minimum Connected Dominating Set. For those two problems, we present algorithms that maintain a solution under edge insertions and edge deletions in time $O(Δでるた\cdot \text{polylog}~n)$ per update, where $Δでるた$ is the maximum vertex degree in the graph. In both cases, we achieve an approximation ratio of $O(\log n)$, which is optimal up to a constant factor (under the assumption that $P \ne NP$). Although those two problems have been widely studied in the static and in the distributed settings, to the best of our knowledge we are the first to present efficient algorithms in the dynamic setting. As a further application of our approach, we also present an algorithm that maintains a Minimal Dominating Set in $O(min(Δでるた, \sqrt{m}))$ per update. △ Less

Submitted 28 January, 2019; originally announced January 2019.

arXiv:1812.08664 [pdf, other]

Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics

Authors: Vincent Cohen-Addad, Andreas Emil Feldmann, David Saulpic

Abstract: We consider the classic Facility Location, $k$-Median, and $k$-Means problems in metric spaces of doubling dimension $d$. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is $2^{(\log(1/\eps)/\eps)^{O(d^2)}} n \log^4 n + 2^{O(d)} n \log^9 n$, making a significant improvement over the state-of-the-art algorithms which run in time… ▽ More We consider the classic Facility Location, $k$-Median, and $k$-Means problems in metric spaces of doubling dimension $d$. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is $2^{(\log(1/\eps)/\eps)^{O(d^2)}} n \log^4 n + 2^{O(d)} n \log^9 n$, making a significant improvement over the state-of-the-art algorithms which run in time $n^{(d/\eps)^{O(d)}}$. Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting $k$-Medians and $k$-Means, and efficient bicriteria approximation schemes for $k$-Medians with outliers, $k$-Means with outliers and $k$-Center. △ Less

Submitted 20 May, 2020; v1 submitted 20 December, 2018; originally announced December 2018.

arXiv:1709.08357 [pdf, ps, other]

Generating Functionally Equivalent Programs Having Non-Isomorphic Control-Flow Graphs

Authors: Rémi Géraud, Mirko Koscina, Paul Lenczner, David Naccache, David Saulpic

Abstract: One of the big challenges in program obfuscation consists in modifying not only the program's straight-line code (SLC) but also the program's control flow graph (CFG). Indeed, if only SLC is modified, the program's CFG can be extracted and analyzed. Usually, the CFG leaks a considerable amount of information on the program's structure. In this work we propose a method allowing to re-write a code… ▽ More One of the big challenges in program obfuscation consists in modifying not only the program's straight-line code (SLC) but also the program's control flow graph (CFG). Indeed, if only SLC is modified, the program's CFG can be extracted and analyzed. Usually, the CFG leaks a considerable amount of information on the program's structure. In this work we propose a method allowing to re-write a code P into a functionally equivalent code P' such that CFG{P} and CFG{P'} are radically different. △ Less

Submitted 25 September, 2017; originally announced September 2017.

Comments: 16 pages paper, published in NordSec 2017 (conference), Proceedings of the Nordic Conference on Secure IT Systems (Nordic 2017)

arXiv:1707.08270 [pdf, other]

Polynomial-Time Approximation Schemes for k-Center and Bounded-Capacity Vehicle Routing in Graphs with Bounded Highway Dimension

Authors: Amariah Becker, Philip N. Klein, David Saulpic

Abstract: The concept of bounded highway dimension was developed to capture observed properties of the metrics of road networks. We show that a graph with bounded highway dimension, for any vertex, can be embedded into a a graph of bounded treewidth in such a way that the distance between $u$ and $v$ is preserved up to an additive error of $εいぷしろん$ times the distance from $u$ or $v$ to the selected vertex. We sh… ▽ More The concept of bounded highway dimension was developed to capture observed properties of the metrics of road networks. We show that a graph with bounded highway dimension, for any vertex, can be embedded into a a graph of bounded treewidth in such a way that the distance between $u$ and $v$ is preserved up to an additive error of $εいぷしろん$ times the distance from $u$ or $v$ to the selected vertex. We show that this theorem yields a PTAS for Bounded-Capacity Vehicle Routing in graphs of bounded highway dimension. In this problem, the input specifies a depot and a set of clients, each with a location and demand; the output is a set of depot-to-depot tours, where each client is visited by some tour and each tour covers at most $Q$ units of client demand. Our PTAS can be extended to handle penalties for unvisited clients. We extend this embedding result to handle a set $S$ of distinguished vertices. The treewidth depends on $|S|$, and the distance between $u$ and $v$ is preserved up to an additive error of $εいぷしろん$ times the distance from $u$ and $v$ to $S$. This embedding result implies a PTAS for Multiple Depot Bounded-Capacity Vehicle Routing: the tours can go from one depot to another. The embedding result also implies that, for fixed $k$, there is a PTAS for $k$-Center in graphs of bounded highway dimension. In this problem, the goal is to minimize $d$ such that there exist $k$ vertices (the centers) such that every vertex is within distance $d$ of some center. Similarly, for fixed $k$, there is a PTAS for $k$-Median in graphs of bounded highway dimension. In this problem, the goal is to minimize the sum of distances to the $k$ centers. △ Less

Submitted 13 November, 2017; v1 submitted 25 July, 2017; originally announced July 2017.

Showing 1–19 of 19 results for author: Saulpic, D