(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–19 of 19 results for author: Saulpic, D

.
  1. arXiv:2407.11217  [pdf, ps, other

    cs.DS cs.AI

    Almost-linear Time Approximation Algorithm to Euclidean $k$-median and $k$-means

    Authors: Max Dupré la Tour, David Saulpic

    Abstract: Clustering is one of the staples of data analysis and unsupervised learning. As such, clustering algorithms are often used on massive data sets, and they need to be extremely fast. We focus on the Euclidean $k$-median and $k$-means problems, two of the standard ways to model the task of clustering. For these, the go-to algorithm is $k$-means++, which yields an $O(\log k)$-approximation in time… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  2. arXiv:2406.19926  [pdf, other

    cs.DS

    Fully Dynamic k-Means Coreset in Near-Optimal Update Time

    Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

    Abstract: We study in this paper the problem of maintaining a solution to $k$-median and $k$-means clustering in a fully dynamic setting. To do so, we present an algorithm to efficiently maintain a coreset, a compressed version of the dataset, that allows easy computation of a clustering solution at query time. Our coreset algorithm has near-optimal update time of $\tilde O(k)$ in general metric spaces, whi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

    Comments: To appear at ESA 2024

  3. arXiv:2406.11649  [pdf, other

    cs.DS cs.CR cs.LG

    Making Old Things New: A Unified Algorithm for Differentially Private Clustering

    Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

    Abstract: As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. T… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: Oral presentation at ICML 2024

  4. arXiv:2405.01339  [pdf, other

    cs.DS

    Sensitivity Sampling for $k$-Means: Worst Case and Stability Optimal Coreset Bounds

    Authors: Nikhil Bansal, Vincent Cohen-Addad, Milind Prabhu, David Saulpic, Chris Schwiegelshohn

    Abstract: Coresets are arguably the most popular compression paradigm for center-based clustering objectives such as $k$-means. Given a point set $P$, a coreset $Ωおめが$ is a small, weighted summary that preserves the cost of all candidate solutions $S$ up to a $(1\pm \varepsilon)$ factor. For $k$-means in $d$-dimensional Euclidean space the cost for solution $S$ is $\sum_{p\in P}\min_{s\in S}\|p-s\|^2$. A ver… ▽ More

    Submitted 2 May, 2024; originally announced May 2024.

    Comments: 57 pages

  5. arXiv:2404.01936  [pdf, other

    cs.LG cs.DS

    Settling Time vs. Accuracy Tradeoffs for Clustering Big Data

    Authors: Andrew Draganov, David Saulpic, Chris Schwiegelshohn

    Abstract: We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of poi… ▽ More

    Submitted 2 April, 2024; originally announced April 2024.

  6. arXiv:2402.17327  [pdf, other

    cs.LG cs.DS

    Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond

    Authors: Kyriakos Axiotis, Vincent Cohen-Addad, Monika Henzinger, Sammy Jerome, Vahab Mirrokni, David Saulpic, David Woodruff, Michael Wunder

    Abstract: We study the data selection problem, whose aim is to select a small representative subset of data that can be used to efficiently train a machine learning model. We present a new data selection approach based on $k$-means clustering and sensitivity sampling. Assuming access to an embedding representation of the data with respect to which the model loss is Hölder continuous, our approach provably a… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

  7. arXiv:2310.18034  [pdf, other

    cs.DS

    Experimental Evaluation of Fully Dynamic k-Means via Coresets

    Authors: Monika Henzinger, David Saulpic, Leonhard Sidl

    Abstract: For a set of points in $\mathbb{R}^d$, the Euclidean $k$-means problems consists of finding $k$ centers such that the sum of distances squared from each data point to its closest center is minimized. Coresets are one the main tools developed recently to solve this problem in a big data context. They allow to compress the initial dataset while preserving its structure: running any algorithm on the… ▽ More

    Submitted 27 October, 2023; originally announced October 2023.

    Comments: Accepted at ALENEX 24

  8. arXiv:2310.04076  [pdf, other

    cs.DS

    Deterministic Clustering in High Dimensional Spaces: Sketches and Approximation

    Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

    Abstract: In all state-of-the-art sketching and coreset techniques for clustering, as well as in the best known fixed-parameter tractable approximation algorithms, randomness plays a key role. For the classic $k$-median and $k$-means problems, there are no known deterministic dimensionality reduction procedure or coreset construction that avoid an exponential dependency on the input dimension $d$, the preci… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: FOCS 2023. Abstract reduced for arxiv requirements

  9. arXiv:2307.03430  [pdf, ps, other

    cs.DS cs.CR cs.LG

    Differential Privacy for Clustering Under Continual Observation

    Authors: Max Dupré la Tour, Monika Henzinger, David Saulpic

    Abstract: We consider the problem of clustering privately a dataset in $\mathbb{R}^d$ that undergoes both insertion and deletion of points. Specifically, we give an $\varepsilon$-differentially private clustering mechanism for the $k$-means objective under continual observation. This is the first approximation algorithm for that problem with an additive error that depends only logarithmically in the number… ▽ More

    Submitted 27 July, 2023; v1 submitted 7 July, 2023; originally announced July 2023.

  10. arXiv:2211.08184  [pdf, other

    cs.CG cs.LG

    Improved Coresets for Euclidean $k$-Means

    Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn, Omar Ali Sheikh-Omar

    Abstract: Given a set of $n$ points in $d$ dimensions, the Euclidean $k$-means problem (resp. the Euclidean $k$-median problem) consists of finding $k$ centers such that the sum of squared distances (resp. sum of distances) from every point to its closest center is minimized. The arguably most popular way of dealing with this problem in the big data setting is to first compress the data by computing a weigh… ▽ More

    Submitted 16 November, 2022; v1 submitted 15 November, 2022; originally announced November 2022.

  11. arXiv:2206.08646  [pdf, other

    cs.DS cs.CR cs.LG

    Scalable Differentially Private Clustering via Hierarchically Separated Trees

    Authors: Vincent Cohen-Addad, Alessandro Epasto, Silvio Lattanzi, Vahab Mirrokni, Andres Munoz, David Saulpic, Chris Schwiegelshohn, Sergei Vassilvitskii

    Abstract: We study the private $k$-median and $k$-means clustering problem in $d$ dimensional Euclidean space. By leveraging tree embeddings, we give an efficient and easy to implement algorithm, that is empirically competitive with state of the art non private methods. We prove that our method computes a solution with cost at most $O(d^{3/2}\log n)\cdot OPT + O(k d^2 \log^2 n / εいぷしろん^2)$, where $εいぷしろん$ is the priv… ▽ More

    Submitted 17 June, 2022; originally announced June 2022.

    Comments: To appear at KDD'22

  12. arXiv:2202.12793  [pdf, other

    cs.DS cs.CG cs.LG

    Towards Optimal Lower Bounds for k-median and k-means Coresets

    Authors: Vincent Cohen-Addad, Kasper Green Larsen, David Saulpic, Chris Schwiegelshohn

    Abstract: Given a set of points in a metric space, the $(k,z)$-clustering problem consists of finding a set of $k$ points called centers, such that the sum of distances raised to the power of $z$ of every data point to its closest center is minimized. Special cases include the famous k-median problem ($z = 1$) and k-means problem ($z = 2$). The $k$-median and $k$-means problems are at the heart of modern da… ▽ More

    Submitted 25 February, 2022; originally announced February 2022.

  13. arXiv:2111.04589  [pdf, other

    cs.DS

    An Improved Local Search Algorithm for k-Median

    Authors: Vincent Cohen-Addad, Anupam Gupta, Lunjia Hu, Hoon Oh, David Saulpic

    Abstract: We present a new local-search algorithm for the $k$-median clustering problem. We show that local optima for this algorithm give a $(2.836+εいぷしろん)$-approximation; our result improves upon the $(3+εいぷしろん)$-approximate local-search algorithm of Arya et al. [STOC 01]. Moreover, a computer-aided analysis of a natural extension suggests that this approach may lead to an improvement over the best-known approximat… ▽ More

    Submitted 8 November, 2021; originally announced November 2021.

    Comments: To appear at SODA 22

    ACM Class: F.2.2

  14. A New Coreset Framework for Clustering

    Authors: Vincent Cohen-Addad, David Saulpic, Chris Schwiegelshohn

    Abstract: Given a metric space, the $(k,z)$-clustering problem consists of finding $k$ centers such that the sum of the of distances raised to the power $z$ of every point to its closest center is minimized. This encapsulates the famous $k$-median ($z=1$) and $k$-means ($z=2$) clustering problems. Designing small-space sketches of the data that approximately preserves the cost of the solutions, also known a… ▽ More

    Submitted 29 July, 2022; v1 submitted 13 April, 2021; originally announced April 2021.

    Comments: Improved presentation. Adds a simpler suboptimal proof for interesting points, and an improved analysis for planar graphs. Corrects errors in the construction of centroid sets

  15. arXiv:2006.12897  [pdf, other

    cs.DS

    Polynomial Time Approximation Schemes for Clustering in Low Highway Dimension Graphs

    Authors: Andreas Emil Feldmann, David Saulpic

    Abstract: We study clustering problems such as k-Median, k-Means, and Facility Location in graphs of low highway dimension, which is a graph parameter modeling transportation networks. It was previously shown that approximation schemes for these problems exist, which either run in quasi-polynomial time (assuming constant highway dimension) [Feldmann et al. SICOMP 2018] or run in FPT time (parameterized by t… ▽ More

    Submitted 31 May, 2021; v1 submitted 23 June, 2020; originally announced June 2020.

    ACM Class: F.2.2

  16. arXiv:1901.09877  [pdf, other

    cs.DS

    Dominating Sets and Connected Dominating Sets in Dynamic Graphs

    Authors: Niklas Hjuler, Giuseppe F. Italiano, Nikos Parotsidis, David Saulpic

    Abstract: In this paper we study the dynamic versions of two basic graph problems: Minimum Dominating Set and its variant Minimum Connected Dominating Set. For those two problems, we present algorithms that maintain a solution under edge insertions and edge deletions in time $O(Δでるた\cdot \text{polylog}~n)$ per update, where $Δでるた$ is the maximum vertex degree in the graph. In both cases, we achieve an approximati… ▽ More

    Submitted 28 January, 2019; originally announced January 2019.

  17. arXiv:1812.08664  [pdf, other

    cs.DS cs.CG

    Near-Linear Time Approximation Schemes for Clustering in Doubling Metrics

    Authors: Vincent Cohen-Addad, Andreas Emil Feldmann, David Saulpic

    Abstract: We consider the classic Facility Location, $k$-Median, and $k$-Means problems in metric spaces of doubling dimension $d$. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is $2^{(\log(1/\eps)/\eps)^{O(d^2)}} n \log^4 n + 2^{O(d)} n \log^9 n$, making a significant improvement over the state-of-the-art algorithms which run in time… ▽ More

    Submitted 20 May, 2020; v1 submitted 20 December, 2018; originally announced December 2018.

  18. arXiv:1709.08357  [pdf, ps, other

    cs.CR

    Generating Functionally Equivalent Programs Having Non-Isomorphic Control-Flow Graphs

    Authors: Rémi Géraud, Mirko Koscina, Paul Lenczner, David Naccache, David Saulpic

    Abstract: One of the big challenges in program obfuscation consists in modifying not only the program's straight-line code (SLC) but also the program's control flow graph (CFG). Indeed, if only SLC is modified, the program's CFG can be extracted and analyzed. Usually, the CFG leaks a considerable amount of information on the program's structure. In this work we propose a method allowing to re-write a code… ▽ More

    Submitted 25 September, 2017; originally announced September 2017.

    Comments: 16 pages paper, published in NordSec 2017 (conference), Proceedings of the Nordic Conference on Secure IT Systems (Nordic 2017)

  19. arXiv:1707.08270  [pdf, other

    cs.DS

    Polynomial-Time Approximation Schemes for k-Center and Bounded-Capacity Vehicle Routing in Graphs with Bounded Highway Dimension

    Authors: Amariah Becker, Philip N. Klein, David Saulpic

    Abstract: The concept of bounded highway dimension was developed to capture observed properties of the metrics of road networks. We show that a graph with bounded highway dimension, for any vertex, can be embedded into a a graph of bounded treewidth in such a way that the distance between $u$ and $v$ is preserved up to an additive error of $εいぷしろん$ times the distance from $u$ or $v$ to the selected vertex. We sh… ▽ More

    Submitted 13 November, 2017; v1 submitted 25 July, 2017; originally announced July 2017.