(Translated by https://www.hiragana.jp/)
Search | arXiv e-print repository
Skip to main content

Showing 1–50 of 110 results for author: Alistarh, D

.
  1. arXiv:2406.12572  [pdf, other

    cs.CL cs.AI cs.LG

    Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

    Authors: Eldar Kurtic, Amir Moeini, Dan Alistarh

    Abstract: We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across l… ▽ More

    Submitted 19 June, 2024; v1 submitted 18 June, 2024; originally announced June 2024.

    ACM Class: I.2.7

  2. arXiv:2405.15756  [pdf, other

    cs.LG cs.AI

    Sparse Expansion and Neuronal Disentanglement

    Authors: Shashata Sawmya, Linghao Kong, Ilia Markov, Dan Alistarh, Nir Shavit

    Abstract: We show how to improve the inference efficiency of an LLM by expanding it into a mixture of sparse experts, where each expert is a copy of the original weights, one-shot pruned for a specific cluster of input values. We call this approach $\textit{Sparse Expansion}$. We show that, for models such as Llama 2 70B, as we increase the number of sparse experts, Sparse Expansion outperforms all other on… ▽ More

    Submitted 24 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

    Comments: 10 pages, 8 figures

  3. arXiv:2405.15593  [pdf, other

    cs.LG math.NA

    MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence

    Authors: Ionut-Vlad Modoranu, Mher Safaryan, Grigory Malinovsky, Eldar Kurtic, Thomas Robert, Peter Richtarik, Dan Alistarh

    Abstract: We propose a new variant of the Adam optimizer [Kingma and Ba, 2014] called MICROADAM that specifically minimizes memory overheads, while maintaining theoretical convergence guarantees. We achieve this by compressing the gradient information before it is fed into the optimizer state, thereby reducing its memory footprint significantly. We control the resulting compression error via a novel instanc… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.

  4. arXiv:2405.14852  [pdf, other

    cs.LG

    PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

    Authors: Vladimir Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, Peter Richtarik

    Abstract: There has been significant interest in "extreme" compression of large language models (LLMs), i.e., to 1-2 bits per parameter, which allows such models to be executed efficiently on resource-constrained devices. Existing work focused on improved one-shot quantization techniques and weight representations; yet, purely post-training approaches are reaching diminishing returns in terms of the accurac… ▽ More

    Submitted 30 May, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

    Comments: Preprint

  5. arXiv:2405.03594  [pdf, other

    cs.CL cs.AI

    Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

    Authors: Abhinav Agarwalla, Abhay Gupta, Alexandre Marques, Shubhra Pandit, Michael Goin, Eldar Kurtic, Kevin Leong, Tuan Nguyen, Mahmoud Salem, Dan Alistarh, Sean Lie, Mark Kurtz

    Abstract: Large language models (LLMs) have revolutionized Natural Language Processing (NLP), but their size creates computational bottlenecks. We introduce a novel approach to create accurate, sparse foundational versions of performant LLMs that achieve full accuracy recovery for fine-tuning tasks at up to 70% sparsity. We achieve this for the LLaMA-2 7B model by combining the SparseGPT one-shot pruning me… ▽ More

    Submitted 6 May, 2024; originally announced May 2024.

  6. arXiv:2404.03605  [pdf, other

    cs.LG cs.CL

    Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization

    Authors: Aniruddha Nrusimha, Mayank Mishra, Naigang Wang, Dan Alistarh, Rameswar Panda, Yoon Kim

    Abstract: We consider the problem of accurate quantization for language models, where both the weights and activations are uniformly quantized to 4 bits per parameter, the lowest bitwidth format natively supported by GPU hardware. In this context, the key challenge is activation quantization: it is known that language models contain outlier channels whose values on average are orders of magnitude higher tha… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  7. arXiv:2404.00456  [pdf, other

    cs.LG

    QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

    Authors: Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, James Hensman

    Abstract: We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to th… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

    Comments: 19 pages, 6 figures

  8. arXiv:2401.06118  [pdf, other

    cs.LG cs.CL

    Extreme Compression of Large Language Models via Additive Quantization

    Authors: Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh

    Abstract: The emergence of accurate open large language models (LLMs) has led to a race towards performant quantization techniques which can enable their execution on end-user devices. In this paper, we revisit the problem of ``extreme'' LLM compression -- defined as targeting extremely low bit counts, such as 2 to 3 bits per parameter -- from the point of view of classic methods in Multi-Codebook Quantizat… ▽ More

    Submitted 8 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: ICML, 2024

  9. arXiv:2401.04679  [pdf, other

    cs.CL cs.AI cs.LG

    RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation

    Authors: Mahdi Nikdan, Soroush Tabesh, Elvir Crnčević, Dan Alistarh

    Abstract: We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of large language models (LLMs). We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis that jointly trains $\textit{low-rank}$ and $\textit{highly-sparse}$ components on top of a set of fixe… ▽ More

    Submitted 3 June, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

  10. arXiv:2312.13547  [pdf, other

    cs.CL

    How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry'' Benchmark

    Authors: Eldar Kurtic, Torsten Hoefler, Dan Alistarh

    Abstract: Pruning large language models (LLMs) from the BERT family has emerged as a standard compression benchmark, and several pruning methods have been proposed for this task. The recent ``Sparsity May Cry'' (SMC) benchmark put into question the validity of all existing methods, exhibiting a more complex setup where many known pruning methods appear to fail. We revisit the question of accurate BERT-pruni… ▽ More

    Submitted 20 December, 2023; originally announced December 2023.

    Comments: Accepted as oral to CPAL 2024

  11. arXiv:2312.06872  [pdf, other

    cs.LG

    ELSA: Partial Weight Freezing for Overhead-Free Sparse Network Deployment

    Authors: Paniz Halvachi, Alexandra Peste, Dan Alistarh, Christoph H. Lampert

    Abstract: We present ELSA, a practical solution for creating deep networks that can easily be deployed at different levels of sparsity. The core idea is to embed one or more sparse networks within a single dense network as a proper subset of the weights. At prediction time, any sparse model can be extracted effortlessly simply be zeroing out weights according to a predefined mask. ELSA is simple, powerful a… ▽ More

    Submitted 17 December, 2023; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: updated to reflect PackNet prior work

  12. arXiv:2310.20452  [pdf, other

    cs.LG cs.AI math.OC stat.ML

    AsGrad: A Sharp Unified Analysis of Asynchronous-SGD Algorithms

    Authors: Rustem Islamov, Mher Safaryan, Dan Alistarh

    Abstract: We analyze asynchronous-type algorithms for distributed SGD in the heterogeneous setting, where each worker has its own computation and communication speeds, as well as data distribution. In these algorithms, workers compute possibly stale and stochastic gradients associated with their local data at some iteration back in history and then return those gradients to the server without synchronizing… ▽ More

    Submitted 31 October, 2023; originally announced October 2023.

  13. arXiv:2310.16795  [pdf, other

    cs.LG

    QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

    Authors: Elias Frantar, Dan Alistarh

    Abstract: Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challe… ▽ More

    Submitted 25 October, 2023; originally announced October 2023.

  14. arXiv:2310.09259  [pdf, other

    cs.LG

    QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models

    Authors: Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh

    Abstract: Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference costs to allow for efficient local computation. Yet, the vast majority of existing work focuses on weight-only quantization, which can reduce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address them in compute-bound scenarios,… ▽ More

    Submitted 2 November, 2023; v1 submitted 13 October, 2023; originally announced October 2023.

    Comments: 16 pages

  15. arXiv:2310.06927  [pdf, other

    cs.CL cs.AI

    Sparse Fine-tuning for Inference Acceleration of Large Language Models

    Authors: Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh

    Abstract: We consider the problem of accurate sparse fine-tuning of large language models (LLMs), that is, fine-tuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based fine-tuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determ… ▽ More

    Submitted 13 October, 2023; v1 submitted 10 October, 2023; originally announced October 2023.

  16. arXiv:2310.05298  [pdf, other

    cs.DS

    Efficient Self-Adjusting Search Trees via Lazy Updates

    Authors: Alexander Slastin, Dan Alistarh, Vitaly Aksenov

    Abstract: Self-adjusting data structures are a classic approach to adapting the complexity of operations to the data access distribution. While several self-adjusting variants are known for both binary search trees and B-Trees, existing constructions come with limitations. For instance, existing works on self-adjusting B-Trees do not provide static-optimality and tend to be complex and inefficient to implem… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

  17. arXiv:2310.05293  [pdf, other

    cs.DB cs.DC cs.DS

    Wait-free Trees with Asymptotically-Efficient Range Queries

    Authors: Ilya Kokorin, Dan Alistarh, Vitaly Aksenov

    Abstract: Tree data structures, such as red-black trees, quad trees, treaps, or tries, are fundamental tools in computer science. A classical problem in concurrency is to obtain expressive, efficient, and scalable versions of practical tree data structures. We are interested in concurrent trees supporting range queries, i.e., queries that involve multiple consecutive data items. Existing implementations wit… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

  18. arXiv:2310.04519  [pdf, other

    cs.LG

    SPADE: Sparsity-Guided Debugging for Deep Neural Networks

    Authors: Arshia Soltani Moakhar, Eugenia Iofinova, Dan Alistarh

    Abstract: Interpretability, broadly defined as mechanisms for understanding why and how machine learning models reach their decisions, is one of the key open goals at the intersection of deep learning theory and practice. Towards this goal, multiple tools have been proposed to aid a human examiner in reasoning about a network's behavior in general or on a set of instances. However, the outputs of these tool… ▽ More

    Submitted 6 October, 2023; originally announced October 2023.

    Comments: preprint. 28 pages

  19. arXiv:2309.08520  [pdf, other

    cs.LG

    Scaling Laws for Sparsely-Connected Foundation Models

    Authors: Elias Frantar, Carlos Riquelme, Neil Houlsby, Dan Alistarh, Utku Evci

    Abstract: We explore the impact of parameter sparsity on the scaling behavior of Transformers trained on massive datasets (i.e., "foundation models"), in both vision and language domains. In this setting, we identify the first scaling law describing the relationship between weight sparsity, number of non-zero parameters, and amount of training data, which we validate empirically across model and data scales… ▽ More

    Submitted 15 September, 2023; originally announced September 2023.

  20. arXiv:2308.02060  [pdf, other

    cs.LG cs.AI

    Accurate Neural Network Pruning Requires Rethinking Sparse Optimization

    Authors: Denis Kuznedelev, Eldar Kurtic, Eugenia Iofinova, Elias Frantar, Alexandra Peste, Dan Alistarh

    Abstract: Obtaining versions of deep neural networks that are both highly-accurate and highly-sparse is one of the main challenges in the area of model compression, and several high-performance pruning techniques have been investigated by the community. Yet, much less is known about the interaction between sparsity and the standard stochastic optimization techniques used for training sparse networks, and mo… ▽ More

    Submitted 8 September, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

  21. arXiv:2307.07297  [pdf, other

    cs.DC cs.GT

    Game Dynamics and Equilibrium Computation in the Population Protocol Model

    Authors: Dan Alistarh, Krishnendu Chatterjee, Mehrdad Karrabi, John Lazarsfeld

    Abstract: We initiate the study of game dynamics in the population protocol model: $n$ agents each maintain a current local strategy and interact in pairs uniformly at random. Upon each interaction, the agents play a two-person game and receive a payoff from an underlying utility function, and they can subsequently update their strategies according to a fixed local algorithm. In this setting, we ask how the… ▽ More

    Submitted 19 May, 2024; v1 submitted 14 July, 2023; originally announced July 2023.

    Comments: To appear in PODC 2024

  22. arXiv:2307.03738  [pdf, other

    cs.LG cs.CL cs.PF

    QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models

    Authors: Tommaso Pegolotti, Elias Frantar, Dan Alistarh, Markus Püschel

    Abstract: We present ongoing work on a new automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Our approach is informed by the target architecture and a performance model, including both hardware characteristics and method-specific accuracy constraints. Results on CPU-based inference for LLaMA models show that our approach can… ▽ More

    Submitted 7 July, 2023; originally announced July 2023.

  23. arXiv:2306.08670  [pdf, other

    cs.LG cs.DC cs.DS

    Simple Opinion Dynamics for No-Regret Learning

    Authors: John Lazarsfeld, Dan Alistarh

    Abstract: We study a cooperative multi-agent bandit setting in the distributed GOSSIP model: in every round, each of $n$ agents chooses an action from a common set, observes the action's corresponding reward, and subsequently exchanges information with a single randomly chosen neighbor, which may inform its choice in the next round. We introduce and analyze families of memoryless and time-independent protoc… ▽ More

    Submitted 8 July, 2024; v1 submitted 14 June, 2023; originally announced June 2023.

  24. arXiv:2306.06098  [pdf, other

    cs.LG math.NA math.OC

    Error Feedback Can Accurately Compress Preconditioners

    Authors: Ionut-Vlad Modoranu, Aleksei Kalinov, Eldar Kurtic, Elias Frantar, Dan Alistarh

    Abstract: Leveraging second-order information about the loss at the scale of deep networks is one of the main lines of approach for improving the performance of current optimizers for deep learning. Yet, existing approaches for accurate full-matrix preconditioning, such as Full-Matrix Adagrad (GGT) or Matrix-Free Approximate Curvature (M-FAC) suffer from massive storage costs when applied even to small-scal… ▽ More

    Submitted 5 June, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

  25. arXiv:2306.03078  [pdf, other

    cs.CL cs.LG

    SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

    Authors: Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh

    Abstract: Recent advances in large language model (LLM) pretraining have led to high-quality LLMs with impressive abilities. By compressing such LLMs via quantization to 3-4 bits per parameter, they can fit into memory-limited devices such as laptops and mobile phones, enabling personalized use. However, quantization down to 3-4 bits per parameter usually leads to moderate-to-high accuracy losses, especiall… ▽ More

    Submitted 5 June, 2023; originally announced June 2023.

    Comments: Extended preprint

  26. arXiv:2305.17581  [pdf, other

    cs.LG math.OC

    Knowledge Distillation Performs Partial Variance Reduction

    Authors: Mher Safaryan, Alexandra Peste, Dan Alistarh

    Abstract: Knowledge distillation is a popular approach for enhancing the performance of ''student'' models, with lower representational capacity, by taking advantage of more powerful ''teacher'' models. Despite its apparent simplicity and widespread use, the underlying mechanics behind knowledge distillation (KD) are still not fully understood. In this work, we shed new light on the inner workings of this m… ▽ More

    Submitted 8 December, 2023; v1 submitted 27 May, 2023; originally announced May 2023.

    Comments: 15+22 pages, NeurIPS 2023

  27. arXiv:2304.12622  [pdf, other

    cs.CV cs.LG

    Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

    Authors: Eugenia Iofinova, Alexandra Peste, Dan Alistarh

    Abstract: Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression. Yet, several recent works have raised the issue that pruning may induce or exacerbate bias in the output of the compressed model. Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is no… ▽ More

    Submitted 25 April, 2023; originally announced April 2023.

    Comments: 8 Pages / 49 with references and appendix. Accepted to CVPR 2023

  28. Provably-Efficient and Internally-Deterministic Parallel Union-Find

    Authors: Alexander Fedorov, Diba Hashemi, Giorgi Nadiradze, Dan Alistarh

    Abstract: Determining the degree of inherent parallelism in classical sequential algorithms and leveraging it for fast parallel execution is a key topic in parallel computing, and detailed analyses are known for a wide range of classical algorithms. In this paper, we perform the first such analysis for the fundamental Union-Find problem, in which we are given a graph as a sequence of edges, and must maintai… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

  29. arXiv:2303.14409  [pdf, other

    cs.CV

    Vision Models Can Be Efficiently Specialized via Few-Shot Task-Aware Compression

    Authors: Denis Kuznedelev, Soroush Tabesh, Kimia Noorbakhsh, Elias Frantar, Sara Beery, Eldar Kurtic, Dan Alistarh

    Abstract: Recent vision architectures and self-supervised training methods enable vision models that are extremely accurate and general, but come with massive parameter and computational costs. In practical settings, such as camera traps, users have limited resources, and may fine-tune a pretrained model on (often limited) data from a small set of specific categories of interest. These users may wish to mak… ▽ More

    Submitted 25 March, 2023; originally announced March 2023.

    MSC Class: 68T07 ACM Class: I.m

  30. arXiv:2302.04852  [pdf, other

    cs.LG

    SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

    Authors: Mahdi Nikdan, Tommaso Pegolotti, Eugenia Iofinova, Eldar Kurtic, Dan Alistarh

    Abstract: We provide a new efficient version of the backpropagation algorithm, specialized to the case where the weights of the neural network being trained are sparse. Our algorithm is general, as it applies to arbitrary (unstructured) sparsity and common layer types (e.g., convolutional or linear). We provide a fast vectorized implementation on commodity CPUs, and show that it can yield speedups in end-to… ▽ More

    Submitted 9 February, 2023; originally announced February 2023.

  31. arXiv:2302.04089  [pdf, other

    cs.LG cs.CL

    ZipLM: Inference-Aware Structured Pruning of Language Models

    Authors: Eldar Kurtic, Elias Frantar, Dan Alistarh

    Abstract: The breakthrough performance of large language models (LLMs) comes with major computational footprints and high deployment costs. In this paper, we progress towards resolving this problem by proposing a novel structured compression approach for LLMs, called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups in any given inference env… ▽ More

    Submitted 26 October, 2023; v1 submitted 7 February, 2023; originally announced February 2023.

    Comments: Accepted to NeurIPS 2023

  32. arXiv:2302.02390  [pdf, other

    cs.LG

    Quantized Distributed Training of Large Models with Convergence Guarantees

    Authors: Ilia Markov, Adrian Vladu, Qi Guo, Dan Alistarh

    Abstract: Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is… ▽ More

    Submitted 5 February, 2023; originally announced February 2023.

  33. arXiv:2301.00774  [pdf, other

    cs.LG

    SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

    Authors: Elias Frantar, Dan Alistarh

    Abstract: We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models. We can execute SparseGPT on the largest available… ▽ More

    Submitted 22 March, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

  34. PathCAS: An Efficient Middle Ground for Concurrent Search Data Structures

    Authors: Trevor Brown, William Sigouin, Dan Alistarh

    Abstract: To maximize the performance of concurrent data structures, researchers have often turned to highly complex fine-grained techniques, resulting in efficient and elegant algorithms, which can however be often difficult to understand and prove correct. While simpler techniques exist, such as transactional memory, they can have limited performance or portability relative to their fine-grained counterpa… ▽ More

    Submitted 19 December, 2022; originally announced December 2022.

    Comments: Extended version of the conference paper, which appeared at PPoPP'22. This work won the PPoPP'22 best artifact award

  35. arXiv:2211.04986  [pdf, other

    cs.DS cs.DC

    Fast and Scalable Channels in Kotlin Coroutines

    Authors: Nikita Koval, Dan Alistarh, Roman Elizarov

    Abstract: Asynchronous programming has gained significant popularity over the last decade: support for this programming pattern is available in many popular languages via libraries and native language implementations, typically in the form of coroutines or the async/await construct. Instead of programming via shared memory, this concept assumes implicit synchronization through message passing. The key data… ▽ More

    Submitted 9 November, 2022; originally announced November 2022.

  36. arXiv:2210.17357  [pdf, other

    cs.LG cs.DC

    L-GreCo: Layerwise-Adaptive Gradient Compression for Efficient and Accurate Deep Learning

    Authors: Mohammadreza Alimohammadi, Ilia Markov, Elias Frantar, Dan Alistarh

    Abstract: Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks. To address this issue, entire families of compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all k… ▽ More

    Submitted 9 June, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

  37. arXiv:2210.17323  [pdf, other

    cs.LG

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh

    Abstract: Generative Pre-trained Transformer models, known as GPT or OPT, set themselves apart through breakthrough performance across complex language modelling tasks, but also by their extremely high computational and storage costs. Specifically, due to their massive size, even inference for large, highly-accurate GPT models may require multiple performant GPUs, which limits the usability of such models.… ▽ More

    Submitted 22 March, 2023; v1 submitted 31 October, 2022; originally announced October 2022.

    Comments: ICLR 2023

  38. arXiv:2210.09223  [pdf, other

    cs.CV cs.LG

    CAP: Correlation-Aware Pruning for Highly-Accurate Sparse Vision Models

    Authors: Denis Kuznedelev, Eldar Kurtic, Elias Frantar, Dan Alistarh

    Abstract: Driven by significant improvements in architectural design and training pipelines, computer vision has recently experienced dramatic progress in terms of accuracy on classic benchmarks such as ImageNet. These highly-accurate models are challenging to deploy, as they appear harder to compress using standard techniques such as pruning. We address this issue by introducing the Correlation Aware Prune… ▽ More

    Submitted 31 May, 2023; v1 submitted 14 October, 2022; originally announced October 2022.

    MSC Class: 68T07 ACM Class: I.m

  39. arXiv:2210.07703  [pdf, other

    cs.LG cs.DC

    Hybrid Decentralized Optimization: First- and Zeroth-Order Optimizers Can Be Jointly Leveraged For Faster Convergence

    Authors: Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

    Abstract: Distributed optimization has become one of the standard ways of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. I… ▽ More

    Submitted 14 October, 2022; originally announced October 2022.

  40. arXiv:2210.06384  [pdf, other

    cs.CL

    GMP*: Well-Tuned Gradual Magnitude Pruning Can Outperform Most BERT-Pruning Methods

    Authors: Eldar Kurtic, Dan Alistarh

    Abstract: We revisit the performance of the classic gradual magnitude pruning (GMP) baseline for large language models, focusing on the classic BERT benchmark on various popular tasks. Despite existing evidence in the literature that GMP performs poorly, we show that a simple and general variant, which we call GMP*, can match and sometimes outperform more complex state-of-the-art methods. Our results provid… ▽ More

    Submitted 8 December, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

  41. arXiv:2208.11580  [pdf, other

    cs.LG

    Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning

    Authors: Elias Frantar, Sidak Pal Singh, Dan Alistarh

    Abstract: We consider the problem of model compression for deep neural networks (DNNs) in the challenging one-shot/post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via… ▽ More

    Submitted 8 January, 2023; v1 submitted 24 August, 2022; originally announced August 2022.

    Comments: Published at NeurIPS 2022

  42. arXiv:2207.14200  [pdf, other

    cs.LG

    CrAM: A Compression-Aware Minimizer

    Authors: Alexandra Peste, Adrian Vladu, Eldar Kurtic, Christoph H. Lampert, Dan Alistarh

    Abstract: Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trai… ▽ More

    Submitted 4 May, 2023; v1 submitted 28 July, 2022; originally announced July 2022.

    Comments: Accepted to ICLR 2023

  43. arXiv:2206.10032  [pdf, other

    cs.LG

    Communication-Efficient Federated Learning With Data and Client Heterogeneity

    Authors: Hossein Zakerinia, Shayan Talaei, Giorgi Nadiradze, Dan Alistarh

    Abstract: Federated Learning (FL) enables large-scale distributed training of machine learning models, while still allowing individual nodes to maintain data locally. However, executing FL at scale comes with inherent practical challenges: 1) heterogeneity of the local node data distributions, 2) heterogeneity of node computational speeds (asynchrony), but also 3) constraints in the amount of commun… ▽ More

    Submitted 3 June, 2023; v1 submitted 20 June, 2022; originally announced June 2022.

  44. arXiv:2205.12597  [pdf, other

    cs.DC

    Near-Optimal Leader Election in Population Protocols on Graphs

    Authors: Dan Alistarh, Joel Rybicki, Sasha Voitovych

    Abstract: In the stochastic population protocol model, we are given a connected graph with $n$ nodes, and in every time step, a scheduler samples an edge of the graph uniformly at random and the nodes connected by this edge interact. A fundamental task in this model is stable leader election, in which all nodes start in an identical state and the aim is to reach a configuration in which (1) exactly one node… ▽ More

    Submitted 21 December, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

    Comments: 55 pages, 2 figures, revised version

  45. arXiv:2203.07259  [pdf, other

    cs.CL cs.LG

    The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

    Authors: Eldar Kurtic, Daniel Campos, Tuan Nguyen, Elias Frantar, Mark Kurtz, Benjamin Fineran, Michael Goin, Dan Alistarh

    Abstract: Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, wi… ▽ More

    Submitted 17 October, 2022; v1 submitted 14 March, 2022; originally announced March 2022.

    Comments: Accepted to EMNLP 2022

  46. arXiv:2203.06638  [pdf, other

    cs.LG

    Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

    Authors: Bapi Chatterjee, Vyacheslav Kungurtsev, Dan Alistarh

    Abstract: Powered by the simplicity of lock-free asynchrony, Hogwilld! is a go-to approach to parallelize SGD over a shared-memory setting. Despite its popularity and concomitant extensions, such as PASSM+ wherein concurrent processes update a shared model with partitioned gradients, scaling it to decentralized workers has surprisingly been relatively unexplored. To our knowledge, there is no convergence… ▽ More

    Submitted 13 March, 2022; originally announced March 2022.

  47. arXiv:2201.13096  [pdf, other

    cs.LG

    SPDY: Accurate Pruning with Speedup Guarantees

    Authors: Elias Frantar, Dan Alistarh

    Abstract: The recent focus on the efficiency of deep neural networks (DNNs) has led to significant work on model compression approaches, of which weight pruning is one of the most popular. At the same time, there is rapidly-growing computational support for efficiently executing the unstructured-sparse models obtained via pruning. Yet, most existing pruning methods minimize just the number of remaining weig… ▽ More

    Submitted 24 August, 2022; v1 submitted 31 January, 2022; originally announced January 2022.

    Comments: ICML 2022

  48. arXiv:2112.05830  [pdf, ps, other

    cs.DC math.PR

    Collecting Coupons is Faster with Friends

    Authors: Dan Alistarh, Peter Davies

    Abstract: In this note, we introduce a distributed twist on the classic coupon collector problem: a set of $m$ collectors wish to each obtain a set of $n$ coupons; for this, they can each sample coupons uniformly at random, but can also meet in pairwise interactions, during which they can exchange coupons. By doing so, they hope to reduce the number of coupons that must be sampled by each collector in order… ▽ More

    Submitted 10 December, 2021; originally announced December 2021.

    Comments: 9 pages, appeared as an invited paper at SIROCCO 2021

  49. arXiv:2111.13445  [pdf, other

    cs.CV cs.AI cs.LG

    How Well Do Sparse Imagenet Models Transfer?

    Authors: Eugenia Iofinova, Alexandra Peste, Mark Kurtz, Dan Alistarh

    Abstract: Transfer learning is a classic paradigm by which models pretrained on large "upstream" datasets are adapted to yield good results on "downstream" specialized datasets. Generally, more accurate models on the "upstream" dataset tend to provide better transfer accuracy "downstream". In this work, we perform an in-depth investigation of this phenomenon in the context of convolutional neural networks (… ▽ More

    Submitted 21 April, 2022; v1 submitted 26 November, 2021; originally announced November 2021.

    Comments: Accepted to CVPR'22. This version: 25 pages, 9 figures (including appendix). **Includes extended upstream training results, which are not present in the CVPR version.**

  50. arXiv:2111.12682  [pdf, other

    cs.PL cs.DS

    CQS: A Formally-Verified Framework for Fair and Abortable Synchronization

    Authors: Nikita Koval, Dmitry Khalanskiy, Dan Alistarh

    Abstract: Writing concurrent code that is both correct and efficient is notoriously difficult. Thus, programmers often prefer to use synchronization abstractions, which render code simpler and easier to reason about. Despite a wealth of work on this topic, there is still a gap between the rich semantics provided by synchronization abstractions in modern programming languages -- specifically, \emph{fair} FIF… ▽ More

    Submitted 20 May, 2023; v1 submitted 22 November, 2021; originally announced November 2021.

    Comments: PLDI 2023