Search | arXiv e-print repository

arXiv:2406.11929 [pdf, other]

Long-time asymptotics of noisy SVGD outside the population limit

Authors: Victor Priser, Pascal Bianchi, Adil Salim

Abstract: Stein Variational Gradient Descent (SVGD) is a widely used sampling algorithm that has been successfully applied in several areas of Machine Learning. SVGD operates by iteratively moving a set of interacting particles (which represent the samples) to approximate the target distribution. Despite recent studies on the complexity of SVGD and its variants, their long-time asymptotic behavior (i.e., a… ▽ More Stein Variational Gradient Descent (SVGD) is a widely used sampling algorithm that has been successfully applied in several areas of Machine Learning. SVGD operates by iteratively moving a set of interacting particles (which represent the samples) to approximate the target distribution. Despite recent studies on the complexity of SVGD and its variants, their long-time asymptotic behavior (i.e., after numerous iterations ) is still not understood in the finite number of particles regime. We study the long-time asymptotic behavior of a noisy variant of SVGD. First, we establish that the limit set of noisy SVGD for large is well-defined. We then characterize this limit set, showing that it approaches the target distribution as increases. In particular, noisy SVGD provably avoids the variance collapse observed for SVGD. Our approach involves demonstrating that the trajectories of noisy SVGD closely resemble those described by a McKean-Vlasov process. △ Less

Submitted 21 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

arXiv:2404.14219 [pdf, other]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Authors: Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Qin Cai, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Yen-Chun Chen, Yi-Ling Chen, Parul Chopra , et al. (90 additional authors not shown)

Abstract: We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset… ▽ More We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens, whose overall performance, as measured by both academic benchmarks and internal testing, rivals that of models such as Mixtral 8x7B and GPT-3.5 (e.g., phi-3-mini achieves 69% on MMLU and 8.38 on MT-bench), despite being small enough to be deployed on a phone. The innovation lies entirely in our dataset for training, a scaled-up version of the one used for phi-2, composed of heavily filtered publicly available web data and synthetic data. The model is also further aligned for robustness, safety, and chat format. We also provide some initial parameter-scaling results with a 7B and 14B models trained for 4.8T tokens, called phi-3-small and phi-3-medium, both significantly more capable than phi-3-mini (e.g., respectively 75% and 78% on MMLU, and 8.7 and 8.9 on MT-bench). Moreover, we also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts. △ Less

Submitted 23 May, 2024; v1 submitted 22 April, 2024; originally announced April 2024.

Comments: 19 pages

arXiv:2311.12825 [pdf, ps, other]

A PSO Based Method to Generate Actionable Counterfactuals for High Dimensional Data

Authors: Shashank Shekhar, Asif Salim, Adesh Bansode, Vivaswan Jinturkar, Anirudha Nayak

Abstract: Counterfactual explanations (CFE) are methods that explain a machine learning model by giving an alternate class prediction of a data point with some minimal changes in its features. It helps the users to identify their data attributes that caused an undesirable prediction like a loan or credit card rejection. We describe an efficient and an actionable counterfactual (CF) generation method based o… ▽ More Counterfactual explanations (CFE) are methods that explain a machine learning model by giving an alternate class prediction of a data point with some minimal changes in its features. It helps the users to identify their data attributes that caused an undesirable prediction like a loan or credit card rejection. We describe an efficient and an actionable counterfactual (CF) generation method based on particle swarm optimization (PSO). We propose a simple objective function for the optimization of the instance-centric CF generation problem. The PSO brings in a lot of flexibility in terms of carrying out multi-objective optimization in large dimensions, capability for multiple CF generation, and setting box constraints or immutability of data attributes. An algorithm is proposed that incorporates these features and it enables greater control over the proximity and sparsity properties over the generated CFs. The proposed algorithm is evaluated with a set of action-ability metrics in real-world datasets, and the results were superior compared to that of the state-of-the-arts. △ Less

Submitted 30 November, 2023; v1 submitted 30 September, 2023; originally announced November 2023.

Comments: Accepted in IEEE CSDE 2023

arXiv:2311.11291 [pdf, ps, other]

doi 10.1142/S0217732323750032

A comment on singular and non-singular black holes using the Gaussian distribution

Authors: D. Batic, M. Nowakowski, S. A. Salim

Abstract: In this work, we join the controversial discussion on singular and non-singular black holes using the Gaussian distribution. Our result which uses correct boundary conditions shifts the debate in favour of regular black holes at the centre. The present findings add new insights into the ongoing discussions surrounding singularities in black hole solutions of the Einstein equations. In this work, we join the controversial discussion on singular and non-singular black holes using the Gaussian distribution. Our result which uses correct boundary conditions shifts the debate in favour of regular black holes at the centre. The present findings add new insights into the ongoing discussions surrounding singularities in black hole solutions of the Einstein equations. △ Less

Submitted 19 November, 2023; originally announced November 2023.

Comments: 6 pages

arXiv:2306.16308 [pdf, other]

Gaussian random field approximation via Stein's method with applications to wide random neural networks

Authors: Krishnakumar Balasubramanian, Larry Goldstein, Nathan Ross, Adil Salim

Abstract: We derive upper bounds on the Wasserstein distance ($W_1$), with respect to $\sup$-norm, between any continuous $\mathbb{R}^d$ valued random field indexed by the $n$-sphere and the Gaussian, based on Stein's method. We develop a novel Gaussian smoothing technique that allows us to transfer a bound in a smoother metric to the $W_1$ distance. The smoothing is based on covariance functions constructe… ▽ More We derive upper bounds on the Wasserstein distance ($W_1$), with respect to $\sup$-norm, between any continuous $\mathbb{R}^d$ valued random field indexed by the $n$-sphere and the Gaussian, based on Stein's method. We develop a novel Gaussian smoothing technique that allows us to transfer a bound in a smoother metric to the $W_1$ distance. The smoothing is based on covariance functions constructed using powers of Laplacian operators, designed so that the associated Gaussian process has a tractable Cameron-Martin or Reproducing Kernel Hilbert Space. This feature enables us to move beyond one dimensional interval-based index sets that were previously considered in the literature. Specializing our general result, we obtain the first bounds on the Gaussian random field approximation of wide random neural networks of any depth and Lipschitz activation functions at the random field level. Our bounds are explicitly expressed in terms of the widths of the network and moments of the random weights. We also obtain tighter bounds when the activation function has three bounded derivatives. △ Less

Submitted 30 April, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

Comments: To appear in Applied and Computational Harmonic Analysis

arXiv:2306.11644 [pdf, other]

Textbooks Are All You Need

Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li

Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accu… ▽ More We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval. △ Less

Submitted 2 October, 2023; v1 submitted 20 June, 2023; originally announced June 2023.

Comments: 26 pages; changed color scheme of plot. fixed minor typos and added couple clarifications

arXiv:2305.11798 [pdf, ps, other]

The probability flow ODE is provably fast

Authors: Sitan Chen, Sinho Chewi, Holden Lee, Yuanzhi Li, Jianfeng Lu, Adil Salim

Abstract: We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques f… ▽ More We provide the first polynomial-time convergence guarantees for the probability flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i.e., denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt{d})$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 23 pages, 2 figures

arXiv:2304.05398 [pdf, other]

Forward-backward Gaussian variational inference via JKO in the Bures-Wasserstein Space

Authors: Michael Diao, Krishnakumar Balasubramanian, Sinho Chewi, Adil Salim

Abstract: Variational inference (VI) seeks to approximate a target distribution $πぱい$ by an element of a tractable family of distributions. Of key interest in statistics and machine learning is Gaussian VI, which approximates $πぱい$ by minimizing the Kullback-Leibler (KL) divergence to $πぱい$ over the space of Gaussians. In this work, we develop the (Stochastic) Forward-Backward Gaussian Variational Inference (FB-G… ▽ More Variational inference (VI) seeks to approximate a target distribution $πぱい$ by an element of a tractable family of distributions. Of key interest in statistics and machine learning is Gaussian VI, which approximates $πぱい$ by minimizing the Kullback-Leibler (KL) divergence to $πぱい$ over the space of Gaussians. In this work, we develop the (Stochastic) Forward-Backward Gaussian Variational Inference (FB-GVI) algorithm to solve Gaussian VI. Our approach exploits the composite structure of the KL divergence, which can be written as the sum of a smooth term (the potential) and a non-smooth term (the entropy) over the Bures-Wasserstein (BW) space of Gaussians endowed with the Wasserstein distance. For our proposed algorithm, we obtain state-of-the-art convergence guarantees when $πぱい$ is log-smooth and log-concave, as well as the first convergence guarantees to first-order stationary solutions when $πぱい$ is only log-smooth. △ Less

Submitted 10 April, 2023; originally announced April 2023.

arXiv:2302.09487 [pdf]

Understanding how the use of AI decision support tools affect critical thinking and over-reliance on technology by drug dispensers in Tanzania

Authors: Ally Salim Jr, Megan Allen, Kelvin Mariki, Kevin James Masoy, Jafary Liana

Abstract: The use of AI in healthcare is designed to improve care delivery and augment the decisions of providers to enhance patient outcomes. When deployed in clinical settings, the interaction between providers and AI is a critical component for measuring and understanding the effectiveness of these digital tools on broader health outcomes. Even in cases where AI algorithms have high diagnostic accuracy,… ▽ More The use of AI in healthcare is designed to improve care delivery and augment the decisions of providers to enhance patient outcomes. When deployed in clinical settings, the interaction between providers and AI is a critical component for measuring and understanding the effectiveness of these digital tools on broader health outcomes. Even in cases where AI algorithms have high diagnostic accuracy, healthcare providers often still rely on their experience and sometimes gut feeling to make a final decision. Other times, providers rely unquestioningly on the outputs of the AI models, which leads to a concern about over-reliance on the technology. The purpose of this research was to understand how reliant drug shop dispensers were on AI-powered technologies when determining a differential diagnosis for a presented clinical case vignette. We explored how the drug dispensers responded to technology that is framed as always correct in an attempt to measure whether they begin to rely on it without any critical thought of their own. We found that dispensers relied on the decision made by the AI 25 percent of the time, even when the AI provided no explanation for its decision. △ Less

Submitted 22 February, 2023; v1 submitted 19 February, 2023; originally announced February 2023.

arXiv:2209.11215 [pdf, ps, other]

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

Authors: Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, Anru R. Zhang

Abstract: We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to… ▽ More We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an $L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does not reduce the complexity of SGMs. △ Less

Submitted 15 April, 2023; v1 submitted 22 September, 2022; originally announced September 2022.

Comments: 29 pages

arXiv:2209.07513 [pdf, other]

On the complexity of finding stationary points of smooth functions in one dimension

Authors: Sinho Chewi, Sébastien Bubeck, Adil Salim

Abstract: We characterize the query complexity of finding stationary points of one-dimensional non-convex but smooth functions. We consider four settings, based on whether the algorithms under consideration are deterministic or randomized, and whether the oracle outputs $1^{\rm st}$-order or both $0^{\rm th}$- and $1^{\rm st}$-order information. Our results show that algorithms for this task provably benefi… ▽ More We characterize the query complexity of finding stationary points of one-dimensional non-convex but smooth functions. We consider four settings, based on whether the algorithms under consideration are deterministic or randomized, and whether the oracle outputs $1^{\rm st}$-order or both $0^{\rm th}$- and $1^{\rm st}$-order information. Our results show that algorithms for this task provably benefit by incorporating either randomness or $0^{\rm th}$-order information. Our results also show that, for every dimension $d \geq 1$, gradient descent is optimal among deterministic algorithms using $1^{\rm st}$-order queries only. △ Less

Submitted 18 March, 2023; v1 submitted 15 September, 2022; originally announced September 2022.

Comments: 17 pages, 3 figures

arXiv:2206.00920 [pdf, ps, other]

Federated Learning with a Sampling Algorithm under Isoperimetry

Authors: Lukang Sun, Adil Salim, Peter Richtárik

Abstract: Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost -- the main bottleneck -- between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minim… ▽ More Federated learning uses a set of techniques to efficiently distribute the training of a machine learning algorithm across several devices, who own the training data. These techniques critically rely on reducing the communication cost -- the main bottleneck -- between the devices and a central server. Federated learning algorithms usually take an optimization approach: they are algorithms for minimizing the training loss subject to communication (and other) constraints. In this work, we instead take a Bayesian approach for the training task, and propose a communication-efficient variant of the Langevin algorithm to sample a posteriori. The latter approach is more robust and provides more knowledge of the \textit{a posteriori} distribution than its optimization counterpart. We analyze our algorithm without assuming that the target distribution is strongly log-concave. Instead, we assume the weaker log Sobolev inequality, which allows for nonconvexity. △ Less

Submitted 7 June, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2203.12859 [pdf, other]

Making SMART decisions in prophylaxis and treatment studies

Authors: Robert K. Mahar, Katherine J. Lee, Bibhas Chakraborty, Agus Salim, Julie A. Simpson

Abstract: The optimal prophylaxis, and treatment if the prophylaxis fails, for a disease may be best evaluated using a sequential multiple assignment randomised trial (SMART). A SMART is a multi-stage study that randomises a participant to an initial treatment, observes some response to that treatment and then, depending on their observed response, randomises the same participant to an alternative treatment… ▽ More The optimal prophylaxis, and treatment if the prophylaxis fails, for a disease may be best evaluated using a sequential multiple assignment randomised trial (SMART). A SMART is a multi-stage study that randomises a participant to an initial treatment, observes some response to that treatment and then, depending on their observed response, randomises the same participant to an alternative treatment. Response adaptive randomisation may, in some settings, improve the trial participants' outcomes and expedite trial conclusions, compared to fixed randomisation. But 'myopic' response adaptive randomisation strategies, blind to multistage dynamics, may also result in suboptimal treatment assignments. We propose a 'dynamic' response adaptive randomisation strategy based on Q-learning, an approximate dynamic programming algorithm. Q-learning uses stage-wise statistical models and backward induction to incorporate late-stage 'payoffs' (i.e. clinical outcomes) into early-stage 'actions' (i.e. treatments). Our real-world example consists of a COVID-19 prophylaxis and treatment SMART with qualitatively different binary endpoints at each stage. Standard Q-learning does not work with such data because it cannot be used for sequences of binary endpoints. Sequences of qualitatively distinct endpoints may also require different weightings to ensure that the design guides participants to regimens with the highest utility. We describe how a simple decision-theoretic extension to Q-learning can be used to handle sequential binary endpoints with distinct utilities. Using simulation we show that, under a set of binary utilities, the 'dynamic' approach increases expected participant utility compared to the fixed approach, sometimes markedly, for all model parameters, whereas the 'myopic' approach can actually decrease utility. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2202.06386 [pdf, ps, other]

Improved analysis for a proximal algorithm for sampling

Authors: Yongxin Chen, Sinho Chewi, Adil Salim, Andre Wibisono

Abstract: We study the proximal sampler of Lee, Shen, and Tian (2021) and obtain new convergence guarantees under weaker assumptions than strong log-concavity: namely, our results hold for (1) weakly log-concave targets, and (2) targets satisfying isoperimetric assumptions which allow for non-log-concavity. We demonstrate our results by obtaining new state-of-the-art sampling guarantees for several classes… ▽ More We study the proximal sampler of Lee, Shen, and Tian (2021) and obtain new convergence guarantees under weaker assumptions than strong log-concavity: namely, our results hold for (1) weakly log-concave targets, and (2) targets satisfying isoperimetric assumptions which allow for non-log-concavity. We demonstrate our results by obtaining new state-of-the-art sampling guarantees for several classes of target distributions. We also strengthen the connection between the proximal sampler and the proximal method in optimization by interpreting the proximal sampler as an entropically regularized Wasserstein proximal method, and the proximal point method as the limit of the proximal sampler with vanishing noise. △ Less

Submitted 13 February, 2022; originally announced February 2022.

Comments: 34 pages

arXiv:2202.05214 [pdf, other]

Towards a Theory of Non-Log-Concave Sampling: First-Order Stationarity Guarantees for Langevin Monte Carlo

Authors: Krishnakumar Balasubramanian, Sinho Chewi, Murat A. Erdogdu, Adil Salim, Matthew Zhang

Abstract: For the task of sampling from a density $πぱい\propto \exp(-V)$ on $\mathbb{R}^d$, where $V$ is possibly non-convex but $L$-gradient Lipschitz, we prove that averaged Langevin Monte Carlo outputs a sample with $\varepsilon$-relative Fisher information after $O( L^2 d^2/\varepsilon^2)$ iterations. This is the sampling analogue of complexity bounds for finding an $\varepsilon$-approximate first-order st… ▽ More For the task of sampling from a density $πぱい\propto \exp(-V)$ on $\mathbb{R}^d$, where $V$ is possibly non-convex but $L$-gradient Lipschitz, we prove that averaged Langevin Monte Carlo outputs a sample with $\varepsilon$-relative Fisher information after $O( L^2 d^2/\varepsilon^2)$ iterations. This is the sampling analogue of complexity bounds for finding an $\varepsilon$-approximate first-order stationary points in non-convex optimization and therefore constitutes a first step towards the general theory of non-log-concave sampling. We discuss numerous extensions and applications of our result; in particular, it yields a new state-of-the-art guarantee for sampling from distributions which satisfy a Poincaré inequality. △ Less

Submitted 10 February, 2022; originally announced February 2022.

arXiv:2201.08901 [pdf]

An Ensemble Model for Face Liveness Detection

Authors: Shashank Shekhar, Avinash Patel, Mrinal Haloi, Asif Salim

Abstract: In this paper, we present a passive method to detect face presentation attack a.k.a face liveness detection using an ensemble deep learning technique. Face liveness detection is one of the key steps involved in user identity verification of customers during the online onboarding/transaction processes. During identity verification, an unauthenticated user tries to bypass the verification system by… ▽ More In this paper, we present a passive method to detect face presentation attack a.k.a face liveness detection using an ensemble deep learning technique. Face liveness detection is one of the key steps involved in user identity verification of customers during the online onboarding/transaction processes. During identity verification, an unauthenticated user tries to bypass the verification system by several means, for example, they can capture a user photo from social media and do an imposter attack using printouts of users faces or using a digital photo from a mobile device and even create a more sophisticated attack like video replay attack. We have tried to understand the different methods of attack and created an in-house large-scale dataset covering all the kinds of attacks to train a robust deep learning model. We propose an ensemble method where multiple features of the face and background regions are learned to predict whether the user is a bonafide or an attacker. △ Less

Submitted 19 January, 2022; originally announced January 2022.

Comments: Accepted and presented at MLDM 2022. To be published in Lattice journal

arXiv:2201.06433 [pdf, other]

A Comparative study of Hyper-Parameter Optimization Tools

Authors: Shashank Shekhar, Adesh Bansode, Asif Salim

Abstract: Most of the machine learning models have associated hyper-parameters along with their parameters. While the algorithm gives the solution for parameters, its utility for model performance is highly dependent on the choice of hyperparameters. For a robust performance of a model, it is necessary to find out the right hyper-parameter combination. Hyper-parameter optimization (HPO) is a systematic proc… ▽ More Most of the machine learning models have associated hyper-parameters along with their parameters. While the algorithm gives the solution for parameters, its utility for model performance is highly dependent on the choice of hyperparameters. For a robust performance of a model, it is necessary to find out the right hyper-parameter combination. Hyper-parameter optimization (HPO) is a systematic process that helps in finding the right values for them. The conventional methods for this purpose are grid search and random search and both methods create issues in industrial-scale applications. Hence a set of strategies have been recently proposed based on Bayesian optimization and evolutionary algorithm principles that help in runtime issues in a production environment and robust performance. In this paper, we compare the performance of four python libraries, namely Optuna, Hyper-opt, Optunity, and sequential model-based algorithm configuration (SMAC) that has been proposed for hyper-parameter optimization. The performance of these tools is tested using two benchmarks. The first one is to solve a combined algorithm selection and hyper-parameter optimization (CASH) problem The second one is the NeurIPS black-box optimization challenge in which a multilayer perception (MLP) architecture has to be chosen from a set of related architecture constraints and hyper-parameters. The benchmarking is done with six real-world datasets. From the experiments, we found that Optuna has better performance for CASH problem and HyperOpt for MLP problem. △ Less

Submitted 17 January, 2022; originally announced January 2022.

Comments: Selected and presented at IEEE CSDE 2021. To be published in Proceedings of IEEE CSDE 2021

arXiv:2106.03076 [pdf, ps, other]

A Convergence Theory for SVGD in the Population Limit under Talagrand's Inequality T1

Authors: Adil Salim, Lukang Sun, Peter Richtárik

Abstract: Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distributio… ▽ More Stein Variational Gradient Descent (SVGD) is an algorithm for sampling from a target density which is known up to a multiplicative constant. Although SVGD is a popular algorithm in practice, its theoretical study is limited to a few recent works. We study the convergence of SVGD in the population limit, (i.e., with an infinite number of particles) to sample from a non-logconcave target distribution satisfying Talagrand's inequality T1. We first establish the convergence of the algorithm. Then, we establish a dimension-dependent complexity bound in terms of the Kernelized Stein Discrepancy (KSD). Unlike existing works, we do not assume that the KSD is bounded along the trajectory of the algorithm. Our approach relies on interpreting SVGD as a gradient descent over a space of probability measures. △ Less

Submitted 16 June, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

arXiv:2104.14123 [pdf, other]

An efficient scheme based on graph centrality to select nodes for training for effective learning

Authors: CR Sandeep, Asif Salim, R Sethunadh, S Sumitra

Abstract: The process of selecting points for training a machine learning model is often a challenging task. Many times, we will have a lot of data, but for training, we require the labels and labeling is often costly. So we need to select the points for training in an efficient manner so that the model trained on the points selected will be better than the ones trained on any other training set. We propose… ▽ More The process of selecting points for training a machine learning model is often a challenging task. Many times, we will have a lot of data, but for training, we require the labels and labeling is often costly. So we need to select the points for training in an efficient manner so that the model trained on the points selected will be better than the ones trained on any other training set. We propose a novel method to select the nodes in graph datasets using the concept of graph centrality. Two methods are proposed - one using a smart selection strategy, where the model is required to be trained only once and another using active learning method. We have tested this idea on three popular graph datasets - Cora, Citeseer and Pubmed- and the results are found to be encouraging. △ Less

Submitted 19 May, 2021; v1 submitted 29 April, 2021; originally announced April 2021.

arXiv:2102.11079 [pdf, ps, other]

An Optimal Algorithm for Strongly Convex Minimization under Affine Constraints

Authors: Adil Salim, Laurent Condat, Dmitry Kovalev, Peter Richtárik

Abstract: Optimization problems under affine constraints appear in various areas of machine learning. We consider the task of minimizing a smooth strongly convex function F(x) under the affine constraint Kx=b, with an oracle providing evaluations of the gradient of F and multiplications by K and its transpose. We provide lower bounds on the number of gradient computations and matrix multiplications to achie… ▽ More Optimization problems under affine constraints appear in various areas of machine learning. We consider the task of minimizing a smooth strongly convex function F(x) under the affine constraint Kx=b, with an oracle providing evaluations of the gradient of F and multiplications by K and its transpose. We provide lower bounds on the number of gradient computations and matrix multiplications to achieve a given accuracy. Then we propose an accelerated primal-dual algorithm achieving these lower bounds. Our algorithm is the first optimal algorithm for this class of problems. △ Less

Submitted 10 April, 2022; v1 submitted 22 February, 2021; originally announced February 2021.

arXiv:2012.02896 [pdf, other]

Experimental Implementation of an Adaptive Digital Autopilot

Authors: Ankit Goel, Juan Augusto Paredes, Harshil Dadhaniya, Syed Aseem Ul Islam, Abdulazeez Mohammed Salim, Sai Ravela, Dennis Bernstein

Abstract: This paper develops an adaptive digital autopilot for quadcopters and presents experimental results. The adaptive digital autopilot is constructed by augmenting the PX4 autopilot control system architecture with adaptive digital control laws based on retrospective cost adaptive control (RCAC). In order to investigate the performance of the adaptive digital autopilot, the default gains of the fixed… ▽ More This paper develops an adaptive digital autopilot for quadcopters and presents experimental results. The adaptive digital autopilot is constructed by augmenting the PX4 autopilot control system architecture with adaptive digital control laws based on retrospective cost adaptive control (RCAC). In order to investigate the performance of the adaptive digital autopilot, the default gains of the fixed-gain autopilot are scaled by a small factor, which severely degrades its performance. This scenario thus provides a venue for determining the ability of the adaptive digital autopilot to compensate for the detuned fixed-gain autopilot. The adaptive digital autopilot is tested in simulation and physical flight tests, and the resulting performance improvements are examined. △ Less

Submitted 4 December, 2020; originally announced December 2020.

Comments: Submitted to ACC 2021

arXiv:2010.06261 [pdf, other]

doi 10.1109/TPAMI.2022.3143806

Neighborhood Preserving Kernels for Attributed Graphs

Authors: Asif Salim, Shiju. S. S, Sumitra. S

Abstract: We describe the design of a reproducing kernel suitable for attributed graphs, in which the similarity between the two graphs is defined based on the neighborhood information of the graph nodes with the aid of a product graph formulation. We represent the proposed kernel as the weighted sum of two other kernels of which one is an R-convolution kernel that processes the attribute information of the… ▽ More We describe the design of a reproducing kernel suitable for attributed graphs, in which the similarity between the two graphs is defined based on the neighborhood information of the graph nodes with the aid of a product graph formulation. We represent the proposed kernel as the weighted sum of two other kernels of which one is an R-convolution kernel that processes the attribute information of the graph and the other is an optimal assignment kernel that processes label information. They are formulated in such a way that the edges processed as part of the kernel computation have the same neighborhood properties and hence the kernel proposed makes a well-defined correspondence between regions processed in graphs. These concepts are also extended to the case of the shortest paths. We identified the state-of-the-art kernels that can be mapped to such a neighborhood preserving framework. We found that the kernel value of the argument graphs in each iteration of the Weisfeiler-Lehman color refinement algorithm can be obtained recursively from the product graph formulated in our method. By incorporating the proposed kernel on support vector machines we analyzed the real-world data sets and it has shown superior performance in comparison with that of the other state-of-the-art graph kernels. △ Less

Submitted 13 October, 2020; originally announced October 2020.

Journal ref: IEEE Transations on Pattern Analysis and Machine Intelligence, 2022

arXiv:2009.13801 [pdf, other]

Framework for Designing Filters of Spectral Graph Convolutional Neural Networks in the Context of Regularization Theory

Authors: Asif Salim, Sumitra S

Abstract: Graph convolutional neural networks (GCNNs) have been widely used in graph learning. It has been observed that the smoothness functional on graphs can be defined in terms of the graph Laplacian. This fact points out in the direction of using Laplacian in deriving regularization operators on graphs and its consequent use with spectral GCNN filter designs. In this work, we explore the regularization… ▽ More Graph convolutional neural networks (GCNNs) have been widely used in graph learning. It has been observed that the smoothness functional on graphs can be defined in terms of the graph Laplacian. This fact points out in the direction of using Laplacian in deriving regularization operators on graphs and its consequent use with spectral GCNN filter designs. In this work, we explore the regularization properties of graph Laplacian and proposed a generalized framework for regularized filter designs in spectral GCNNs. We found that the filters used in many state-of-the-art GCNNs can be derived as a special case of the framework we developed. We designed new filters that are associated with well-defined regularization behavior and tested their performance on semi-supervised node classification tasks. Their performance was found to be superior to that of the other state-of-the-art techniques. △ Less

Submitted 29 September, 2020; originally announced September 2020.

arXiv:2006.11773 [pdf, other]

Optimal and Practical Algorithms for Smooth and Strongly Convex Decentralized Optimization

Authors: Dmitry Kovalev, Adil Salim, Peter Richtárik

Abstract: We consider the task of decentralized minimization of the sum of smooth strongly convex functions stored across the nodes of a network. For this problem, lower bounds on the number of gradient computations and the number of communication rounds required to achieve $\varepsilon$ accuracy have recently been proven. We propose two new algorithms for this decentralized optimization problem and equip t… ▽ More We consider the task of decentralized minimization of the sum of smooth strongly convex functions stored across the nodes of a network. For this problem, lower bounds on the number of gradient computations and the number of communication rounds required to achieve $\varepsilon$ accuracy have recently been proven. We propose two new algorithms for this decentralized optimization problem and equip them with complexity guarantees. We show that our first method is optimal both in terms of the number of communication rounds and in terms of the number of gradient computations. Unlike existing optimal algorithms, our algorithm does not rely on the expensive evaluation of dual gradients. Our second algorithm is optimal in terms of the number of communication rounds, without a logarithmic factor. Our approach relies on viewing the two proposed algorithms as accelerated variants of the Forward Backward algorithm to solve monotone inclusions associated with the decentralized optimization problem. We also verify the efficacy of our methods against state-of-the-art algorithms through numerical experiments. △ Less

Submitted 13 November, 2020; v1 submitted 21 June, 2020; originally announced June 2020.

arXiv:2006.09797 [pdf, other]

A Non-Asymptotic Analysis for Stein Variational Gradient Descent

Authors: Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, Arthur Gretton

Abstract: We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution $πぱい\propto e^{-V}$ on $\mathbb{R}^d$. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to $πぱい$, where the gradient is smoothed through a kernel integral operator. In thi… ▽ More We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution $πぱい\propto e^{-V}$ on $\mathbb{R}^d$. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to $πぱい$, where the gradient is smoothed through a kernel integral operator. In this paper, we provide a novel finite time analysis for the SVGD algorithm. We provide a descent lemma establishing that the algorithm decreases the objective at each iteration, and rates of convergence for the average Stein Fisher divergence (also referred to as Kernel Stein Discrepancy). We also provide a convergence result of the finite particle system corresponding to the practical implementation of SVGD to its population version. △ Less

Submitted 3 January, 2021; v1 submitted 17 June, 2020; originally announced June 2020.

Comments: Accepted to Neurips 2020

arXiv:2006.09270 [pdf, other]

Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm

Authors: Adil Salim, Peter Richtárik

Abstract: We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, \textit{i.e.}, written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (\t… ▽ More We consider the task of sampling with respect to a log concave probability distribution. The potential of the target distribution is assumed to be composite, \textit{i.e.}, written as the sum of a smooth convex term, and a nonsmooth convex term possibly taking infinite values. The target distribution can be seen as a minimizer of the Kullback-Leibler divergence defined on the Wasserstein space (\textit{i.e.}, the space of probability measures). In the first part of this paper, we establish a strong duality result for this minimization problem. In the second part of this paper, we use the duality gap arising from the first part to study the complexity of the Proximal Stochastic Gradient Langevin Algorithm (PSGLA), which can be seen as a generalization of the Projected Langevin Algorithm. Our approach relies on viewing PSGLA as a primal dual algorithm and covers many cases where the target distribution is not fully supported. In particular, we show that if the potential is strongly convex, the complexity of PSGLA is $O(1/\varepsilon^2)$ in terms of the 2-Wasserstein distance. In contrast, the complexity of the Projected Langevin Algorithm is $O(1/\varepsilon^{12})$ in terms of total variation when the potential is convex. △ Less

Submitted 22 February, 2021; v1 submitted 16 June, 2020; originally announced June 2020.

arXiv:2006.00416 [pdf, other]

Adaptive Digital PID Control of a Quadcopter with Unknown Dynamics

Authors: Ankit Goel, Abdulazeez Mohammed Salim, Ahmad Ansari, Sai Ravela, Dennis Bernstein

Abstract: This paper develops an adaptive autopilot for quadcopters with unknown dynamics. To do this, the PX4 autopilot architecture is modified so that the feedback and feedforward controllers are replaced by adaptive control laws based on retrospective cost adaptive control (RCAC). The present paper provides a numerical investigation of the performance of the adaptive autopilot on a quadcopter with unkno… ▽ More This paper develops an adaptive autopilot for quadcopters with unknown dynamics. To do this, the PX4 autopilot architecture is modified so that the feedback and feedforward controllers are replaced by adaptive control laws based on retrospective cost adaptive control (RCAC). The present paper provides a numerical investigation of the performance of the adaptive autopilot on a quadcopter with unknown dynamics. In order to reflect the absence of prior modeling information, all of the adaptive digital controllers are initialized at zero gains. In addition, moment-of-inertia of the quadcopter is varied to test the robustness of the adaptive autopilot. In all test cases, the vehicle is commanded to follow a given trajectory, and the resulting performance is examined. △ Less

Submitted 30 May, 2020; originally announced June 2020.

Comments: Submitted to ACC2020

arXiv:2004.12354 [pdf, ps, other]

Subgroups of a finitary linear group

Authors: V. A. Bovdi, O. Yu. Dashkova, M. A. Salim

Abstract: Let FL_s(K) be the finitary linear group of degree s over an associative ring K with unity. We prove that the torsion subgroups of FL_s(K) are locally finite for certain classes of rings K. A description of some f.g. solvable subgroups of FL_s(K) are given. Let FL_s(K) be the finitary linear group of degree s over an associative ring K with unity. We prove that the torsion subgroups of FL_s(K) are locally finite for certain classes of rings K. A description of some f.g. solvable subgroups of FL_s(K) are given. △ Less

Submitted 26 April, 2020; originally announced April 2020.

Comments: 7 pages

MSC Class: 20H25

arXiv:2004.02635 [pdf, other]

doi 10.1007/s10957-022-02061-8

Dualize, Split, Randomize: Toward Fast Nonsmooth Optimization Algorithms

Authors: Adil Salim, Laurent Condat, Konstantin Mishchenko, Peter Richtárik

Abstract: We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal-dual algorithm, which we call PDDY,… ▽ More We consider minimizing the sum of three convex functions, where the first one F is smooth, the second one is nonsmooth and proximable and the third one is the composition of a nonsmooth proximable function with a linear operator L. This template problem has many applications, for instance, in image processing and machine learning. First, we propose a new primal-dual algorithm, which we call PDDY, for this problem. It is constructed by applying Davis-Yin splitting to a monotone inclusion in a primal-dual product space, where the operators are monotone under a specific metric depending on L. We show that three existing algorithms (the two forms of the Condat-Vu algorithm and the PD3O algorithm) have the same structure, so that PDDY is the fourth missing link in this self-consistent class of primal-dual algorithms. This representation eases the convergence analysis: it allows us to derive sublinear convergence rates in general, and linear convergence results in presence of strong convexity. Moreover, within our broad and flexible analysis framework, we propose new stochastic generalizations of the algorithms, in which a variance-reduced random estimate of the gradient of F is used, instead of the true gradient. Furthermore, we obtain, as a special case of PDDY, a linearly converging algorithm for the minimization of a strongly convex function F under a linear constraint; we discuss its important application to decentralized optimization. △ Less

Submitted 26 July, 2022; v1 submitted 3 April, 2020; originally announced April 2020.

arXiv:2003.06919 [pdf, ps, other]

doi 10.1016/j.laa.2020.03.004

Operators on positive semidefinite inner product spaces

Authors: Victor A. Bovdi, Tetiana Klymchuk, Tetiana Rybalkina, Mohamed A. Salim, Vladimir V. Sergeichuk

Abstract: We give canonical forms of selfadjoint and isometric operators on a complex vector space $U$ with scalar product given by a positive semidefinite Hermitian form, and of Hermitian forms on $U$. For an arbitrary system of semiunitary spaces and linear mappings on/between them, we give an algorithm that reduces their matrices to canonical form. We give canonical forms of selfadjoint and isometric operators on a complex vector space $U$ with scalar product given by a positive semidefinite Hermitian form, and of Hermitian forms on $U$. For an arbitrary system of semiunitary spaces and linear mappings on/between them, we give an algorithm that reduces their matrices to canonical form. △ Less

Submitted 15 March, 2020; originally announced March 2020.

Comments: 28 pages

MSC Class: 15A21; 15A42; 15A63; 47B50

Journal ref: Linear Algebra Appl. 596 (2020) 82-105

arXiv:2003.01346 [pdf, ps, other]

doi 10.1016/j.physletb.2020.135830

Derivations of group rings

Authors: Orest D. Artemovych, Victor A. Bovdi, Mohamed A. Salim

Abstract: Let R[G] be the group ring of a group G over an associative ring R with unity such that all prime divisors of orders of elements of G are invertible in R. If R is finite and G is a Chernikov (torsion FC-) group, then each R-derivation of R[G] is inner. Similar results also are obtained for other classes of groups G and rings R. Let R[G] be the group ring of a group G over an associative ring R with unity such that all prime divisors of orders of elements of G are invertible in R. If R is finite and G is a Chernikov (torsion FC-) group, then each R-derivation of R[G] is inner. Similar results also are obtained for other classes of groups G and rings R. △ Less

Submitted 3 March, 2020; originally announced March 2020.

Comments: 17 pages

MSC Class: 20C05; 16S34; 20F45; 20F19; 16W25

arXiv:2002.03035 [pdf, other]

The Wasserstein Proximal Gradient Algorithm

Authors: Adil Salim, Anna Korba, Giulia Luise

Abstract: Wasserstein gradient flows are continuous time dynamics that define curves of steepest descent to minimize an objective function over the space of probability measures (i.e., the Wasserstein space). This objective is typically a divergence w.r.t. a fixed target distribution. In recent years, these continuous time dynamics have been used to study the convergence of machine learning algorithms aimin… ▽ More Wasserstein gradient flows are continuous time dynamics that define curves of steepest descent to minimize an objective function over the space of probability measures (i.e., the Wasserstein space). This objective is typically a divergence w.r.t. a fixed target distribution. In recent years, these continuous time dynamics have been used to study the convergence of machine learning algorithms aiming at approximating a probability distribution. However, the discrete-time behavior of these algorithms might differ from the continuous time dynamics. Besides, although discretized gradient flows have been proposed in the literature, little is known about their minimization power. In this work, we propose a Forward Backward (FB) discretization scheme that can tackle the case where the objective function is the sum of a smooth and a nonsmooth geodesically convex terms. Using techniques from convex optimization and optimal transport, we analyze the FB scheme as a minimization algorithm on the Wasserstein space. More precisely, we show under mild assumptions that the FB scheme has convergence guarantees similar to the proximal gradient algorithm in Euclidean spaces. △ Less

Submitted 21 February, 2021; v1 submitted 7 February, 2020; originally announced February 2020.

arXiv:1912.09925 [pdf, other]

Distributed Fixed Point Methods with Compressed Iterates

Authors: Sélim Chraibi, Ahmed Khaled, Dmitry Kovalev, Peter Richtárik, Adil Salim, Martin Takáč

Abstract: We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establis… ▽ More We propose basic and natural assumptions under which iterative optimization methods with compressed iterates can be analyzed. This problem is motivated by the practice of federated learning, where a large model stored in the cloud is compressed before it is sent to a mobile device, which then proceeds with training based on local data. We develop standard and variance reduced methods, and establish communication complexity bounds. Our algorithms are the first distributed methods with compressed iterates, and the first fixed point methods with compressed iterates. △ Less

Submitted 20 December, 2019; originally announced December 2019.

Comments: 15 pages, 4 algorithms, 4 Theorems

arXiv:1910.04405 [pdf, ps, other]

A Strong Law of Large Numbers for Random Monotone Operators

Authors: Adil Salim

Abstract: Random monotone operators are stochastic versions of maximal monotone operators which play an important role in stochastic nonsmooth optimization. Several stochastic nonsmooth optimization algorithms have been shown to converge to a zero of a mean operator defined as the expectation, in the sense of the Aumann integral, of a random monotone operator. In this note, we prove a strong law of large… ▽ More Random monotone operators are stochastic versions of maximal monotone operators which play an important role in stochastic nonsmooth optimization. Several stochastic nonsmooth optimization algorithms have been shown to converge to a zero of a mean operator defined as the expectation, in the sense of the Aumann integral, of a random monotone operator. In this note, we prove a strong law of large numbers for random monotone operators where the limit is the mean operator. We apply this result to the empirical risk minimization problem appearing in machine learning. We show that if the empirical risk minimizers converge as the number of data points goes to infinity, then they converge to an expected risk minimizer. △ Less

Submitted 20 October, 2023; v1 submitted 10 October, 2019; originally announced October 2019.

arXiv:1910.01484 [pdf, ps, other]

doi 10.2478/cm-2020-0019

The variety of dual mock-Lie algebras

Authors: Luisa M. Camacho, Ivan Kaygorodov, Victor Lopatkin, Mohamed A. Salim

Abstract: We classify all complex $7$- and $8$-dimensional dual mock-Lie algebras by algebraic and geometric way. Also we find all non-trivial complex $9$-dimensional dual mock-Lie algebras. We classify all complex $7$- and $8$-dimensional dual mock-Lie algebras by algebraic and geometric way. Also we find all non-trivial complex $9$-dimensional dual mock-Lie algebras. △ Less

Submitted 1 October, 2019; originally announced October 2019.

Comments: arXiv admin note: substantial text overlap with arXiv:1905.05361 and text overlap with arXiv:1907.00685

arXiv:1909.08704 [pdf, other]

Balsam: Automated Scheduling and Execution of Dynamic, Data-Intensive HPC Workflows

Authors: Michael A. Salim, Thomas D. Uram, J. Taylor Childers, Prasanna Balaprakash, Venkatram Vishwanath, Michael E. Papka

Abstract: We introduce the Balsam service to manage high-throughput task scheduling and execution on supercomputing systems. Balsam allows users to populate a task database with a variety of tasks ranging from simple independent tasks to dynamic multi-task workflows. With abstractions for the local resource scheduler and MPI environment, Balsam dynamically packages tasks into ensemble jobs and manages their… ▽ More We introduce the Balsam service to manage high-throughput task scheduling and execution on supercomputing systems. Balsam allows users to populate a task database with a variety of tasks ranging from simple independent tasks to dynamic multi-task workflows. With abstractions for the local resource scheduler and MPI environment, Balsam dynamically packages tasks into ensemble jobs and manages their scheduling lifecycle. The ensembles execute in a pilot "launcher" which (i) ensures concurrent, load-balanced execution of arbitrary serial and parallel programs with heterogeneous processor requirements, (ii) requires no modification of user applications, (iii) is tolerant of task-level faults and provides several options for error recovery, (iv) stores provenance data (e.g task history, error logs) in the database, (v) supports dynamic workflows, in which tasks are created or killed at runtime. Here, we present the design and Python implementation of the Balsam service and launcher. The efficacy of this system is illustrated using two case studies: hyperparameter optimization of deep neural networks, and high-throughput single-point quantum chemistry calculations. We find that the unique combination of flexible job-packing and automated scheduling with dynamic (pilot-managed) execution facilitates excellent resource utilization. The scripting overheads typically needed to manage resources and launch workflows on supercomputers are substantially reduced, accelerating workflow development and execution. △ Less

Submitted 18 September, 2019; originally announced September 2019.

Comments: SC '18: 8th Workshop on Python for High-Performance and Scientific Computing (PyHPC 2018)

arXiv:1906.04370 [pdf, other]

Maximum Mean Discrepancy Gradient Flow

Authors: Michael Arbel, Anna Korba, Adil Salim, Arthur Gretton

Abstract: We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to… ▽ More We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples. △ Less

Submitted 3 December, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

arXiv:1905.11768 [pdf, other]

Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates

Authors: Adil Salim, Dmitry Kovalev, Peter Richtárik

Abstract: We propose a new algorithm---Stochastic Proximal Langevin Algorithm (SPLA)---for sampling from a log concave distribution. Our method is a generalization of the Langevin algorithm to potentials expressed as the sum of one stochastic smooth term and multiple stochastic nonsmooth terms. In each iteration, our splitting technique only requires access to a stochastic gradient of the smooth term and a… ▽ More We propose a new algorithm---Stochastic Proximal Langevin Algorithm (SPLA)---for sampling from a log concave distribution. Our method is a generalization of the Langevin algorithm to potentials expressed as the sum of one stochastic smooth term and multiple stochastic nonsmooth terms. In each iteration, our splitting technique only requires access to a stochastic gradient of the smooth term and a stochastic proximal operator for each of the nonsmooth terms. We establish nonasymptotic sublinear and linear convergence rates under convexity and strong convexity of the smooth term, respectively, expressed in terms of the KL divergence and Wasserstein distance. We illustrate the efficiency of our sampling technique through numerical simulations on a Bayesian learning task. △ Less

Submitted 16 June, 2020; v1 submitted 28 May, 2019; originally announced May 2019.

Journal ref: Neurips 2019 (Spotlight)

arXiv:1901.08170 [pdf, ps, other]

A Fully Stochastic Primal-Dual Algorithm

Authors: Pascal Bianchi, Walid Hachem, Adil Salim

Abstract: A new stochastic primal--dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions/operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d. realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In t… ▽ More A new stochastic primal--dual algorithm for solving a composite optimization problem is proposed. It is assumed that all the functions/operators that enter the optimization problem are given as statistical expectations. These expectations are unknown but revealed across time through i.i.d. realizations. The proposed algorithm is proven to converge to a saddle point of the Lagrangian function. In the framework of the monotone operator theory, the convergence proof relies on recent results on the stochastic Forward Backward algorithm involving random monotone operators. An example of convex optimization under stochastic linear constraints is considered. △ Less

Submitted 22 June, 2020; v1 submitted 23 January, 2019; originally announced January 2019.

arXiv:1808.06444 [pdf]

Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders

Authors: Ally Salim Jr

Abstract: Artificial Intelligence in healthcare is a new and exciting frontier and the possibilities are endless. With deep learning approaches beating human performances in many areas, the logical next step is to attempt their application in the health space. For these and other Machine Learning approaches to produce good results and have their potential realized, the need for, and importance of, large amo… ▽ More Artificial Intelligence in healthcare is a new and exciting frontier and the possibilities are endless. With deep learning approaches beating human performances in many areas, the logical next step is to attempt their application in the health space. For these and other Machine Learning approaches to produce good results and have their potential realized, the need for, and importance of, large amounts of accurate data is second to none. This is a challenge faced by many industries and more so in the healthcare space. We present an approach of using Variational Autoencoders (VAE's) as an approach to generating more data for training deeper networks, as well as uncovering underlying patterns in diagnoses and the patients suffering from them. By training a VAE, on available data, it was able to learn the latent distribution of the patient features given the diagnosis. It is then possible, after training, to sample from the learnt latent distribution to generate new accurate patient records given the patient diagnosis. △ Less

Submitted 20 August, 2018; originally announced August 2018.

MSC Class: 68T00

arXiv:1804.00934 [pdf, other]

A Constant Step Stochastic Douglas-Rachford Algorithm with Application to Non Separable Regularizations

Authors: Adil Salim, Pascal Bianchi, Walid Hachem

Abstract: The Douglas Rachford algorithm is an algorithm that converges to a minimizer of a sum of two convex functions. The algorithm consists in fixed point iterations involving computations of the proximity operators of the two functions separately. The paper investigates a stochastic version of the algorithm where both functions are random and the step size is constant. We establish that the iterates of… ▽ More The Douglas Rachford algorithm is an algorithm that converges to a minimizer of a sum of two convex functions. The algorithm consists in fixed point iterations involving computations of the proximity operators of the two functions separately. The paper investigates a stochastic version of the algorithm where both functions are random and the step size is constant. We establish that the iterates of the algorithm stay close to the set of solution with high probability when the step size is small enough. Application to structured regularization is considered. △ Less

Submitted 3 April, 2018; originally announced April 2018.

arXiv:1712.09186 [pdf, ps, other]

Completely simple endomorphism rings of modules

Authors: V. A. Bovdi, M. A. Salim, Mihail Ursul

Abstract: It is proved that if A_p is a countable elementary abelian p-group, then: (i) The ring End(A_p) does not admit a nondiscrete locally compact ring topology. (ii) Under (CH) the simple ring End(A_p)/I, where I is the ideal of End(A_p) consisting of all endomorphisms with finite images, does not admit a nondiscrete locally compact ring topology. (iii) The finite topology on End(A_p) is the only secon… ▽ More It is proved that if A_p is a countable elementary abelian p-group, then: (i) The ring End(A_p) does not admit a nondiscrete locally compact ring topology. (ii) Under (CH) the simple ring End(A_p)/I, where I is the ideal of End(A_p) consisting of all endomorphisms with finite images, does not admit a nondiscrete locally compact ring topology. (iii) The finite topology on End(A_p) is the only second metrizable ring topology on it. Moreover, a characterization of completely simple endomorphism rings of the endomorphism rings of modules over commutative rings is also obtained. △ Less

Submitted 26 December, 2017; originally announced December 2017.

Comments: 16 pages

MSC Class: 16W80; 16N20; 16S50; 16N40

arXiv:1712.08729 [pdf, ps, other]

doi 10.1016/j.laa.2017.12.013

Reduction of a pair of skew-symmetric matrices to its canonical form under congruence

Authors: V. A. Bovdi, T. G. Gerasimova, M. A. Salim, V. V. Sergeichuk

Abstract: Let $(A,B)$ be a pair of skew-symmetric matrices over a field of characteristic not 2. Its regularization decomposition is a direct sum \[ (\underline{\underline A},\underline{\underline B})\oplus (A_1,B_1)\oplus\dots\oplus(A_t,B_t) \] that is congruent to $(A,B)$, in which $(\underline{\underline A},\underline{\underline B})$ is a pair of nonsingular matrices and $(A_1,B_1),$ $\dots,$… ▽ More Let $(A,B)$ be a pair of skew-symmetric matrices over a field of characteristic not 2. Its regularization decomposition is a direct sum \[ (\underline{\underline A},\underline{\underline B})\oplus (A_1,B_1)\oplus\dots\oplus(A_t,B_t) \] that is congruent to $(A,B)$, in which $(\underline{\underline A},\underline{\underline B})$ is a pair of nonsingular matrices and $(A_1,B_1),$ $\dots,$ $(A_t,B_t)$ are singular indecomposable canonical pairs of skew-symmetric matrices under congruence. We give an algorithm that constructs a regularization decomposition. We also give a constructive proof of the known canonical form of $(A,B)$ under congruence over an algebraically closed field of characteristic not 2. △ Less

Submitted 23 December, 2017; originally announced December 2017.

Comments: 16 pages

MSC Class: 15A21; 15A22; 15A63; 51A50

Journal ref: Linear Algebra Appl. 543 (2018) 17-30

arXiv:1712.07027 [pdf, other]

Snake: a Stochastic Proximal Gradient Algorithm for Regularized Problems over Large Graphs

Authors: Adil Salim, Pascal Bianchi, Walid Hachem

Abstract: A regularized optimization problem over a large unstructured graph is studied, where the regularization term is tied to the graph geometry. Typical regularization examples include the total variation and the Laplacian regularizations over the graph. When applying the proximal gradient algorithm to solve this problem, there exist quite affordable methods to implement the proximity operator (backwar… ▽ More A regularized optimization problem over a large unstructured graph is studied, where the regularization term is tied to the graph geometry. Typical regularization examples include the total variation and the Laplacian regularizations over the graph. When applying the proximal gradient algorithm to solve this problem, there exist quite affordable methods to implement the proximity operator (backward step) in the special case where the graph is a simple path without loops. In this paper, an algorithm, referred to as "Snake", is proposed to solve such regularized problems over general graphs, by taking benefit of these fast methods. The algorithm consists in properly selecting random simple paths in the graph and performing the proximal gradient algorithm over these simple paths. This algorithm is an instance of a new general stochastic proximal gradient algorithm, whose convergence is proven. Applications to trend filtering and graph inpainting are provided among others. Numerical experiments are conducted over large graphs. △ Less

Submitted 19 December, 2017; originally announced December 2017.

arXiv:1709.10350 [pdf, ps, other]

doi 10.1016/j.laa.2017.09.026

Symplectic spaces and pairs of symmetric and nonsingular skew-symmetric matrices under congruence

Authors: Victor A. Bovdi, Roger A. Horn, Mohamed A. Salim, Vladimir V. Sergeichuk

Abstract: Let $\mathbb F$ be a field of characteristic not $2$, and let $(A,B)$ be a pair of $n\times n$ matrices over $\mathbb F$, in which $A$ is symmetric and $B$ is skew-symmetric. A canonical form of $(A,B)$ with respect to congruence transformations $(S^TAS,S^TBS)$ was given by Sergeichuk (1988) up to classification of symmetric and Hermitian forms over finite extensions of $\mathbb F$. We obtain a si… ▽ More Let $\mathbb F$ be a field of characteristic not $2$, and let $(A,B)$ be a pair of $n\times n$ matrices over $\mathbb F$, in which $A$ is symmetric and $B$ is skew-symmetric. A canonical form of $(A,B)$ with respect to congruence transformations $(S^TAS,S^TBS)$ was given by Sergeichuk (1988) up to classification of symmetric and Hermitian forms over finite extensions of $\mathbb F$. We obtain a simpler canonical form of $(A,B)$ if $B$ is nonsingular. Such a pair $(A,B)$ defines a quadratic form on a symplectic space, that is, on a vector space with scalar product given by a nonsingular skew-symmetric form. As an application, we obtain known canonical matrices of quadratic forms and Hamiltonian operators on real and complex symplectic spaces. △ Less

Submitted 29 September, 2017; originally announced September 2017.

Comments: 19 pages

MSC Class: 15A21; 15A22; 15A63; 51A50

Journal ref: Linear Algebra and Its Applications 537 (2018) 84-99

arXiv:1702.04144 [pdf, ps, other]

A constant step Forward-Backward algorithm involving random maximal monotone operators

Authors: Pascal Bianchi, Walid Hachem, Adil Salim

Abstract: A stochastic Forward-Backward algorithm with a constant step is studied. At each time step, this algorithm involves an independent copy of a couple of random maximal monotone operators. Defining a mean operator as a selection integral, the differential inclusion built from the sum of the two mean operators is considered. As a first result, it is shown that the interpolated process obtained from th… ▽ More A stochastic Forward-Backward algorithm with a constant step is studied. At each time step, this algorithm involves an independent copy of a couple of random maximal monotone operators. Defining a mean operator as a selection integral, the differential inclusion built from the sum of the two mean operators is considered. As a first result, it is shown that the interpolated process obtained from the iterates converges narrowly in the small step regime to the solution of this differential inclusion. In order to control the long term behavior of the iterates, a stability result is needed in addition. To this end, the sequence of the iterates is seen as a homogeneous Feller Markov chain whose transition kernel is parameterized by the algorithm step size. The cluster points of the Markov chains invariant measures in the small step regime are invariant for the semiflow induced by the differential inclusion. Conclusions regarding the long run behavior of the iterates for small steps are drawn. It is shown that when the sum of the mean operators is demipositive, the probabilities that the iterates are away from the set of zeros of this sum are small in Cesàro mean. The ergodic behavior of these iterates is studied as well. Applications of the proposed algorithm are considered. In particular, a detailed analysis of the random proximal gradient algorithm with constant step is performed. △ Less

Submitted 4 April, 2018; v1 submitted 14 February, 2017; originally announced February 2017.

arXiv:1612.03831 [pdf, ps, other]

Constant Step Stochastic Approximations Involving Differential Inclusions: Stability, Long-Run Convergence and Applications

Authors: Pascal Bianchi, Walid Hachem, Adil Salim

Abstract: We consider a Markov chain $(x_n)$ whose kernel is indexed by a scaling parameter $γがんま>0$, refered to as the step size. The aim is to analyze the behavior of the Markov chain in the doubly asymptotic regime where $n\to\infty$ then $γがんま\to 0$. First, under mild assumptions on the so-called drift of the Markov chain, we show that the interpolated process converges narrowly to the solutions of a Differen… ▽ More We consider a Markov chain $(x_n)$ whose kernel is indexed by a scaling parameter $γがんま>0$, refered to as the step size. The aim is to analyze the behavior of the Markov chain in the doubly asymptotic regime where $n\to\infty$ then $γがんま\to 0$. First, under mild assumptions on the so-called drift of the Markov chain, we show that the interpolated process converges narrowly to the solutions of a Differential Inclusion (DI) involving an upper semicontinuous set-valued map with closed and convex values. Second, we provide verifiable conditions which ensure the stability of the iterates. Third, by putting the above results together, we establish the long run convergence of the iterates as $γがんま\to 0$, to the Birkhoff center of the DI. The ergodic behavior of the iterates is also provided. Application examples are investigated. We apply our findings to 1) the problem of nonconvex proximal stochastic optimization and 2) a fluid model of parallel queues. △ Less

Submitted 14 December, 2017; v1 submitted 12 December, 2016; originally announced December 2016.

arXiv:1611.03557 [pdf, ps, other]

doi 10.1016/j.laa.2016.09.026

Neighborhood radius estimation for Arnold's miniversal deformations of complex and $p$-adic matrices

Authors: Victor A. Bovdi, Mohammed A. Salim, Vladimir V. Sergeichuk

Abstract: V.I. Arnold (1971) constructed a simple normal form to which all complex matrices $B$ in a neighborhood $U$ of a given square matrix $A$ can be reduced by similarity transformations that smoothly depend on the entries of $B$. We calculate the radius of the neighborhood $U$. A.A. Mailybaev (1999, 2001) constructed a reducing similarity transformation in the form of Taylor series; we construct this… ▽ More V.I. Arnold (1971) constructed a simple normal form to which all complex matrices $B$ in a neighborhood $U$ of a given square matrix $A$ can be reduced by similarity transformations that smoothly depend on the entries of $B$. We calculate the radius of the neighborhood $U$. A.A. Mailybaev (1999, 2001) constructed a reducing similarity transformation in the form of Taylor series; we construct this transformation by another method. We extend Arnold's normal form to matrices over the field $\mathbb Q_p$ of $p$-adic numbers and the field $\mathbb F((T))$ of Laurent series over a field $\mathbb F$. △ Less

Submitted 10 November, 2016; originally announced November 2016.

Comments: 19 pages

MSC Class: 15A21; 15B33; 37J40

Journal ref: Linear Algebra Appl. 512 (2017) 97-112

arXiv:1610.07256 [pdf, other]

doi 10.1002/wcm.2692

Differential Modulation for Asynchronous Two-Way-Relay Systems over Frequency-Selective Fading Channels

Authors: Ahmad Salim, Tolga M. Duman

Abstract: In this paper, we propose two schemes for asynchronous multi-relay two-way relay (MR-TWR) systems in which neither the users nor the relays know the channel state information (CSI). In an MR-TWR system, two users exchange their messages with the help of $N_R$ relays. Most of the existing works on MR-TWR systems based on differential modulation assume perfect symbol-level synchronization between al… ▽ More In this paper, we propose two schemes for asynchronous multi-relay two-way relay (MR-TWR) systems in which neither the users nor the relays know the channel state information (CSI). In an MR-TWR system, two users exchange their messages with the help of $N_R$ relays. Most of the existing works on MR-TWR systems based on differential modulation assume perfect symbol-level synchronization between all communicating nodes. However, this assumption is not valid in many practical systems, which makes the design of differentially modulated schemes more challenging. Therefore, we design differential modulation schemes that can tolerate timing misalignment under frequency-selective fading. We investigate the performance of the proposed schemes in terms of either probability of bit error or pairwise error probability. Through numerical examples, we show that the proposed schemes outperform existing competing solutions in the literature, especially for high signal-to-noise ratio (SNR) values. △ Less

Submitted 23 October, 2016; originally announced October 2016.

Journal ref: Wirel. Commun. Mob. Comput., 16: 2422 to 2435 (2016)

arXiv:1606.07589 [pdf, ps, other]

Group algebras whose groups of normalized units have exponent 4

Authors: V. A. Bovdi, M. A. Salim

Abstract: We give a full description of locally finite p-groups G such that the normalized group of units V(FG) of the group algebra FG over a field F of characteristic p has exponent 4. We give a full description of locally finite p-groups G such that the normalized group of units V(FG) of the group algebra FG over a field F of characteristic p has exponent 4. △ Less

Submitted 21 July, 2016; v1 submitted 24 June, 2016; originally announced June 2016.

Comments: 7 pages

MSC Class: 16S34; 16U60

Showing 1–50 of 63 results for author: Salim, A