Statistics
See recent articles
- [1] arXiv:2408.04796 [pdf, html, other]
-
Title: A Density Ratio Super LearnerComments: 10 pages, 3 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The estimation of the ratio of two density probability functions is of great interest in many statistics fields, including causal inference. In this study, we develop an ensemble estimator of density ratios with a novel loss function based on super learning. We show that this novel loss function is qualified for building super learners. Two simulations corresponding to mediation analysis and longitudinal modified treatment policy in causal inference, where density ratios are nuisance parameters, are conducted to show our density ratio super learner's performance empirically.
- [2] arXiv:2408.04847 [pdf, html, other]
-
Title: A Pipeline for Data-Driven Learning of Topological Features with Applications to Protein Stability PredictionComments: 13 figures, 23 pages (without appendix and references)Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Data Analysis, Statistics and Probability (physics.data-an)
In this paper, we propose a data-driven method to learn interpretable topological features of biomolecular data and demonstrate the efficacy of parsimonious models trained on topological features in predicting the stability of synthetic mini proteins. We compare models that leverage automatically-learned structural features against models trained on a large set of biophysical features determined by subject-matter experts (SME). Our models, based only on topological features of the protein structures, achieved 92%-99% of the performance of SME-based models in terms of the average precision score. By interrogating model performance and feature importance metrics, we extract numerous insights that uncover high correlations between topological features and SME features. We further showcase how combining topological features and SME features can lead to improved model performance over either feature set used in isolation, suggesting that, in some settings, topological features may provide new discriminating information not captured in existing SME features that are useful for protein stability prediction.
- [3] arXiv:2408.04854 [pdf, html, other]
-
Title: A propensity score weighting approach to integrate aggregated data in random-effect individual-level data meta-analysisSubjects: Methodology (stat.ME)
In evidence synthesis, collecting individual participant data (IPD) across eligible studies is the most reliable way to investigate the treatment effects in different subgroups defined by participant characteristics. Nonetheless, access to all IPD from all studies might be very challenging due to privacy concerns. To overcome this, many approaches such as multilevel modeling have been proposed to incorporate the vast amount of aggregated data from the literature into IPD meta-analysis. These methods, however, often rely on specifying separate models for trial-level versus patient-level data, which likely suffers from ecological bias when there are non-linearities in the outcome generating mechanism. In this paper, we introduce a novel method to combine aggregated data and IPD in meta-analysis that is free from ecological bias. The proposed approach relies on modeling the study membership given covariates, then using inverse weighting to estimate the trial-specific coefficients in the individual-level outcome model of studies without IPD accessible. The weights derived from this approach also shed insights on the similarity in the case-mix across studies, which is useful to assess whether eligible trials are sufficiently similar to be meta-analyzed. We evaluate the proposed method by synthetic data, then apply it to a real-world meta-analysis comparing the chance of response between guselkumab and adalimumab among patients with psoriasis.
- [4] arXiv:2408.04866 [pdf, html, other]
-
Title: Network and interaction models for data with hierarchical granularity via fragmentation and coagulationComments: 25 pages, 6 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR)
We introduce a nested family of Bayesian nonparametric models for network and interaction data with a hierarchical granularity structure that naturally arises through finer and coarser population labelings. In the case of network data, the structure is easily visualized by merging and shattering vertices, while respecting the edge structure. We further develop Bayesian inference procedures for the model family, and apply them to synthetic and real data. The family provides a connection of practical and theoretical interest between the Hollywood model of Crane and Dempsey, and the generalized-gamma graphex model of Caron and Fox. A key ingredient for the construction of the family is fragmentation and coagulation duality for integer partitions, and for this we develop novel duality relations that generalize those of Pitman and Dong, Goldschmidt and Martin. The duality is also crucially used in our inferential procedures.
- [5] arXiv:2408.04907 [pdf, html, other]
-
Title: Causal Discovery of Linear Non-Gaussian Causal Models with Unobserved ConfoundingSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Methodology (stat.ME)
We consider linear non-Gaussian structural equation models that involve latent confounding. In this setting, the causal structure is identifiable, but, in general, it is not possible to identify the specific causal effects. Instead, a finite number of different causal effects result in the same observational distribution. Most existing algorithms for identifying these causal effects use overcomplete independent component analysis (ICA), which often suffers from convergence to local optima. Furthermore, the number of latent variables must be known a priori. To address these issues, we propose an algorithm that operates recursively rather than using overcomplete ICA. The algorithm first infers a source, estimates the effect of the source and its latent parents on their descendants, and then eliminates their influence from the data. For both source identification and effect size estimation, we use rank conditions on matrices formed from higher-order cumulants. We prove asymptotic correctness under the mild assumption that locally, the number of latent variables never exceeds the number of observed variables. Simulation studies demonstrate that our method achieves comparable performance to overcomplete ICA even though it does not know the number of latents in advance.
- [6] arXiv:2408.04933 [pdf, html, other]
-
Title: Variance-based sensitivity analysis in the presence of correlated input variablesComments: presented at 5th International Conference on Reliable Engineering Computing (REC), Brno, Czech Republic, 13-15 June, 2012Subjects: Methodology (stat.ME); Machine Learning (cs.LG); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper we propose an extension of the classical Sobol' estimator for the estimation of variance based sensitivity indices. The approach assumes a linear correlation model between the input variables which is used to decompose the contribution of an input variable into a correlated and an uncorrelated part. This method provides sampling matrices following the original joint probability distribution which are used directly to compute the model output without any assumptions or approximations of the model response function.
- [7] arXiv:2408.05058 [pdf, html, other]
-
Title: Variational Bayesian Phylogenetic Inference with Semi-implicit Branch Length DistributionsComments: 26 pages, 7 figuresSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Reconstructing the evolutionary history relating a collection of molecular sequences is the main subject of modern Bayesian phylogenetic inference. However, the commonly used Markov chain Monte Carlo methods can be inefficient due to the complicated space of phylogenetic trees, especially when the number of sequences is large. An alternative approach is variational Bayesian phylogenetic inference (VBPI) which transforms the inference problem into an optimization problem. While effective, the default diagonal lognormal approximation for the branch lengths of the tree used in VBPI is often insufficient to capture the complexity of the exact posterior. In this work, we propose a more flexible family of branch length variational posteriors based on semi-implicit hierarchical distributions using graph neural networks. We show that this semi-implicit construction emits straightforward permutation equivariant distributions, and therefore can handle the non-Euclidean branch length space across different tree topologies with ease. To deal with the intractable marginal probability of semi-implicit variational distributions, we develop several alternative lower bounds for stochastic optimization. We demonstrate the effectiveness of our proposed method over baseline methods on benchmark data examples, in terms of both marginal likelihood estimation and branch length posterior approximation.
- [8] arXiv:2408.05071 [pdf, html, other]
-
Title: Functional Sieve Bootstrap for the Partial Sum Process with Application to Change-Point Detection without Dimension ReductionSubjects: Statistics Theory (math.ST)
Change-points in functional time series can be detected using the CUSUM-statistic, which is a non-linear functional of the partial sum process. Various methods have been proposed to obtain critical values for this statistic. In this paper we use the functional autoregressive sieve bootstrap to imitate the behavior of the partial sum process and we show that this procedure asymptotically correct estimates critical values under the null hypothesis. We also establish the consistency of the corresponding bootstrap based test under local alternatives. The finite sample performance of the procedure is studied via simulations under the null -hypothesis and under the alternative.
- [9] arXiv:2408.05085 [pdf, html, other]
-
Title: On expected signatures and signature cumulants in semimartingale modelsComments: arXiv admin note: text overlap with arXiv:2102.03345Subjects: Machine Learning (stat.ML); Probability (math.PR)
The concept of signatures and expected signatures is vital in data science, especially for sequential data analysis. The signature transform, a Cartan type development, translates paths into high-dimensional feature vectors, capturing their intrinsic characteristics. Under natural conditions, the expectation of the signature determines the law of the signature, providing a statistical summary of the data distribution. This property facilitates robust modeling and inference in machine learning and stochastic processes. Building on previous work by the present authors [Unified signature cumulants and generalized Magnus expansions, FoM Sigma '22] we here revisit the actual computation of expected signatures, in a general semimartingale setting. Several new formulae are given. A log-transform of (expected) signatures leads to log-signatures (signature cumulants), offering a significant reduction in complexity.
- [10] arXiv:2408.05106 [pdf, html, other]
-
Title: Spatial Deconfounding is Reasonable Statistical Practice: Interpretations, Clarifications, and New BenefitsSubjects: Methodology (stat.ME)
The spatial linear mixed model (SLMM) consists of fixed and spatial random effects that can be confounded. Restricted spatial regression (RSR) models restrict the spatial random effects to be in the orthogonal column space of the covariates, which "deconfounds" the SLMM. Recent articles have shown that the RSR generally performs worse than the SLMM under a certain interpretation of the RSR. We show that every additive model can be reparameterized as a deconfounded model leading to what we call the linear reparameterization of additive models (LRAM). Under this reparameterization the coefficients of the covariates (referred to as deconfounded regression effects) are different from the (confounded) regression effects in the SLMM. It is shown that under the LRAM interpretation, existing deconfounded spatial models produce estimated deconfounded regression effects, spatial prediction, and spatial prediction variances equivalent to that of SLMM in Bayesian contexts. Furthermore, a general RSR (GRSR) and the SLMM produce identical inferences on confounded regression effects. While our results are in complete agreement with recent criticisms, our new results under the LRAM interpretation provide clarifications that lead to different and sometimes contrary conclusions. Additionally, we discuss the inferential and computational benefits to deconfounding, which we illustrate via a simulation.
New submissions for Monday, 12 August 2024 (showing 10 of 10 entries )
- [11] arXiv:2408.04739 (cross-list from nlin.CD) [pdf, html, other]
-
Title: Deep learning-based sequential data assimilation for chaotic dynamics identifies local instabilities from single state forecastsMarc Bocquet, Alban Farchi, Tobias S. Finn, Charlotte Durand, Sibo Cheng, Yumeng Chen, Ivo Pasmans, Alberto CarrassiSubjects: Chaotic Dynamics (nlin.CD); Atmospheric and Oceanic Physics (physics.ao-ph); Machine Learning (stat.ML)
We investigate the ability to discover data assimilation (DA) schemes meant for chaotic dynamics with deep learning (DL). The focus is on learning the analysis step of sequential DA, from state trajectories and their observations, using a simple residual convolutional neural network, while assuming the dynamics to be known. Experiments are performed with the Lorenz 96 dynamics, which display spatiotemporal chaos and for which solid benchmarks for DA performance exist. The accuracy of the states obtained from the learned analysis approaches that of the best possibly tuned ensemble Kalman filter (EnKF), and is far better than that of variational DA alternatives. Critically, this can be achieved while propagating even just a single state in the forecast step. We investigate the reason for achieving ensemble filtering accuracy without an ensemble. We diagnose that the analysis scheme actually identifies key dynamical perturbations, mildly aligned with the unstable subspace, from the forecast state alone, without any ensemble-based covariances representation. This reveals that the analysis scheme has learned some multiplicative ergodic theorem associated to the DA process seen as a non-autonomous random dynamical system.
- [12] arXiv:2408.04765 (cross-list from physics.chem-ph) [pdf, html, other]
-
Title: Scalable learning of potentials to predict time-dependent Hartree-Fock dynamicsComments: 24 pages, 8 figuresSubjects: Chemical Physics (physics.chem-ph); Computational Physics (physics.comp-ph); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
We propose a framework to learn the time-dependent Hartree-Fock (TDHF) inter-electronic potential of a molecule from its electron density dynamics. Though the entire TDHF Hamiltonian, including the inter-electronic potential, can be computed from first principles, we use this problem as a testbed to develop strategies that can be applied to learn \emph{a priori} unknown terms that arise in other methods/approaches to quantum dynamics, e.g., emerging problems such as learning exchange-correlation potentials for time-dependent density functional theory. We develop, train, and test three models of the TDHF inter-electronic potential, each parameterized by a four-index tensor of size up to $60 \times 60 \times 60 \times 60$. Two of the models preserve Hermitian symmetry, while one model preserves an eight-fold permutation symmetry that implies Hermitian symmetry. Across seven different molecular systems, we find that accounting for the deeper eight-fold symmetry leads to the best-performing model across three metrics: training efficiency, test set predictive power, and direct comparison of true and learned inter-electronic potentials. All three models, when trained on ensembles of field-free trajectories, generate accurate electron dynamics predictions even in a field-on regime that lies outside the training set. To enable our models to scale to large molecular systems, we derive expressions for Jacobian-vector products that enable iterative, matrix-free training.
- [13] arXiv:2408.04819 (cross-list from cs.LG) [pdf, html, other]
-
Title: Interventional Causal Structure Discovery over Graphical Models with Convergence and Optimality GuaranteesSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Learning causal structure from sampled data is a fundamental problem with applications in various fields, including healthcare, machine learning and artificial intelligence. Traditional methods predominantly rely on observational data, but there exist limits regarding the identifiability of causal structures with only observational data. Interventional data, on the other hand, helps establish a cause-and-effect relationship by breaking the influence of confounding variables. It remains to date under-explored to develop a mathematical framework that seamlessly integrates both observational and interventional data in causal structure learning. Furthermore, existing studies often focus on centralized approaches, necessitating the transfer of entire datasets to a single server, which lead to considerable communication overhead and heightened risks to privacy. To tackle these challenges, we develop a bilevel polynomial optimization (Bloom) framework. Bloom not only provides a powerful mathematical modeling framework, underpinned by theoretical support, for causal structure discovery from both interventional and observational data, but also aspires to an efficient causal discovery algorithm with convergence and optimality guarantees. We further extend Bloom to a distributed setting to reduce the communication overhead and mitigate data privacy risks. It is seen through experiments on both synthetic and real-world datasets that Bloom markedly surpasses other leading learning algorithms.
- [14] arXiv:2408.04851 (cross-list from cs.LG) [pdf, html, other]
-
Title: Your Classifier Can Be Secretly a Likelihood-Based OOD DetectorSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The ability to detect out-of-distribution (OOD) inputs is critical to guarantee the reliability of classification models deployed in an open environment. A fundamental challenge in OOD detection is that a discriminative classifier is typically trained to estimate the posterior probability p(y|z) for class y given an input z, but lacks the explicit likelihood estimation of p(z) ideally needed for OOD detection. While numerous OOD scoring functions have been proposed for classification models, these estimate scores are often heuristic-driven and cannot be rigorously interpreted as likelihood. To bridge the gap, we propose Intrinsic Likelihood (INK), which offers rigorous likelihood interpretation to modern discriminative-based classifiers. Specifically, our proposed INK score operates on the constrained latent embeddings of a discriminative classifier, which are modeled as a mixture of hyperspherical embeddings with constant norm. We draw a novel connection between the hyperspherical distribution and the intrinsic likelihood, which can be effectively optimized in modern neural networks. Extensive experiments on the OpenOOD benchmark empirically demonstrate that INK establishes a new state-of-the-art in a variety of OOD detection setups, including both far-OOD and near-OOD. Code is available at this https URL.
- [15] arXiv:2408.04869 (cross-list from cs.LG) [pdf, html, other]
-
Title: UCB Exploration for Fixed-Budget Bayesian Best Arm IdentificationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
We study best-arm identification (BAI) in the fixed-budget setting. Adaptive allocations based on upper confidence bounds (UCBs), such as UCBE, are known to work well in BAI. However, it is well-known that its optimal regret is theoretically dependent on instances, which we show to be an artifact in many fixed-budget BAI problems. In this paper we propose an UCB exploration algorithm that is both theoretically and empirically efficient for the fixed budget BAI problem under a Bayesian setting. The key idea is to learn prior information, which can enhance the performance of UCB-based BAI algorithm as it has done in the cumulative regret minimization problem. We establish bounds on the failure probability and the simple regret for the Bayesian BAI problem, providing upper bounds of order $\tilde{O}(\sqrt{K/n})$, up to logarithmic factors, where $n$ represents the budget and $K$ denotes the number of arms. Furthermore, we demonstrate through empirical results that our approach consistently outperforms state-of-the-art baselines.
- [16] arXiv:2408.04928 (cross-list from physics.data-an) [pdf, html, other]
-
Title: Identification of the parameters of complex constitutive models: Least squares minimization vs. Bayesian updatingComments: presented at 15th working conference IFIP Working Group 7.5 on Reliability and Optimization of Structural Systems, Munich, April 7-10, 2010Subjects: Data Analysis, Statistics and Probability (physics.data-an); Probability (math.PR); Statistics Theory (math.ST); Computation (stat.CO)
In this study the common least-squares minimization approach is compared to the Bayesian updating procedure. In the content of material parameter identification the posterior parameter density function is obtained from its prior and the likelihood function of the measurements. By using Markov Chain Monte Carlo methods, such as the Metropolis-Hastings algorithm \cite{Hastings1970}, the global density function including local peaks can be computed. Thus this procedure enables an accurate evaluation of the global parameter quality. However, the computational effort is remarkable larger compared to the minimization approach. Thus several methodologies for an efficient approximation of the likelihood function are discussed in the present study.
- [17] arXiv:2408.04948 (cross-list from cs.CL) [pdf, html, other]
-
Title: HybridRAG: Integrating Knowledge Graphs and Vector Retrieval Augmented Generation for Efficient Information ExtractionComments: 9 pages, 2 figures, 5 tablesSubjects: Computation and Language (cs.CL); Machine Learning (cs.LG); Statistical Finance (q-fin.ST); Applications (stat.AP); Machine Learning (stat.ML)
Extraction and interpretation of intricate information from unstructured text data arising in financial applications, such as earnings call transcripts, present substantial challenges to large language models (LLMs) even using the current best practices to use Retrieval Augmented Generation (RAG) (referred to as VectorRAG techniques which utilize vector databases for information retrieval) due to challenges such as domain specific terminology and complex formats of the documents. We introduce a novel approach based on a combination, called HybridRAG, of the Knowledge Graphs (KGs) based RAG techniques (called GraphRAG) and VectorRAG techniques to enhance question-answer (Q&A) systems for information extraction from financial documents that is shown to be capable of generating accurate and contextually relevant answers. Using experiments on a set of financial earning call transcripts documents which come in the form of Q&A format, and hence provide a natural set of pairs of ground-truth Q&As, we show that HybridRAG which retrieves context from both vector database and KG outperforms both traditional VectorRAG and GraphRAG individually when evaluated at both the retrieval and generation stages in terms of retrieval accuracy and answer generation. The proposed technique has applications beyond the financial domain
- [18] arXiv:2408.05040 (cross-list from cs.LG) [pdf, html, other]
-
Title: BoFire: Bayesian Optimization Framework Intended for Real ExperimentsJohannes P. Dürholt, Thomas S. Asche, Johanna Kleinekorte, Gabriel Mancino-Ball, Benjamin Schiller, Simon Sung, Julian Keupp, Aaron Osburg, Toby Boyne, Ruth Misener, Rosona Eldred, Wagner Steuer Costa, Chrysoula Kappatou, Robert M. Lee, Dominik Linzner, David Walz, Niklas Wulkow, Behrang ShafeiComments: 6 pages, 1 figure, 1 listingSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
Our open-source Python package BoFire combines Bayesian Optimization (BO) with other design of experiments (DoE) strategies focusing on developing and optimizing new chemistry. Previous BO implementations, for example as they exist in the literature or software, require substantial adaptation for effective real-world deployment in chemical industry. BoFire provides a rich feature-set with extensive configurability and realizes our vision of fast-tracking research contributions into industrial use via maintainable open-source software. Owing to quality-of-life features like JSON-serializability of problem formulations, BoFire enables seamless integration of BO into RESTful APIs, a common architecture component for both self-driving laboratories and human-in-the-loop setups. This paper discusses the differences between BoFire and other BO implementations and outlines ways that BO research needs to be adapted for real-world use in a chemistry setting.
- [19] arXiv:2408.05116 (cross-list from quant-ph) [pdf, html, other]
-
Title: Concept learning of parameterized quantum models from limited measurementsComments: 16 + 8 pages, 4 figuresSubjects: Quantum Physics (quant-ph); Machine Learning (cs.LG); Machine Learning (stat.ML)
Classical learning of the expectation values of observables for quantum states is a natural variant of learning quantum states or channels. While learning-theoretic frameworks establish the sample complexity and the number of measurement shots per sample required for learning such statistical quantities, the interplay between these two variables has not been adequately quantified before. In this work, we take the probabilistic nature of quantum measurements into account in classical modelling and discuss these quantities under a single unified learning framework. We provide provable guarantees for learning parameterized quantum models that also quantify the asymmetrical effects and interplay of the two variables on the performance of learning algorithms. These results show that while increasing the sample size enhances the learning performance of classical machines, even with single-shot estimates, the improvements from increasing measurements become asymptotically trivial beyond a constant factor. We further apply our framework and theoretical guarantees to study the impact of measurement noise on the classical surrogation of parameterized quantum circuit models. Our work provides new tools to analyse the operational influence of finite measurement noise in the classical learning of quantum systems.
- [20] arXiv:2408.05209 (cross-list from econ.EM) [pdf, html, other]
-
Title: What are the real implications for $CO_2$ as generation from renewables increases?Subjects: Econometrics (econ.EM); Applications (stat.AP)
Wind and solar electricity generation account for 14% of total electricity generation in the United States and are expected to continue to grow in the next decades. In low carbon systems, generation from renewable energy sources displaces conventional fossil fuel power plants resulting in lower system-level emissions and emissions intensity. However, we find that intermittent generation from renewables changes the way conventional thermal power plants operate, and that the displacement of generation is not 1 to 1 as expected. Our work provides a method that allows policy and decision makers to continue to track the effect of additional renewable capacity and the resulting thermal power plant operational responses.
Cross submissions for Monday, 12 August 2024 (showing 10 of 10 entries )
- [21] arXiv:2206.03038 (replaced) [pdf, html, other]
-
Title: Asymptotic Distribution-free Change-point Detection for Modern Data Based on a New Ranking SchemeSubjects: Methodology (stat.ME)
Change-point detection (CPD) involves identifying distributional changes in a sequence of independent observations. Among nonparametric methods, rank-based methods are attractive due to their robustness and effectiveness and have been extensively studied for univariate data. However, they are not well explored for high-dimensional or non-Euclidean data. This paper proposes a new method, Rank INduced by Graph Change-Point Detection (RING-CPD), which utilizes graph-induced ranks to handle high-dimensional and non-Euclidean data. The new method is asymptotically distribution-free under the null hypothesis, and an analytic $p$-value approximation is provided for easy type-I error control. Simulation studies show that RING-CPD effectively detects change points across a wide range of alternatives and is also robust to heavy-tailed distribution and outliers. The new method is illustrated by the detection of seizures in a functional connectivity network dataset, changes of digit images, and travel pattern changes in the New York City Taxi dataset.
- [22] arXiv:2303.17765 (replaced) [pdf, other]
-
Title: Learning from Similar Linear Representations: Adaptivity, Minimaxity, and RobustnessComments: 121 pages, 10 figures, 2 tablesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Representation multi-task learning (MTL) has achieved tremendous success in practice. However, the theoretical understanding of these methods is still lacking. Most existing theoretical works focus on cases where all tasks share the same representation, and claim that MTL almost always improves performance. Nevertheless, as the number of tasks grows, assuming all tasks share the same representation is unrealistic. Furthermore, empirical findings often indicate that a shared representation does not necessarily improve single-task learning performance. In this paper, we aim to understand how to learn from tasks with \textit{similar but not exactly the same} linear representations, while dealing with outlier tasks. Assuming a known intrinsic dimension, we proposed a penalized empirical risk minimization method and a spectral method that are \textit{adaptive} to the similarity structure and \textit{robust} to outlier tasks. Both algorithms outperform single-task learning when representations across tasks are sufficiently similar and the proportion of outlier tasks is small. Moreover, they always perform at least as well as single-task learning, even when the representations are dissimilar. We provided information-theoretic lower bounds to demonstrate that both methods are nearly \textit{minimax} optimal in a large regime, with the spectral method being optimal in the absence of outlier tasks. Additionally, we introduce a thresholding algorithm to adapt to an unknown intrinsic dimension. We conducted extensive numerical experiments to validate our theoretical findings.
- [23] arXiv:2304.09310 (replaced) [pdf, other]
-
Title: The Adaptive $\tau$-Lasso: Robustness and Oracle PropertiesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Signal Processing (eess.SP)
This paper introduces a new regularized version of the robust $\tau$-regression estimator for analyzing high-dimensional datasets subject to gross contamination in the response variables and covariates (explanatory variables). The resulting estimator, termed adaptive $\tau$-Lasso, is robust to outliers and high-leverage points. It also incorporates an adaptive $\ell_1$-norm penalty term, which enables the selection of relevant variables and reduces the bias associated with large true regression coefficients. More specifically, this adaptive $\ell_1$-norm penalty term assigns a weight to each regression coefficient. For a fixed number of predictors $p$, we show that the adaptive $\tau$-Lasso has the oracle property, ensuring both variable-selection consistency and asymptotic normality. Asymptotic normality applies only to the entries of the regression vector corresponding to the true support, assuming knowledge of the true regression vector support. We characterize its robustness by establishing the finite-sample breakdown point and the influence function. We carry out extensive simulations and observe that the class of $\tau$-Lasso estimators exhibits robustness and reliable performance in both contaminated and uncontaminated data settings. We also validate our theoretical findings on robustness properties through simulations. In the face of outliers and high-leverage points, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators achieve the best performance or close-to-best performance in terms of prediction and variable selection accuracy compared to other competing regularized estimators for all scenarios considered in this study. Therefore, the adaptive $\tau$-Lasso and $\tau$-Lasso estimators provide attractive tools for a variety of sparse linear regression problems, particularly in high-dimensional settings and when the data is contaminated by outliers and high-leverage points.
- [24] arXiv:2306.15199 (replaced) [pdf, html, other]
-
Title: A new classification framework for high-dimensional dataSubjects: Methodology (stat.ME)
Classification, a fundamental problem in many fields, faces significant challenges when handling a large number of features, a scenario commonly encountered in modern applications, such as identifying tumor subtypes from genomic data or categorizing customer attitudes based on online reviews. We propose a novel framework that utilizes the ranks of pairwise distances among observations and identifies consistent patterns in moderate- to high- dimensional data, which previous methods have overlooked. The proposed method exhibits superior performance across a variety of scenarios, from high-dimensional data to network data. We further explore a typical setting to investigate key quantities that play essential roles in our framework, which reveal the framework's capabilities in distinguishing differences in the first and/or second moment, as well as distinctions in higher moments.
- [25] arXiv:2308.04420 (replaced) [pdf, html, other]
-
Title: Contour Location for Reliability in Airfoil Simulation Experiments using Deep Gaussian ProcessesComments: 19 pages, 11 figuresSubjects: Methodology (stat.ME)
Bayesian deep Gaussian processes (DGPs) outperform ordinary GPs as surrogate models of complex computer experiments when response surface dynamics are non-stationary, which is especially prevalent in aerospace simulations. Yet DGP surrogates have not been deployed for the canonical downstream task in that setting: reliability analysis through contour location (CL). In that context, we are motivated by a simulation of an RAE-2822 transonic airfoil which demarcates efficient and inefficient flight conditions. Level sets separating passable versus failable operating conditions are best learned through strategic sequential designs. There are two limitations to modern CL methodology which hinder DGP integration in this setting. First, derivative-based optimization underlying acquisition functions is thwarted by sampling-based Bayesian (i.e., MCMC) inference, which is essential for DGP posterior integration. Second, canonical acquisition criteria, such as entropy, are famously myopic to the extent that optimization may even be undesirable. Here we tackle both of these limitations at once, proposing a hybrid criterion that explores along the Pareto front of entropy and (predictive) uncertainty, requiring evaluation only at strategically located "triangulation" candidates. We showcase DGP CL performance in several synthetic benchmark exercises and on the RAE-2822 airfoil.
- [26] arXiv:2310.20609 (replaced) [pdf, html, other]
-
Title: Graph Matching via convex relaxation to the simplexComments: We fixed some typos and added Lemma 4. Reference to the published version below. Added link to the codeJournal-ref: Ernesto Araya, Hemant Tyagi. Graph Matching via convex relaxation to the simplex. Foundations of Data Science.2024Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Statistics Theory (math.ST)
This paper addresses the Graph Matching problem, which consists of finding the best possible alignment between two input graphs, and has many applications in computer vision, network deanonymization and protein alignment. A common approach to tackle this problem is through convex relaxations of the NP-hard \emph{Quadratic Assignment Problem} (QAP).
Here, we introduce a new convex relaxation onto the unit simplex and develop an efficient mirror descent scheme with closed-form iterations for solving this problem. Under the correlated Gaussian Wigner model, we show that the simplex relaxation admits a unique solution with high probability. In the noiseless case, this is shown to imply exact recovery of the ground truth permutation. Additionally, we establish a novel sufficiency condition for the input matrix in standard greedy rounding methods, which is less restrictive than the commonly used `diagonal dominance' condition. We use this condition to show exact one-step recovery of the ground truth (holding almost surely) via the mirror descent scheme, in the noiseless setting. We also use this condition to obtain significantly improved conditions for the GRAMPA algorithm [Fan et al. 2019] in the noiseless setting. - [27] arXiv:2312.05365 (replaced) [pdf, html, other]
-
Title: Product Centered Dirichlet Processes for Dependent ClusteringSubjects: Methodology (stat.ME)
While there is an immense literature on Bayesian methods for clustering, the multiview case has received little attention. This problem focuses on obtaining distinct but statistically dependent clusterings in a common set of entities for different data types. For example, clustering patients into subgroups with subgroup membership varying according to the domain of the patient variables. A challenge is how to model the across-view dependence between the partitions of patients into subgroups. The complexities of the partition space make standard methods to model dependence, such as correlation, infeasible. In this article, we propose CLustering with Independence Centering (CLIC), a clustering prior that uses a single parameter to explicitly model dependence between clusterings across views. CLIC is induced by the product centered Dirichlet process (PCDP), a novel hierarchical prior that bridges between independent and equivalent partitions. We show appealing theoretic properties, provide a finite approximation and prove its accuracy, present a marginal Gibbs sampler for posterior computation, and derive closed form expressions for the marginal and joint partition distributions for the CLIC model. On synthetic data and in an application to epidemiology, CLIC accurately characterizes view-specific partitions while providing inference on the dependence level.
- [28] arXiv:2402.16792 (replaced) [pdf, other]
-
Title: Rate-Optimal Rank Aggregation with Private Pairwise RankingsSubjects: Machine Learning (stat.ML); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
In various real-world scenarios, such as recommender systems and political surveys, pairwise rankings are commonly collected and utilized for rank aggregation to obtain an overall ranking of items. However, preference rankings can reveal individuals' personal preferences, underscoring the need to protect them from being released for downstream analysis. In this paper, we address the challenge of preserving privacy while ensuring the utility of rank aggregation based on pairwise rankings generated from a general comparison model. Using the randomized response mechanism to perturb raw pairwise rankings is a common privacy protection strategy used in practice. However, a critical challenge arises because the privatized rankings no longer adhere to the original model, resulting in significant bias in downstream rank aggregation tasks. Motivated by this, we propose to adaptively debiasing the rankings from the randomized response mechanism, ensuring consistent estimation of true preferences and enhancing the utility of downstream rank aggregation. Theoretically, we offer insights into the relationship between overall privacy guarantees and estimation errors from private ranking data, and establish minimax rates for estimation errors. This enables the determination of optimal privacy guarantees that balance consistency in rank aggregation with privacy protection. We also investigate convergence rates of expected ranking errors for partial and full ranking recovery, quantifying how privacy protection influences the specification of top-$K$ item sets and complete rankings. Our findings are validated through extensive simulations and a real application.
- [29] arXiv:2405.14840 (replaced) [pdf, html, other]
-
Title: Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target DistributionComments: 22 pages, including 9 pages of main text and 11 pages of appendix, conference paper at ICML 2024, updated terminologySubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Differentiable annealed importance sampling (DAIS), proposed by Geffner & Domke (2021) and Zhang et al. (2021), allows optimizing over the initial distribution of AIS. In this paper, we show that, in the limit of many transitions, DAIS minimizes the symmetrized Kullback-Leibler divergence between the initial and target distribution. Thus, DAIS can be seen as a form of variational inference (VI) as its initial distribution is a parametric fit to an intractable target distribution. We empirically evaluate the usefulness of the initial distribution as a variational distribution on synthetic and real-world data, observing that it often provides more accurate uncertainty estimates than VI (optimizing the reverse KL divergence), importance weighted VI, and Markovian score climbing (optimizing the forward KL divergence).
- [30] arXiv:2406.09195 (replaced) [pdf, html, other]
-
Title: When Pearson $\chi^2$ and other divisible statistics are not goodness-of-fit testsSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Data Analysis, Statistics and Probability (physics.data-an); Computation (stat.CO)
Thousands of experiments are analyzed and papers are published each year involving the statistical analysis of grouped data. While this area of statistics is often perceived - somewhat naively - as saturated, several misconceptions still affect everyday practice, and new frontiers have so far remained unexplored. Researchers must be aware of the limitations affecting their analyses and what are the new possibilities in their hands. Motivated by this need, the article introduces a unifying approach to the analysis of grouped data which allows us to study the class of divisible statistics - that includes Pearson's $\chi^2$, the likelihood ratio as special cases - with a fresh perspective. The contributions collected in this manuscript span from modeling and estimation to distribution-free goodness-of-fit tests. Perhaps the most surprising result presented here is that, in a sparse regime, all tests proposed in the literature are dominated by a class of weighted linear statistics.
- [31] arXiv:2112.09741 (replaced) [pdf, html, other]
-
Title: Envisioning Future Deep Learning Theories: Some Basic Concepts and CharacteristicsComments: Accepted by Science China (Information Sciences)Subjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); Statistical Mechanics (cond-mat.stat-mech); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (stat.ML)
To advance deep learning methodologies in the next decade, a theoretical framework for reasoning about modern neural networks is needed. While efforts are increasing toward demystifying why deep learning is so effective, a comprehensive picture remains lacking, suggesting that a better theory is possible. We argue that a future deep learning theory should inherit three characteristics: a \textit{hierarchically} structured network architecture, parameters \textit{iteratively} optimized using stochastic gradient-based methods, and information from the data that evolves \textit{compressively}. As an instantiation, we integrate these characteristics into a graphical model called \textit{neurashed}. This model effectively explains some common empirical patterns in deep learning. In particular, neurashed enables insights into implicit regularization, information bottleneck, and local elasticity. Finally, we discuss how neurashed can guide the development of deep learning theories.
- [32] arXiv:2207.09768 (replaced) [pdf, html, other]
-
Title: Learning Counterfactually Invariant PredictorsSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Notions of counterfactual invariance (CI) have proven essential for predictors that are fair, robust, and generalizable in the real world. We propose graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of a conditional independence in the observational distribution. In order to learn such predictors, we propose a model-agnostic framework, called Counterfactually Invariant Prediction (CIP), building on the Hilbert-Schmidt Conditional Independence Criterion (HSCIC), a kernel-based conditional dependence measure. Our experimental results demonstrate the effectiveness of CIP in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings.
- [33] arXiv:2212.12921 (replaced) [pdf, html, other]
-
Title: Learning k-Level Structured Sparse Neural Networks Using Group Envelope RegularizationSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
The extensive need for computational resources poses a significant obstacle to deploying large-scale Deep Neural Networks (DNN) on devices with constrained resources. At the same time, studies have demonstrated that a significant number of these DNN parameters are redundant and extraneous. In this paper, we introduce a novel approach for learning structured sparse neural networks, aimed at bridging the DNN hardware deployment challenges. We develop a novel regularization technique, termed Weighted Group Sparse Envelope Function (WGSEF), generalizing the Sparse Envelop Function (SEF), to select (or nullify) neuron groups, thereby reducing redundancy and enhancing computational efficiency. The method speeds up inference time and aims to reduce memory demand and power consumption, thanks to its adaptability which lets any hardware specify group definitions, such as filters, channels, filter shapes, layer depths, a single parameter (unstructured), etc. The properties of the WGSEF enable the pre-definition of a desired sparsity level to be achieved at the training convergence. In the case of redundant parameters, this approach maintains negligible network accuracy degradation or can even lead to improvements in accuracy. Our method efficiently computes the WGSEF regularizer and its proximal operator, in a worst-case linear complexity relative to the number of group variables. Employing a proximal-gradient-based optimization technique, to train the model, it tackles the non-convex minimization problem incorporating the neural network loss and the WGSEF. Finally, we experiment and illustrate the efficiency of our proposed method in terms of the compression ratio, accuracy, and inference latency.
- [34] arXiv:2308.06422 (replaced) [pdf, html, other]
-
Title: Sensitivity-Aware Mixed-Precision Quantization and Width Optimization of Deep Neural Networks Through Cluster-Based Tree-Structured Parzen EstimationSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Neural and Evolutionary Computing (cs.NE); Machine Learning (stat.ML)
As the complexity and computational demands of deep learning models rise, the need for effective optimization methods for neural network designs becomes paramount. This work introduces an innovative search mechanism for automatically selecting the best bit-width and layer-width for individual neural network layers. This leads to a marked enhancement in deep neural network efficiency. The search domain is strategically reduced by leveraging Hessian-based pruning, ensuring the removal of non-crucial parameters. Subsequently, we detail the development of surrogate models for favorable and unfavorable outcomes by employing a cluster-based tree-structured Parzen estimator. This strategy allows for a streamlined exploration of architectural possibilities and swift pinpointing of top-performing designs. Through rigorous testing on well-known datasets, our method proves its distinct advantage over existing methods. Compared to leading compression strategies, our approach records an impressive 20% decrease in model size without compromising accuracy. Additionally, our method boasts a 12x reduction in search time relative to the best search-focused strategies currently available. As a result, our proposed method represents a leap forward in neural network design optimization, paving the way for quick model design and implementation in settings with limited resources, thereby propelling the potential of scalable deep learning solutions.
- [35] arXiv:2309.14512 (replaced) [pdf, html, other]
-
Title: Byzantine-Resilient Federated PCA and Low Rank Column-wise SensingComments: 36 pagesSubjects: Information Theory (cs.IT); Machine Learning (stat.ML)
This work considers two related learning problems in a federated attack prone setting: federated principal components analysis (PCA) and federated low rank column-wise sensing (LRCS). The node attacks are assumed to be Byzantine which means that the attackers are omniscient and can collude. We introduce a novel provably Byzantine-resilient communication-efficient and sampleefficient algorithm, called Subspace-Median, that solves the PCA problem and is a key part of the solution for the LRCS problem. We also study the most natural Byzantine-resilient solution for federated PCA, a geometric median based modification of the federated power method, and explain why it is not useful. Our second main contribution is a complete alternating gradient descent (GD) and minimization (altGDmin) algorithm for Byzantine-resilient horizontally federated LRCS and sample and communication complexity guarantees for it. Extensive simulation experiments are used to corroborate our theoretical guarantees. The ideas that we develop for LRCS are easily extendable to other LR recovery problems as well.
- [36] arXiv:2403.02811 (replaced) [pdf, other]
-
Title: Linear quadratic control of nonlinear systems with Koopman operator learning and the Nystr\"om methodEdoardo Caldarelli, Antoine Chatalic, Adrià Colomé, Cesare Molinari, Carlos Ocampo-Martinez, Carme Torras, Lorenzo RosascoSubjects: Optimization and Control (math.OC); Systems and Control (eess.SY); Machine Learning (stat.ML)
In this paper, we study how the Koopman operator framework can be combined with kernel methods to effectively control nonlinear dynamical systems. While kernel methods have typically large computational requirements, we show how random subspaces (Nyström approximation) can be used to achieve huge computational savings while preserving accuracy. Our main technical contribution is deriving theoretical guarantees on the effect of the Nyström approximation. More precisely, we study the linear quadratic regulator problem, showing that the approximated Riccati operator converges at the rate $m^{-1/2}$, and the regulator objective, for the associated solution of the optimal control problem, converges at the rate $m^{-1}$, where $m$ is the random subspace size. Theoretical findings are complemented by numerical experiments corroborating our results.
- [37] arXiv:2406.12916 (replaced) [pdf, html, other]
-
Title: Opening the Black Box: predicting the trainability of deep neural networks with reconstruction entropyComments: 21 pages, 5 figures, 1 table Minor changes to presentation, typos correctedSubjects: Machine Learning (cs.LG); Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics - Theory (hep-th); Machine Learning (stat.ML)
An important challenge in machine learning is to predict the initial conditions under which a given neural network will be trainable. We present a method for predicting the trainable regime in parameter space for deep feedforward neural networks, based on reconstructing the input from subsequent activation layers via a cascade of single-layer auxiliary networks. For both the MNIST and CIFAR10 datasets, we show that a single epoch of training of the shallow cascade networks is sufficient to predict the trainability of the deep feedforward network, thereby providing a significant reduction in overall training time. We achieve this by computing the relative entropy between reconstructed images and the original inputs, and show that this probe of information loss is sensitive to the phase behaviour of the network. Moreover, our approach illustrates the network's decision making process by displaying the changes performed on the input data at each layer. Our results provide a concrete link between the flow of information and the trainability of deep neural networks, further explaining the role of criticality in these systems.
- [38] arXiv:2407.17781 (replaced) [pdf, other]
-
Title: Ensemble data assimilation to diagnose AI-based weather prediction model: A case with ClimaXSubjects: Machine Learning (cs.LG); Applications (stat.AP)
Artificial intelligence (AI)-based weather prediction research is growing rapidly and has shown to be competitive with the advanced dynamic numerical weather prediction models. However, research combining AI-based weather prediction models with data assimilation remains limited partially because long-term sequential data assimilation cycles are required to evaluate data assimilation systems. This study proposes using ensemble data assimilation for diagnosing AI-based weather prediction models, and marked the first successful implementation of ensemble Kalman filter with AI-based weather prediction models. Our experiments with an AI-based model ClimaX demonstrated that the ensemble data assimilation cycled stably for the AI-based weather prediction model using covariance inflation and localization techniques within the ensemble Kalman filter. While ClimaX showed some limitations in capturing flow-dependent error covariance compared to dynamical models, the AI-based ensemble forecasts provided reasonable and beneficial error covariance in sparsely observed regions. In addition, ensemble data assimilation revealed that error growth based on ensemble ClimaX predictions was weaker than that of dynamical NWP models, leading to higher inflation factors. A series of experiments demonstrated that ensemble data assimilation can be used to diagnose properties of AI weather prediction models such as physical consistency and accurate error growth representation.
- [39] arXiv:2407.21424 (replaced) [pdf, html, other]
-
Title: Cost-Effective Hallucination Detection for LLMsComments: Accepted to GenAI Evaluation Workshop at KDD 2024Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Machine Learning (stat.ML)
Large language models (LLMs) can be prone to hallucinations - generating unreliable outputs that are unfaithful to their inputs, external facts or internally inconsistent. In this work, we address several challenges for post-hoc hallucination detection in production settings. Our pipeline for hallucination detection entails: first, producing a confidence score representing the likelihood that a generated answer is a hallucination; second, calibrating the score conditional on attributes of the inputs and candidate response; finally, performing detection by thresholding the calibrated score. We benchmark a variety of state-of-the-art scoring methods on different datasets, encompassing question answering, fact checking, and summarization tasks. We employ diverse LLMs to ensure a comprehensive assessment of performance. We show that calibrating individual scoring methods is critical for ensuring risk-aware downstream decision making. Based on findings that no individual score performs best in all situations, we propose a multi-scoring framework, which combines different scores and achieves top performance across all datasets. We further introduce cost-effective multi-scoring, which can match or even outperform more expensive detection methods, while significantly reducing computational overhead.