Search | arXiv e-print repository

Generalization error of min-norm interpolators in transfer learning

Authors: Yanke Song, Sohom Bhattacharya, Pragya Sur

Abstract: This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during trai… ▽ More This paper establishes the generalization error of pooled min-$\ell_2$-norm interpolation in transfer learning where data from diverse distributions are available. Min-norm interpolators emerge naturally as implicit regularized limits of modern machine learning algorithms. Previous work characterized their out-of-distribution risk when samples from the test distribution are unavailable during training. However, in many applications, a limited amount of test data may be available during training, yet properties of min-norm interpolation in this setting are not well-understood. We address this gap by characterizing the bias and variance of pooled min-$\ell_2$-norm interpolation under covariate and model shifts. The pooled interpolator captures both early fusion and a form of intermediate fusion. Our results have several implications: under model shift, for low signal-to-noise ratio (SNR), adding data always hurts. For higher SNR, transfer learning helps as long as the shift-to-signal (SSR) ratio lies below a threshold that we characterize explicitly. By consistently estimating these ratios, we provide a data-driven method to determine: (i) when the pooled interpolator outperforms the target-based interpolator, and (ii) the optimal number of target samples that minimizes the generalization error. Under covariate shift, if the source sample size is small relative to the dimension, heterogeneity between between domains improves the risk, and vice versa. We establish a novel anisotropic local law to achieve these characterizations, which may be of independent interest in random matrix theory. We supplement our theoretical characterizations with comprehensive simulations that demonstrate the finite-sample efficacy of our results. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: 53 pages, 2 figures

arXiv:2406.11666 [pdf, other]

ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data

Authors: Kevin Luo, Yufan Li, Pragya Sur

Abstract: Two key tasks in high-dimensional regularized regression are tuning the regularization strength for good predictions and estimating the out-of-sample risk. It is known that the standard approach -- $k$-fold cross-validation -- is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become incon… ▽ More Two key tasks in high-dimensional regularized regression are tuning the regularization strength for good predictions and estimating the out-of-sample risk. It is known that the standard approach -- $k$-fold cross-validation -- is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. To model structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions - a crucial concept from compressed sensing. In the common modern proportional asymptotics regime where the number of features and samples grow comparably, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance under these challenging conditions. We conduct extensive experiments that demonstrate the power of our approach and its superiority over existing methods. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 25 pages, 3 figures

arXiv:2406.11184 [pdf, other]

HEDE: Heritability estimation in high dimensions by Ensembling Debiased Estimators

Authors: Yanke Song, Xihong Lin, Pragya Sur

Abstract: Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sampl… ▽ More Estimating heritability remains a significant challenge in statistical genetics. Diverse approaches have emerged over the years that are broadly categorized as either random effects or fixed effects heritability methods. In this work, we focus on the latter. We propose HEDE, an ensemble approach to estimate heritability or the signal-to-noise ratio in high-dimensional linear models where the sample size and the dimension grow proportionally. Our method ensembles post-processed versions of the debiased lasso and debiased ridge estimators, and incorporates a data-driven strategy for hyperparameter selection that significantly boosts estimation performance. We establish rigorous consistency guarantees that hold despite adaptive tuning. Extensive simulations demonstrate our method's superiority over existing state-of-the-art methods across various signal structures and genetic architectures, ranging from sparse to relatively dense and from evenly to unevenly distributed signals. Furthermore, we discuss the advantages of fixed effects heritability estimation compared to random effects estimation. Our theoretical guarantees hold for realistic genotype distributions observed in genetic studies, where genotypes typically take on discrete values and are often well-modeled by sub-Gaussian distributed random variables. We establish our theoretical results by deriving uniform bounds, built upon the convex Gaussian min-max theorem, and leveraging universality results. Finally, we showcase the efficacy of our approach in estimating height and BMI heritability using the UK Biobank. △ Less

Submitted 16 June, 2024; originally announced June 2024.

Comments: 58 pages, 7 figures

arXiv:2403.16336 [pdf, other]

Predictive Inference in Multi-environment Scenarios

Authors: John C. Duchi, Suyash Gupta, Kuanhao Jiang, Pragya Sur

Abstract: We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include exte… ▽ More We address the challenge of constructing valid confidence intervals and sets in problems of prediction across multiple environments. We investigate two types of coverage suitable for these problems, extending the jackknife and split-conformal methods to show how to obtain distribution-free coverage in such non-traditional, hierarchical data-generating scenarios. Our contributions also include extensions for settings with non-real-valued responses and a theory of consistency for predictive inference in these general problems. We demonstrate a novel resizing method to adapt to problem difficulty, which applies both to existing approaches for predictive inference with hierarchical data and the methods we develop; this reduces prediction set sizes using limited information from the test environment, a key to the methods' practical performance, which we evaluate through neurochemical sensing and species classification datasets. △ Less

Submitted 24 March, 2024; originally announced March 2024.

arXiv:2309.07810 [pdf, other]

Spectrum-Aware Debiasing: A Modern Inference Framework with Applications to Principal Components Regression

Authors: Yufan Li, Pragya Sur

Abstract: Debiasing is a fundamental concept in high-dimensional statistics. While degrees-of-freedom adjustment is the state-of-the-art technique in high-dimensional linear regression, it is limited to i.i.d. samples and sub-Gaussian covariates. These constraints hinder its broader practical use. Here, we introduce Spectrum-Aware Debiasing--a novel method for high-dimensional regression. Our approach appli… ▽ More Debiasing is a fundamental concept in high-dimensional statistics. While degrees-of-freedom adjustment is the state-of-the-art technique in high-dimensional linear regression, it is limited to i.i.d. samples and sub-Gaussian covariates. These constraints hinder its broader practical use. Here, we introduce Spectrum-Aware Debiasing--a novel method for high-dimensional regression. Our approach applies to problems with structured dependencies, heavy tails, and low-rank structures. Our method achieves debiasing through a rescaled gradient descent step, deriving the rescaling factor using spectral information of the sample covariance matrix. The spectrum-based approach enables accurate debiasing in much broader contexts. We study the common modern regime where the number of features and samples scale proportionally. We establish asymptotic normality of our proposed estimator (suitably centered and scaled) under various convergence notions when the covariates are right-rotationally invariant. Such designs have garnered recent attention due to their crucial role in compressed sensing. Furthermore, we devise a consistent estimator for its asymptotic variance. Our work has two notable by-products: first, we use Spectrum-Aware Debiasing to correct bias in principal components regression (PCR), providing the first debiased PCR estimator in high dimensions. Second, we introduce a principled test for checking alignment between the signal and the eigenvectors of the sample covariance matrix. This test is independently valuable for statistical methods developed using approximate message passing, leave-one-out, or convex Gaussian min-max theorems. We demonstrate our method through simulated and real data experiments. Technically, we connect approximate message passing algorithms with debiasing and provide the first proof of the Cauchy property of vector approximate message passing (V-AMP). △ Less

Submitted 22 July, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

Comments: Minor changes and updated references

arXiv:2210.12082 [pdf, other]

A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models

Authors: Lijia Zhou, Frederic Koehler, Pragya Sur, Danica J. Sutherland, Nathan Srebro

Abstract: We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the squa… ▽ More We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal $\ell_2$-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand's well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation. △ Less

Submitted 21 October, 2022; originally announced October 2022.

Comments: As published at NeurIPS 2022

arXiv:2207.04588 [pdf, other]

Multi-Study Boosting: Theoretical Considerations for Merging vs. Ensembling

Authors: Cathy Shyr, Pragya Sur, Giovanni Parmigiani, Prasad Patil

Abstract: Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. When training cross-study replicable prediction models, it is critical to decide between merging and treating the studies separately. We study boosting algorithms in the presence of potential heterogeneity in predictor-outcome relationships across studies and compare two multi-study… ▽ More Cross-study replicability is a powerful model evaluation criterion that emphasizes generalizability of predictions. When training cross-study replicable prediction models, it is critical to decide between merging and treating the studies separately. We study boosting algorithms in the presence of potential heterogeneity in predictor-outcome relationships across studies and compare two multi-study learning strategies: 1) merging all the studies and training a single model, and 2) multi-study ensembling, which involves training a separate model on each study and ensembling the resulting predictions. In the regression setting, we provide theoretical guidelines based on an analytical transition point to determine whether it is more beneficial to merge or to ensemble for boosting with linear learners. In addition, we characterize a bias-variance decomposition of estimation error for boosting with component-wise linear learners. We verify the theoretical transition point result in simulation and illustrate how it can guide the decision on merging vs. ensembling in an application to breast cancer gene expression data. △ Less

Submitted 12 July, 2022; v1 submitted 10 July, 2022; originally announced July 2022.

arXiv:2205.10198 [pdf, other]

A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-Fit Covariance and Beyond

Authors: Kuanhao Jiang, Rajarshi Mukherjee, Subhabrata Sen, Pragya Sur

Abstract: Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged a popular choice in practice. In this work, we study this cro… ▽ More Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged a popular choice in practice. In this work, we study this cross-fit AIPW estimator under well-specified outcome regression and propensity score models in a high-dimensional regime where the number of features and samples are both large and comparable. Under assumptions on the covariate distribution, we establish a new central limit theorem for the suitably scaled cross-fit AIPW that applies without any sparsity assumptions on the underlying high-dimensional parameters. Our CLT uncovers two crucial phenomena among others: (i) the AIPW exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the pre-cross-fit estimators is non-negligible even on the root-n scale. These findings are strikingly different from their classical counterparts. On the technical front, our work utilizes a novel interplay between three distinct tools--approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. We believe our proof techniques should be useful for analyzing other two-stage estimators in this high-dimensional regime. Finally, we complement our theoretical results with simulations that demonstrate both the finite sample efficacy of our CLT and its robustness to our assumptions. △ Less

Submitted 28 October, 2022; v1 submitted 20 May, 2022; originally announced May 2022.

Comments: 132 pages, 7 figures; In V2, we added extensive comparisons with the classical variance formula (c.f.~Sec 3, Fig 2, Fig 4) and elaborated on the non-trivial cross-fit covariance phenomenon further

arXiv:2204.04476 [pdf, other]

doi 10.1093/imaiai/iaad042

High-dimensional Asymptotics of Langevin Dynamics in Spiked Matrix Models

Authors: Tengyuan Liang, Subhabrata Sen, Pragya Sur

Abstract: We study Langevin dynamics for recovering the planted signal in the spiked matrix model. We provide a "path-wise" characterization of the overlap between the output of the Langevin algorithm and the planted signal. This overlap is characterized in terms of a self-consistent system of integro-differential equations, usually referred to as the Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) equa… ▽ More We study Langevin dynamics for recovering the planted signal in the spiked matrix model. We provide a "path-wise" characterization of the overlap between the output of the Langevin algorithm and the planted signal. This overlap is characterized in terms of a self-consistent system of integro-differential equations, usually referred to as the Crisanti-Horner-Sommers-Cugliandolo-Kurchan (CHSCK) equations in the spin glass literature. As a second contribution, we derive an explicit formula for the limiting overlap in terms of the signal-to-noise ratio and the injected noise in the diffusion. This uncovers a sharp phase transition -- in one regime, the limiting overlap is strictly positive, while in the other, the injected noise overcomes the signal, and the limiting overlap is zero. △ Less

Submitted 9 April, 2022; originally announced April 2022.

Comments: 26 pages

Journal ref: Information and Inference: A Journal of the IMA, 12(4):2720-2752, 2023

arXiv:2006.11478 [pdf, ps, other]

Representation via Representations: Domain Generalization via Adversarially Learned Invariant Representations

Authors: Zhun Deng, Frances Ding, Cynthia Dwork, Rachel Hong, Giovanni Parmigiani, Prasad Patil, Pragya Sur

Abstract: We investigate the power of censoring techniques, first developed for learning {\em fair representations}, to address domain generalization. We examine {\em adversarial} censoring techniques for learning invariant representations from multiple "studies" (or domains), where each study is drawn according to a distribution on domains. The mapping is used at test time to classify instances from a new… ▽ More We investigate the power of censoring techniques, first developed for learning {\em fair representations}, to address domain generalization. We examine {\em adversarial} censoring techniques for learning invariant representations from multiple "studies" (or domains), where each study is drawn according to a distribution on domains. The mapping is used at test time to classify instances from a new domain. In many contexts, such as medical forecasting, domain generalization from studies in populous areas (where data are plentiful), to geographically remote populations (for which no training data exist) provides fairness of a different flavor, not anticipated in previous work on algorithmic fairness. We study an adversarial loss function for $k$ domains and precisely characterize its limiting behavior as $k$ grows, formalizing and proving the intuition, backed by experiments, that observing data from a larger number of domains helps. The limiting results are accompanied by non-asymptotic learning-theoretic bounds. Furthermore, we obtain sufficient conditions for good worst-case prediction performance of our algorithm on previously unseen domains. Finally, we decompose our mappings into two components and provide a complete characterization of invariance in terms of this decomposition. To our knowledge, our results provide the first formal guarantees of these kinds for adversarial invariant domain generalization. △ Less

Submitted 19 June, 2020; originally announced June 2020.

arXiv:2004.01840 [pdf, other]

Abstracting Fairness: Oracles, Metrics, and Interpretability

Authors: Cynthia Dwork, Christina Ilvento, Guy N. Rothblum, Pragya Sur

Abstract: It is well understood that classification algorithms, for example, for deciding on loan applications, cannot be evaluated for fairness without taking context into account. We examine what can be learned from a fairness oracle equipped with an underlying understanding of ``true'' fairness. The oracle takes as input a (context, classifier) pair satisfying an arbitrary fairness definition, and accept… ▽ More It is well understood that classification algorithms, for example, for deciding on loan applications, cannot be evaluated for fairness without taking context into account. We examine what can be learned from a fairness oracle equipped with an underlying understanding of ``true'' fairness. The oracle takes as input a (context, classifier) pair satisfying an arbitrary fairness definition, and accepts or rejects the pair according to whether the classifier satisfies the underlying fairness truth. Our principal conceptual result is an extraction procedure that learns the underlying truth; moreover, the procedure can learn an approximation to this truth given access to a weak form of the oracle. Since every ``truly fair'' classifier induces a coarse metric, in which those receiving the same decision are at distance zero from one another and those receiving different decisions are at distance one, this extraction process provides the basis for ensuring a rough form of metric fairness, also known as individual fairness. Our principal technical result is a higher fidelity extractor under a mild technical constraint on the weak oracle's conception of fairness. Our framework permits the scenario in which many classifiers, with differing outcomes, may all be considered fair. Our results have implications for interpretablity -- a highly desired but poorly defined property of classification systems that endeavors to permit a human arbiter to reject classifiers deemed to be ``unfair'' or illegitimately derived. △ Less

Submitted 3 April, 2020; originally announced April 2020.

Comments: 17 pages, 1 figure

arXiv:2002.01586 [pdf, other]

doi 10.1214/22-AOS2170

A Precise High-Dimensional Asymptotic Theory for Boosting and Minimum-$\ell_1$-Norm Interpolated Classifiers

Authors: Tengyuan Liang, Pragya Sur

Abstract: This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider a high-dimensional setting where the number of features (weak learners) $p$ scales with the sample size $n$, in an overparametrized regime. Under a class of statistical models, we provide an exact analysis of the generalization error of… ▽ More This paper establishes a precise high-dimensional asymptotic theory for boosting on separable data, taking statistical and computational perspectives. We consider a high-dimensional setting where the number of features (weak learners) $p$ scales with the sample size $n$, in an overparametrized regime. Under a class of statistical models, we provide an exact analysis of the generalization error of boosting when the algorithm interpolates the training data and maximizes the empirical $\ell_1$-margin. Further, we explicitly pin down the relation between the boosting test error and the optimal Bayes error, as well as the proportion of active features at interpolation (with zero initialization). In turn, these precise characterizations answer certain questions raised in \cite{breiman1999prediction, schapire1998boosting} surrounding boosting, under assumed data generating processes. At the heart of our theory lies an in-depth study of the maximum-$\ell_1$-margin, which can be accurately described by a new system of non-linear equations; to analyze this margin, we rely on Gaussian comparison techniques and develop a novel uniform deviation argument. Our statistical and computational arguments can handle (1) any finite-rank spiked covariance model for the feature distribution and (2) variants of boosting corresponding to general $\ell_q$-geometry, $q \in [1, 2]$. As a final component, via the Lindeberg principle, we establish a universality result showcasing that the scaled $\ell_1$-margin (asymptotically) remains the same, whether the covariates used for boosting arise from a non-linear random feature model or an appropriately linearized model with matching moments. △ Less

Submitted 22 July, 2021; v1 submitted 4 February, 2020; originally announced February 2020.

Comments: 68 pages, 4 figures

Journal ref: The Annals of Statistics, 50(3):1669-1695, 2022

arXiv:2001.09351 [pdf, other]

doi 10.3150/21-BEJ1401

The Asymptotic Distribution of the MLE in High-dimensional Logistic Models: Arbitrary Covariance

Authors: Qian Zhao, Pragya Sur, Emmanuel J. Candès

Abstract: We study the distribution of the maximum likelihood estimate (MLE) in high-dimensional logistic models, extending the recent results from Sur (2019) to the case where the Gaussian covariates may have an arbitrary covariance structure. We prove that in the limit of large problems holding the ratio between the number $p$ of covariates and the sample size $n$ constant, every finite list of MLE coordi… ▽ More We study the distribution of the maximum likelihood estimate (MLE) in high-dimensional logistic models, extending the recent results from Sur (2019) to the case where the Gaussian covariates may have an arbitrary covariance structure. We prove that in the limit of large problems holding the ratio between the number $p$ of covariates and the sample size $n$ constant, every finite list of MLE coordinates follows a multivariate normal distribution. Concretely, the $j$th coordinate $\hat βべーた_j$ of the MLE is asymptotically normally distributed with mean $αあるふぁ_\star βべーた_j$ and standard deviation $σしぐま_\star/τたう_j$; here, $βべーた_j$ is the value of the true regression coefficient, and $τたう_j$ the standard deviation of the $j$th predictor conditional on all the others. The numerical parameters $αあるふぁ_\star > 1$ and $σしぐま_\star$ only depend upon the problem dimensionality $p/n$ and the overall signal strength, and can be accurately estimated. Our results imply that the MLE's magnitude is biased upwards and that the MLE's standard deviation is greater than that predicted by classical theory. We present a series of experiments on simulated and real data showing excellent agreement with the theory. △ Less

Submitted 4 January, 2023; v1 submitted 25 January, 2020; originally announced January 2020.

Journal ref: Bernoulli 28 (3) 1835-1861, August 2022

arXiv:1804.09753 [pdf, other]

The phase transition for the existence of the maximum likelihood estimate in high-dimensional logistic regression

Authors: Emmanuel J. Candes, Pragya Sur

Abstract: This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp `phase transition'. We introduce an explicit boundary curve $h_{\text{MLE}}$, parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following proper… ▽ More This paper rigorously establishes that the existence of the maximum likelihood estimate (MLE) in high-dimensional logistic regression models with Gaussian covariates undergoes a sharp `phase transition'. We introduce an explicit boundary curve $h_{\text{MLE}}$, parameterized by two scalars measuring the overall magnitude of the unknown sequence of regression coefficients, with the following property: in the limit of large sample sizes $n$ and number of features $p$ proportioned in such a way that $p/n \rightarrow κかっぱ$, we show that if the problem is sufficiently high dimensional in the sense that $κかっぱ> h_{\text{MLE}}$, then the MLE does not exist with probability one. Conversely, if $κかっぱ< h_{\text{MLE}}$, the MLE asymptotically exists with probability one. △ Less

Submitted 25 April, 2018; originally announced April 2018.

Comments: 15 pages, 2 figures

arXiv:1803.06964 [pdf, other]

doi 10.1073/pnas.1810420116

A modern maximum-likelihood theory for high-dimensional logistic regression

Authors: Pragya Sur, Emmanuel J. Candes

Abstract: Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testi… ▽ More Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when $n$ is large in such a way that $n \ge 5p$ or $n \ge 10p$. This paper shows that this is far from the case, and consequently, inferences routinely produced by common software packages are often unreliable. Consider a logistic model with independent features in which $n$ and $p$ become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inference △ Less

Submitted 16 June, 2018; v1 submitted 19 March, 2018; originally announced March 2018.

Comments: 29 pages, 14 figures, 4 tables

arXiv:1706.01191 [pdf, other]

The Likelihood Ratio Test in High-Dimensional Logistic Regression Is Asymptotically a Rescaled Chi-Square

Authors: Pragya Sur, Yuxin Chen, Emmanuel J. Candès

Abstract: Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we ha… ▽ More Logistic regression is used thousands of times a day to fit data, predict future outcomes, and assess the statistical significance of explanatory variables. When used for the purpose of statistical inference, logistic models produce p-values for the regression coefficients by using an approximation to the distribution of the likelihood-ratio test. Indeed, Wilks' theorem asserts that whenever we have a fixed number $p$ of variables, twice the log-likelihood ratio (LLR) $2Λらむだ$ is distributed as a $χかい^2_k$ variable in the limit of large sample sizes $n$; here, $k$ is the number of variables being tested. In this paper, we prove that when $p$ is not negligible compared to $n$, Wilks' theorem does not hold and that the chi-square approximation is grossly incorrect; in fact, this approximation produces p-values that are far too small (under the null hypothesis). Assume that $n$ and $p$ grow large in such a way that $p/n\rightarrowκかっぱ$ for some constant $κかっぱ< 1/2$. We prove that for a class of logistic models, the LLR converges to a rescaled chi-square, namely, $2Λらむだ~\stackrel{\mathrm{d}}{\rightarrow}~αあるふぁ(κかっぱ)χかい_k^2$, where the scaling factor $αあるふぁ(κかっぱ)$ is greater than one as soon as the dimensionality ratio $κかっぱ$ is positive. Hence, the LLR is larger than classically assumed. For instance, when $κかっぱ=0.3$, $αあるふぁ(κかっぱ)\approx1.5$. In general, we show how to compute the scaling factor by solving a nonlinear system of two equations with two unknowns. Our mathematical arguments are involved and use techniques from approximate message passing theory, non-asymptotic random matrix theory and convex geometry. We also complement our mathematical study by showing that the new limiting distribution is accurate for finite sample sizes. Finally, all the results from this paper extend to some other regression models such as the probit regression model. △ Less

Submitted 5 June, 2017; originally announced June 2017.

Comments: 58 pages, 7 figures

arXiv:1309.0579 [pdf]

Modeling Bimodal Discrete Data Using Conway-Maxwell-Poisson Mixture Models

Authors: Pragya Sur, Galit Shmueli, Smarajit Bose, Paromita Dubey

Abstract: Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes of different samples (or survey questions) to predi… ▽ More Bimodal truncated count distributions are frequently observed in aggregate survey data and in user ratings when respondents are mixed in their opinion. They also arise in censored count data, where the highest category might create an additional mode. Modeling bimodal behavior in discrete data is useful for various purposes, from comparing shapes of different samples (or survey questions) to predicting future ratings by new raters. The Poisson distribution is the most common distribution for fitting count data and can be modified to achieve mixtures of truncated Poisson distributions. However, it is suitable only for modeling equi-dispersed distributions and is limited in its ability to capture bimodality. The Conway-Maxwell-Poisson (CMP) distribution is a two-parameter generalization of the Poisson distribution that allows for over- and under-dispersion. In this work, we propose a mixture of CMPs for capturing a wide range of truncated discrete data, which can exhibit unimodal and bimodal behavior. We present methods for estimating the parameters of a mixture of two CMP distributions using an EM approach. Our approach introduces a special two-step optimization within the M step to estimate multiple parameters. We examine computational and theoretical issues. The methods are illustrated for modeling ordered rating data as well as truncated count data, using simulated and real examples. △ Less

Submitted 23 January, 2014; v1 submitted 2 September, 2013; originally announced September 2013.

Comments: 29 pages

Showing 1–17 of 17 results for author: Sur, P