-
Volatility Forecasting Using Similarity-based Parameter Correction and Aggregated Shock Information
Authors:
David P. Lundquist,
Daniel J. Eck
Abstract:
We develop a procedure for forecasting the volatility of a time series immediately following a news shock. Adapting the similarity-based framework of Lin and Eck (2020), we exploit series that have experienced similar shocks. We aggregate their shock-induced excess volatilities by positing the shocks to be affine functions of exogenous covariates. The volatility shocks are modeled as random effect…
▽ More
We develop a procedure for forecasting the volatility of a time series immediately following a news shock. Adapting the similarity-based framework of Lin and Eck (2020), we exploit series that have experienced similar shocks. We aggregate their shock-induced excess volatilities by positing the shocks to be affine functions of exogenous covariates. The volatility shocks are modeled as random effects and estimated as fixed effects. The aggregation of these estimates is done in service of adjusting the $h$-step-ahead GARCH forecast of the time series under study by an additive term. The adjusted and unadjusted forecasts are evaluated using the unobservable but easily-estimated realized volatility (RV). A real-world application is provided, as are simulation results suggesting the conditions and hyperparameters under which our method thrives.
△ Less
Submitted 6 August, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Comparing baseball players across eras via novel Full House Modeling
Authors:
Shen Yan,
Adrian Burgos Jr.,
Christopher Kinson,
Daniel J. Eck
Abstract:
A new methodological framework suitable for era-adjusting baseball statistics is developed in this article. Within this methodological framework specific models are motivated. We call these models Full House Models. Full House Models work by balancing the achievements of Major League Baseball (MLB) players within a given season and the size of the MLB talent pool from which a player came. We demon…
▽ More
A new methodological framework suitable for era-adjusting baseball statistics is developed in this article. Within this methodological framework specific models are motivated. We call these models Full House Models. Full House Models work by balancing the achievements of Major League Baseball (MLB) players within a given season and the size of the MLB talent pool from which a player came. We demonstrate the utility of Full House Models in an application of comparing baseball players' performance statistics across eras. Our results reveal a new ranking of baseball's greatest players which include several modern players among the top all-time players. Modern players are elevated by Full House Modeling because they come from a larger talent pool. Sensitivity and multiverse analyses which investigate the how results change with changes to modeling inputs including the estimate of the talent pool are presented.
△ Less
Submitted 24 April, 2024; v1 submitted 22 July, 2022;
originally announced July 2022.
-
Robust model-based estimation for binary outcomes in genomics studies
Authors:
Suyoung Park,
Alexander E. Lipka,
Daniel J. Eck
Abstract:
In quantitative genetics, statistical modeling techniques are used to facilitate advances in the understanding of which genes underlie agronomically important traits and have enabled the use of genome-wide markers to accelerate genetic gain. The logistic regression model is a statistically optimal approach for quantitative genetics analysis of binary traits. To encourage more widespread use of the…
▽ More
In quantitative genetics, statistical modeling techniques are used to facilitate advances in the understanding of which genes underlie agronomically important traits and have enabled the use of genome-wide markers to accelerate genetic gain. The logistic regression model is a statistically optimal approach for quantitative genetics analysis of binary traits. To encourage more widespread use of the logistic model in such analyses, efforts need to be made to address separation, which occurs whenever a specific combination of predictors can perfectly predict the value of a binary trait. Data separation is especially prevalent in applications where the number of predictors is near the sample size. In this study we motivate a logistic model that is robust to separation, and we develop a novel prediction procedure for this robust model that is appropriate when separation exists. We show that this robust model offers superior inferences and comparable predictions to existing approaches while remaining true to the logistic model. This is an improvement to previously existing approaches which treats separation as a modeling shortcoming and not an antagonistic data configuration. Previous approaches, therefore, change the modeling paradigm to consider separation that, before our robust model exists, is problematic to logistic models. Our comparisons are conducted on several didactic examples and a genomics study on the kernel color in maize. The ensuing analyses reaffirm the billed superior inferences and comparable predictive performance of our robust model. Therefore, our approach provides scientists with an appropriate statistical modeling framework for analyses involving agronomically important binary traits.
△ Less
Submitted 28 October, 2021;
originally announced October 2021.
-
Do Most Students Need In-Person Lectures? A Study of a Large Statistics Class
Authors:
Ellen S. Fireman,
Zachary S. Donnini,
Michael B. Weissman,
Daniel J. Eck
Abstract:
Over 1100 students over four semesters were given the option of taking an introductory undergraduate statistics class either by in-person attendance in lectures or by taking exactly the same class (same instructor, recorded lectures, homework, blind grading, website, etc.) without the in-person lectures. Roughly equal numbers of students chose each option. The online lectures were available to all…
▽ More
Over 1100 students over four semesters were given the option of taking an introductory undergraduate statistics class either by in-person attendance in lectures or by taking exactly the same class (same instructor, recorded lectures, homework, blind grading, website, etc.) without the in-person lectures. Roughly equal numbers of students chose each option. The online lectures were available to all. Attendance by online students was rare. The online students did slightly better on computer-graded exams. The causal effect of choosing only online lectures was estimated by adjusting for measured confounders, of which the incoming ACT math scores turned out to be most important, using four standard methods. The four nearly identical point estimates remained positive but were small and not statistically significant at the 95% confidence level. Sensitivity analysis indicated that unmeasured confounding was unlikely to be large but might plausibly reduce the point estimate to zero. No statistically significant differences were found in preliminary comparisons of effects on females/males, U.S./non-U.S. citizens, freshmen/non-freshman, and lower-scoring/higher-scoring math ACT groups.
△ Less
Submitted 7 April, 2023; v1 submitted 17 January, 2021;
originally announced January 2021.
-
Minimizing post-shock forecasting error through aggregation of outside information
Authors:
Jilei Lin,
Daniel J. Eck
Abstract:
We develop a forecasting methodology for providing credible forecasts for time series that have recently undergone a shock. We achieve this by borrowing knowledge from other time series that have undergone similar shocks for which post-shock outcomes are observed. Three shock effect estimators are motivated with the aim of minimizing average forecast risk. We propose risk-reduction propositions th…
▽ More
We develop a forecasting methodology for providing credible forecasts for time series that have recently undergone a shock. We achieve this by borrowing knowledge from other time series that have undergone similar shocks for which post-shock outcomes are observed. Three shock effect estimators are motivated with the aim of minimizing average forecast risk. We propose risk-reduction propositions that provide conditions that establish when our methodology works. Bootstrap and leave-one-out cross validation procedures are provided to prospectively assess the performance of our methodology. Several simulated data examples, and a real data example of forecasting Conoco Phillips stock price are provided for verification and illustration.
△ Less
Submitted 26 August, 2020;
originally announced August 2020.
-
SEAM methodology for context-rich player matchup evaluations
Authors:
Julia Wapner,
David Dalpiaz,
Daniel J. Eck
Abstract:
We develop the SEAM (synthetic estimated average matchup) method for describing batter versus pitcher matchups in baseball. We first estimate the distribution of balls put into play by a batter facing a pitcher, called the empirical spray chart distribution. Many individual matchups have a sample size that is too small to be reliable for use in predicting future outcomes. Synthetic versions of the…
▽ More
We develop the SEAM (synthetic estimated average matchup) method for describing batter versus pitcher matchups in baseball. We first estimate the distribution of balls put into play by a batter facing a pitcher, called the empirical spray chart distribution. Many individual matchups have a sample size that is too small to be reliable for use in predicting future outcomes. Synthetic versions of the batter and pitcher under consideration are constructed in order to alleviate these concerns. Weights governing how much influence these synthetic players have on the overall estimated spray chart distribution are constructed to minimize expected mean square error. We provide a Shiny web application that allows users to visualize and evaluate any batter-pitcher matchup that has occurred or could have occurred during the Statcast era (specifically 2017-present). This methodology and web application could be used to determine defensive alignments, lineup construction, or pitcher selection through estimation of spray densities based on any input matchup. One can access this web application at https://seam.stat.illinois.edu/. The computational speed with which the method calculates the spray densities allows the app to display the visualizations for any input almost instantly. Therefore, SEAM offers distributional interpretations of dependent matchup data which is computationally fast.
△ Less
Submitted 20 August, 2022; v1 submitted 15 May, 2020;
originally announced May 2020.
-
General model-free weighted envelope estimation
Authors:
Daniel J. Eck
Abstract:
Envelope methodology is succinctly pitched as a class of procedures for increasing efficiency in multivariate analyses without altering traditional objectives \citep[first sentence of page 1]{cook2018introduction}. This description is true with the additional caveat that the efficiency gains obtained by envelope methodology are mitigated by model selection volatility to an unknown degree. The bulk…
▽ More
Envelope methodology is succinctly pitched as a class of procedures for increasing efficiency in multivariate analyses without altering traditional objectives \citep[first sentence of page 1]{cook2018introduction}. This description is true with the additional caveat that the efficiency gains obtained by envelope methodology are mitigated by model selection volatility to an unknown degree. The bulk of the current envelope methodology literature does not account for this added variance that arises from the uncertainty in model selection. Recent strides to account for model selection volatility have been made on two fronts: 1) development of a weighted envelope estimator, in the context of multivariate regression, to account for this variability directly; 2) development of a model selection criterion that facilitates consistent estimation of the correct envelope model for more general settings. In this paper, we unify these two directions and provide weighted envelope estimators that directly account for the variability associated with model selection and are appropriate for general multivariate estimation settings for vector valued parameters. Our weighted estimation technique provides practitioners with robust and useful variance reduction in finite samples. Theoretical justification is given for our estimators and validity of a nonparametric bootstrap procedure for estimating their asymptotic variance are established. Simulation studies and a real data analysis support our claims and demonstrate the advantage of our weighted envelope estimator when model selection variability is present.
△ Less
Submitted 3 February, 2020;
originally announced February 2020.
-
Efficient and minimal length parametric conformal prediction regions
Authors:
Daniel J. Eck,
Forrest W. Crawford
Abstract:
Conformal prediction methods construct prediction regions for iid data that are valid in finite samples. We provide two parametric conformal prediction regions that are applicable for a wide class of continuous statistical models. This class of statistical models includes generalized linear models (GLMs) with continuous outcomes. Our parametric conformal prediction regions possesses finite sample…
▽ More
Conformal prediction methods construct prediction regions for iid data that are valid in finite samples. We provide two parametric conformal prediction regions that are applicable for a wide class of continuous statistical models. This class of statistical models includes generalized linear models (GLMs) with continuous outcomes. Our parametric conformal prediction regions possesses finite sample validity, even when the model is misspecified, and are asymptotically of minimal length when the model is correctly specified. The first parametric conformal prediction region is constructed through binning of the predictor space, guarantees finite-sample local validity and is asymptotically minimal at the $\sqrt{\log(n)/n}$ rate when the dimension $d$ of the predictor space is one or two, and converges at the $O\{(\log(n)/n)^{1/d}\}$ rate when $d > 2$. The second parametric conformal prediction region is constructed by transforming the outcome variable to a common distribution via the probability integral transform, guarantees finite-sample marginal validity, and is asymptotically minimal at the $\sqrt{\log(n)/n}$ rate. We develop a novel concentration inequality for maximum likelihood estimation that induces these convergence rates. We analyze prediction region coverage properties, large-sample efficiency, and robustness properties of four methods for constructing conformal prediction intervals for GLMs: fully nonparametric kernel-based conformal, residual based conformal, normalized residual based conformal, and parametric conformal which uses the assumed GLM density as a conformity measure. Extensive simulations compare these approaches to standard asymptotic prediction regions. The utility of the parametric conformal prediction region is demonstrated in an application to interval prediction of glycosylated hemoglobin levels, a blood measurement used to diagnose diabetes.
△ Less
Submitted 25 October, 2019; v1 submitted 9 May, 2019;
originally announced May 2019.
-
Challenging nostalgia and performance metrics in baseball
Authors:
Daniel J. Eck
Abstract:
We show that the great baseball players that started their careers before 1950 are overrepresented among rankings of baseball's all time greatest players. The year 1950 coincides with the decennial US Census that is closest to when Major League Baseball (MLB) was integrated in 1947. We also show that performance metrics used to compare players have substantial era biases that favor players who sta…
▽ More
We show that the great baseball players that started their careers before 1950 are overrepresented among rankings of baseball's all time greatest players. The year 1950 coincides with the decennial US Census that is closest to when Major League Baseball (MLB) was integrated in 1947. We also show that performance metrics used to compare players have substantial era biases that favor players who started their careers before 1950. In showing that the these players are overrepresented, no individual statistics or era adjusted metrics are used. Instead, we argue that the eras in which players played are fundamentally different and are not comparable. In particular, there were significantly fewer eligible MLB players available at and before 1950. As a consequence of this and other differences across eras, we argue that popular opinion, performance metrics, and expert opinion over include players that started their careers before 1950 in their rankings of baseball's all time greatest players.
△ Less
Submitted 17 June, 2019; v1 submitted 18 October, 2018;
originally announced October 2018.
-
Randomization for the susceptibility effect of an infectious disease intervention
Authors:
Daniel J. Eck,
Olga Morozova,
Forrest W. Crawford
Abstract:
Randomized trials of infectious disease interventions, such as vaccines, often focus on groups of connected or potentially interacting individuals. When the pathogen of interest is transmissible between study subjects, interference may occur: individual infection outcomes may depend on treatments received by others. Epidemiologists have defined the primary causal effect of interest -- called the "…
▽ More
Randomized trials of infectious disease interventions, such as vaccines, often focus on groups of connected or potentially interacting individuals. When the pathogen of interest is transmissible between study subjects, interference may occur: individual infection outcomes may depend on treatments received by others. Epidemiologists have defined the primary causal effect of interest -- called the "susceptibility effect" -- as a contrast in infection risk under treatment versus no treatment, while holding exposure to infectiousness constant. A related quantity -- the "direct effect" -- is defined as an unconditional contrast between the infection risk under treatment versus no treatment. The purpose of this paper is to show that under a widely recommended randomization design, the direct effect may fail to recover the sign of the true susceptibility effect of the intervention in a randomized trial when outcomes are contagious. The analytical approach uses structural features of infectious disease transmission to define the susceptibility effect. A new probabilistic coupling argument reveals stochastic dominance relations between potential infection outcomes under different treatment allocations. The results suggest that estimating the direct effect under randomization may provide misleading inferences about the effect of an intervention -- such as a vaccine -- when outcomes are contagious.
△ Less
Submitted 9 December, 2019; v1 submitted 16 August, 2018;
originally announced August 2018.
-
Estimating the size of a hidden finite set: large-sample behavior of estimators
Authors:
Si Cheng,
Daniel J. Eck,
Forrest W. Crawford
Abstract:
A finite set is "hidden" if its elements are not directly enumerable or if its size cannot be ascertained via a deterministic query. In public health, epidemiology, demography, ecology and intelligence analysis, researchers have developed a wide variety of indirect statistical approaches, under different models for sampling and observation, for estimating the size of a hidden set. Some methods mak…
▽ More
A finite set is "hidden" if its elements are not directly enumerable or if its size cannot be ascertained via a deterministic query. In public health, epidemiology, demography, ecology and intelligence analysis, researchers have developed a wide variety of indirect statistical approaches, under different models for sampling and observation, for estimating the size of a hidden set. Some methods make use of random sampling with known or estimable sampling probabilities, and others make structural assumptions about relationships (e.g. ordering or network information) between the elements that comprise the hidden set. In this review, we describe models and methods for learning about the size of a hidden finite set, with special attention to asymptotic properties of estimators. We study the properties of these methods under two asymptotic regimes, "infill" in which the number of fixed-size samples increases, but the population size remains constant, and "outfill" in which the sample size and population size grow together. Statistical properties under these two regimes can be dramatically different.
△ Less
Submitted 15 October, 2019; v1 submitted 14 August, 2018;
originally announced August 2018.
-
Computationally efficient likelihood inference in exponential families when the maximum likelihood estimator does not exist
Authors:
Daniel J. Eck,
Charles J. Geyer
Abstract:
In a regular full exponential family, the maximum likelihood estimator (MLE) need not exist in the traditional sense. However, the MLE may exist in the completion of the exponential family. Existing algorithms for finding the MLE in the completion solve many linear programs; they are slow in small problems and too slow for large problems. We provide new, fast, and scalable methodology for finding…
▽ More
In a regular full exponential family, the maximum likelihood estimator (MLE) need not exist in the traditional sense. However, the MLE may exist in the completion of the exponential family. Existing algorithms for finding the MLE in the completion solve many linear programs; they are slow in small problems and too slow for large problems. We provide new, fast, and scalable methodology for finding the MLE in the completion of the exponential family. This methodology is based on conventional maximum likelihood computations which come close, in a sense, to finding the MLE in the completion of the exponential family. These conventional computations construct a likelihood maximizing sequence of canonical parameter values which goes uphill on the likelihood function until they meet a convergence criteria. Nonexistence of the MLE in this context results from a degeneracy of the canonical statistic of the exponential family, the canonical statistic is on the boundary of its support. There is a correspondance between this boundary and the null eigenvectors of the Fisher information matrix. Convergence of Fisher information along a likelihood maximizing sequence follows from cumulant generating function (CGF) convergence along a likelihood maximizing sequence, conditions for which are given. This allows for the construction of necessarily one-sided confidence intervals for mean value parameters when the MLE exists in the completion. We demonstrate our methodology on three examples in the main text and three additional examples in the Appendix. We show that when the MLE exists in the completion of the exponential family, our methodology provides statistical inference that is much faster than existing techniques.
△ Less
Submitted 25 November, 2020; v1 submitted 29 March, 2018;
originally announced March 2018.
-
Multivariate Design of Experiments for Engineering Dimensional Analysis
Authors:
Daniel J. Eck,
Christopher J. Nachtsheim,
R. Dennis Cook,
Thomas A. Albrecht
Abstract:
We consider the design of dimensional analysis experiments when there is more than a single response. We first give a brief overview of dimensional analysis experiments and the dimensional analysis (DA) procedure. The validity of the DA method for univariate responses was established by the Buckingham $Π$-Theorem in the early 20th century. We extend the theorem to the multivariate case, develop ba…
▽ More
We consider the design of dimensional analysis experiments when there is more than a single response. We first give a brief overview of dimensional analysis experiments and the dimensional analysis (DA) procedure. The validity of the DA method for univariate responses was established by the Buckingham $Π$-Theorem in the early 20th century. We extend the theorem to the multivariate case, develop basic criteria for multivariate design of DA and give guidelines for design construction. Finally, we illustrate the construction of designs for DA experiments for an example involving the design of a heat exchanger.
△ Less
Submitted 7 August, 2018; v1 submitted 4 August, 2017;
originally announced August 2017.
-
Bootstrapping for multivariate linear regression models
Authors:
Daniel J. Eck
Abstract:
The multivariate linear regression model is an important tool for investigating relationships between several response variables and several predictor variables. The primary interest is in inference about the unknown regression coefficient matrix. We propose multivariate bootstrap techniques as a means for making inferences about the unknown regression coefficient matrix. These bootstrapping techn…
▽ More
The multivariate linear regression model is an important tool for investigating relationships between several response variables and several predictor variables. The primary interest is in inference about the unknown regression coefficient matrix. We propose multivariate bootstrap techniques as a means for making inferences about the unknown regression coefficient matrix. These bootstrapping techniques are extensions of those developed in Freedman (1981), which are only appropriate for univariate responses. Extensions to the multivariate linear regression model are made without proof. We formalize this extension and prove its validity. A real data example and two simulated data examples which offer some finite sample verification of our theoretical results are provided.
△ Less
Submitted 12 September, 2017; v1 submitted 24 April, 2017;
originally announced April 2017.
-
Combining Envelope Methodology and Aster Models for Variance Reduction in Life History Analyses
Authors:
Daniel J. Eck,
Charles J. Geyer,
R. Dennis Cook
Abstract:
Precise estimation of expected Darwinian fitness, the expected lifetime number of offspring of organism, is a central component of life history analysis. The aster model serves as a defensible statistical model for distributions of Darwinian fitness. The aster model is equipped to incorporate the major life stages an organism travels through which separately may effect Darwinian fitness. Envelope…
▽ More
Precise estimation of expected Darwinian fitness, the expected lifetime number of offspring of organism, is a central component of life history analysis. The aster model serves as a defensible statistical model for distributions of Darwinian fitness. The aster model is equipped to incorporate the major life stages an organism travels through which separately may effect Darwinian fitness. Envelope methodology reduces asymptotic variability by establishing a link between unknown parameters of interest and the asymptotic covariance matrices of their estimators. It is known both theoretically and in applications that incorporation of envelope methodology reduces asymptotic variability. We develop an envelope framework, including a new envelope estimator, that is appropriate for aster analyses. The level of precision provided from our methods allows researchers to draw stronger conclusions about the driving forces of Darwinian fitness from their life history analyses than they could with the aster model alone. Our methods are illustrated on a simulated dataset and a life history analysis of \emph{Mimulus guttatus} flowers is provided. Useful variance reduction is obtained in both analyses.
△ Less
Submitted 27 February, 2018; v1 submitted 26 January, 2017;
originally announced January 2017.
-
Weighted envelope estimation to handle variability in model selection
Authors:
Daniel J. Eck,
R. Dennis Cook
Abstract:
Envelope methodology can provide substantial efficiency gains in multivariate statistical problems, but in some applications the estimation of the envelope dimension can induce selection volatility that may mitigate those gains. Current envelope methodology does not account for the added variance that can result from this selection. In this article, we circumvent dimension selection volatility thr…
▽ More
Envelope methodology can provide substantial efficiency gains in multivariate statistical problems, but in some applications the estimation of the envelope dimension can induce selection volatility that may mitigate those gains. Current envelope methodology does not account for the added variance that can result from this selection. In this article, we circumvent dimension selection volatility through the development of a weighted envelope estimator. Theoretical justification is given for our estimator and validity of the residual bootstrap for estimating its asymptotic variance is established. A simulation study and an analysis on a real data set illustrate the utility of our weighted envelope estimator.
△ Less
Submitted 14 April, 2017; v1 submitted 3 January, 2017;
originally announced January 2017.