Search | arXiv e-print repository

Tail calibration of probabilistic forecasts

Authors: Sam Allen, Jonathan Koh, Johan Segers, Johanna Ziegel

Abstract: Probabilistic forecasts comprehensively describe the uncertainty in the unknown future outcome, making them essential for decision making and risk management. While several methods have been introduced to evaluate probabilistic forecasts, existing evaluation techniques are ill-suited to the evaluation of tail properties of such forecasts. However, these tail properties are often of particular inte… ▽ More Probabilistic forecasts comprehensively describe the uncertainty in the unknown future outcome, making them essential for decision making and risk management. While several methods have been introduced to evaluate probabilistic forecasts, existing evaluation techniques are ill-suited to the evaluation of tail properties of such forecasts. However, these tail properties are often of particular interest to forecast users due to the severe impacts caused by extreme outcomes. In this work, we introduce a general notion of tail calibration for probabilistic forecasts, which allows forecasters to assess the reliability of their predictions for extreme outcomes. We study the relationships between tail calibration and standard notions of forecast calibration, and discuss connections to peaks-over-threshold models in extreme value theory. Diagnostic tools are introduced and applied in a case study on European precipitation forecasts △ Less

Submitted 3 July, 2024; originally announced July 2024.

arXiv:2404.18678 [pdf, other]

Sequential model confidence sets

Authors: Sebastian Arnold, Georgios Gavrilopoulos, Benedikt Schulz, Johanna Ziegel

Abstract: In most prediction and estimation situations, scientists consider various statistical models for the same problem, and naturally want to select amongst the best. Hansen et al. (2011) provide a powerful solution to this problem by the so-called model confidence set, a subset of the original set of available models that contains the best models with a given level of confidence. Importantly, model co… ▽ More In most prediction and estimation situations, scientists consider various statistical models for the same problem, and naturally want to select amongst the best. Hansen et al. (2011) provide a powerful solution to this problem by the so-called model confidence set, a subset of the original set of available models that contains the best models with a given level of confidence. Importantly, model confidence sets respect the underlying selection uncertainty by being flexible in size. However, they presuppose a fixed sample size which stands in contrast to the fact that model selection and forecast evaluation are inherently sequential tasks where we successively collect new data and where the decision to continue or conclude a study may depend on the previous outcomes. In this article, we extend model confidence sets sequentially over time by relying on sequential testing methods. Recently, e-processes and confidence sequences have been introduced as new, safe methods for assessing statistical evidence. Sequential model confidence sets allow to continuously monitor the models' performances and come with time-uniform, nonasymptotic coverage guarantees. △ Less

Submitted 8 May, 2024; v1 submitted 29 April, 2024; originally announced April 2024.

arXiv:2311.14122 [pdf, other]

Decompositions of the mean continuous ranked probability score

Authors: Sebastian Arnold, Eva-Maria Walz, Johanna Ziegel, Tilmann Gneiting

Abstract: The continuous ranked probability score (crps) is the most commonly used scoring rule in the evaluation of probabilistic forecasts for real-valued outcomes. To assess and rank forecasting methods, researchers compute the mean crps over given sets of forecast situations, based on the respective predictive distributions and outcomes. We propose a new, isotonicity-based decomposition of the mean crps… ▽ More The continuous ranked probability score (crps) is the most commonly used scoring rule in the evaluation of probabilistic forecasts for real-valued outcomes. To assess and rank forecasting methods, researchers compute the mean crps over given sets of forecast situations, based on the respective predictive distributions and outcomes. We propose a new, isotonicity-based decomposition of the mean crps into interpretable components that quantify miscalibration (MSC), discrimination ability (DSC), and uncertainty (UNC), respectively. In a detailed theoretical analysis, we compare the new approach to empirical decompositions proposed earlier, generalize to population versions, analyse their properties and relationships, and relate to a hierarchy of notions of calibration. The isotonicity-based decomposition guarantees the nonnegativity of the components and quantifies calibration in a sense that is stronger than for other types of decompositions, subject to the nondegeneracy of empirical decompositions. We illustrate the usage of the isotonicity-based decomposition in case studies from weather prediction and machine learning. △ Less

Submitted 23 November, 2023; originally announced November 2023.

arXiv:2307.05846 [pdf, other]

Assessing the calibration of multivariate probabilistic forecasts

Authors: Sam Allen, Johanna Ziegel, David Ginsbourger

Abstract: Rank and PIT histograms are established tools to assess the calibration of probabilistic forecasts. They not only check whether an ensemble forecast is calibrated, but they also reveal what systematic biases (if any) are present in the forecasts. Several extensions of rank histograms have been proposed to evaluate the calibration of probabilistic forecasts for multivariate outcomes. These extensio… ▽ More Rank and PIT histograms are established tools to assess the calibration of probabilistic forecasts. They not only check whether an ensemble forecast is calibrated, but they also reveal what systematic biases (if any) are present in the forecasts. Several extensions of rank histograms have been proposed to evaluate the calibration of probabilistic forecasts for multivariate outcomes. These extensions introduce a so-called pre-rank function that condenses the multivariate forecasts and observations into univariate objects, from which a standard rank histogram can be produced. Existing pre-rank functions typically aim to preserve as much information as possible when condensing the multivariate forecasts and observations into univariate objects. Although this is sensible when conducting statistical tests for multivariate calibration, it can hinder the interpretation of the resulting histograms. In this paper, we demonstrate that there are few restrictions on the choice of pre-rank function, meaning forecasters can choose a pre-rank function depending on what information they want to extract from their forecasts. We introduce the concept of simple pre-rank functions, and provide examples that can be used to assess the location, scale, and dependence structure of multivariate probabilistic forecasts, as well as pre-rank functions tailored to the evaluation of probabilistic spatial field forecasts. The simple pre-rank functions that we introduce are easy to interpret, easy to implement, and they deliberately provide complementary information, meaning several pre-rank functions can be employed to achieve a more complete understanding of multivariate forecast performance. We then discuss how e-values can be employed to formally test for multivariate calibration over time. This is demonstrated in an application to wind speed forecasting using the EUPPBench post-processing benchmark data set. △ Less

Submitted 11 July, 2023; originally announced July 2023.

arXiv:2301.02692 [pdf, other]

Isotonic Recalibration under a Low Signal-to-Noise Ratio

Authors: Mario V. Wüthrich, Johanna Ziegel

Abstract: Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration st… ▽ More Insurance pricing systems should fulfill the auto-calibration property to ensure that there is no systematic cross-financing between different price cohorts. Often, regression models are not auto-calibrated. We propose to apply isotonic recalibration to a given regression model to ensure auto-calibration. Our main result proves that under a low signal-to-noise ratio, this isotonic recalibration step leads to explainable pricing systems because the resulting isotonically recalibrated regression functions have a low complexity. △ Less

Submitted 6 January, 2023; originally announced January 2023.

Comments: 21 pages, 9 figures

arXiv:2212.08376 [pdf, other]

Easy Uncertainty Quantification (EasyUQ): Generating Predictive Distributions from Single-valued Model Output

Authors: Eva-Maria Walz, Alexander Henzi, Johanna Ziegel, Tilmann Gneiting

Abstract: How can we quantify uncertainty if our favorite computational tool - be it a numerical, a statistical, or a machine learning approach, or just any computer model - provides single-valued output only? In this article, we introduce the Easy Uncertainty Quantification (EasyUQ) technique, which transforms real-valued model output into calibrated statistical distributions, based solely on training data… ▽ More How can we quantify uncertainty if our favorite computational tool - be it a numerical, a statistical, or a machine learning approach, or just any computer model - provides single-valued output only? In this article, we introduce the Easy Uncertainty Quantification (EasyUQ) technique, which transforms real-valued model output into calibrated statistical distributions, based solely on training data of model output-outcome pairs, without any need to access model input. In its basic form, EasyUQ is a special case of the recently introduced Isotonic Distributional Regression (IDR) technique that leverages the pool-adjacent-violators algorithm for nonparametric isotonic regression. EasyUQ yields discrete predictive distributions that are calibrated and optimal in finite samples, subject to stochastic monotonicity. The workflow is fully automated, without any need for tuning. The Smooth EasyUQ approach supplements IDR with kernel smoothing, to yield continuous predictive distributions that preserve key properties of the basic form, including both, stochastic monotonicity with respect to the original model output, and asymptotic consistency. For the selection of kernel parameters, we introduce multiple one-fit grid search, a computationally much less demanding approximation to leave-one-out cross-validation. We use simulation examples and forecast data from weather prediction to illustrate the techniques. In a study of benchmark problems from machine learning, we show how EasyUQ and Smooth EasyUQ can be integrated into the workflow of neural network learning and hyperparameter tuning, and find EasyUQ to be competitive with conformal prediction, as well as more elaborate input-based approaches. △ Less

Submitted 24 July, 2023; v1 submitted 16 December, 2022; originally announced December 2022.

arXiv:2209.04872 [pdf, other]

Weighted verification tools to evaluate univariate and multivariate forecasts for high-impact weather events

Authors: Sam Allen, Jonas Bhend, Olivia Martius, Johanna Ziegel

Abstract: To mitigate the impacts associated with adverse weather conditions, meteorological services issue weather warnings to the general public. These warnings rely heavily on forecasts issued by underlying prediction systems. When deciding which prediction system(s) to utilise to construct warnings, it is important to compare systems in their ability to forecast the occurrence and severity of extreme we… ▽ More To mitigate the impacts associated with adverse weather conditions, meteorological services issue weather warnings to the general public. These warnings rely heavily on forecasts issued by underlying prediction systems. When deciding which prediction system(s) to utilise to construct warnings, it is important to compare systems in their ability to forecast the occurrence and severity of extreme weather events. However, evaluating forecasts for extreme events is known to be a challenging task. This is exacerbated further by the fact that high-impact weather often manifests as a result of several confounding features, a realisation that has led to considerable research on so-called compound weather events. Both univariate and multivariate methods are therefore required to evaluate forecasts for high-impact weather. In this paper, we discuss weighted verification tools, which allow particular outcomes to be emphasised during forecast evaluation. We review and compare different approaches to construct weighted scoring rules, both in a univariate and multivariate setting, and we leverage existing results on weighted scores to introduce weighted probability integral transform (PIT) histograms, allowing forecast calibration to be assessed conditionally on particular outcomes having occurred. To illustrate the practical benefit afforded by these weighted verification tools, they are employed in a case study to evaluate forecasts for extreme heat events issued by the Swiss Federal Office of Meteorology and Climatology (MeteoSwiss). △ Less

Submitted 11 September, 2022; originally announced September 2022.

arXiv:2209.00991 [pdf, other]

E-backtesting

Authors: Qiuqi Wang, Ruodu Wang, Johanna Ziegel

Abstract: In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. To design a model-free backtesting procedure for ES… ▽ More In the recent Basel Accords, the Expected Shortfall (ES) replaces the Value-at-Risk (VaR) as the standard risk measure for market risk in the banking sector, making it the most important risk measure in financial regulation. One of the most challenging tasks in risk modeling practice is to backtest ES forecasts provided by financial institutions. To design a model-free backtesting procedure for ES, we make use of the recently developed techniques of e-values and e-processes. Backtest e-statistics are introduced to formulate e-processes for risk measure forecasts, and unique forms of backtest e-statistics for VaR and ES are characterized using recent results on identification functions. For a given backtest e-statistic, a few criteria for optimally constructing the e-processes are studied. The proposed method can be naturally applied to many other risk measures and statistical quantities. We conduct extensive simulation studies and data analysis to illustrate the advantages of the model-free backtesting method, and compare it with the ones in the literature. △ Less

Submitted 12 August, 2024; v1 submitted 26 August, 2022; originally announced September 2022.

arXiv:2206.07588 [pdf, ps, other]

Characteristic kernels on Hilbert spaces, Banach spaces, and on sets of measures

Authors: Johanna Ziegel, David Ginsbourger, Lutz Dümbgen

Abstract: We present new classes of positive definite kernels on non-standard spaces that are integrally strictly positive definite or characteristic. In particular, we discuss radial kernels on separable Hilbert spaces, and introduce broad classes of kernels on Banach spaces and on metric spaces of strong negative type. The general results are used to give explicit classes of kernels on separable $L^p$ spa… ▽ More We present new classes of positive definite kernels on non-standard spaces that are integrally strictly positive definite or characteristic. In particular, we discuss radial kernels on separable Hilbert spaces, and introduce broad classes of kernels on Banach spaces and on metric spaces of strong negative type. The general results are used to give explicit classes of kernels on separable $L^p$ spaces and on sets of measures. △ Less

Submitted 15 June, 2022; originally announced June 2022.

arXiv:2204.05680 [pdf, ps, other]

Anytime-valid sequential testing for elicitable functionals via supermartingales

Authors: Philippe Casgrain, Martin Larsson, Johanna Ziegel

Abstract: We design sequential tests for a large class of nonparametric null hypotheses based on elicitable and identifiable functionals. Such functionals are defined in terms of scoring functions and identification functions, which are ideal building blocks for constructing nonnegative supermartingales under the null. This in turn yields sequential tests via Ville's inequality. Using regret bounds from Onl… ▽ More We design sequential tests for a large class of nonparametric null hypotheses based on elicitable and identifiable functionals. Such functionals are defined in terms of scoring functions and identification functions, which are ideal building blocks for constructing nonnegative supermartingales under the null. This in turn yields sequential tests via Ville's inequality. Using regret bounds from Online Convex Optimization, we obtain rigorous guarantees on the asymptotic power of the tests for a wide range of alternative hypotheses. Our results allow for bounded and unbounded data distributions, assuming that a sub-$ψぷさい$ tail bound is satisfied. △ Less

Submitted 4 June, 2023; v1 submitted 12 April, 2022; originally announced April 2022.

Comments: 36 pages, 3 figures

MSC Class: 62L05; 62L10; 62L15 ACM Class: G.3

arXiv:2203.04065 [pdf, other]

doi 10.1093/biomet/asac068

Honest calibration assessment for binary outcome predictions

Authors: Timo Dimitriadis, Lutz Duembgen, Alexander Henzi, Marius Puke, Johanna Ziegel

Abstract: Probability predictions from binary regressions or machine learning methods ought to be calibrated: If an event is predicted to occur with probability $x$, it should materialize with approximately that frequency, which means that the so-called calibration curve $p(\cdot)$ should equal the identity, $p(x) = x$ for all $x$ in the unit interval. We propose honest calibration assessment based on novel… ▽ More Probability predictions from binary regressions or machine learning methods ought to be calibrated: If an event is predicted to occur with probability $x$, it should materialize with approximately that frequency, which means that the so-called calibration curve $p(\cdot)$ should equal the identity, $p(x) = x$ for all $x$ in the unit interval. We propose honest calibration assessment based on novel confidence bands for the calibration curve, which are valid only subject to the natural assumption of isotonicity. Besides testing the classical goodness-of-fit null hypothesis of perfect calibration, our bands facilitate inverted goodness-of-fit tests whose rejection allows for the sought-after conclusion of a sufficiently well specified model. We show that our bands have a finite sample coverage guarantee, are narrower than existing approaches, and adapt to the local smoothness of the calibration curve $p$ and the local variance of the binary observations. In an application to model predictions of an infant having a low birth weight, the bounds give informative insights on model calibration. △ Less

Submitted 2 November, 2022; v1 submitted 8 March, 2022; originally announced March 2022.

arXiv:2203.00426 [pdf, other]

A safe Hosmer-Lemeshow test

Authors: Alexander Henzi, Marius Puke, Timo Dimitriadis, Johanna Ziegel

Abstract: This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative invers… ▽ More This article proposes an alternative to the Hosmer-Lemeshow (HL) test for evaluating the calibration of probability forecasts for binary events. The approach is based on e-values, a new tool for hypothesis testing. An e-value is a random variable with expected value less or equal to one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a p-value. Our test uses online isotonic regression to estimate the calibration curve as a `betting strategy' against the null hypothesis. We show that the test has power against essentially all alternatives, which makes it theoretically superior to the HL test and at the same time resolves the well-known instability problem of the latter. A simulation study shows that a feasible version of the proposed eHL test can detect slight miscalibrations in practically relevant sample sizes, but trades its universal validity and power guarantees against a reduced empirical power compared to the HL test in a classical simulation setup.We illustrate our test on recalibrated predictions for credit card defaults during the Taiwan credit card crisis, where the classical HL test delivers equivocal results. △ Less

Submitted 27 February, 2023; v1 submitted 1 March, 2022; originally announced March 2022.

arXiv:2202.12732 [pdf, other]

Evaluating forecasts for high-impact events using transformed kernel scores

Authors: Sam Allen, David Ginsbourger, Johanna Ziegel

Abstract: It is informative to evaluate a forecaster's ability to predict outcomes that have a large impact on the forecast user. Although weighted scoring rules have become a well-established tool to achieve this, such scores have been studied almost exclusively in the univariate case, with interest typically placed on extreme events. However, a large impact may also result from events not considered to be… ▽ More It is informative to evaluate a forecaster's ability to predict outcomes that have a large impact on the forecast user. Although weighted scoring rules have become a well-established tool to achieve this, such scores have been studied almost exclusively in the univariate case, with interest typically placed on extreme events. However, a large impact may also result from events not considered to be extreme from a statistical perspective: the interaction of several moderate events could also generate a high impact. Compound weather events provide a good example of this. To assess forecasts made for high-impact events, this work extends existing results on weighted scoring rules by introducing weighted multivariate scores. To do so, we utilise kernel scores. We demonstrate that the threshold-weighted continuous ranked probability score (twCRPS), arguably the most well-known weighted scoring rule, is a kernel score. This result leads to a convenient representation of the twCRPS when the forecast is an ensemble, and also permits a generalisation that can be employed with alternative kernels, allowing us to introduce, for example, a threshold-weighted energy score and threshold-weighted variogram score. To illustrate the additional information that these weighted multivariate scoring rules provide, results are presented for a case study in which the weighted scores are used to evaluate daily precipitation accumulation forecasts, with particular interest on events that could lead to flooding. △ Less

Submitted 25 February, 2022; originally announced February 2022.

arXiv:2109.11761 [pdf, other]

Sequentially valid tests for forecast calibration

Authors: Sebastian Arnold, Alexander Henzi, Johanna F. Ziegel

Abstract: Forecasting and forecast evaluation are inherently sequential tasks. Predictions are often issued on a regular basis, such as every hour, day, or month, and their quality is monitored continuously. However, the classical statistical tools for forecast evaluation are static, in the sense that statistical tests for forecast calibration are only valid if the evaluation period is fixed in advance. Rec… ▽ More Forecasting and forecast evaluation are inherently sequential tasks. Predictions are often issued on a regular basis, such as every hour, day, or month, and their quality is monitored continuously. However, the classical statistical tools for forecast evaluation are static, in the sense that statistical tests for forecast calibration are only valid if the evaluation period is fixed in advance. Recently, e-values have been introduced as a new, dynamic method for assessing statistical significance. An e-value is a non-negative random variable with expected value at most one under a null hypothesis. Large e-values give evidence against the null hypothesis, and the multiplicative inverse of an e-value is a conservative p-value. E-values are particularly suitable for sequential forecast evaluation, since they naturally lead to statistical tests which are valid under optional stopping. This article proposes e-values for testing probabilistic calibration of forecasts, which is one of the most important notions of calibration. The proposed methods are also more generally applicable for sequential goodness-of-fit testing. We demonstrate that the e-values are competitive in terms of power when compared to extant methods, which do not allow sequential testing. Furthermore, they provide important and useful insights in the evaluation of probabilistic weather forecasts. △ Less

Submitted 1 July, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

arXiv:2103.08402 [pdf, other]

Valid sequential inference on probability forecast performance

Authors: Alexander Henzi, Johanna F. Ziegel

Abstract: Probability forecasts for binary events play a central role in many applications. Their quality is commonly assessed with proper scoring rules, which assign forecasts a numerical score such that a correct forecast achieves a minimal expected score. In this paper, we construct e-values for testing the statistical significance of score differences of competing forecasts in sequential settings. E-val… ▽ More Probability forecasts for binary events play a central role in many applications. Their quality is commonly assessed with proper scoring rules, which assign forecasts a numerical score such that a correct forecast achieves a minimal expected score. In this paper, we construct e-values for testing the statistical significance of score differences of competing forecasts in sequential settings. E-values have been proposed as an alternative to p-values for hypothesis testing, and they can easily be transformed into conservative p-values by taking the multiplicative inverse. The e-values proposed in this article are valid in finite samples without any assumptions on the data generating processes. They also allow optional stopping, so a forecast user may decide to interrupt evaluation taking into account the available data at any time and still draw statistically valid inference, which is generally not true for classical p-value based tests. In a case study on postprocessing of precipitation forecasts, state-of-the-art forecasts dominance tests and e-values lead to the same conclusions. △ Less

Submitted 1 July, 2022; v1 submitted 15 March, 2021; originally announced March 2021.

arXiv:2006.09219 [pdf, other]

Distributional (Single) Index Models

Authors: Alexander Henzi, Gian-Reto Kleger, Johanna F. Ziegel

Abstract: A Distributional (Single) Index Model (DIM) is a semi-parametric model for distributional regression, that is, estimation of conditional distributions given covariates. The method is a combination of classical single index models for the estimation of the conditional mean of a response given covariates, and isotonic distributional regression. The model for the index is parametric, whereas the cond… ▽ More A Distributional (Single) Index Model (DIM) is a semi-parametric model for distributional regression, that is, estimation of conditional distributions given covariates. The method is a combination of classical single index models for the estimation of the conditional mean of a response given covariates, and isotonic distributional regression. The model for the index is parametric, whereas the conditional distributions are estimated non-parametrically under a stochastic ordering constraint. We show consistency of our estimators and apply them to a highly challenging data set on the length of stay (LoS) of patients in intensive care units. We use the model to provide skillful and calibrated probabilistic predictions for the LoS of individual patients, that outperform the available methods in the literature. △ Less

Submitted 3 August, 2022; v1 submitted 16 June, 2020; originally announced June 2020.

arXiv:1909.03725 [pdf, other]

doi 10.1111/rssb.12450

Isotonic Distributional Regression

Authors: Alexander Henzi, Johanna F. Ziegel, Tilmann Gneiting

Abstract: Isotonic distributional regression (IDR) is a powerful nonparametric technique for the estimation of conditional distributions under order restrictions. In a nutshell, IDR learns conditional distributions that are calibrated, and simultaneously optimal relative to comprehensive classes of relevant loss functions, subject to isotonicity constraints in terms of a partial order on the covariate space… ▽ More Isotonic distributional regression (IDR) is a powerful nonparametric technique for the estimation of conditional distributions under order restrictions. In a nutshell, IDR learns conditional distributions that are calibrated, and simultaneously optimal relative to comprehensive classes of relevant loss functions, subject to isotonicity constraints in terms of a partial order on the covariate space. Nonparametric isotonic quantile regression and nonparametric isotonic binary regression emerge as special cases. For prediction, we propose an interpolation method that generalizes extant specifications under the pool adjacent violators algorithm. We recommend the use of IDR as a generic benchmark technique in probabilistic forecast problems, as it does not involve any parameter tuning nor implementation choices, except for the selection of a partial order on the covariate space. The method can be combined with subsample aggregation, with the benefits of smoother regression functions and gains in computational efficiency. In a simulation study, we compare methods for distributional regression in terms of the continuous ranked probability score (CRPS) and $L_2$ estimation error, which are closely linked. In a case study on raw and postprocessed quantitative precipitation forecasts from a leading numerical weather prediction system, IDR is competitive with state of the art techniques. △ Less

Submitted 28 September, 2021; v1 submitted 9 September, 2019; originally announced September 2019.

arXiv:1805.09902 [pdf, other]

Generic Conditions for Forecast Dominance

Authors: Fabian Krüger, Johanna F. Ziegel

Abstract: Recent studies have analyzed whether one forecast method dominates another under a class of consistent scoring functions. While the existing literature focuses on empirical tests of forecast dominance, little is known about the theoretical conditions under which one forecast dominates another. To address this question, we derive a new characterization of dominance among forecasts of the mean funct… ▽ More Recent studies have analyzed whether one forecast method dominates another under a class of consistent scoring functions. While the existing literature focuses on empirical tests of forecast dominance, little is known about the theoretical conditions under which one forecast dominates another. To address this question, we derive a new characterization of dominance among forecasts of the mean functional. We present various scenarios under which dominance occurs. Unlike existing results, our results allow for the case that the forecasts' underlying information sets are not nested, and allow for uncalibrated forecasts that suffer, e.g., from model misspecification or parameter estimation error. We illustrate the empirical relevance of our results via data examples from finance and economics. △ Less

Submitted 18 December, 2019; v1 submitted 24 May, 2018; originally announced May 2018.

arXiv:1712.05279 [pdf, ps, other]

Strictly proper kernel scores and characteristic kernels on compact spaces

Authors: Ingo Steinwart, Johanna F. Ziegel

Abstract: Strictly proper kernel scores are well-known tool in probabilistic forecasting, while characteristic kernels have been extensively investigated in the machine learning literature. We first show that both notions coincide, so that insights from one part of the literature can be used in the other. We then show that the metric induced by a characteristic kernel cannot reliably distinguish between dis… ▽ More Strictly proper kernel scores are well-known tool in probabilistic forecasting, while characteristic kernels have been extensively investigated in the machine learning literature. We first show that both notions coincide, so that insights from one part of the literature can be used in the other. We then show that the metric induced by a characteristic kernel cannot reliably distinguish between distributions that are far apart in the total variation norm as soon as the underlying space of measures is infinite dimensional. In addition, we provide a characterization of characteristic kernels in terms of eigenvalues and -functions and apply this characterization to the case of continuous kernels on (locally) compact spaces. In the compact case we further show that characteristic kernels exist if and only if the space is metrizable. As special cases of our general theory we investigate translation-invariant kernels on compact Abelian groups and isotropic kernels on spheres. The latter are of particular interest for forecast evaluation of probabilistic predictions on spherical domains as frequently encountered in meteorology and climatology. △ Less

Submitted 14 December, 2017; originally announced December 2017.

arXiv:1705.04537 [pdf, other]

Murphy Diagrams: Forecast Evaluation of Expected Shortfall

Authors: Johanna F. Ziegel, Fabian Krüger, Alexander Jordan, Fernando Fasciati

Abstract: Motivated by the Basel 3 regulations, recent studies have considered joint forecasts of Value-at-Risk and Expected Shortfall. A large family of scoring functions can be used to evaluate forecast performance in this context. However, little intuitive or empirical guidance is currently available, which renders the choice of scoring function awkward in practice. We therefore develop graphical checks… ▽ More Motivated by the Basel 3 regulations, recent studies have considered joint forecasts of Value-at-Risk and Expected Shortfall. A large family of scoring functions can be used to evaluate forecast performance in this context. However, little intuitive or empirical guidance is currently available, which renders the choice of scoring function awkward in practice. We therefore develop graphical checks (Murphy diagrams) of whether one forecast method dominates another under a relevant class of scoring functions, and propose an associated hypothesis test. We illustrate these tools with simulation examples and an empirical analysis of S&P 500 and DAX returns. △ Less

Submitted 12 May, 2017; originally announced May 2017.

Report number: Discussion paper nr. 632, AWI, Heidelberg University

arXiv:1608.05498 [pdf, ps, other]

Elicitability and backtesting: Perspectives for banking regulation

Authors: Natalia Nolde, Johanna F. Ziegel

Abstract: Conditional forecasts of risk measures play an important role in internal risk management of financial institutions as well as in regulatory capital calculations. In order to assess forecasting performance of a risk measurement procedure, risk measure forecasts are compared to the realized financial losses over a period of time and a statistical test of correctness of the procedure is conducted. T… ▽ More Conditional forecasts of risk measures play an important role in internal risk management of financial institutions as well as in regulatory capital calculations. In order to assess forecasting performance of a risk measurement procedure, risk measure forecasts are compared to the realized financial losses over a period of time and a statistical test of correctness of the procedure is conducted. This process is known as backtesting. Such traditional backtests are concerned with assessing some optimality property of a set of risk measure estimates. However, they are not suited to compare different risk estimation procedures. We investigate the proposal of comparative backtests, which are better suited for method comparisons on the basis of forecasting accuracy, but necessitate an elicitable risk measure. We argue that supplementing traditional backtests with comparative backtests will enhance the existing trading book regulatory framework for banks by providing the correct incentive for accuracy of risk measure forecasts. In addition, the comparative backtesting framework could be used by banks internally as well as by researchers to guide selection of forecasting methods. The discussion focuses on three risk measures, Value-at-Risk, expected shortfall and expectiles, and is supported by a simulation study and data analysis. △ Less

Submitted 21 February, 2017; v1 submitted 19 August, 2016; originally announced August 2016.

arXiv:1505.05314 [pdf, other]

Cross-calibration of probabilistic forecasts

Authors: Christof Strähl, Johanna F. Ziegel

Abstract: When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Several notions of calibration are available in the case of a single forecaster alongside with diagnostic tools and statistical tests to assess calibration in practice. Often, there is more than on… ▽ More When providing probabilistic forecasts for uncertain future events, it is common to strive for calibrated forecasts, that is, the predictive distribution should be compatible with the observed outcomes. Several notions of calibration are available in the case of a single forecaster alongside with diagnostic tools and statistical tests to assess calibration in practice. Often, there is more than one forecaster providing predictions, and these forecasters may use information of the others and therefore influence one another. We extend common notions of calibration, where each forecaster is analysed individually, to notions of cross-calibration where each forecaster is analysed with respect to the other forecasters in a natural way. It is shown theoretically and in simulation studies that cross-calibration is a stronger requirement on a forecaster than calibration. Analogously to calibration for individual forecasters, we provide diagnostic tools and statistical tests to assess forecasters in terms of cross-calibration. The methods are illustrated in simulation examples and applied to probabilistic forecasts for inflation rates by the Bank of England. △ Less

Submitted 20 May, 2015; originally announced May 2015.

arXiv:1307.7650 [pdf, ps, other]

Copula Calibration

Authors: Johanna F. Ziegel, Tilmann Gneiting

Abstract: We propose notions of calibration for probabilistic forecasts of general multivariate quantities. Probabilistic copula calibration is a natural analogue of probabilistic calibration in the univariate setting. It can be assessed empirically by checking for the uniformity of the copula probability integral transform (CopPIT), which is invariant under coordinate permutations and coordinatewise strict… ▽ More We propose notions of calibration for probabilistic forecasts of general multivariate quantities. Probabilistic copula calibration is a natural analogue of probabilistic calibration in the univariate setting. It can be assessed empirically by checking for the uniformity of the copula probability integral transform (CopPIT), which is invariant under coordinate permutations and coordinatewise strictly monotone transformations of the predictive distribution and the outcome. The CopPIT histogram can be interpreted as a generalization and variant of the multivariate rank histogram, which has been used to check the calibration of ensemble forecasts. Climatological copula calibration is an analogue of marginal calibration in the univariate setting. Methods and tools are illustrated in a simulation study and applied to compare raw numerical model and statistically postprocessed ensemble forecasts of bivariate wind vectors. △ Less

Submitted 29 July, 2013; originally announced July 2013.

Showing 1–23 of 23 results for author: Ziegel, J