James' Empty Blog: statistics

Showing posts with label statistics. Show all posts

Monday, November 09, 2020

BlueSkiesResearch.org.uk: Modelling the ONS COVID data

I’ve grumbled for a while about the ONS analyses of their infection survey pilot (pilot? isn’t it a full-blown survey yet?) without doing anything about it. The purpose of this blog is to outline the issue, get me started on fixing it (or at least presenting my own approach to an analysis) and commit me to actually doing it this time. There are a couple of minor obstacles that I’ve been using as an excuse for several weeks now and it’s time I had a go.

The survey itself seems good – they are regularly testing a large "random" cohort of people for their infection status, and thereby estimating the prevalence of the disease and how it varies over time. The problem is in how they are doing this estimation. They are fitting a curve through their data, using a method know as a "thin plate spline." I am not familiar with this approach but it’s essentially a generic smooth curve that attempts to minimise wiggles.

There are two fundamental problems with their analysis, which may be related (or not) but are both important IMO. The first is that this smooth curve isn’t necessarily a credible epidemic curve. Epidemics have a particular dynamical form (you can think of them as locally exponential for the most part, though this is a bit of an oversimplification) and while variation in R over time gives rise to some flexibility in the outcome of this process, there are also inevitably constraints due to the way that infections arise and are detected. In short, the curve they are fitting has no theoretical foundation as the model of an epidemic. In practice it has often appeared to me that the curves have looked a bit unrealistic though of course I don’t claim to have much experience to draw on here!

The second fundamental problem is the empirical observation that their analyses are inconsistent and incoherent. This is illustrated in the following analysis of their Sept 4th results:

Here we see their on the right their model fit (plume) to the last few months of data (the data themselves are not shown). The dot and bar is their estimate for the final week which is one of the main outputs of their analysis. On the left, we see the equivalent latest-week estimates from previous reports which have been produced at roughly weekly intervals The last dot and bar is a duplicate of the one on the right hand panel. It is straightforward to superimpose these graphics thusly:

The issue here is that (for example) their dot and bar estimate several reports back centred on the 3rd August, has no overlap with their new plume. Both of these are supposed to be 95% intervals, meaning they are both claimed to have a 95% chance of including the real value. At least one of them does not.

While it’s entirely to be expected that some 95% intervals will not contain reality, this should be rare, occurring only 5% of the time. Three of the dot and bar estimates in the above graphic are wholly disjoint with the plume, one more has negligible overlap and another two are in substantial disagreement. This is not just bad luck, it’s bad calibration. This phenomenon has continued subsequent to my observation, eg this is the equivalent from the 9th Oct:

The previous dot and bar from late Sept is again disjoint from the new plume, and the one just after the dotted line in mid-August looks to be right on the limit depending on graphical resolution. In the most recent reports have changed the scaling of the graphics to make this overlaying more difficult, but there is no suggestion that they have fixed the underlying methodological issues.

The obvious solution to all this is to fit a model along the lines of my existing approach, using a standard Bayesian paradigm, and I propose to do just this. First, let’s look at the data. The spreadsheet that accompanies the report each week gives various numerical summaries of the data and the one that I think is most usable for my purposes is the weighted fortnightly means in Table 1d which take the raw infection numbers and adjust to represent the population more accurately (presumably, accounting for things like different age distributions in the sample vs the national population). Thanks to the weekly publication of these data, we can actually create a series of overlapping fortnightly means out of two consecutive reports and I’ve plotted such a set here:

It’s not quite the latest data, I’m not going to waste effort updating this throughout the process until I’ve got a working algorithm at which point the latest set can just slot in. The black circles here are the mean estimates, with the black bars representing the 95% intervals from the table.

Now we get to the minor headaches that I’d been procrastinating over for a while. The black bars are not generally symmetric around the central points as they arise from a binomial (type) distribution. My methods (in common with many efficient approaches) require the likelihood P(obs|model) to be Gaussian. The issue here is easy illustrated with a simple example. Let’s say the observations on a given day contain 10 positives in 1000 samples. If the model predicts 5 positives in a sample of 1000, then it’s quite unlikely we would obtain 10: P(O=10|m=5) = 1.8%. However if the model predicts 15 positives, the chance of seeing 10 is rather larger: P(O=10|m=15) = 4.8%. So even though both model predictions are an equal distance from the observation, the latter has a higher likelihood. A Gaussian (of any given width) would assign equal likelihood to both 5 and 15 as the observations are equally far from either of these predictions. I’ve wondered about trying to build in a transformation from binomial to Gaussian but for the first draft I’ll just use a Gaussian approximation which is shown in the plot as the symmetric red error bars. You can see a couple of them actually coincide with the black bars, presumably due to rounding on the data as presented in the table. The ones that don’t, are all biased slightly low relative to the consistent positive skew of the binomial. The skew in these data is rather small compared to that of my simple example but using the Gaussian approximation will result in all of my estimates being just a fraction low compared to the correct answer.

Another issue is that the underlying sample data contribute to two consecutive fortnightly means in these summaries. A simple heuristic to account for this double-counting is to increase uncertainties by a factor sqrt(2) as shown by the blue bars. This isn’t formally correct and I may eventually use the appropriate covariance matrix for observational uncertainties instead, but it’s easier to set up and debug this way and I bet it won’t make a detectable difference to the answer as the model will impose strong serial correlation on this time scale anyway.

So that’s how I’m going to approach the problem. Someone was previously asking for an introduction to how this Bayesian estimation process works anyway. The basic idea is that we have a prior distribution of parameters/inputs P(Φふぁい) from which we can draw an ensemble of samples. In this case, our main uncertain input is a time series of R which I’m treating as Brownian motion with a random daily perturbation. For each sample of Φふぁい, we can run the model simulation and work out how likely we would be to observe the values that have been seen, if the real world had been the model – ie P(Obs|Φふぁい). Using these likelihood values as weights, the weighted ensemble is directly interpretable as the posterior P(Φふぁい|Obs). That really is all there is to it. The difficulties are mostly in designing a computationally efficient algorithm as the approach I have described may need a vast ensemble to work accurately and is therefore sometimes far too slow and expensive to apply to interesting problems. For example, my iterative Kalman smoother doesn’t actually use this algorithm at all, but instead uses a far more efficient way of getting to the same answer. One limitation that it requires (as mentioned above) is that the likelihood has to be expressed in Gaussian form.

Saturday, September 12, 2020

Weekly RRRRRRReport

A lot of different estimates of the growth rate (R) of the epidemic have come out in the last couple of days, so here's a summary of which ones are wrong (and why) and which ones you can believe. And who am I to do this, you might reasonably ask? While not an epidemiologist, my professional expertise is in fitting models to data, which is precisely what this question demands. And the available evidence suggests I'm rather better at it than many epidemiologists appear to be.

As you may recall, a month ago I posted an argument that R really couldn't be under 1 any longer, and the epidemic was starting to grow again. At the time, the "experts" of SAGE were still insisting that R was less than 1, and they kept on claiming that for a while, despite the very clear rise in reported case numbers. The rise has continued and indeed accelerated a bit, other than for a very brief hiatus in the middle of last month. Part of this steady rise might have be due to a bit more testing, but it's been pretty implausible to believe that all of it was for a while now. I'll come back to SAGE's ongoing incompetence later.

I'll start with my own estimate which currently comes out at R= ~1.3. This is based on fitting a very simple model to both case and death data, which the model struggles to reconcile due to its simplicity. The average death rate (as a percentage of infected people) has dropped in recent weeks, thanks to mostly younger people being infected recently, and perhaps also helped by some improvements in treatment. I could try to account for this in the model but haven't got round to it. So it consistently undershoots the case numbers and overshoots deaths a bit, but I don't think this biases the estimate of R enough to really matter (precisely because the biases are fairly constant). Incidentally, the method I'm using for the estimation is an iterative version of an ensemble Kalman smoother, which is a technique I developed about 15 years ago for a different purpose. It's rather effective for this problem and clearly superior to anything that the epidemiologists are aware of. Ho hum.

Here are my plots of the fit to cases (top) and deaths (bottom) along with the R number.

As pointed out, these graphs need better annotation. Top graph is modelled daily infections (blue plume), modelled daily cases (green plume with some blue lines sampled from the ensemble and the median shown as magenta) and case ascertainment ratio which is basically the ratio of these (red plume, RH scale). Reported case numbers are the red circles. Bottom graph is modelled deaths (green plume with lines again) with data as red circles. Red plume here is the effective R number (RH scale). R number and case ascertainment are the fundamental parameters that are being fitted in my approach. Infection fatality rate is fixed at 0.75%.

So far, so good. Well, bad, but hopefully you know what I mean.

Another relevant weekly analysis that came out recently is the infection pilot survey from ONS. Up to now it's been pretty flat and inconclusive, with estimates that have wobbled about a little but with no clear signal. This all changed with their latest result, in which the previous estimate of 27,100 cases (uncertainty range 19,300 - 36,700) in the week of 19 - 25 Aug increasing to 39,700 (29,300 - 52,700) in the week 30 Aug - 5 Sept. That is a rise of 46% in 11 days or about 3.5% per day. R is roughly the 5-day growth rate (for this disease), so that corresponds to an R value of 1.2, but note that their analysis doesn't extend over the past week when the cases have increased more sharply.

Actually, I don't really think the ONS modelling is particularly good - it's a rather arbitrary curve-fitting exercise - but when the data are clear enough it doesn't matter too much. Just looking at the raw data that they kindly make available, they had almost 1 positive test result per 1000 participants over the fortnight 23 Aug - 5 Sept (55 cases in 59k people) which was 65% up on the rate for the previous fortnight of 26 cases in 46k people. Again, that works out at R=1.2.

A rather worse perspective was provided by SAGE, who continue to baffle me with their inability to apply a bit of common sense and update their methods when they repeatedly give results so strikingly at odds with reality. They have finally noted the growth in the epidemic and managed to come up with an estimate marginally greater than 1, but only to the level of R=1.1 with a range of 1-1.2. And even this is a rounding-up of their estimate of daily growth rate of 1 ± 2% per day (which equates more closely to R=1.05 with range of 0.95-1.15). Yes, they really did say that the epidemic might be shrinking by 1% per day, even as cases are soaring and hospital admissions are rising. I do understand how they've managed to generate this answer - some of the estimates that feed into their calculation only use death data, and this is still very flat - but it's such obvious nonsense that they really ought to have pulled their heads out of their arses by now. I sometimes think my work is a bit artificial and removed from practical issues but their unwillingness to bend to reality gives ivory tower academics a bad name.

At the other extreme, a paper claiming R=1.7 was puffed in the press yesterday. It's a large survey from Imperial College, that bastion of incompetent modelling from the start of the epidemic. The 1.7 number comes from the bottom right hand panel in the below plot where they have fitted an exponential through this short subset of the full time series of data. There is of course a lot of uncertainty there. More importantly, it doesn't line up at all with the exponential fitted through the immediately preceding data set, starting at a lower level than the previous curve finishes. While R might not have been constant over this entire time frame, the epidemic has certainly progressed in a continuous manner, which would imply the gap is filled by something like the purple line I've added by hand.

It's obviously stupid to pretend that R was higher than 1 in both of the recent intervals where they made observations, and just happened to briefly drop below 1 exactly in the week where they didn't observe. The sad thing about the way they presented this work to the media is that they've actually done a rather more sensible analysis where they fit the 3rd and 4th intervals simultaneously, which is shown as the green results in the 3rd and 4th panels on the top row of the plots (the green in the 3rd panel is largely overlain by blue which is the fit to 2nd and 3rd intervals, but you can see if you click for a larger view). Which gives.....R=1.3. Who'd have thought it?

Of course R=1.7 is much more headline-grabbing. And it is possible that R has increased towards the end of their experimental period. Rather than fitting simple exponentials (ie fixed R values) to intervals of data, perhaps a more intelligent thing to do would have been to fit an epidemiological model where R is allowed to vary through time. Like I have been doing, for example. I'm available to help and my consultancy rates are very reasonable.

In conclusion, R=1.3(ish) but this is a significant rise on the value it took previously and it might well be heading higher.

Wednesday, April 22, 2020

BlueSkiesResearch.org.uk: Bayesian deconstruction of climate sensitivity estimates using simple models: implicit priors and the confusion of the inverse

Has just been published!

It wasn’t really my intention, but somehow we never came up with a proper title so now we’re stuck with it!

This paper was born out of our long visit to Hamburg a few years ago, from some discussions relating to estimates of climate sensitivity. It was observed that there were two distinct ways of analysing the temperature trend over the 20th century: you could either (a): take an estimate of forced temperature change, and an estimate of the net forcing (accounting for ocean heat uptake) and divide one by the other, like Gregory et al, or else (b): use an explicitly Bayesian method in which you start with a prior over sensitivity (and an estimate of the forcing change), perform an energy balance calculation and update according to how well the calculation agrees with the observed warming, like this paper (though that one uses a slightly more complex model and obs – in principle the same model and obs could have been used though).

These give slightly different answers, raising the question of (a) why? and (b) is there a way of doing the first one that makes it look like a Bayesian calculation?

This is closely related to an issue that Nic Lewis once discussed many years ago with reference to the IPCC AR4, but that never got written up AFAIK and is a bit lost in the weeds. If you look carefully, you can see a clue in the caption to Figure 1, Box 10.1 in the AR4 where it says of the studies:

some are shown for different prior distributions than in the original studies

Anyway, there is a broader story to tell, because this issue also pops up in other areas including our own paleoclimate research (slaps wrist). The basic point we try to argue in the manuscript is that when a temperature (change) is observed, it can usually be assumed to be the result of a measurement equation like:

T_O = T_T + e (1)

where T_O is the numerical value observed, T_T is the actual true value, and e is an observational error which we assume to be drawn from a known distribution, probably Gaussian N(0,σしぐま²) though it doesn’t have to be. The critical point is that this equation automatically describes a likelihood P(T_O|T_T) and not a probability distribution P(T_T|T_O), and we claim that when researchers interpret a temperature estimate directly as a probability distribution in that second way they are probably committing a simple error known as “confusion of the inverse” which is incredibly common and often not hugely important but which can and should be avoided when trying to do proper probabilistic calculations.

Going back to equation (1), you may think it can be rewritten as

T_T = T_O + e (2)

(since -e and e have the same distribution) but this is not the same thing at all because all these terms are random variables and e is actually independent of T_T, not T_O.

Further, we show that in committing the confusion of the inverse fallacy, researchers can be viewed as implicitly assuming a particular prior for the sensitivity, which probably isn’t the prior they would have chosen had they thought about it more explicitly.

The manuscript had a surprisingly (to me) challenging time in review, with one reviewer in particular taking exception to it. I encourage you to read their review(s) if you are interested. We struggled to understand their comments initially, but think their main point was that when a researcher writes down a pdf for T_T such as N(T_O,σしぐま²) it was a bit presumptuous of us to claim they had made an elementary error in logical reasoning when they might in fact have been making a carefully considered Bayesian estimate taking account of all their uncertainties.

While I think in theory it’s possible that they could be right in some cases, I am confident that in practice they are wrong in the vast majority of cases including all the situations under consideration in our manuscript. For starters, if their scenario was indeed the case, the question would not have arisen in the first place as all the researchers working on these problems would already have understood fully what they did and why. And one of the other cases in the manuscript was based on our own previous work, where I’m pretty confident in remembering correctly that we did this wrong

But readers can make up their own minds as to how generally applicable it is. It’s an idea, not a law.

Our overall recommendation is that people should always try to take the approach of the Bayesian calculation, as this makes all their assumptions explicit. It would have been a bit embarrassing if it had been rejected, because a large and wildly exciting manuscript which makes extensive use of this idea has just been (re-)submitted somewhere else today. Watch this space!

I think this is also notable as the first time we’ve actually paid paper charges in the past few years – on previous occasions we have sometimes pleaded poverty, but now we’ve had a couple of contracts that no long really applies. Should get it free really as a reward for all the editing work – especially by jules!

Friday, April 10, 2020

EUROMOMO

A couple of weeks ago, commenter David Young was scathing of the threat of COVID-19. Look at the EUROMOMO web page, he said.

"there is no evidence I could see that mortality is above the expected numbers over the last few weeks."

and

"Italy does show a spike the last few weeks but well below the peak in 2016-17."

Moreover, this is all backed up by impeccable analysis from Nic Lewis and we know how good he is. he has looked at the Diamond Princess which had only 8 deaths out of 700 cases, mostly old, and without a single death in the 60-69 age bracket, he has shown the fatality rate is negligible.

Well, let's see how this stacks up a full 2 weeks later.

Week 14 in EUROMOMO is now out. Click for full size.

They've literally had to rescale their axes for multiple countries to keep the death numbers on the plot. Belgium, France, Italy, Spain, Switzerland and England are all at or above their highs from the cold 2016/17 spell. And this is after a few weeks of lockdown that has stopped the death rates from zooming up higher.

And as for the Diamond Princess, there are actually 12 deaths now from the 700 cases, including one in the 60-69 age range of 200 cases. An empirical fatality rate of 0.5%. According to David, Nic said the death rate was 0.11% for this age group, but what's a factor of 4 between friends?

I expect David will appear shortly and acknowledge that he was wrong and that Nic massively underestimated the death rate and that in fact mortality across much of Europe is at an extremely high level despite strong attempts to suppress the epidemic. I can see his flying pig coming in to land right now..

Tuesday, March 31, 2020

The new study from IC

I haven't had time to examine this new research in great detail but it looks pretty good to me (maybe one or two minor caveats). They have fitted a fairly simple mechanistic statistical model to time series data for deaths in a large number of European countries, using a Bayesian hierarchical modelling approach to estimate the effects of various "non-pharmaceutical interventions" like shutting schools and banning large meetings which have been widely adopted. I suspect (could be wrong tho) that the reason for using a primarily statistical model is that it's quicker and easier for their method than a fully dynamical model like the one I'm using. They say it is fundamentally similar to SIR though, so I don't think their results should be any the worse for it (though perhaps the E box makes a difference?).

First and foremost, by far the most significant result IMO is that they find the initial R0 to be much larger than in the complex mechanistic modelling of Ferguson et al. This confirms what I suggested a few days ago, which was based on the fact that their value of R0 = 2.4 (ish) was simply not compatible with the rapid exponential growth rate in the UK and just about everywhere else in the western world.

Here are pics of their values for a number of countries, for their preferred reproductive time scale of 6.5 days. Left is prior to interventions, right is current post-intervention estimates.

You may spot that all the values on the right imply a substantial probability of R0 > 1, as I also suggested in my recent post. If the initial R0 is high, it's hard to reduce it enough to get it below 1. If true, that would mean ongoing growth of the epidemic, albeit at a lower rate and with a longer lower peak than would be the case without the interventions. I will show over the next few days what the implications of this could be over the longer term.

It's also important to realise that these values on the right depend on the data for all countries - this is the "hierarchical" bit - there is no evidence from the UK data itself of any significant drop in R, as you can work out from the below plot which gives the fit and a one-week forecast. There simply hasn't been long enough since the restrictions were put in place, for anything to feed through into the deaths. Though if it bends in the near future, that will be a strong indication that something is going on. They appear to be 3 days behind reality here - it looks like they haven't got the two small drops and most recent large rise.

Despite my best intentions, I don't really see much to criticise, except that they should have done this a week or two ago before releasing the complex modelling results in which they made such poor parameter choices. Obviously, they would not have been able to estimate the effect of interventions using this method, but they'd have got the initial growth rate right.

It's a little like presenting forecasts from one of the new CMIP6 models, without checking that it has roughly the right warming rate over the last few decades. You'd be better off using an energy balance that was fitted to recent history, especially if you were primarily interested in the broad details such as large-scale temperature trend. In fact that's a pretty direct analogy, except that the epidemic model is much simpler and more easily tuned than a state of the art climate model. Though they also had a bit less time to do it all :-)

As for the caveats however - I had to laugh at this, especially the description above:

I remember seeing a similarly impressive "large correspondence" plotted in a climate science paper a while back. I pressed the authors for the correlation between the two sets of data, and they admitted there wasn't one.

But that's a minor quibble. It's a really interesting piece of work.

I am now just about in a position to show results from my attempts at doing proper MCMC fits to time series data with the SEIR model so it will be interesting to see how the results compare. This can also give longer-term forecasts (under various scenario assumptions of course).

Monday, March 23, 2020

Uncertainty in the COVID-19 model

Time for James to Speak to the Nation. Well, Speak to my Reader at least. Rather than subjecting myself to the waffling windbag, here are some fun experiments in parameter estimation that I've been doing.

The setup is very simple. The SEIR model I've been using has 3 main internal parameters, being the reproduction number R0, the latent period, and the infectious period. They can all be estimated to some extent from case studies, but the appropriate numbers to use in a model are not precisely known. Beyond these parameters, the only remaining tunable value is the number of infected people at the start date (this date isn't worth changing itself as you just go up and down the exponential curve). It seems that Ferguson et al picked the biological numbers from the recent literature and then chose the start value to hit a given number of deaths from their simulations on a date in mid-March. No uncertainties are reported in their analysis, though they have used a few different R0 values in some scenarios. So I think it's worth investigating how much the parameters could reasonably vary, and how much difference this might make.

I've just done a very simple tuning, fitting to values at the start of Feb and 20th March. Since the data are essentially exponential there is little point in using more intermediate data. All it can tell me is the rate of growth and its current level.

Priors for the model parameters are chosen rather arbitrarily by me, based on my previous tuning.

R0 ~ N(2.4,0.3)

Latent_p ~ N(4,1)

Infectious_p ~ N(4.5,1)

Start ~ exp(N(-20,1))

All uncertainties quoted at one standard deviation.

Sampling from this prior gives a wide range of possible epidemics, as shown below. The blue plume is the 5-95% range with the median line also drawn. This plume may be a bit misleading as it seems to include the possibility of no epidemic at all! Actually what is happening here is the different parameter sets lead to a different timing of the peak, so at any moment in time quite a lot of them have either not started, or ended. I've plotted a handful of the individual curves as red lines to show that. In fact all parameter sets do lead to a large epidemic, which is an immediate and unavoidable consequence of the R0 value.

Now for the tuning.

For the data....I assume a significant amount of under-reporting in the official case numbers, as discussed by many. Kucharski today suggested the true value (and note this is even for symptomatic cases) was about 16x the reported values, though I'm not sure if this applies from day1. Vallance recently made the heroic claim that the number of cases might be 1000 times the number of deaths at any given time, presumably in a desperate attempt to pretend that the mortality rate is really low as that's what he staked his advice on. I guess if you include newly infected, it's reasonable. My constraint on March 20 (which I'm taking as relating to the total symptomatic and recovered cases) doesn't include his value, but ranges from 22k-162k (median of 60k) on the day when the official number was 4k and deaths were 177. My early Feb constraint is centred on 20 with a range of 7-55. Mostly just because the log of 20 is very close to 3, I guess on reflection that range is probably a bit high but note that it has to account also for further imported cases (which the model does not simulate) rather than just home-grown ones. The initial growth is exponential so it only really makes sense to do these constraints in log space. A little simplistically, I assume uncorrelated errors on these two data points. In reality, the degree of under-reporting might be expected to be somewhat related.

The below plot shows the prior and posterior spread over the initial segment on a log scale. The two red Hs are the constraints I've used with red circles at their means, the dark circles below are the actual reported numbers. My pale blue prior has a broad spread, and the green posterior (just tuned by rejection sampling, nothing fancy) focusses nicely on the plausible range.

And what about the projected epidemic? Here we have it:

Compared to the prior, it's greatly constrained with the timing of the peak ranging from about mid-April to the end of May. Again the lower limit in that plume is made up of the start of late epidemics and end of early ones, it doesn't represent a realistic trajectory in itself.

I think one thing that can probably be concluded is that the parametric uncertainty isn't hugely important for the broad policy advice. Rather like climate science, in fact :-) In the absence of strong action, we are getting an epidemic in the next couple of months, and that's really all we need to know. Though the timing is somewhat uncertain...I may investigate that further...watch this space...

Edit: I realised overnight that the way I have calculated the percentiles was a bit clumsy and misleading and will change for future work. Basically I calculated the percentiles of each model compartment first, with the various diagnostics being generated from these, whereas I should have calculate the diagnostics first, and then presented the percentiles of them. A bit of a silly error really. I'll leave it up as a testament to the limitations of doing stuff quickly.

Monday, December 23, 2019

Outcomes

At the start of the year I made some predictions. It's now time to see how I did.

In reverse order....

6. The level of CO2 in the atmosphere will increase (p=0.999).

Yup. I don't really need to wait for 1 Jan for that.

5. 2019 will be warmer than most years this century so far (p=0.75 - not the result of any real analysis).

As above, I know we've a few days to go but no need to wait for this one, which has been very clear for a good while now.

4. We will also submit a highly impactful paper in collaboration with many others (p=0.85).

Done, reviews back which look broadly ok, revision planned for early next year when we all have a bit of time (31 Dec is a stupid IPCC submission deadline for lots of other stuff).

3. Jules and I will finish off the rather delayed work with Thorsten and Bjorn (p=0.95).

Yes, the project is done, through the write-up continues. Actually we hope to submit a paper by 31 Dec but there will be more to do next year too.

2. I will run a time (just!) under 2:45 at Manchester marathon (p=0.6).

Nope, 2:47:15 this time. The prediction was made just a few days after I'd run a big PB in a 10k but even then I thought it was barely more likely than not, and it got less likely as the date approached.

1. Brexit won't happen (p=0.95).

On re-reading the old post, I have to admit I cannot remember the precise intention I had when I wrote this. Given the annual time frame of the remainder of the bets, and the narrative of that time being that we were certainly going to leave on the 29th March (as repeated over 100 times by May - remember her? - and the rest of them) I do believe I must have been referring to leaving during 2019. After all, I could never hope to validate a bet of infinite duration. So yes, I'm going to give myself this one.

On the other hand, I did actually think that we would probably not be stupid enough to leave at all, and clearly I misunderestimated the electorate and also the dishonesty of the Conservative Party, or perhaps as it should be known, the English National Party.

I have learnt from that misjudgment and will not be offering any predictions as to where we end up at the end of next year. Which is sort of inconvenient, as we are trying to arrange a new contract with our European friends for work which could extend into 2021. Our options would seem to include: limiting the scope of the contract to what we can confidently complete strictly within 2020, which is far from ideal, or shifting everything to Estonia (incurring additional costs and inconvenience for us, though it may be the best option in the long term). Or just take a punt and cross our fingers that it all turns out ok, despite there being as yet no hint of a sketch of a plan as to how the sales of services into the EU will be regulated or taxed past 2020. It is quite possible that we'll just shut down the (very modest) operation and put our feet up.

I'm still waiting for the brexiters to tell me how any of this is in the country's interests. But that's a rant for another day. Perhaps it's something to do with having enough of experts.

As for scoring my predictions: the idea of a “proper scoring rule” is to provide a useful measure of performance for probabilistic prediction. A natural choice is the logarithmic scoring rule L = log(p) where p is the probability assigned to the outcome, and with all of my predictions having a binary yes/no basis I'll use base 2 for the calculation. The aim is to maximise the score (ie minimise its negativity, as the log of numbers in the range 0 to 1 is negative). A certain prediction where we assign a probability of p=1 to something that comes out right scores a maximum 0, a coin toss is -1 whether right or wrong but if you predict something to only have a p=0.1 chance and it happens, then the score is log(0.1) which in base 2 is a whopping -3.3. Assigning a probability of 0 to the event that happens is a bad idea, the score is infinitely negative...oops.

My score is therefore:
0 - 0.42 - .23 - .07 -1.32 - .07 = 2.11

or about 0.35 per bet, which is equivalent to assigning p=0.78 to the correct outcome each time (which is just the geometric mean of the probabilities I did assign). Of course some were very easy, but that's why I gave them high p estimates which means high score (but a big risk if I'd got them wrong). I could have given a higher probability to the temperature prediction if I'd bothered thinking about it a bit more carefully. The running one was the only truly difficult prediction, because I was specifically calibrating the threshold to be close to the border of what I might achieve. It might have been better presented as a distribution for my finish time, where I would have had to judge the sharpness of the pdf as well as its location (ie mean).

Saturday, November 09, 2019

BlueSkiesResearch.org.uk: Marty Weitzman: Dismally Wrong.

De mortuis nil nisi bonum and all that, but I realise I only wrote this down in a very abbreviated and perhaps unclear form many years ago, in fact prior to publication of the paper it concerns. I was sad to hear of his untimely death and especially by suicide when he surely had much to offer. But like all innovative researchers, he made mistakes too, and his Dismal Theorem was surely one of them. Since it’s been repeatedly brought up again recently, I thought I should explain why it’s wrong, or perhaps to be more precise, why it isn’t applicable or relevant to climate science in the way he presented it.

His basic claim in this famous paper was that a “fat tail” (which can be rigorously defined) on a pdf of climate sensitivity is inevitable, and leads to the possibility of catastrophic outcomes dominating any rational economic analysis. The error in his reasoning is, I believe, rather simple once you’ve seen it, but the number of people sufficiently well-versed in statistics, climate science and economics (and sufficiently well-motivated to carefully examine the basis of his claim) is approximately zero so as far as I’m aware no-one else ever spotted the problem, or at least I haven’t seen it mentioned elsewhere.

The basic paradigm that underpins his analysis is that if we try to estimate the parameters of a distribution by taking random draws from it, then our estimate of the distribution is going to naturally take the form of a t-distribution which is fat-tailed. And importantly, this remains true even when we know the distribution to be Gaussian (thin-tailed), but we don’t know the width and can only estimate it from the data. The presentation of this paradigm is hidden beyond several pages of verbiage and economics which you have to read through first, but it’s clear enough on page 7 onwards (starting with “The point of departure here”).

The simple point that I have to make is to observe that this paradigm is not relevant to how we generate estimates of the equilibrium climate sensitivity. We are not trying to estimate parameters of “the distribution of climate sensitivity”, in fact to even talk of such a thing would be to commit a category error. Climate sensitivity is an unknown parameter, it does not have a distribution. Furthermore, we do not generate an uncertainty estimate by comparing a handful of different observationally-based point estimates and building a distribution around them. (Amusingly, if we were to do this, we would actually end up with a much lower uncertainty than usually stated at the 1-sigma level, though in this case it could indeed end up being fat-tailed in the Weitzman sense.) Instead, we have independent uncertainty estimates attached to each observational analysis, which are based on analysis of how the observations are made and processed in each specific case. There is no fundamental reason why these uncertainty estimates should necessarily be either fat- or thin-tailed, they just are what they are and in many cases the uncertainties we attach to them are a matter of judgment rather than detailed mathematical analysis. It is easy to create artificial toy scenarios (where we can control all structural errors and other “black swans”) where the correct posterior pdf arising from the analysis can be of either form.

Hence, or otherwise, things are not necessarily quite as dismal as they may have seemed.

Tuesday, June 04, 2019

BlueSkiesResearch.org.uk: How confident should you have been about confidence intervals?

OK, it’s answer time for these questions (also here on this blog). First, a little background. This is the paper, or rather, here it is to download. The questions were asked of over 100 psychology researchers and 400 students and virtually none of them got all the answers right, with more wrong than right answers overall.

The questions were modelled on a paper by Gigerenzer who had done a similar investigation into the misinterpretation of p-values arising in null hypothesis significance testing. Confidence intervals are often recommended as an improvement over p-values, but as this research shows, they are just as prone to misinterpretation.

Some of my commenters argued that one or two of the questions were a a bit unclear or otherwise unsatisfactory, but the instructions were quite clear and the point was not whether one might think the statement probably right, but whether it could be deduced as correct from the stated experimental result. I do have my own doubts about statement 5, as I suspect that some scientists would assert that “We can be 95% confident” is exactly synonymous with “I have a 95% confidence interval”. That’s a confidence trick, of course, but that’s what confidence intervals are anyway. No untrained member of the public could ever guess what a confidence interval is.

Anyway, the answer, for those who have not yet guessed, is that all of the statements were false, broadly speaking because they were making probabilistic statements about the parameter of interest, which simply cannot be deduced from a frequentist confidence interval. Under repetition of an experiment, 95% of confidence intervals will contain the parameter of interest (assuming they are correctly constructed and all auxiliary hypotheses are true) but that doesn’t mean that, ONCE YOU HAVE CREATED A SPECIFIC INTERVAL, the parameter has a 95% probability of lying in that specific range.

In reading around the topic, I found one paper which had an example which is similar to my own favourite. We can generate valid confidence intervals for an unknown parameter with the following procedure: with probability 0.95, say “the whole number line”, otherwise say “the empty set”. If you repeat this many times, the long-run coverage frequency tends to 0.95, as 95% of the intervals do include the true parameter value. However, for a given example, we can state with absolute certainty whether the parameter is either in or outside the interval, so we will never be able to say, once we have generated an interval, that there is 95% probability that the parameter lies inside that interval.

(Someone is now going to raise the issue of Schrödinger’s interval, where the interval is calculated automatically, and sealed in a box. Yes, in this situation we can place 95% probability on that specific interval containing the parameter, but it’s not the situation we usually have where someone has published a confidence interval, and it’s not the situation in the quiz).

And how about my readers? These questions were asked on both blogs (here and here) and also on twitter, gleaning a handful of replies in all places. Votes here and on twitter were majority wrong (and no-one got them all right), interestingly all three of the commenters on the Empty Blog were basically correct though two of them gave slightly ambiguous replies but I think their intent was right. Maybe helps that I’ve been going on about this for years there

Sunday, May 26, 2019

BlueSkiesResearch.org.uk: How confident are you about confidence intervals?

Found a fun little quiz somewhere, which I thought some of my readers might like to take. My aim is not to embarrass people who may get some answers wrong – in testing, the vast majority of all respondents (including researchers who reported substantial experience) were found to make mistakes. My hypothesis is that my readers are rather more intelligent than average

Please answer in comments but work out your answers before reading what others have said, so as not to be unduly influenced by them.

I will summarise and explain the quiz when enough have answered…

A researcher undertakes an experiment and reports “the 95% confidence interval for the mean ranges from 0.1 to 0.4”

Please mark each of the statements below as “true” or “false”. False means that the statement does not follow logically from the quoted result. Also note that all, several, or none of the statements may be correct:

1. The probability that the true mean is greater than 0 is at least 95%.

2. The probability that the true mean equals 0 is smaller than 5%.

3. The “null hypothesis” that the true mean equals 0 is likely to be incorrect.

4. There is a 95% probability that the true mean lies between 0.1 and 0.4.

5. We can be 95% confident that the true mean lies between 0.1 and 0.4.

6. If we were to repeat the experiment over and over, then 95% of the time the true mean falls between 0.1 and 0.4.

Thursday, November 30, 2017

Implicit priors and the energy balance of the earth system

So, this old chestnut seems to keep on coming back....

Back in 2002, Gregory et al proposed that we could generate “An observationally based estimate of the climate sensitivity” via the energy balance equation S = F_2x dT/Q where S is the equilibrium sensitivity to 2xCO2, F_2x = 3.7 is the (known constant) forcing of 2xCO2, dT is the observed surface air temperature change and Q is the net radiative imbalance at the surface which takes account of both radiative forcing and the deep ocean heat uptake. (Their notation is marginally different, I'm simplifying a bit.)

Observational values for both dT and Q can be calculated/observed, albeit with uncertainties (reasonably taken to be Gaussian). Repeatedly sampling from these observationally-derived distributions and taking the ratio generates an ensemble of values for S which can be used as a probability distribution. Or can it? Is there a valid Bayesian interpretation of this, and if so, what was the prior for S? Because we know that it is not possible to generate a Bayesian posterior pdf from observations alone. And yet, it seems that one was generated.

This method may date back to before Gregory et al, and is still used quite regularly. For example, Thorsten Mauritsen (who we were visiting in Hamburg recently) and Robert Pincus did it in their recent “Committed warming” paper. Using historical observations, they generated a rather tight estimate for S as 1.1-4.4C, though this wasn't really the main focus of their paper. It seems a bit optimistic compared to much of the literature (which indicates the 20th century to provide a rather weaker constraint than that) so what's the explanation for this?

The key is in the use of the observationally-derived distributions for the quantities dT and Q. It seems quite common among scientists to interpret a measurement x_o of an unknown x, with some known (or perhaps assumed) uncertainty σしぐま, as implying the probability distribution N(x_o,σしぐま) for x. However, this is not justifiable in general. In Bayesian terms, it may be considered equivalent to starting with a uniform prior for x and updating with the likelihood arising from the observation. In many cases, this may be a reasonable enough thing to do, but it's not automatically correct. For instance, if x is known to be positive definite, then the posterior distribution must be truncated at 0, making it no longer Gaussian (even if only to a negligible degree). (Note however that it is perfectly permissible to do things like use (x_o- 2σしぐま, x_o+ 2σしぐま) as a 95% frequentist confidence interval for x, even when it is not a reasonable 95% Bayesian credible interval. Most scientists don't really understand the distinction between confidence intervals and credible intervals, which may help to explain why the error is so prevalent.)

So by using the observational estimates for dT and Q in this way, the researcher is implicitly making the assumption of independent uniform priors for these quantities. This implies, via the energy balance equation, that their prior on S is the quotient of two uniform priors. Which has a funny shape in general, with a flat region near 0 and then a quadratically-decaying tail. Moreover, this prior on S is not independent of the prior for either dT or Q. Although it looks like there are three unknown quantities, the energy balance equation tying them together means there are only two degrees of freedom here.

At the time of the IPCC AR4, this rather unconventional implicit prior for S was noticed by Nic Lewis who engaged in some correspondence with IPCC authors about the description and presentation of the Gregory et al results in that IPCC report. His interpretation and analysis is very sightly different to mine, in that he took the uncertainty in dT to be so (relatively) small that one could ignore it and consider the uniform prior on Q alone, which implies an inverse quadratic prior on S. However the principle of his analysis is similar enough.

In my opinion, a much more straightforward and natural way to approach the problem is instead to define the priors over Q and S directly. These can be whatever we want and are prepared to defend publicly. I've previously advocated a Cauchy prior for S which avoids the unreasonableness and arbitrariness of a uniform prior for this constant. In contrast, a uniform prior over Q (independent of S) is probably fairly harmless in this instance, and this does allow for directly using the observational estimate of Q as a pdf. Sampling from these priors to generate an ensemble of (S,Q) pairs allows us to calculate the resulting dT and weight the ensemble members according to how well the simulated values match the observed temperature rise. This is standard Monte Carlo integration using Bayes Theorem to update a prior with a likelihood. Applying this approach to Thorsten's data set (and using my preferred Cauchy prior), we obtain a slightly higher range for S of 1.2 - 4.8C. Here's a picture of the results (oops, ECS = S there, an inconsistent labelling that I can't be bothered fixing).

The median and 5-95% ranges for prior and posterior are also given. As you can see, the Cauchy prior doesn't really cut off the high tail that aggressively. In fact it's a lot higher than a U[0,10] or even U[0,20] prior would imply.

Monday, September 18, 2017

More D&A; and FvB.

By chance I happened to notice another paper with an interesting title appearing in Climatic Change on the very same day as the recent Mann et al paper: Is the choice of statistical paradigm critical in extreme event attribution studies? While my noticing it was fortuitous, the publication date was no coincidence, as it was clearly intended as a "comment on" in all but name. I am not particularly impressed by such shenanigans. I know that Nature often publishes a simultaneous commentary along with the article itself, but these are generally along the lines of a sycophantic laudation extolling the virtues of the new work. The climatic change version seems to be designed to invite a critical comment which does not provide the authors under attack any right to reply. Jules and I were previously supposed to be a victim of this behaviour when we published this paper. However the commentary never got written, so in the end we suffered nothing more than a lengthy delay to final publication.

Anyway, back to the commentary. Is the choice of statistical paradigm critical? I can't really be bothered discussing it in any detail. The arguments have been hashed out before (including on this blog, e.g. here). The authors provide a rather waffly defence of frequentist approaches without really providing any credible support IMO, based on little more than some rather silly straw men. Of course a badly-chosen prior can give poor results, but so can an error in your computer program or a typo in your manuscript, and no-one argues that it's better to just take a guess instead of doing a calculation and writing down the answer. Well, almost no-one.

Saturday, April 01, 2017

BlueSkiesResearch.org.uk: Independence day

We all know by now that Brexit means brexit. However, it is not so clear whether independence means independence or perhaps something else entirely. This has been an interesting and important question in climate science for at least 15 years and probably longer. The basic issue is, how do we interpret the fact that all the major climate models, which have been built at various research centres around the world, generally agree on the major issues of climate change? Ie, that the current increase in CO2 will generate warming at around the rate of 0.2C/decade globally, with the warming rate being higher at high latitudes, over land, at night, and in winter. And that this will be associated with increases in rainfall (though not uniformly everywhere – in fact this being focussed on the wettest areas, with many dry areas becoming drier). Etc etc at various levels of precision. Models disagree on the fine details but agree on the broad picture. But are these forecasts robust, or have we instead merely created multiple copies of the same fundamentally wrong model? We know for sure that some models in the IPCC/CMIP collection are basically copies of other models with very minor changes. Others appear to differ more substantially, but many common concepts and methods are widely shared. This has led some to argue that we can’t really read much into the CMIP model-based consensus as these models are all basically near-replicates and their agreement just means they are all making the same mistakes.

While people have been talking about this issue for a number of years, it seems to us that little real progress has been made in addressing it. In fact, there have been few attempts to even define what "independence" in this context should mean, let alone how it could be measured or how some degree of dependence could be accounted for correctly. Many papers present an argument that runs roughly like this:

We want models to be independent but aren’t defining independence rigorously
(Some analysis of a semi-quantitative nature)
Look, our analysis shows that the models are not independent!

Perhaps I’m not being entirely fair, but there really isn’t a lot to get your teeth into.

We’ve been pondering this for some time, and have given a number of presentations of varyng levels of coherence over the last few years. Last August we finally we managed to write something down in a manner that we thought tolerable for publication, as I wrote about at the time. During our trip to the USA, there was a small workshop on this topic which we found very interesting and useful, and that together with reviewer comments helped us to improve the paper in various ways. The final version was accepted recently and has now appeared in ESD. Our basic premise is that independence can and indeed must be defined in a mathematically rigorous manner in order to make any progress on this question. Further, we present one possible definition, show how it can be applied in practice, and what conclusions flow from this.

Our basic idea is to use the standard probabilistic definition of independence: A and B are independent if and only if P(A and B) = P(A) x P(B). In order to make sense of this approach, it has to be applied in a fundamentally Bayesian manner. That is to say, the probabilities (and therefore the independence or otherwise of the models) are not truly properties of the models themselves, but rather properties of a researcher’s knowledge (belief) about the models. So the issue is fundamentally subjective and depends on the background knowledge of the researcher: A and B are conditionally independent given X if and only if P(A and B given X) = P(A given X) x P(B given X). Depending what one is conditioning on, this approach seems to be flexible and powerful enough to encapsulate in quantitative terms some of the ideas that have been discussed somewhat vaguely. For example, if X is the truth, then we arrive at the truth-centred hypothesis that the errors of an ensemble of models will generally cancel out and the ensemble mean will converge to the truth. It’s not true (or even remotely plausible), but we can see why it was appealing. More realistically, if X is the current distribution of model outputs, and A and B are two additional models, then this corresponds quite closely to the concept of model duplication or near-duplication. If you want more details then have a look at the paper.

Anyway, we don’t expect this to be the last word on the subject, though it may be the last we say about it for some while as we are planning on heading off in a different direction with our climate research for the coming months and perhaps years.

Monday, January 23, 2017

Is the Food Standards Agency fit for purpose?

Fans of Betteridge's Law will already know the answer....

This is about the latest food scare of course. Toast and “overcooked potatoes” which was later revealed to refer to roasted potatoes (and well-fried chips) are the supposed culprits this time. It's not the first time we've been warned about acrylamides, and probably won't be the last. It's all nonsense, sadly. The basic problem with the FSA approach is that it identifies and publicises chemicals as being likely to be a cancer causing agent, without any(*) consideration of the dose required. As David Spiegelhalter's excellent article explains, a typical human diet contains around 100th of the dose that has been observed to lead to a modest increase in the rate of tumours in animals. And despite all the studies that have been done, no-one has found any link between acrylamide and tumours in humans. But that doesn't stop the FSA generating scare stories about how we shouldn't toast bread properly, or roast our potatoes (I heard someone recommend 45 mins in a cool oven which would just produce soft greasy pallid lumps).

Incidentally, that article probably doesn't blow DS's horn sufficiently for people to realise how authoritative an expert he is. He is Winton Professor for the Public Understanding of Risk in the Statistical Laboratory, Centre for Mathematical Sciences, University of Cambridge. When he writes about something, he's probably right.

Anyway, I'm going to keep on roasting.

* not strictly true as a careful reading of David Spiegelhalter's article reveals. But the margin of safety has to be astronomical, rather than merely huge, in order for them to discount it.

Wednesday, February 24, 2016

No, Terence Mills does not believe his “forecast”

Another day, another bit of clickbait climate sceptic trash in the Times. Hidden behind a paywall, which saves you from having to read it, though some of it is republished here and the underlying report (not peer reviewed) is on the GWPF site here. The basic premise is that if you fit a nonsense model with no trend or drift, you generate a forecast with no trend or drift (though with huge uncertainty intervals, necessary to allow for the historical warming we've already seen). Amusingly, even with those huge uncertainty intervals, the temperature is already outside them as tweeted here by Gavin.

I was interested in whether the author really believed it, whether he'd been conned or misrepresented by the GWPF, or whatever else his explanation might be. So I emailed him, proposing a simple bet: for every month where the temperature lies above the red 50% line, he pays me £50. For every month where the temperature lies below the line, I pay him £60. Expected win for him: £5 per month, indefinitely. Assuming he believes the forecast, that is. Expected win for me, knowing his forecast to be junk: £50 per month, though I might lose a few at the outset :-)

There followed a fruitless exchange in which he declined to comment on whether he thought the forecast was credible and refused to even discuss any possible bet. I'm still baffled as to what might motivate a statistics professor to put their name to such obvious drivel, it's hardly something that will enhance his reputation in the academic community, or that he can feel proud to have written.

Friday, December 11, 2015

The benefits of an informed prior

Since Bayesian priors seem to have come up on a couple of blogs recently...

A few weeks ago I got a rather spammy email about a “climate change challenge” from someone I'd not come across before. Looking at his “publication list”, that is perhaps not so surprising (the most recent ACP submission listed there was rejected without review, so cannot be found on their web site). Anyway, it's just a typical kooky site full of claims about how everyone else is wrong apart from the author who cannot get his groundbreaking theories published, move along nothing to see etc...

Shortly afterwards, Doug Keenan announced a “contest” wherein all and sundry were challenged to identify from a large set of random time series which of them had been generated by a trendless random process, and which were generated by one with a known trend. Of course the trivial trick underlying his game is to make the trend small relative to the inherent variability of the random time series. For a simple example, consider if I generate a set A consisting of 500 samples from N(0,1) and a set B of 500 samples from N(1,1). If I publish the 1000 values, no-one could possibly hope to identify correctly which were from A and which were from B, because any value like 0.3 or 0.8 could easily have come from either set. If I'd used N(0,0.1) and N(1,0.1) for the two sets, on the other hand, it would have been rather different...

Amusingly, “I'm a genius time series analyst” Keenan bungled his calculations, as is documented in the comments here. His original set of trendless and trended series were sufficiently well separated that a successful partitioning might have been possible, at least with a bit of luck. Of course, his “$100,000” prize was “based on the honor system” so any entrant's chance of collecting would have to factor in their opinion of how honourable he is. I suggest that his juxtaposition of “My name would be mud if I reneged”, with the fact that he actually has reneged on his challenge by withdrawing the original set of samples and replacing it with another, might be usefully considered as evidence on the matter.

As for the Bayesian prior, that's what got me to the same answer as Andrew Gelman without the need to do all the calculations...

Nov	APR	Jul
	22
2019	2021	2023

Some links