Thanks to Adina for the nomination but due to a lack of other entries, there is no winner for this week.
Posts from October 2011 (39)
Stat of the Week Competition: October 29-November 4 2011
Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.
Here’s how it works:
- Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday November 4 2011.
- Statistics can be bad, exemplary or fascinating.
- The statistic must be in the NZ media during the period of October 29-November 4 2011 inclusive.
- Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.
Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.
The fine print:
- Judging will be conducted by the blog moderator in liaison with staff at the Department of Statistics, The University of Auckland.
- The judges’ decision will be final.
- The judges can decide not to award a prize if they do not believe a suitable statistic has been posted in the preceeding week.
- Only the first nomination of any individual example of a statistic used in the NZ media will qualify for the competition.
- Employees (other than student employees) of the Statistics department at the University of Auckland are not eligible to win.
- The person posting the winning entry will receive a $20 iTunes voucher.
- The blog moderator will contact the winner via their notified email address and advise the details of the $20 iTunes voucher to that same email address.
- The competition will commence Monday 8 August 2011 and continue until cancellation is notified on the blog.
Stat of the Week Nominations: October 29-November 4 2011
If you’d like to comment on or debate any of this week’s Stat of the Week nominations, please do so below!
New climate change datasets: boring but useful results. (updated)
The Berkeley Earth Surface Temperature project has just released its first data sets and analyses. The aim of this project is to summarise all the temperature records around the world in a comprehensive and transparent way, both to get an estimate of changes in global temperature and to make it easy to see the impact of data quality filters that have been applied by previous climate modelling projects. So far, they have analysed all the land measurements; ocean measurements are coming next. You can download the data and code yourself, and check their findings or explore further (if you actually want 1.6 billion temperature measurements — don’t try this on a smartphone).
The main finding of the project won’t surprise most people: it’s getting hotter. More importantly, the estimates based on all available data agree almost perfectly with the previous estimates that were based on a small subset of ‘best’ weather stations. Incorporating lower-quality stations doesn’t change the estimates. Even using just the low quality stations gives pretty much the same estimates. Other things that don’t affect the results include the urban heat island effect: cities are hotter than the countryside, but most of the world isn’t in a city. They’ve also made a neat movie of the climate since 1800: you can see the normal oscillations over time, and the heating trend that eventually swamps them.
Combining temperature records with varying quality, measurement frequency, and duration, is a major statistical task even without considering the volume of data involved. The statistical expertise on the Berkeley Earth team includes an Auckland Stats graduate (and Berkeley Stats PhD), Charlotte Wickham.
Updated to add: this is now in the Kiwi media: NZ Herald, Stuff, 3 News.
Poisson variation strikes again
What’s the best strategy if you want to have a perfect record betting on the rugby? Just bet once: that gives you a 50:50 chance.
After national statistics on colorectal cancer were released in Britain, the charity Beating Bowel Cancer pointed out that there was a three-fold variation across local government areas in the death rate. They claimed that over 5,000 lives per year could be saved, presumably by adopting whatever practices were responsible for the lowest rates. Unfortunately, as UK blogger ‘plumbum’ noticed, the only distinctive factor about the lowest rates is shared by most of the highest rates: a small population, leading to large random variation.
His article was picked up by a number of other blogs interested in medicine and statistics, and Cambridge University professor David Speigelhalter suggested a funnel plot as a way of displaying the information.
A funnel plot has rates on the vertical axis and population size (or some other measure of information) on the horizontal axis, with the ‘funnel’ lines showing what level of variation would be expected just by chance.
The funnel plot (click to embiggen) makes clear what the tables and press releases do not: almost all districts fall inside the funnel, and vary only as much as would be expected by chance. There is just one clear exception: Glasgow City has a substantially higher rate, not explainable by chance.
Distinguishing random variation from real differences is critical if you want to understand cancer and alleviate the suffering it causes to victims and their families. Looking at the districts with the lowest death rates isn’t going to help, because there is nothing very special about them, but understanding what is different about Glasgow could be valuable both to the Glaswegians and to everyone else in Britain and even in the rest of the world.
All about election polls
November 26 is Election Day, and from now on, you’ll be getting election polls from all directions. So which ones can you trust? The easy answer is: none of them. However, some polls are worth more than others.
Assess their worth by asking these questions:
- Who is commissioning the poll? Is this done by an objective organisation or is it done by those who have a vested interest? Have they been clear about any conflict of interest?
- How have they collected the opinions of a representative sample of eligible voters? One of the cardinal sins of polls is to get people to select themselves (self-selection bias) to volunteer their opinions, like those ‘polls’ you see on newspaper websites. Here, you have no guarantee that the sample is representative of voters. “None of my mates down at the RSA vote that way, so all the polls are wrong” is a classic example of how self-selection manifests itself.
- How did they collect their sample? Any worthy pollster will have attempted to contact a random sample of voters via some mechanism that ensures that they have no idea who, beforehand, they will be able to contact. One of the easiest ways is via computer-aided interviewing (CATI) of random household telephone numbers (landlines), typically sampled in proportion to geographical regions with a rural/urban split (usually called a stratified random sample). A random eligible voter needs to be selected from that household – and it won’t necessarily be the person who most often answers the phone! A random eligible voter is usually found by asking which of the household’s eligible voters had the most recent birthday and talking to that person. But the fact that not all households have landlines is an increasing concern with CATI interviewing. However, in the absence of any substantiated better technique, CATI interviewing remains the industry standard.
- What about people who refuse to cooperate? This is called non-response. Any pollster should try to reduce this as much as possible by re-contacting households that did not answer the phone first time around, or, if the first call found the person with the most recent birthday wasn’t home, try to get hold of them. If the voter still refuses, they become a ‘non-respondent’ and attempts should be made to re-weight the data so that this non-response effect is diminished. The catch is that the data is adjusted on the assumption that the respondents selected represented the opinion of a non-respondent on whom, by definition, we have no information. This is a big assumption that rarely gets verified. Any worthy polling company will mention non-response s and discuss how they attempt to adjust for them. Don’t trust any outfits that are not willing to discuss this!
- Has the polling company asked reasonable, unambiguous questions? If the voters are confused by the question, their answers will be too. The pollsters need to state what questions have been asked and why. Any fool can ask questions – asking the right question is one of the most important skills in polling. Pollsters should openly supply detail on what they ask and how they ask it.
- How can a sample of, say, 1000 randomly-selected voters represent the opinions of 3 million potential voters? This is one of the truly remarkable aspects of random sampling. The thing to realise is that whilst this a very small sub-sample of voters, provided they have been randomly selected, the precision of this estimate is determined by the amount of information you have collected, not the proportion of the total population (provided this sampling fraction is quite small e.g. 1000 out of 3 million).
- What is the margin of error (MOE)? It’s a measure of precision. It measures the price paid for not taking a complete census of the data, which happens once every three years on Election Day, which we call in statistical terms a population result. The MOE is based on behaviour of all similar possible poll results we could have selected (for a given level of confidence which is usually taken to be 95%). Once we know what that behaviour is (via probability theory and suitable approximations) we can then use the data that has been collected to make inference about the population that interests us. We know that 95% of all possible poll results plus or minus their MOE include the true unknown population value. Hence, we say we are 95% confident that a poll result contains the population value.
- When we see quoted a MOE of 3.1% (from random sample of n=1000 eligible voters), how has it been calculated? It is, in fact, the maximum margin of error that could have been obtained for any political party. It is only really valid for parties that are close to the 50% mark (National and Labour are okay here, but it is irrelevant for, say, NZ First, whose support is closer to 5%). So if National is quoted a having a party vote of 56%, we are 95% confident that the true population value for National support is anywhere between 56% plus or minus 3.1% or about 53% to 59% – in this case, indicating a majority.
- Saying that a party is below the margin of error is saying it has few people supporting it, and not much else. Its MOE will be much lower than the maximum MOE. For back-of-the-envelope calculations, the maximum MOE for a party is approximately =1/(square root of n), e.g. If n=1000 random voters are sampled then MOE 1/(square root of 1000) =1/31.62 =3.1%.
- Comparing parties become somewhat more complicated. If National are up, then no doubt Labour will be down. So to see if National has a lead on Labour, we have to adjust for this negative correlation. A rough rule of thumb for comparing parties sitting around 50% is to see if they differ by more than 2xMOE. So if Labour has 43% of the party vote and National 53% (with MOE = 3.1% from n=1000) we can see that this 10% lead is greater than 2×3.1=6.2% – indicating that we have evidence to believe that this lead of National is ‘real’, or statistically significant.
- Note that any poll result only represents the opinion of those sampled at the place and time. As a week is a long time in politics, and most polls become obsolete very quickly. Note also that poll results now can affect poll results tomorrow, so these results are fluid, not fixed.
If you’re reading, watching or listening to poll results, be aware of their limitations. But note that although polls are fraught with difficulties, they remain useful. Any pollster who is open about the limitations of his or her methods is to be trusted over those who peddle certainty based on uncertain or biased information.
Stat of the Week Winner: October 15-21 2011
Thanks to Rob for his nomination of a misleading graph but due to a lack of other entries (perhaps we were all a little distracted by other events this week), there is no winner for this week.
Stat of the Week Competition: October 22-28 2011
Each week, we would like to invite readers of Stats Chat to submit nominations for our Stat of the Week competition and be in with the chance to win an iTunes voucher.
Here’s how it works:
- Anyone may add a comment on this post to nominate their Stat of the Week candidate before midday Friday October 28 2011.
- Statistics can be bad, exemplary or fascinating.
- The statistic must be in the NZ media during the period of October 22-28 2011 inclusive.
- Quote the statistic, when and where it was published and tell us why it should be our Stat of the Week.
Next Monday at midday we’ll announce the winner of this week’s Stat of the Week competition, and start a new one.
The fine print:
- Judging will be conducted by the blog moderator in liaison with staff at the Department of Statistics, The University of Auckland.
- The judges’ decision will be final.
- The judges can decide not to award a prize if they do not believe a suitable statistic has been posted in the preceeding week.
- Only the first nomination of any individual example of a statistic used in the NZ media will qualify for the competition.
- Employees (other than student employees) of the Statistics department at the University of Auckland are not eligible to win.
- The person posting the winning entry will receive a $20 iTunes voucher.
- The blog moderator will contact the winner via their notified email address and advise the details of the $20 iTunes voucher to that same email address.
- The competition will commence Monday 8 August 2011 and continue until cancellation is notified on the blog.
Stat of the Week Nominations: October 22-28 2011
If you’d like to comment on or debate any of this week’s Stat of the Week nominations, please do so below!