Posts from May 2014 (77)

May 30, 2014

Trusting your data or your model

Even with large amounts of data, automated predictions must usually incorporate explicit or implicit prior understanding of the structure of the problem. “Look for anything” is not good enough: “anything” is too big.

Here, for your weekend light entertainment, are some examples where the prior structure was too strong or too weak:

The example that prompted this post, from the blog of Melville House Press, is about automated scanning of books to create digital editions

 in many old texts the scanner is reading the word ‘arms’ as ‘anus’ and replacing it as such in the digital edition. As you can imagine, you don’t want to be getting those two things mixed up.

A similar phenomenon was pointed out at Language Log a decade ago

Fear not your toes, though they are strong,
The conquest doth to you belong;

Daniel Dennett recounts two anecdotes of speech recognition, one human and one computer, which err in the opposite direction to the text recognition example. The computer one:

An AI speech-understanding system whose development was funded by DARPA (Defense Advanced Research Projects Agency), was being given its debut before the Pentagon brass at Carnegie Mellon University some years ago. To show off the capabilities of the system, it had been attached as the “front end” or “user interface” on a chess-playing program. The general was to play white, and it was explained to him that he should simply tell the computer what move he wanted to make. The general stepped up to the mike and cleared his throat–which the computer immediately interpreted as “Pawn to King-4.” 

And, the example that is frustratingly familiar to so many of us: mobile phone autocorrupt, which you can search for yourself.

Levels of evidence

If you find that changing your diet in some way makes you feel happier and healthier, that’s a good thing.  It doesn’t matter whether the same change would be useful for most people, or only useful for you. It doesn’t matter whether the change is a placebo effect. It doesn’t even matter if it’s an illusion, a combination of regression to the mean and confirmation bias. You might check with a doctor or dietician as to whether the change is dangerous, but otherwise, go for it.

If you want to campaign for the entire community to make a change in their diet, you need to have evidence that it’s better on average for the entire community. A few people’s subjective experience isn’t good enough.  Good quality observational data might be all you can manage if the benefits are subtle or take years to appear, but if you’re claiming dramatic short-term benefits you should be able to demonstrate them in a randomised controlled trial.

The reason for mentioning this is that PETA has been making friends again. They’re trying to link milk consumption to autism. They don’t even pretend to have any evidence that milk causes autism, and the evidence that milk-free diet has a beneficial effect in people with autism is very weak.  That is, there are a few studies that suggest a benefit, but the benefit is smaller in studies with more reliable designs, and absent in the best-designed studies.  The most recent review of the evidence concluded that dairy-free or gluten-free diets should only be tried for people who have some separate evidence of food intolerance.  After reading the review, I would agree.

There are respectable arguments against dairy farming, both ethical and environmental. Scaremongering about autism isn’t one of them.

May 29, 2014

Lede program at Columbia

Columbia University in New York is running an amazing-looking data journalism certificate called The Lede Program. The program director is Cathy O’Neill of mathbabe.org and Occupy Finance,  and the program advisor is Mark Hansen, statistician, computational scientist, and artist.

Anyway, their syllabus (and quite a bit of other content) is available on Github.

I’d like to quote a course outline by Cathy O’Neill

This course begins with the idea that computing tools are the products of human ingenuity and effort. They are never neutral and carry with them the biases of their designers and their design process. “Platform studies” is a new term used to describe investigations into these relationships between computing technologies and the creative or research products that they help to generate. How you understand how data, code, and algorithms affect creative practices can be an effective first step toward critical thinking about technology. 

 

We like to say ‘second lowest’

From the Herald

New Zealand has the highest rate of obesity in Australasia, according to a new global analysis.

The “Australasia” group has two countries in it. The proportion overweight or obese differs between those two countries by 3.3 percentage points.

There’s a good interactive visualisation from the IHME group who put the data together.

Margins of error and our new party

Attention conservation notice:  if you’re not from NZ or Germany you probably don’t understand the electoral system, and if you’re not from NZ you don’t care.

Assessing the chances of the new Internet Mana party from polls will be even harder than usual. The Internet half of the chimera will get a List seat if the party gets exactly one electorate and enough votes for two seats (about 1.7 1.2%), or if they get two electorates (eg Hone Harawira and Annette Sykes)  and enough votes for three seats (about 2.5 2%), or if they get no electorates and at least 5% of the vote. [Update: a correspondent points out that it’s more complicated. The orange man provides a nice calculator. Numbers in the rest of the post are updated]

With a poll of 1000 people, 1.2% is 12 people and 2% is 20 people.  Even if there were no other complications, the sampling uncertainty is pretty large: if the true support proportion is 0.02, a 95% prediction interval for the poll result goes from 0.9% to 2.9%, and if the true support proportion is 0.012, the interval goes from 0.6% to 1.8%.

Any single poll is almost entirely useless — for example, if the party polls 1.5% it could have enough votes for one, two, or three total seats, and national polling data won’t tell us anything useful about the relevant electorates. Aggregating polls will help reduce the sampling uncertainty, but there’s not much to aggregate for the Internet Party and it’s not clear how the amalgamation will affect Mana’s vote, so we are limited to polls starting now.

Worse, we don’t have any data on how the polls are biased (compared to the election) for this party. The Internet half will presumably have larger support among people without landline phones,  even after age, ethnicity, and location are taken into account. Historically, the cell-phone problem doesn’t seem to have caused a lot of bias in NZ opinion polls (in contrast to the US), but this may well be an extreme case. The party may also have more support from younger and less well off people, who are less likely to vote on average, making it harder to translate poll responses into election predictions.

May 28, 2014

Monty Hall problem and data

Tonight’s Mythbusters episode on Prime looked at the Monty Hall/Pick-a-Door problem, using experimental data as well as theory.

For those of you who haven’t been exposed to it, the idea is as follows:

There are three doors. Behind one is a prize. The contestant picks a door. The host then always opens one of the other doors, which he knows does not contain the prize. The contestant is given an opportunity to change their choice to the other unopened door. Should they take this choice?

The stipulation that the host always makes the offer and always opens an empty door is critical to the analysis. It was present in the original game-show problem and was explicit in Mythbusters.

A probabilistic analysis is straightforward. The chance that the prize is behind the originally-chosen door is 1/3.  It has to be somewhere. So the chance of it being behind the remaining door is 2/3.  You can do this more carefully by enumerating all possibilities, and you get the same answer.

The conclusion is surprising. Almost everyone, famously including both Marilyn vos Savant, and Paul Erdős, gets it wrong. Less impressively, so did I as an undergraduate, until I was convinced by writing a computer simulation (I didn’t need to run it; writing it was enough).  The compelling error is probably an example of the endowment effect.

All of the Mythbusters live subjects chose to keep their original choice,ruining the comparison.  The Mythbusters then ran a moderately large series of random choices where one person always switched and the other did not.  They got 38 wins out of 49 for switching and 11 for not switching. That’s a bit more extreme than you’d expect, but not unreasonably so. It gives a 95% confidence interval (analogous to the polling margin of error)  from 12% to 37%.

The Mythbusters are sometimes criticised for insufficient replication, but in this case 49 is plenty to distinguish the ‘obvious’ 50% success rate from the true 33%. It was a very nicely designed experiment.

‘Balanced’ Lotto reporting

From ChCh Press

Are you feeling lucky?

The number drawn most often in Saturday night’s Lotto is one.

The second is seven, the third is lucky 13, followed by 21, 38 and 12.

And if you are selecting a Powerball for Saturday’s draw, the record suggests two is a much better pick than seven.

The numbers are from Lotto Draw Frequency data provided by Lotto NZ for the 1406 Lottery family draws held to last Wednesday.

The Big Wednesday data shows the luckiest numbers are 30, 12, 20, 31, 28 and 16. And heads is drawn more often (232) than tails (216), based on 448 draws to last week.

In theory, selecting the numbers drawn most often would result in more prizes and avoiding the numbers drawn least would result in fewer losses. The record speaks for itself.

Of course this is utter bollocks. The record is entirely consistent with the draw being completely unpredictable, as you would also expect it to be if you’ve ever watched a Lotto draw on television and seen how they work.

This story is better than the ones we used to see, because it does go on and quote people who know what they are talking about, who point out that predicting this way isn’t going to work, and then goes on to say that many people must understand this because they do just take random picks.  On the other hand, that’s the sort of journalistic balance that gets caricatured as “Opinions differ on shape of Earth.”

In world historical terms it doesn’t really matter how these lottery stories are written, but they are missing a relatively a simple opportunity to demonstrate that a paper understands the difference between fact and fancy and thinks it matters.

NRL Predictions for Round 12

Team Ratings for Round 12

The basic method is described on my Department home page. I have made some changes to the methodology this year, including shrinking the ratings between seasons.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Roosters 8.74 12.35 -3.60
Rabbitohs 6.88 5.82 1.10
Sea Eagles 5.77 9.10 -3.30
Bulldogs 5.12 2.46 2.70
Cowboys 3.92 6.01 -2.10
Storm 2.90 7.64 -4.70
Warriors 1.33 -0.72 2.00
Broncos 1.26 -4.69 5.90
Panthers -0.09 -2.48 2.40
Knights -0.77 5.23 -6.00
Titans -2.18 1.45 -3.60
Wests Tigers -5.17 -11.26 6.10
Eels -5.91 -18.45 12.50
Raiders -6.26 -8.99 2.70
Sharks -6.52 2.32 -8.80
Dragons -10.80 -7.57 -3.20

 

Performance So Far

So far there have been 85 matches played, 46 of which were correctly predicted, a success rate of 54.1%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Bulldogs vs. Roosters May 23 12 – 32 5.30 FALSE
2 Titans vs. Warriors May 24 16 – 24 3.10 FALSE
3 Wests Tigers vs. Broncos May 24 14 – 16 -1.90 TRUE
4 Raiders vs. Cowboys May 25 42 – 12 -12.70 FALSE
5 Sharks vs. Rabbitohs May 26 0 – 18 -6.80 TRUE

 

Predictions for Round 12

Here are the predictions for Round 12. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Panthers vs. Eels May 30 Panthers 10.30
2 Roosters vs. Raiders May 31 Roosters 19.50
3 Cowboys vs. Storm May 31 Cowboys 5.50
4 Warriors vs. Knights Jun 01 Warriors 6.60
5 Broncos vs. Sea Eagles Jun 01 Sea Eagles -0.00
6 Rabbitohs vs. Dragons Jun 02 Rabbitohs 22.20

 

Super 15 Predictions for Round 16

Team Ratings for Round 16

The basic method is described on my Department home page. I have made some changes to the methodology this year, including shrinking the ratings between seasons.

Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 8.13 8.80 -0.70
Sharks 6.25 4.57 1.70
Waratahs 4.91 1.67 3.20
Bulls 3.78 4.87 -1.10
Hurricanes 3.65 -1.44 5.10
Brumbies 2.67 4.12 -1.40
Chiefs 2.50 4.38 -1.90
Stormers 1.38 4.38 -3.00
Blues -0.70 -1.92 1.20
Highlanders -1.28 -4.48 3.20
Force -2.30 -5.37 3.10
Cheetahs -4.52 0.12 -4.60
Reds -4.54 0.58 -5.10
Rebels -5.76 -6.36 0.60
Lions -7.17 -6.93 -0.20

 

Performance So Far

So far there have been 94 matches played, 62 of which were correctly predicted, a success rate of 66%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Blues vs. Sharks May 23 23 – 29 -2.50 TRUE
2 Rebels vs. Waratahs May 23 19 – 41 -6.30 TRUE
3 Highlanders vs. Crusaders May 24 30 – 32 -7.70 TRUE
4 Hurricanes vs. Chiefs May 24 45 – 8 -0.50 FALSE
5 Force vs. Lions May 24 29 – 19 8.70 TRUE
6 Stormers vs. Cheetahs May 24 33 – 0 5.20 TRUE
7 Bulls vs. Brumbies May 24 44 – 23 2.90 TRUE

 

Predictions for Round 16

Here are the predictions for Round 16. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Crusaders vs. Force May 30 Crusaders 14.40
2 Reds vs. Highlanders May 30 Reds 0.70
3 Chiefs vs. Waratahs May 31 Chiefs 1.60
4 Blues vs. Hurricanes May 31 Hurricanes -1.80
5 Brumbies vs. Rebels May 31 Brumbies 10.90
6 Lions vs. Bulls May 31 Bulls -8.50
7 Sharks vs. Stormers May 31 Sharks 7.40

 

$5 million followup

It’s gettable, but it’s hard – that’s why it’s five million dollars.”

“The chances of picking every game correctly were astronomical”

  • NBR (paywalled)

“crystal ball gazing of such magnitude that University of Auckland statistics expert associate professor David Scott doesn’t think either will have to pay out.”

“quite hard to win  “

“someone like you [non-expert] has as much chance  because [an expert] wouldn’t pick an upset”

“An expert is less likely to win it than someone who just has a shot at it.”

“It’s only 64 games and, as I say, there’s only 20 tricky ones I reckon”

 

Yeah, nah.