Posts from March 2013 (75)

March 31, 2013

A simple genetics question

A rocket scientist and winner of the National Medal of Technology and Innovation died recently, and has an obituary in the New York Times.  The first paragraph of the obituary is about family and cooking.

Can you guess how many X chromosomes the scientist had?

 

[Yes, of course,  writing about her family is fine, especially as family life was clearly very important to her. But leading with beef stroganoff?]

[Update: the NYT has thought better of the stroganoff:  See Newsdiffs for the comparison of old and new versions]

Briefly

Easter trading rules don’t appear to forbid blogging today, so a few links

  • Using words like “common”, “uncommon”, “rare”, “very rare” to describe risks of drug side-effects is recommended by guidelines,  and patients like it better than numbers, but it leads to serious overestimation of the actual risks (PDF poster, via Hilda Bastian)
  • A map of gun deaths in the US since the Sandy Hook shootings
  • Stuff’s small-business section says: “Scientists believe the Kiwifruit virus Psa came to New Zealand in a 2009 shipment of flowers.” I hope it’s just the newspaper, not the scientists, that thinks Psa is a virus
  • Another story about petrol prices in the Herald, linked to remind you all that the government collects and publishes data.  You can find it, even if AA, the petrol companies, and the media can’t.  This time AA seems to be right: the importer margin is about 4c above the trend line, which itself is up 5c on last year.
March 30, 2013

Confirmation bias

From the Waikato Times, two quotes from a story on emergency services

He would not comment on what motivated the fracas or whether it was gang-related.

“We’re not jumping to conclusions.”

and

Though science dismisses any link between human behaviour and the moon, it’s cold comfort for hospitality staff and emergency workers who say the amount of trouble often spikes when the moon is at its brightest.

Ms Gill said staff reported that “the full moon often has an impact on the nature of presentations through ED”.

Science doesn’t dismiss a link.  There’s nothing unscientific about the idea of a link. It’s  just that people have looked carefully and it’s not true.

(via @petrajane)

March 29, 2013

Unclear on the concept: average time to event

One of our current Stat of the Week nominations is a story on Stuff claiming that criminals sentenced to preventive detention are being freed after an average of ‘only’ 11 years.

There’s a widely-linked story in the Guardian claiming that the average time until Google kills new services is 1459 days, based on services that have been cancelled in the past.  The story even goes on to say that more recent services have been cancelled more quickly.

As far as I know, no-one has yet produced a headline saying that the average life expectancy  for people born in the 21st century is only about 5 years, but the error in reasoning would be the same.

In all three cases, we’re interested in the average time until some event happens, but our data are incomplete, because the event hasn’t happened for everyone.  Some Google services are still running; some preventive-detention cases are still in prison; some people born this century are still alive.  A little thought reveals that the events which have occurred are a biased sample: they are likely to be the earliest events.   The 21st century kids who will live to 90 are still alive; those who have already died are not representative.

In medical statistics, the proper handling of times to death, to recurrence, or to recovery is a routine problem.  It’s still not possible to learn as much as you’d like without assumptions that are often unreasonable. The most powerful assumption you can make is that the rate of events is constant over time, in which case the life expectancy is the total observed time divided by the total number of events — you need to count all the observed time, even for the events that haven’t happened yet.  That is, to estimate the survival time for Google services, you add up all the time that all the Google services have operated, and divide by the number that have been cancelled.  People in the cricket-playing world will recognise this as the computation used for batting averages: total number of runs scored, divided by total number of times out.

The simple estimator is often biased, since the risk of an event may increase or decrease with time.  A new Google service might be more at risk than an established one; a prisoner detained for many years might be less likely to be released than a more recent convict.  Even so, using it distinguishes people who have paid some attention to the survivors from those who haven’t.

I can’t be bothered chasing down the history of all the Google services, but if we add in search (from 1997),  Adwords (from 2000), image search (2001), news (2002),  Maps, Analytics, Scholar, Talk, and Transit (2005), and count Gmail only from when it became open to all in 2007, we increase the estimated life expectancy for a Google service from the 4 years quoted in the Guardian to about 6.5 years.  Adding in other still-live services can only increase this number.

For a serious question such as the distribution of time in preventive detention you would need to consider trends over time, and differences between criminals, and the simple constant-rate model would not be appropriate.  You’d need a bit more data, unless what you wanted was just a headline.

March 28, 2013

Briefly

  • And since it’s a long weekend coming up: something that’s not remotely statistics, but is Quite Interesting. Siouxsie Wiles has another bioluminescence animation up on Youtube, on the Hawaiian bobtail squid, invisibility cloaks, and quorum sensing.

NRL Predictions, Round 4

Team Ratings for Round 4

Here are the team ratings prior to Round 4, along with the ratings at the start of the season. I have created a brief description of the method I use for predicting rugby games. Go to my Department home page to see this.

Current Rating Rating at Season Start Difference
Storm 11.97 9.73 2.20
Sea Eagles 7.51 4.78 2.70
Bulldogs 6.09 7.33 -1.20
Rabbitohs 5.95 5.23 0.70
Cowboys 3.29 7.05 -3.80
Knights 3.23 0.44 2.80
Titans 2.03 -1.85 3.90
Sharks -0.15 -1.78 1.60
Broncos -1.02 -1.55 0.50
Raiders -2.91 2.03 -4.90
Dragons -3.98 -0.33 -3.70
Wests Tigers -4.51 -3.71 -0.80
Panthers -5.37 -6.58 1.20
Roosters -5.72 -5.68 -0.00
Eels -6.81 -8.82 2.00
Warriors -13.32 -10.01 -3.30

 

Performance So Far

So far there have been 24 matches played, 17 of which were correctly predicted, a success rate of 70.83%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Storm vs. Bulldogs Mar 21 22 – 18 11.97 TRUE
2 Wests Tigers vs. Eels Mar 22 31 – 18 5.25 TRUE
3 Titans vs. Sea Eagles Mar 23 16 – 14 -1.73 FALSE
4 Roosters vs. Broncos Mar 23 8 – 0 -2.26 FALSE
5 Sharks vs. Warriors Mar 24 28 – 4 16.09 TRUE
6 Panthers vs. Rabbitohs Mar 24 32 – 44 -5.52 TRUE
7 Raiders vs. Dragons Mar 24 30 – 17 3.72 TRUE
8 Knights vs. Cowboys Mar 25 34 – 6 -1.44 FALSE

 

Predictions for Round 4

Here are the predictions for Round 4. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Sea Eagles vs. Wests Tigers Mar 28 Sea Eagles 16.50
2 Bulldogs vs. Rabbitohs Mar 29 Bulldogs 4.60
3 Broncos vs. Storm Mar 29 Storm -8.50
4 Sharks vs. Dragons Mar 30 Sharks 8.30
5 Panthers vs. Titans Mar 31 Titans -2.90
6 Knights vs. Raiders Mar 31 Knights 10.60
7 Warriors vs. Cowboys Apr 01 Cowboys -12.10
8 Roosters vs. Eels Apr 01 Roosters 5.60

 

Super 15 Predictions, Round 7

Team Ratings for Round 7

This year the predictions have been slightly changed with the help of a student, Joshua Dale. The home ground advantage now is different when both teams are from the same country to when the teams are from different countries. The basic method is described on my Department home page.

Here are the team ratings prior to Round 7, along with the ratings at the start of the season.

Current Rating Rating at Season Start Difference
Crusaders 8.94 9.03 -0.10
Chiefs 8.51 6.98 1.50
Sharks 5.74 4.57 1.20
Stormers 3.93 3.34 0.60
Brumbies 3.83 -1.06 4.90
Hurricanes 2.21 4.40 -2.20
Bulls 1.65 2.55 -0.90
Blues -0.21 -3.02 2.80
Reds -1.31 0.46 -1.80
Cheetahs -3.51 -4.16 0.60
Highlanders -5.24 -3.41 -1.80
Waratahs -6.08 -4.10 -2.00
Force -9.29 -9.73 0.40
Kings -10.04 -10.00 -0.00
Rebels -13.95 -10.64 -3.30

 

Performance So Far

So far there have been 35 matches played, 25 of which were correctly predicted, a success rate of 71.4%.

Here are the predictions for last week’s games.

Game Date Score Prediction Correct
1 Chiefs vs. Highlanders Mar 22 19 – 7 17.10 TRUE
2 Crusaders vs. Kings Mar 23 55 – 20 20.70 TRUE
3 Reds vs. Bulls Mar 23 23 – 18 0.30 TRUE
4 Force vs. Cheetahs Mar 23 10 – 19 -0.40 TRUE
5 Sharks vs. Rebels Mar 23 64 – 7 17.30 TRUE
6 Stormers vs. Brumbies Mar 23 35 – 22 2.40 TRUE
7 Waratahs vs. Blues Mar 24 30 – 27 -2.80 FALSE

 

Predictions for Round 7

Here are the predictions for Round 7. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

Game Date Winner Prediction
1 Highlanders vs. Reds Mar 29 Highlanders 0.10
2 Hurricanes vs. Kings Mar 30 Hurricanes 16.20
3 Chiefs vs. Blues Mar 30 Chiefs 11.20
4 Brumbies vs. Bulls Mar 30 Brumbies 6.20
5 Cheetahs vs. Rebels Mar 30 Cheetahs 14.40
6 Stormers vs. Crusaders Mar 30 Crusaders -1.00
7 Waratahs vs. Force Mar 31 Waratahs 5.70

 

March 27, 2013

Sensor data and the monitored life

The Herald’s has two good stories recently (and other outlets are similar) about new data collection and analysis with smartphone apps.   If you read the stories carefully there’s been an outbreak of synecdoche in the newsroom — they are actually stories about new sensors that collect data, and smartphone apps that can be used to monitor the sensors — but that’s a minor detail.

As sensors based on electronics and micromechanics become cheap, it’s easy to collect more and more data.  This can be valuable — measuring blood glucose, or pollutants in the air — but it can also get out of hand.  Epidemiologist Hilda Bastian gives us this cartoon of the “Self-Stalker 5000” and writes about evidence-based monitoring in her blog at Scientific American

Self-Stalker-5000

 

and Nick Alex Harrowell points out

The stereotype application could be defined as “bugging granny”. We’re going to check some metrics at intervals, stick them into a control chart, and then badger you about it.

To be fair, he’s worrying most about sensors that don’t send to a smartphone app, but many of the same principles apply.

Does data visualisation matter?

“I wish there were more examples where data viz actually mattered. The case studies for us to lean on are sparser than they should be.”

Amanda Cox, NY Times chartmaker, interviewed at Harvard Business Review.  Includes a graph showing how the same unemployment report might be viewed by partisans of opposing parties.

Why PDFs are not open data

The US non-profit journalism group Pro Publica has written a number of good stories recently about drug company payments to doctors.  They used the data that some US states force the drug companies to release.  They have a new story explaining why this isn’t as easy as it sounds, since there was no requirement that the data be released in any useful form.  For example, a lot of it was in PDF files.  As they write

Here’s how a PDF works, deep down: It positions text by placing each character at minutely precise coordinates in relation to the bottom-left corner of the page. It does something similar for other elements like images. A PDF knows about shapes, characters and their precise positions on the page. Even if a PDF looks like a spreadsheet — in fact, even when it’s made using Microsoft Excel — the PDF format doesn’t retain any sense of the “cells” that once contained the data.

They used a wide range of techniques: in some cases they could use the grid cells on the tables to work out which digits belonged to the same number, but in other cases they basically had to treat the PDF file as an image and use optical text recognition software on it, just as you would for a scanned bitmap. Most people wouldn’t go to these heroic lengths, and would rapidly decide to investigate other exciting stories.

Even Excel spreadsheets are only useful open data formats if they are structured so that it’s easy for a computer to find and extract the actual numbers from the worksheet. Stats NZ , who realise this, try to have data available both as Excel spreadsheets designed for visual display and in some useful downloadable form. Some other sources of NZ official data are not as helpful.

(via @adzebill on Twitter)