One of our current Stat of the Week nominations is a story on Stuff claiming that criminals sentenced to preventive detention are being freed after an average of ‘only’ 11 years.
There’s a widely-linked story in the Guardian claiming that the average time until Google kills new services is 1459 days, based on services that have been cancelled in the past. The story even goes on to say that more recent services have been cancelled more quickly.
As far as I know, no-one has yet produced a headline saying that the average life expectancy for people born in the 21st century is only about 5 years, but the error in reasoning would be the same.
In all three cases, we’re interested in the average time until some event happens, but our data are incomplete, because the event hasn’t happened for everyone. Some Google services are still running; some preventive-detention cases are still in prison; some people born this century are still alive. A little thought reveals that the events which have occurred are a biased sample: they are likely to be the earliest events. The 21st century kids who will live to 90 are still alive; those who have already died are not representative.
In medical statistics, the proper handling of times to death, to recurrence, or to recovery is a routine problem. It’s still not possible to learn as much as you’d like without assumptions that are often unreasonable. The most powerful assumption you can make is that the rate of events is constant over time, in which case the life expectancy is the total observed time divided by the total number of events — you need to count all the observed time, even for the events that haven’t happened yet. That is, to estimate the survival time for Google services, you add up all the time that all the Google services have operated, and divide by the number that have been cancelled. People in the cricket-playing world will recognise this as the computation used for batting averages: total number of runs scored, divided by total number of times out.
The simple estimator is often biased, since the risk of an event may increase or decrease with time. A new Google service might be more at risk than an established one; a prisoner detained for many years might be less likely to be released than a more recent convict. Even so, using it distinguishes people who have paid some attention to the survivors from those who haven’t.
I can’t be bothered chasing down the history of all the Google services, but if we add in search (from 1997), Adwords (from 2000), image search (2001), news (2002), Maps, Analytics, Scholar, Talk, and Transit (2005), and count Gmail only from when it became open to all in 2007, we increase the estimated life expectancy for a Google service from the 4 years quoted in the Guardian to about 6.5 years. Adding in other still-live services can only increase this number.
For a serious question such as the distribution of time in preventive detention you would need to consider trends over time, and differences between criminals, and the simple constant-rate model would not be appropriate. You’d need a bit more data, unless what you wanted was just a headline.