Posts written by Thomas Lumley (2569)

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

February 23, 2025

Official data bad news

By Thomas Lumley

USA: Unfortunately, the Economy Runs On the Data Trump Is Trying to Delete (article, podcast discussion)
New Zealand: Len Cook: Trust must be recovered after a second Census failure (The Post)

View comments (3)

February 19, 2025

Low-flying rocks

By Thomas Lumley

As you have probably heard, a modest-sized asteroid will pass really quite close to the earth in seven years. Perhaps very close. Maybe even closer than that. If this happens, it won’t be civilisation-destroying but could be locally devastating. The main StatsChat relevant issue is that the probability of the asteroid hitting the earth keeps going up: 1.4%, 2.2%, 3.1%. Why is it getting consistently higher? Does that mean we can expect it will keep going up?

Most likely, the probability will keep going up for a while, then suddenly plunge to zero — though it might keep going up and up to 100%. The sudden plunge to zero and the increase before that point happen for the same reason. Because of our limited measurements so far, we know only approximately where the asteroid will be in seven years’ time. You can think of a little moving cloud of possible asteroid positions in space. At the moment, the cloud of possible positions intersects where the Earth will be, but the Earth is a pretty small target (Space is big. Really big.)

As we get more precise information (for example, from the Webb Space Telescope) the cloud of possible positions gets smaller. If the cloud still intersects the Earth’s position, the smaller cloud of possibilities means the probability of impact goes up. If it doesn’t still intersect the Earth’s position, the probability of impact drops to zero.

If the randomness in predictions is approximately correct in some consistent way, the predicted average value of the estimate should stay the same over time (the technical term is ‘martingale’). The chance that the estimate has gone to zero will go up over time, so the estimate in scenarios where it doesn’t go to zero must also go up over time.

Update: it’s down to 0.36%
Update: and now down to 0.0022%. Asteroid goneburger.

View comments (2)

February 18, 2025

Surprises in data

By Thomas Lumley

When you get access to some data, a first step is to see if you understand it: do the variables measure what you expect, are there surprising values, and so on. Often, you will be surprised by some of the results. Almost always this is because the data mean something a bit different from what you expected. Sometimes there are errors. Occasionally there is outright fraud.

Elon Musk and DOG-E have been looking at US Social Security data. They created a table of Social Security-eligible people not recorded as dead and noticed that (a) some of them were surprisingly old, and (b) the total added up to more than the US population.

That’s a good first step, as I said. The next step is to think about possible explanations (as Dan Davies says: “if you don’t make predictions, you won’t know when to be surprised”). The first two I thought of were people leaving the US after working long enough to be eligible for Social Security (like, for example, me) and missing death records for old people (the vital statistics records weren’t as good in the 19th century as they are now).

After that, the proper procedure is to ask someone or look for some documentation, rather than just to go with your first guess. It’s quite likely that someone else has already observed the existence of records with unreasonable ages and looked for an explanation.

In this case, one would find (eg, by following economist Justin Wolfers) a 2023 report “Numberholders Age 100 or Older Who Did Not Have Death Information on the Numident” (PDF), a report by the Office of the Inspector General, which said that the very elderly ‘vampires collecting Social Security’ were neither vampires nor collecting Social Security, but were real people whose deaths hadn’t been recorded. This was presumably a follow-up to a 2015 story where identity fraud was involved — but again, the government wasn’t losing money, because it wasn’t paying money out to dead people.

The excess population at younger years isn’t explained by this report, but again, the next step is to see what is already known by the people who spend their whole careers working with the data, rather than to decide the explanation is the first thing that comes to mind.

View comments (6)

February 17, 2025

Briefly

By Thomas Lumley

CDC’s “disease detectives” halved as part of DOGE cuts at health agencies
Bird flu seen in cow veterinarians
Louisiana will no longer promote mass vaccination (paywalled original)
Measles outbreak continues to expand in Texas

January 24, 2025

When science news isn’t new

By Thomas Lumley

The Herald (from the Telegraph) says People with divorced parents are at greater risk of strokes, study finds

The study is here and the press release is here. It uses 2022 data from a massive annual telephone survey of health in the US, the Behavioral Risk Factor Surveillance System, “BRFSS” to its friends.

Using BRFSS means that the data are representative of the US population, which is useful. On the other hand, you’re limited to variables that can be assessed over the phone. That’s fine for age, and probably fine for parental divorce. It’s known to be a bit biased for BMI and weight. The telephone survey doesn’t even try to collect blood pressure, cholesterol, or oral contraceptive use, all known to be risk factors for stroke. And if you call people up on the phone and ask if they’ve ever had a stroke, you tend to miss the people whose strokes were fatal or incapacitating (about a quarter of people die immediately or within a year if they have a stroke).

Still, the researchers collected some useful variables to try to adjust away the differences between people with and without divorced parents. As usual, we have to worry about whether they went too far — for example, if the mechanism was via diabetes or depression, then adjusting for diabetes or depression would induce bias in the results.

This sort of research can be useful as a first step, to see if it’s worth doing an analysis using more helpful data from a study that followed people up over time — either a birth cohort study or a heart-disease cohort study. It’s interesting as initial news that there’s a relationship — though you also might think adverse effects of divorce would get smaller in recent decades as divorce became less noteworthy.

All this is background for my main point. While looking for links to published papers, I found that one of the same researchers had done the same sort of analysis with the BRFSS data from 2010 and published it in 2012. They found a stronger association twelve years ago than now. I don’t know about you, but I would have appreciated this fact being in the press release and in the news story.

View comments (2)

January 22, 2025

Housework reporting reporting

By Thomas Lumley

Listening to the Slate Money podcast, I heard about an interesting survey result

Elizabeth Spiers: My number is 73 and that’s percent and that’s the number of men in a recent YouGov survey who say they do most of the chores in their household.

I found a Washington Post story

60 percent of women who live with a partner say they do all or most of the chores. But 73 percent of similarly situated men say that they do the most — or that they share chores equally.

Here’s the top few lines of the YouGov table

Breaking out advanced statistical software, 23%+19% is 42%, and 33%+30% is 63%. The figure for women matches the story, allowing for reasonable rounding. The figure for men doesn’t?

If we add in “shared equally”, which is given as 33% for men and 22% for women we can get to 75% “all” or “most” or “equally” for men, but 83% “all” or “most” or “equally” for women. And while the story is supposed to be that men say they are doing more and are delusional, the reported table has more work self-reported by women than men at all levels. It’s still possible, of course, that some the 42% of men claiming to do all or most chores are not fully aware of the situation, but the message that makes the results headline-worthy is not in the data.

The Washington Post story is still well worth reading — it uses the YouGov poll as a hook to discuss much more detailed data from the American Time Use Survey, which has the advantage that people write down contemporaneously what they are doing for an actual two weeks rather than trying to guess at an average.

January 21, 2025

Briefly

By Thomas Lumley

Another example of asking people questions they can’t reasonably be expected to answer, at Newsroom. This is a survey from Forest and Bird, who asked about the proportion of NZ’s ocean that was, and that should be, in marine reserves. The actual figure is 0.4%. People thought it was a lot more, and that it should be a lot more. It’s possible the current general population estimate has been influenced by the Kermadec reserve that was proposed by the last government, which would have protected 15%, but I didn’t remember that figure so I don’t know. The “Not Sure” figure for how much is currently protected is 23%, which is larger than you often see, but clearly still smaller than it should have been. Survey interviewers often push quite hard to get people to give a concrete answer, but I don’t know if that happened here
A good post about the cost of false positives, in this case in detecting spammers on social media.
Useful tools for interpreting the news: Molly White writes about cryptocurrency market caps and what they mean and don’t mean.
Outside the usual range of StatsChat, but about the value of official statistics in policy debates. Bret Devereaux (who you should read if you have any interest in ancient history or its depiction in movies and video games) is writing about the Gracchi, the land reformers of the 2nd century BCE.

Except notice the data points being used to come up with this story: the visible population of landless men in Rome and the Roman census returns. But, as we’ve discussed, the Roman census is self-reported, and the report of a bit of wealth like a small farm is what makes an individual liable for taxes and conscription.

In short the story we have above is an interpretation of the available data but not the only one and both our sources and Tiberius Gracchus simply lack the tools necessary to gather the information they’d need to sound out if their interpretation is correct.

January 20, 2025

Do I look like Wikipedia?

By Thomas Lumley

Ipsos, the polling firm, has again been asking people questions to which they can’t reasonably be expected to know the answer, and finding they in fact don’t. For example, this graph shows what happens when you ask people what proportion of their country are immigrants, ie, born in another country. Everywhere except Singapore they overestimate the proportion, often by a lot. New Zealand comes off fairly well here, with only slight underestimation. South Africa and Canada do quite badly. Indonesia, notably, has almost no immigrants but thinks it has 20%.

Some of this is almost certainly prejudice, but to be fair the only way you could know these numbers reliably would be if someone did a reliable national count and told you. Just walking around Auckland you can’t tell accurately who is an immigrant in Auckland, and you certainly can’t walk around Auckland and tell how many immigrants there are in Ashburton. Specifically, while you might know from your own knowledge how many immigrants there were in your part of the country, it would be very unusual for you to know this for the country as a whole. You might expect, then, that the best-case response to surveys such as these would be an average proportion of immigrants over areas in the country, weighted by the population of those areas. If the proportion of immigrants is correlated with population density, that will be higher than the true nationwide proportion.

That is to say, if people in Auckland accurately estimate the proportion of immigrants in Auckland, and people in Wellington accurately estimate the proportion in Wellington, and people in Huntly accurately estimate the proportion in Huntly, and people in Central Otago accurately estimate the proportion in Central Otago, you don’t get an accurate nationwide estimate if areas with more people have a higher proportion of immigrants. Which, in New Zealand, they do. If we work with regions and Census data, the correlation between population and proportion born overseas is about 50%. That’s enough for about a 5 percentage point bias: we would expect to overestimate the proportion of immigrants by about 5 percentage points if everyone based their survey response on the true proportion in their part of the country.

Fortunately, if the proportion of immigrants in your neighbourhood or in the country as a whole matters to you, you don’t need to guess. Official statistics are useful! Someone has done an accurate national count, and while they probably didn’t tell you, they did put the number somewhere on the web for you to look up.

January 17, 2025

Briefly

By Thomas Lumley

Official statistics agencies are very conservative about survey questions, because changing them causes problems. Another example: in the last US census, the number of people reporting more than one ethnicity increased. The Census Bureau said

“These improvements reveal that the U.S. population is much more multiracial and diverse than what we measured in the past,” Census Bureau officials said at the time.

But does that mean there are more people now with the same sort of multiple heritage, or that the same people are newly identifying as multi-ethnic, or just that the question has changed? According to Associated Press, new research suggests it’s mostly measurement.

“And surveys are especially useless when respondents have the option of answering in a way that is both “respectable” and self-flattering. ” Fred Clark, talking about a survey of ‘politics’ in religion in the US.
Greater Auckland on last year’s road deaths. It’s a good post, with breakdown of subgroups and discussion of appropriate denominators and so it. I’d still ideally like to see random-variability shown in these sorts of trend lines. The simplest level of this, so-called Poisson variability, is fairly easy: you take a reported count, take the square root, add and subtract 1 to get limits, and square again. You don’t need to go to the lengths of full-on Bayesian modelling unless you want to make stronger claims

View comments (2)

January 14, 2025

Only a flesh wound

By Thomas Lumley

An article from ABC News in Adelaide, South Australia, describes incidents where fencing wire was strung across a bike path. According to police

The riders were travelling about 35 kilometres per hour and fell from their bikes. Two suffered minor injuries, while the third was not injured.

Police said each of their bicycles were severely damaged.

That sounds at first like extraordinary good luck: if you come off a bike at 35 km/h and your bike was wrecked, you’d expect to be damaged too. I think the problem, as with a lot of discussions of road crashes, is the official assessment metrics for injuries. In South Australia, according to this and similar documents:

Serious Injury – A person who sustains injuries and is admitted to hospital for a duration of at least 24 hours as a result of a road crash and who does not die as a result of those injuries within 30 days of the crash.

Minor Injury – A person who sustains injuries requiring medical treatment, either by a doctor or in a hospital, as a result of a road crash and who does not die as a result of those injuries with 30 days of the crash.

A broken bone leading to substantial disability might easily not be a Serious Injury, and several square inches of road rash may well not be even a Minor Injury. (New Zealand has the same definition of a “serious” injury is one that gets you admitted to hospital for an overnight stay, but doesn’t have restrictive standards for minor injury)

It’s not that these definitions are necessarily bad for collecting data — there’s a lot to be said for a definition that’s administratively checkable — but it does mean you might want to translate from officialese to ordinary language when reporting individual injuries or aggregated statistics to ordinary people.

Update: One of the cyclists in the first group has talked to the ABC. One of the “minor injuries” required five stitches.

Posts written by Thomas Lumley (2569)

Official data bad news

Low-flying rocks

Surprises in data

Briefly

When science news isn’t new

Housework reporting reporting

Briefly

Do I look like Wikipedia?

Briefly

Only a flesh wound

Latest posts

Recent comments on Thomas Lumley’s posts

Popular posts by Thomas Lumley

Subscribe:

Receive our posts via email:

Posts written by Thomas Lumley (2569)

Latest posts

Recent comments on Thomas Lumley’s posts

Popular posts by Thomas Lumley