Posts written by Thomas Lumley (2542)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

January 20, 2025

Do I look like Wikipedia?

Ipsos, the polling firm, has again been asking people questions to which they can’t reasonably be expected to know the answer, and finding they in fact don’t.  For example, this graph shows what happens when you ask people what proportion of their country are immigrants, ie, born in another country.  Everywhere except Singapore they overestimate the proportion, often by a lot.  New Zealand comes off fairly well here, with only slight underestimation. South Africa and Canada do quite badly. Indonesia, notably, has almost no immigrants but thinks it has 20%.

Some of this is almost certainly prejudice, but to be fair the only way you could know these numbers reliably would be if someone did a reliable national count and told you.  Just walking around Auckland you can’t tell accurately who is an immigrant in Auckland, and you certainly can’t walk around Auckland and tell how many immigrants there are in Ashburton.  Specifically, while you might know from your own knowledge how many immigrants there were in your part of the country, it would be very unusual for you to know this for the country as a whole.  You might expect, then, that the best-case response to surveys such as these would be an average proportion of immigrants over areas in the country, weighted by the population of those areas. If the proportion of immigrants is correlated with population density, that will be higher than the true nationwide proportion.

That is to say, if people in Auckland accurately estimate the proportion of immigrants in Auckland, and people in Wellington accurately estimate the proportion in Wellington, and people in Huntly accurately estimate the proportion in Huntly, and people in Central Otago accurately estimate the proportion in Central Otago, you don’t get an accurate nationwide estimate if areas with more people have a higher proportion of immigrants. Which, in New Zealand, they do.  If we work with regions and Census data, the correlation between population and proportion born overseas is about 50%.  That’s enough for about a 5 percentage point bias: we would expect to overestimate the proportion of immigrants by about 5 percentage points if everyone based their survey response on the true proportion in their part of the country.

Fortunately, if the proportion of immigrants in your neighbourhood or in the country as a whole matters to you, you don’t need to guess. Official statistics are useful! Someone has done an accurate national count, and while they probably didn’t tell you, they did put the number somewhere on the web for you to look up.

January 17, 2025

Briefly

  • Official statistics agencies are very conservative about survey questions, because changing them causes problems.  Another example: in the last US census, the number of people reporting more than one ethnicity increased. The Census Bureau said

“These improvements reveal that the U.S. population is much more multiracial and diverse than what we measured in the past,” Census Bureau officials said at the time.

But does that mean there are more people now with the same sort of multiple heritage, or that the same people are newly identifying as multi-ethnic, or just that the question has changed? According to Associated Press, new research suggests it’s mostly measurement.

  • “And surveys are especially useless when respondents have the option of answering in a way that is both “respectable” and self-flattering. ”  Fred Clark, talking about a survey of ‘politics’ in religion in the US.
  • Greater Auckland on last year’s road deaths. It’s a good post, with breakdown of subgroups and discussion of appropriate denominators and so it. I’d still ideally like to see random-variability shown in these sorts of trend lines.  The simplest level of this, so-called Poisson variability, is fairly easy: you take a reported count, take the square root, add and subtract 1 to get limits, and square again. You don’t need to go to the lengths of full-on Bayesian modelling unless you want to make stronger claims
January 14, 2025

Only a flesh wound

An article from ABC News in Adelaide, South Australia, describes incidents where fencing wire was strung across a bike path.  According to police

The riders were travelling about 35 kilometres per hour and fell from their bikes. Two suffered minor injuries, while the third was not injured.

Police said each of their bicycles were severely damaged.

That sounds at first like extraordinary good luck: if you come off a bike at 35 km/h and your bike was wrecked, you’d expect to be damaged too.  I think the problem, as with a lot of discussions of road crashes, is the official assessment metrics for injuries.  In South Australia, according to this and similar documents:

Serious Injury – A person who sustains injuries and is admitted to hospital for a duration of at least 24 hours as a result of a road crash and who does not die as a result of those injuries within 30 days of the crash.

Minor Injury – A person who sustains injuries requiring medical treatment, either by a doctor or in a hospital, as a result of a road crash and who does not die as a result of those injuries with 30 days of the crash.

A broken bone leading to substantial disability might easily not be a Serious Injury, and several square inches of road rash may well not be even a Minor Injury. (New Zealand has the same definition of a “serious” injury is one that gets you admitted to hospital for an overnight stay, but doesn’t have restrictive standards for minor injury)

It’s not that these definitions are necessarily bad for collecting data — there’s a lot to be said for a definition that’s administratively checkable — but it does mean you might want to translate from officialese to ordinary language when reporting individual injuries or aggregated statistics to ordinary people.

 

Update: One of the cyclists in the first group has talked to the ABC. One of the “minor injuries” required five stitches.

January 10, 2025

All-day breakfast

Q: Did you see that coffee only works if you drink it in the morning?

A: Works?

Q: Reduces your risk of heart disease

A: That’s … not my top reason for drinking coffee. Maybe not even top three.

Q: But is it true?

A: It could be, but probably not

Q: Mice?

A: No, this is in people. Twenty years of survey responses from the big US health survey called NHANES. They divided people into three groups depending on their coffee consumption on the day of their dietary interview: no coffee, coffee only between 4am and 11:59am, and coffee throughout the day

Q: Couldn’t there be other differences between people who drink coffee in the morning and afternoon? Like, cultural differences or age or health or something?

A: Yes, the researchers aimed to control for these statistically: differences in age, sex, race and ethnicity, survey year, family income, education levels, body mass index, diabetes, hypertension, high cholesterol, smoking status, time of smoking cessation, physical activity, Alternative Healthy Eating Index, total calorie intake, caffeinated coffee intake, and decaffeinated coffee intake, tea intake, and caffeinated soda intake, short sleep duration, and difficulty sleeping

Q: Um. Wow?

A: I mean, it’s a good effort. One of the real benefits of NHANES is it measures so much stuff.  On the other hand a lot of these things aren’t measured all that precisely, and it’s not like you have a lot of people left when you chop up the sample that many ways.  And the evidence for any difference is pretty marginal

Q: What’s their theory about how the coffee supposed to be working?

A: The Guardian and BBC versions of the story quotes experts who thinks it’s all about coffee disrupting sleep

Q: That sounds kind of plausible — but didn’t you say they adjusted for sleep?

A: Yes, so if the adjustments work it isn’t sleep

Q: Should we be campaigning for cafes to close early, like the new Auckland alcohol regulations?

A: It’s much too early for that, and in any case there isn’t any real suggestion coffee is harmful after noon.  It might be worth someone repeating the research in a very different population from the US but where people still drink coffee. There are plenty of those.

Q: And what about advice to readers on their coffee consumption?

A: The standard StatsChat advice: if you’re drinking coffee in the morning primarily for the good of your heart, you may be doing it wrong.

 

January 9, 2025

Briefly

  • The top baby names from New Zealand last year are out.  As we’ve seen in the past, the most-common names keep getting less common. “Noah” came top for boys, with only 250 uses, and “Isla” for girls, with only 190 uses.
  • The Daily Mail (because of course) has something purporting to be a map of penis sizes around the world, credited to this site, which gives no sources for the data. Wikipedia points out that a lot of data on this topic is self-reported claims. Wikipedia (possibly NSFW) notes thatMeasurements vary, with studies that rely on self-measurement reporting a significantly higher average than those with a health professional measuring. Even when it’s measured, it tends to be on volunteer samples, and there isn’t good standardisation of measurement protocols across sites.
  • If you live in one of these Aussie suburbs buy a lottery ticket NOW, says the headline on MSN.com, from the Daily Mail (Australia version).  This is a much more extreme headline than the NZ versions I usually complain about, and the text is more measured. Of course, there are two reasons why a suburb will see more lottery wins. The first is just chance, which doesn’t project into the future like that. The second is that these are suburbs where more money is lost on the lottery. Those trends probably will continue, but lottery advertising stories never seem to print the amounts lost on lotto.
  • We’ve seen a number of times that salary/wage ranges generated from advertising at Seek are not very similar to those reported from actual payments by StatsNZ.  This is worse: via Carl Bergstrom and Eduardo Hebkost, on Bluesky, apparently ziprecruiter.com will (in the US at least; not in NZ) give you salaries for any job you ask about, if you just forge a URL pointing to where the graph should be
January 4, 2025

Size matters

Ok, this is a bit late, but I didn’t see the poll (in a physical Sunday Star-Times) until this week.  An established Australian polling firm, Freshwater Strategy, have been doing polls here, too.  Stuff reports that the poll (also, at the Post)

…reveals 37% of New Zealand voters have seriously considered emigrating to Australia in the past 12 months.

By comparison, of Australian voters, only 8% have considered moving to New Zealand, including just 1% who have spent time looking into it.

If you don’t think too carefully, that gives the impression of a giant sucking sound and the lights going out in New Zealand.  Australia is a lot larger than New Zealand, though.  If 8% of people in Australia moved to New Zealand and 37% of people in New Zealand moved to Australia, the population of New Zealand would go up, not down.  The total populations are about 5 million and about 27 million. Of those, about 3.6 million are enrolled to vote in NZ and nearly 18 million enrolled to vote in Australia, so 37% of NZ voters is 1.3 million and 8% of Oz voters is 1.44 million.

Another useful comparison number is that the largest ever number of people migrating out of NZ to all destinations, not just Australia, over any 12 months is about 130,000, a tenth of the ‘seriously considered’ number. A lot of people (apparently) seriously consider a lot of things they don’t end up doing.

The other important aspect of the story is the estimates quoted for small subpopulations.  Overall, the poll claims a maximum margin of error of about  3 percentage points. That’s for the population as a whole. Proportions are given for different age groups, including 18-34 year olds, people earning more than $150,000, and voters for Te Pāti Māori.  We aren’t told the uncertainty in these numbers, but it’s obviously higher.  About 1/3 of adults are 18-34, about 5% earn over $150k (IRD spreadsheet), and about 3% voted for Te Pāti Māori.  The maximum margin of error for subpopulations this big would be 5, 13, and 17 percentage points respectively, assuming equal sampling.   You can’t easily learn much about wealthy people or Pāti Māori voters just by contacting random people throughout the country — and the assumption that you can make your sample representative by reweighting gets increasingly dodgy.

 

January 1, 2025

Black spatulas

If you’ve been paying attention to the food scare news, you will have heard that black plastic spatulas are EVIL and then that they are probably ok.  Predicted exposures to brominated flame retardants were close to the ‘reference level’ because of a simple decimal-place error and there’s now a ten-fold safety margin. The summary by Joel Schwarcz at McGill University in Canada is good; as he notes, there is no actual need for spatulas to be flame-retardant and while the level of fire retardants doesn’t look dangerous, the target level should be roughly zero.

There are two additional points I want to make. First, units.  The original scare paper quoted the ‘oral reference dose’ as 7000 nanograms per kg per day. The EPA document that it cited said 0.007 mg per kilogram per day. These are terrible units.  The SI system gives us names every three orders of magnitude precisely so we don’t have to do this sort of thing and can say 7 micrograms per kilogram per day. It’s a lot easier to work with numbers like this.

Second, what does the reference dose mean?  If you look at the relevant EPA document you will see that the 7 micrograms per kilogram per day comes by taking a reference dose for non-cancer effects in mice and multiplying dividing it by 10 because humans might be more sensitive than mice, and a further 10 because you might be more sensitive than the typical human and a further 3 because long-term exposure might matter more than short-term exposure.  And, on reading further, that the dose for mice was the highest dose tested in that experiment and did not show any adverse effects.  So, the 7 micrograms per kilogram per day reference dose is 300 times the highest dose they even tested for mice.  Some other experiments did find adverse effects in rats, but at doses nearly a thousand times higher: 6 milligrams per kg per day.

Taking all this together you can see the fuzziness in the calculations. There’s now a ten-fold margin of safety between a generously estimated dose and a reference dose — which is not a danger dose, but one that should have no effect.  On top of that, there’s an unknown (and possibly large) safety factor because of the choice of doses in the mouse safety experiments.  The basic problem is that you can’t tell accurately what doses will cause harm in humans without causing harm in humans; some sort of extrapolation is unavoidable in safety assessment.

Away with the ferries

The Isle of Arran, as I’m sure you all know, is on the west side of Scotland  Being an island, it has somewhat limited means of access: Caledonian MacBrayne run two ferries from the mainland.  These ferries are being replaced with allegedly better ferries. However, the BBC headline said ‘Green’ ferry emits more CO2 than old diesel ship

In reply, “Ferries procurement agency CMAL, which owns the ship, said the comparison was “inaccurate” as Glen Sannox is a larger vessel.” 

While New Zealand is very attached to per capita representations of everything, sometimes they aren’t helpful.  The new ship is bigger. Precisely for that reason, it would emit more CO2 if run on the same fuel as the old ship.  The plan is actually to run the ship on liquified fossil gas imported from Qatar and trucked up from the south of England. This would reduce the CO2 emissions, but would produce methane emissions that pretty much compensate for the reduction — and the UK follows mainstream science in recognising that methane actually matters.

In some settings, such as comparing Auckland’s double-decker buses to traditional buses, it’s important to take account of the fact that they’re bigger and so you don’t need as many of them to carry all your passengers.  But when you’re talking about a ferry route with two ships there isn’t the same room for per capita savings to pay off the larger per-ship emissions.   If you run the same number of trips with bigger ships you’ll get more emissions. And if you can carry more cars on the bigger ferries that’s not really going to reduce emissions, either.

February 16, 2024

Say the magic word?

Q: Did you see you can be 50% more influential by using this one word!!

A:  Not convinced

Q: But it’s a Harvard study! With Science!

A: How did they measure influentialness?

Q:

A: <eyeroll emoji>

Q: How did they measure influentialness?

A: By whether someone let you in front of them at the photocopier

Q: What’s a photocopier?

A: When we were very young, books and academic journals were published on this stuff called paper, and stored in a special building, and you had to use a special machine to download them, one page at a time, on to your own paper

Q: That must have sucked.  Wait, why are they asking about photocopiers in a study about influencers now?

A: It’s a study from 50 years ago (PDF)

Q: It says 1978, though. That’s nowhere near… fifty…….. Ok, moving right along here. Why is a study from 50 years ago about photocopiers going to be useful now?

A: If it supports the message you just wrote a book about, it might be.

Q: So the study compared different ways of asking if you could use the photocopier?

A: Yes

Q: And the ones where they used the magic word worked better?

A: Not really. They had three versions of the request. Two of them gave a reason and also used the magic word, the third didn’t do either.

Q: But the ones that gave a reason were 50% more influential?

A: In the case where someone was asking for a short use of the photocopier, the success rate was 60% with no reason and over 90% with a reason (and the magic word)

Q: And when it wasn’t short?

A: 24% with no reason, 24% with a bad reason (and the magic word), and 42% with a good reason (and the magic word)

Q: So what really matters is how long you want someone to wait and whether you have a good reason?

A: That would be an interpretation, yes

Q: In 1978

A: Yes

Q: Still, our parents always told use to “say the magic word” when making requests

A: Actually, they didn’t

Q: Well, no, but they might have

A: And the word they were looking for wasn’t “Because”

November 24, 2023

Detecting ChatGPT

Many news stories and some StatsChat posts have talked about detecting the output of Large Language Models. At the moment, tools to do this are very inaccurate.  Denouncing, for example, a student paper, based on these detectors wouldn’t be supportable. Even worse, the error rate is higher for people who aren’t native English speakers, a group who can already be accused unfairly.

We might hope for better detectors in the future.  If people using ChatGPT have access to the detector, though, there’s a pretty reliable way of getting around it. Take a ChatGPT-produced document, and make small changes to it until it doesn’t trigger the detector.  Here we’re assuming that you can make small changes and still get a good-quality document, but if that’s not true — if there’s only one good answer to the question — there’s no hope for a ChatGPT detector to work.  Additionally, we’re assuming that you can tell which random changes still produce a good answer.  If you can’t, then you might still be able to ask GPT whether the answer is good.

A related question is whether Large Language Model outputs can be ‘watermarked’ invisibly so as to be easier to detect. ChatGPT might encode a signature in the first letters of each sentence, or it might have subtle patterns in word frequencies or sentence lengths. Regrettably, any such watermark falls to the same attack: just make random changes until the detector doesn’t detect.

On the preprint server arXiv recently was a computer science article arguing that even non-public detectors can be attacked in a similar way. Simply take the Large Language Model output and try random changes to it, keeping the changes that don’t mess up the quality.  This produces a random sample from a cloud of similar answers. If there aren’t any similar answers accessible by small changes, it’s going to be hard for the AI to insert a watermark, so we can assume there will be.  ChatGPT didn’t actually produce these similar answers, so a reasonable fraction of them should not trigger the ChatGPT detector.  Skeptics might be reassured that the researchers tried this approach on some real watermarking schemes and it seems to work.