Posts written by Thomas Lumley (2548)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

January 9, 2025

Briefly

  • The top baby names from New Zealand last year are out.  As we’ve seen in the past, the most-common names keep getting less common. “Noah” came top for boys, with only 250 uses, and “Isla” for girls, with only 190 uses.
  • The Daily Mail (because of course) has something purporting to be a map of penis sizes around the world, credited to this site, which gives no sources for the data. Wikipedia points out that a lot of data on this topic is self-reported claims. Wikipedia (possibly NSFW) notes thatMeasurements vary, with studies that rely on self-measurement reporting a significantly higher average than those with a health professional measuring. Even when it’s measured, it tends to be on volunteer samples, and there isn’t good standardisation of measurement protocols across sites.
  • If you live in one of these Aussie suburbs buy a lottery ticket NOW, says the headline on MSN.com, from the Daily Mail (Australia version).  This is a much more extreme headline than the NZ versions I usually complain about, and the text is more measured. Of course, there are two reasons why a suburb will see more lottery wins. The first is just chance, which doesn’t project into the future like that. The second is that these are suburbs where more money is lost on the lottery. Those trends probably will continue, but lottery advertising stories never seem to print the amounts lost on lotto.
  • We’ve seen a number of times that salary/wage ranges generated from advertising at Seek are not very similar to those reported from actual payments by StatsNZ.  This is worse: via Carl Bergstrom and Eduardo Hebkost, on Bluesky, apparently ziprecruiter.com will (in the US at least; not in NZ) give you salaries for any job you ask about, if you just forge a URL pointing to where the graph should be
January 4, 2025

Size matters

Ok, this is a bit late, but I didn’t see the poll (in a physical Sunday Star-Times) until this week.  An established Australian polling firm, Freshwater Strategy, have been doing polls here, too.  Stuff reports that the poll (also, at the Post)

…reveals 37% of New Zealand voters have seriously considered emigrating to Australia in the past 12 months.

By comparison, of Australian voters, only 8% have considered moving to New Zealand, including just 1% who have spent time looking into it.

If you don’t think too carefully, that gives the impression of a giant sucking sound and the lights going out in New Zealand.  Australia is a lot larger than New Zealand, though.  If 8% of people in Australia moved to New Zealand and 37% of people in New Zealand moved to Australia, the population of New Zealand would go up, not down.  The total populations are about 5 million and about 27 million. Of those, about 3.6 million are enrolled to vote in NZ and nearly 18 million enrolled to vote in Australia, so 37% of NZ voters is 1.3 million and 8% of Oz voters is 1.44 million.

Another useful comparison number is that the largest ever number of people migrating out of NZ to all destinations, not just Australia, over any 12 months is about 130,000, a tenth of the ‘seriously considered’ number. A lot of people (apparently) seriously consider a lot of things they don’t end up doing.

The other important aspect of the story is the estimates quoted for small subpopulations.  Overall, the poll claims a maximum margin of error of about  3 percentage points. That’s for the population as a whole. Proportions are given for different age groups, including 18-34 year olds, people earning more than $150,000, and voters for Te Pāti Māori.  We aren’t told the uncertainty in these numbers, but it’s obviously higher.  About 1/3 of adults are 18-34, about 5% earn over $150k (IRD spreadsheet), and about 3% voted for Te Pāti Māori.  The maximum margin of error for subpopulations this big would be 5, 13, and 17 percentage points respectively, assuming equal sampling.   You can’t easily learn much about wealthy people or Pāti Māori voters just by contacting random people throughout the country — and the assumption that you can make your sample representative by reweighting gets increasingly dodgy.

 

January 1, 2025

Black spatulas

If you’ve been paying attention to the food scare news, you will have heard that black plastic spatulas are EVIL and then that they are probably ok.  Predicted exposures to brominated flame retardants were close to the ‘reference level’ because of a simple decimal-place error and there’s now a ten-fold safety margin. The summary by Joel Schwarcz at McGill University in Canada is good; as he notes, there is no actual need for spatulas to be flame-retardant and while the level of fire retardants doesn’t look dangerous, the target level should be roughly zero.

There are two additional points I want to make. First, units.  The original scare paper quoted the ‘oral reference dose’ as 7000 nanograms per kg per day. The EPA document that it cited said 0.007 mg per kilogram per day. These are terrible units.  The SI system gives us names every three orders of magnitude precisely so we don’t have to do this sort of thing and can say 7 micrograms per kilogram per day. It’s a lot easier to work with numbers like this.

Second, what does the reference dose mean?  If you look at the relevant EPA document you will see that the 7 micrograms per kilogram per day comes by taking a reference dose for non-cancer effects in mice and multiplying dividing it by 10 because humans might be more sensitive than mice, and a further 10 because you might be more sensitive than the typical human and a further 3 because long-term exposure might matter more than short-term exposure.  And, on reading further, that the dose for mice was the highest dose tested in that experiment and did not show any adverse effects.  So, the 7 micrograms per kilogram per day reference dose is 300 times the highest dose they even tested for mice.  Some other experiments did find adverse effects in rats, but at doses nearly a thousand times higher: 6 milligrams per kg per day.

Taking all this together you can see the fuzziness in the calculations. There’s now a ten-fold margin of safety between a generously estimated dose and a reference dose — which is not a danger dose, but one that should have no effect.  On top of that, there’s an unknown (and possibly large) safety factor because of the choice of doses in the mouse safety experiments.  The basic problem is that you can’t tell accurately what doses will cause harm in humans without causing harm in humans; some sort of extrapolation is unavoidable in safety assessment.

Away with the ferries

The Isle of Arran, as I’m sure you all know, is on the west side of Scotland  Being an island, it has somewhat limited means of access: Caledonian MacBrayne run two ferries from the mainland.  These ferries are being replaced with allegedly better ferries. However, the BBC headline said ‘Green’ ferry emits more CO2 than old diesel ship

In reply, “Ferries procurement agency CMAL, which owns the ship, said the comparison was “inaccurate” as Glen Sannox is a larger vessel.” 

While New Zealand is very attached to per capita representations of everything, sometimes they aren’t helpful.  The new ship is bigger. Precisely for that reason, it would emit more CO2 if run on the same fuel as the old ship.  The plan is actually to run the ship on liquified fossil gas imported from Qatar and trucked up from the south of England. This would reduce the CO2 emissions, but would produce methane emissions that pretty much compensate for the reduction — and the UK follows mainstream science in recognising that methane actually matters.

In some settings, such as comparing Auckland’s double-decker buses to traditional buses, it’s important to take account of the fact that they’re bigger and so you don’t need as many of them to carry all your passengers.  But when you’re talking about a ferry route with two ships there isn’t the same room for per capita savings to pay off the larger per-ship emissions.   If you run the same number of trips with bigger ships you’ll get more emissions. And if you can carry more cars on the bigger ferries that’s not really going to reduce emissions, either.

February 16, 2024

Say the magic word?

Q: Did you see you can be 50% more influential by using this one word!!

A:  Not convinced

Q: But it’s a Harvard study! With Science!

A: How did they measure influentialness?

Q:

A: <eyeroll emoji>

Q: How did they measure influentialness?

A: By whether someone let you in front of them at the photocopier

Q: What’s a photocopier?

A: When we were very young, books and academic journals were published on this stuff called paper, and stored in a special building, and you had to use a special machine to download them, one page at a time, on to your own paper

Q: That must have sucked.  Wait, why are they asking about photocopiers in a study about influencers now?

A: It’s a study from 50 years ago (PDF)

Q: It says 1978, though. That’s nowhere near… fifty…….. Ok, moving right along here. Why is a study from 50 years ago about photocopiers going to be useful now?

A: If it supports the message you just wrote a book about, it might be.

Q: So the study compared different ways of asking if you could use the photocopier?

A: Yes

Q: And the ones where they used the magic word worked better?

A: Not really. They had three versions of the request. Two of them gave a reason and also used the magic word, the third didn’t do either.

Q: But the ones that gave a reason were 50% more influential?

A: In the case where someone was asking for a short use of the photocopier, the success rate was 60% with no reason and over 90% with a reason (and the magic word)

Q: And when it wasn’t short?

A: 24% with no reason, 24% with a bad reason (and the magic word), and 42% with a good reason (and the magic word)

Q: So what really matters is how long you want someone to wait and whether you have a good reason?

A: That would be an interpretation, yes

Q: In 1978

A: Yes

Q: Still, our parents always told use to “say the magic word” when making requests

A: Actually, they didn’t

Q: Well, no, but they might have

A: And the word they were looking for wasn’t “Because”

November 24, 2023

Detecting ChatGPT

Many news stories and some StatsChat posts have talked about detecting the output of Large Language Models. At the moment, tools to do this are very inaccurate.  Denouncing, for example, a student paper, based on these detectors wouldn’t be supportable. Even worse, the error rate is higher for people who aren’t native English speakers, a group who can already be accused unfairly.

We might hope for better detectors in the future.  If people using ChatGPT have access to the detector, though, there’s a pretty reliable way of getting around it. Take a ChatGPT-produced document, and make small changes to it until it doesn’t trigger the detector.  Here we’re assuming that you can make small changes and still get a good-quality document, but if that’s not true — if there’s only one good answer to the question — there’s no hope for a ChatGPT detector to work.  Additionally, we’re assuming that you can tell which random changes still produce a good answer.  If you can’t, then you might still be able to ask GPT whether the answer is good.

A related question is whether Large Language Model outputs can be ‘watermarked’ invisibly so as to be easier to detect. ChatGPT might encode a signature in the first letters of each sentence, or it might have subtle patterns in word frequencies or sentence lengths. Regrettably, any such watermark falls to the same attack: just make random changes until the detector doesn’t detect.

On the preprint server arXiv recently was a computer science article arguing that even non-public detectors can be attacked in a similar way. Simply take the Large Language Model output and try random changes to it, keeping the changes that don’t mess up the quality.  This produces a random sample from a cloud of similar answers. If there aren’t any similar answers accessible by small changes, it’s going to be hard for the AI to insert a watermark, so we can assume there will be.  ChatGPT didn’t actually produce these similar answers, so a reasonable fraction of them should not trigger the ChatGPT detector.  Skeptics might be reassured that the researchers tried this approach on some real watermarking schemes and it seems to work.

November 23, 2023

Whole lotta baseball

From Ars Technica (and while it’s not a story about baseball, it is is trying to use numbers to mean something)

It’s actually 162 regular season games a year for 30 teams which means, 2,430 games a year. That’s 32,805 hours of baseball based on the average length of a game lasting 162 minutes. The regular season is 185 days long, which equals 4,440 hours. So there’s more baseball than time.

These numbers struck me as wrong immediately.  If there are 32k hours of baseball in 4k hours of regular season, it means an average of eight baseball games being played at any hour of the day or night. Since there’s a maximum of 15 games being played simultaneously (because 30 teams), that would mean a full baseball schedule for an average of nearly 12 hours every day.  There is a lot of baseball, but not that much.  They don’t play at 3am, and they take occasional days off to travel.

So, let’s run the numbers:

  • 162 games by 30 teams is 162×15 games, or 2430 games.
  • Average game lasts 162 minutes. 162×2430 is 393660 minutes, or 393660/60=6561 hours.
  • 185 day season is 185×24=4440 hours

The total hours of baseball seems off. In fact, it’s off by exactly a factor of five, suggesting the story was working with 12-minute hours for some reason.  With 6561 hours of baseball in a 4440 hour season, we’re looking at about 1.5 baseball games simultaneously, averaged over the season, which is more plausible.

While we’re at it, we might want to check on the 162 minutes/game since it’s a bit suspicious for two unrelated numbers in the same calculation to both be 162.  It’s right, at least for 2023, though it’s down from over 3 hours the previous season.

November 18, 2023

Bird of the Century

For years, Bird of the Year has been the only self-selected (‘bogus’, ‘straw-poll’, ‘unscientific’) survey endorsed by StatsChat.  The unique feature of Bird of the Year as a bogus poll is that no-one pretends it is anything else. The pūteketeke won Bird of the Century fair and square, due to John Oliver’s marketing efforts, and no-one seriously thinks this says anything real about the relative popularity of New Zealand birds.

The key takeaway from Bird of the Century is this is what bogus polls are like. All of them. When a bogus poll agrees with population opinion, it’s just an accident.  When someone claims to have information from a survey result, it’s always good to ask whether it’s the type of survey that’s more accurate than just pulling a number out of your arse, or not.

A couple of weeks ago, there was a widelyreported claim made by a spokesperson for AA Insurance that 53% of people in New Zealand wanted a ban on domestic use of fireworks. None of the media outlets asked (or reported asking) anything about how he got that number.  When a news report says something that’s attributed to an anonymous source like that, you want to know who is vouching for the credibility of the source.

I happened to see a post on social media by someone who had been in a survey that could have been the one quoted, which was run by my2cents.  I don’t know how good their surveys are, but they at least qualify as trying to get the right answer.  If that survey was actually the one reported by AA Insurance, it would be good to know.

In some contexts, such as election polling or policy decisions, you might want to know more about the methods used and the reputation of the pollsters.  Even in simple news reporting, though, it’s important to ask if this is the sort of survey that gives you information or the sort of survey that just gives you grebes.

July 20, 2023

Election poll accuracy

1News had a new electoral opinion poll out this week, headlined Poll: National, ACT maintain wafer-thin advantage. The report gave National and ACT 61 seats out of 120. However, if you put the reported percentages and assumptions into the Electoral Commission’s seat allocation calculator, as various people did, you would get a different result, with National and ACT holding 60 seats, and a potential Labour/Green/Te Pāti Māori alliance also holding 60.  That’s strange, but you wouldn’t really expect 1News to get this wrong (unless you were the sort of social media commenter who immediately assumed it was a conspiracy rather than an error).

The first thing to check in a situation like this is the full report from the polling company. The percentages match what 1News reported, and if you scroll down to the bottom, there’s an corresponding seat allocation that matches the one reported.  At this point a simple error looks even less likely.  So what did happen?

Looking at either the 1News report or the Verian report, we see that percentages were rounded to the nearest percentage point, or to the nearest tenth of a percentage point below 4.85%. So, can we construct a set of numbers that would round to the the same reported percentages and match the reported seat allocation? Yes, easily. We need National and Labour to have been rounded down and the Greens to have been rounded up.  So there’s no evidence of a mistake, and rounding is easily the most plausible explanation.

On the other hand, one might hope 1News or Verian would notice that the headline figures are very close to even, and consider how sensitive the results might be to rounding.

On the other other hand, though, what this shows is that a single poll is never going to be enough to support a claim of “maintain wafer-thin advantage”.  The left-right split can easily be off by five seats, and that’s a big difference; if the advantage is wafer-thin, it’s way below the ability of the polling system to measure.  You can do somewhat better by combining polls and estimating biases of different pollers, as the Herald’s poll of polls is doing, using a fairly sophisticated model.  They have been fairly consistent in predicting that (assuming no major events that change things) Labour/Green are unlikely to get a majority without Te Pāti Māori, but that National/ACT have reasonable odds of doing so.

Even then, you can’t really do ‘wafer-thin’ comparisons.

Briefly

  • Substack counts as the media, right?  David Farrier, a NZ journalist and filmmaker, wrote about getting a spine MRI.
  • Aspartame is now in Group IIb on the International Agency for Cancer Research scale of hazards.  We had reports from at least the Herald, Stuff, 1News, RNZ.  It’s important to remember that IIb, “possible carcinogen” is effectively the lowest on a three-point scale. IARC has  Group I (definitely carcinogenic at some dose),  Group IIb (probably carcinogenic at some dose), and Group IIb (possibly carcinogenic at some dose). They also have Group III (insufficient evidence). They once had Group IV (not carcinogenic) but it only ever got used once and was retired.  The “at some dose” proviso is also important; for example, sunlight is a Group I carcinogen.  The reports were all pretty good — much better than when bacon got into Group I several years ago.  Perhaps the best was at Stuff, where they actually quoted the recommended dose limit: David Spiegelhalter, an emeritus statistics professor at Cambridge University, said the guidance means that “average people are safe to drink up to 14 cans of diet drink a day … and even this ‘acceptable daily limit’ has a large built-in safety factor.” That’s safe for cancer, not for all possible problems, but it’s still a lot.  I’d also link to the Twitter thread by Martyn Plummer, a statistician and former IARC researcher, but linking to Twitter threads now doesn’t work because of the War on Twitter.
  • Katie Kenny at Stuff had an excellent story about measurement accuracy and her infant son’s weight.
  • Mediawatch writes about one weird trick for getting your press releases covered in the news (be sure always to call it research)