Posts written by Thomas Lumley (2534)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

February 16, 2024

Say the magic word?

Q: Did you see you can be 50% more influential by using this one word!!

A:  Not convinced

Q: But it’s a Harvard study! With Science!

A: How did they measure influentialness?

Q:

A: <eyeroll emoji>

Q: How did they measure influentialness?

A: By whether someone let you in front of them at the photocopier

Q: What’s a photocopier?

A: When we were very young, books and academic journals were published on this stuff called paper, and stored in a special building, and you had to use a special machine to download them, one page at a time, on to your own paper

Q: That must have sucked.  Wait, why are they asking about photocopiers in a study about influencers now?

A: It’s a study from 50 years ago (PDF)

Q: It says 1978, though. That’s nowhere near… fifty…….. Ok, moving right along here. Why is a study from 50 years ago about photocopiers going to be useful now?

A: If it supports the message you just wrote a book about, it might be.

Q: So the study compared different ways of asking if you could use the photocopier?

A: Yes

Q: And the ones where they used the magic word worked better?

A: Not really. They had three versions of the request. Two of them gave a reason and also used the magic word, the third didn’t do either.

Q: But the ones that gave a reason were 50% more influential?

A: In the case where someone was asking for a short use of the photocopier, the success rate was 60% with no reason and over 90% with a reason (and the magic word)

Q: And when it wasn’t short?

A: 24% with no reason, 24% with a bad reason (and the magic word), and 42% with a good reason (and the magic word)

Q: So what really matters is how long you want someone to wait and whether you have a good reason?

A: That would be an interpretation, yes

Q: In 1978

A: Yes

Q: Still, our parents always told use to “say the magic word” when making requests

A: Actually, they didn’t

Q: Well, no, but they might have

A: And the word they were looking for wasn’t “Because”

November 24, 2023

Detecting ChatGPT

Many news stories and some StatsChat posts have talked about detecting the output of Large Language Models. At the moment, tools to do this are very inaccurate.  Denouncing, for example, a student paper, based on these detectors wouldn’t be supportable. Even worse, the error rate is higher for people who aren’t native English speakers, a group who can already be accused unfairly.

We might hope for better detectors in the future.  If people using ChatGPT have access to the detector, though, there’s a pretty reliable way of getting around it. Take a ChatGPT-produced document, and make small changes to it until it doesn’t trigger the detector.  Here we’re assuming that you can make small changes and still get a good-quality document, but if that’s not true — if there’s only one good answer to the question — there’s no hope for a ChatGPT detector to work.  Additionally, we’re assuming that you can tell which random changes still produce a good answer.  If you can’t, then you might still be able to ask GPT whether the answer is good.

A related question is whether Large Language Model outputs can be ‘watermarked’ invisibly so as to be easier to detect. ChatGPT might encode a signature in the first letters of each sentence, or it might have subtle patterns in word frequencies or sentence lengths. Regrettably, any such watermark falls to the same attack: just make random changes until the detector doesn’t detect.

On the preprint server arXiv recently was a computer science article arguing that even non-public detectors can be attacked in a similar way. Simply take the Large Language Model output and try random changes to it, keeping the changes that don’t mess up the quality.  This produces a random sample from a cloud of similar answers. If there aren’t any similar answers accessible by small changes, it’s going to be hard for the AI to insert a watermark, so we can assume there will be.  ChatGPT didn’t actually produce these similar answers, so a reasonable fraction of them should not trigger the ChatGPT detector.  Skeptics might be reassured that the researchers tried this approach on some real watermarking schemes and it seems to work.

November 23, 2023

Whole lotta baseball

From Ars Technica (and while it’s not a story about baseball, it is is trying to use numbers to mean something)

It’s actually 162 regular season games a year for 30 teams which means, 2,430 games a year. That’s 32,805 hours of baseball based on the average length of a game lasting 162 minutes. The regular season is 185 days long, which equals 4,440 hours. So there’s more baseball than time.

These numbers struck me as wrong immediately.  If there are 32k hours of baseball in 4k hours of regular season, it means an average of eight baseball games being played at any hour of the day or night. Since there’s a maximum of 15 games being played simultaneously (because 30 teams), that would mean a full baseball schedule for an average of nearly 12 hours every day.  There is a lot of baseball, but not that much.  They don’t play at 3am, and they take occasional days off to travel.

So, let’s run the numbers:

  • 162 games by 30 teams is 162×15 games, or 2430 games.
  • Average game lasts 162 minutes. 162×2430 is 393660 minutes, or 393660/60=6561 hours.
  • 185 day season is 185×24=4440 hours

The total hours of baseball seems off. In fact, it’s off by exactly a factor of five, suggesting the story was working with 12-minute hours for some reason.  With 6561 hours of baseball in a 4440 hour season, we’re looking at about 1.5 baseball games simultaneously, averaged over the season, which is more plausible.

While we’re at it, we might want to check on the 162 minutes/game since it’s a bit suspicious for two unrelated numbers in the same calculation to both be 162.  It’s right, at least for 2023, though it’s down from over 3 hours the previous season.

November 18, 2023

Bird of the Century

For years, Bird of the Year has been the only self-selected (‘bogus’, ‘straw-poll’, ‘unscientific’) survey endorsed by StatsChat.  The unique feature of Bird of the Year as a bogus poll is that no-one pretends it is anything else. The pūteketeke won Bird of the Century fair and square, due to John Oliver’s marketing efforts, and no-one seriously thinks this says anything real about the relative popularity of New Zealand birds.

The key takeaway from Bird of the Century is this is what bogus polls are like. All of them. When a bogus poll agrees with population opinion, it’s just an accident.  When someone claims to have information from a survey result, it’s always good to ask whether it’s the type of survey that’s more accurate than just pulling a number out of your arse, or not.

A couple of weeks ago, there was a widelyreported claim made by a spokesperson for AA Insurance that 53% of people in New Zealand wanted a ban on domestic use of fireworks. None of the media outlets asked (or reported asking) anything about how he got that number.  When a news report says something that’s attributed to an anonymous source like that, you want to know who is vouching for the credibility of the source.

I happened to see a post on social media by someone who had been in a survey that could have been the one quoted, which was run by my2cents.  I don’t know how good their surveys are, but they at least qualify as trying to get the right answer.  If that survey was actually the one reported by AA Insurance, it would be good to know.

In some contexts, such as election polling or policy decisions, you might want to know more about the methods used and the reputation of the pollsters.  Even in simple news reporting, though, it’s important to ask if this is the sort of survey that gives you information or the sort of survey that just gives you grebes.

July 20, 2023

Election poll accuracy

1News had a new electoral opinion poll out this week, headlined Poll: National, ACT maintain wafer-thin advantage. The report gave National and ACT 61 seats out of 120. However, if you put the reported percentages and assumptions into the Electoral Commission’s seat allocation calculator, as various people did, you would get a different result, with National and ACT holding 60 seats, and a potential Labour/Green/Te Pāti Māori alliance also holding 60.  That’s strange, but you wouldn’t really expect 1News to get this wrong (unless you were the sort of social media commenter who immediately assumed it was a conspiracy rather than an error).

The first thing to check in a situation like this is the full report from the polling company. The percentages match what 1News reported, and if you scroll down to the bottom, there’s an corresponding seat allocation that matches the one reported.  At this point a simple error looks even less likely.  So what did happen?

Looking at either the 1News report or the Verian report, we see that percentages were rounded to the nearest percentage point, or to the nearest tenth of a percentage point below 4.85%. So, can we construct a set of numbers that would round to the the same reported percentages and match the reported seat allocation? Yes, easily. We need National and Labour to have been rounded down and the Greens to have been rounded up.  So there’s no evidence of a mistake, and rounding is easily the most plausible explanation.

On the other hand, one might hope 1News or Verian would notice that the headline figures are very close to even, and consider how sensitive the results might be to rounding.

On the other other hand, though, what this shows is that a single poll is never going to be enough to support a claim of “maintain wafer-thin advantage”.  The left-right split can easily be off by five seats, and that’s a big difference; if the advantage is wafer-thin, it’s way below the ability of the polling system to measure.  You can do somewhat better by combining polls and estimating biases of different pollers, as the Herald’s poll of polls is doing, using a fairly sophisticated model.  They have been fairly consistent in predicting that (assuming no major events that change things) Labour/Green are unlikely to get a majority without Te Pāti Māori, but that National/ACT have reasonable odds of doing so.

Even then, you can’t really do ‘wafer-thin’ comparisons.

Briefly

  • Substack counts as the media, right?  David Farrier, a NZ journalist and filmmaker, wrote about getting a spine MRI.
  • Aspartame is now in Group IIb on the International Agency for Cancer Research scale of hazards.  We had reports from at least the Herald, Stuff, 1News, RNZ.  It’s important to remember that IIb, “possible carcinogen” is effectively the lowest on a three-point scale. IARC has  Group I (definitely carcinogenic at some dose),  Group IIb (probably carcinogenic at some dose), and Group IIb (possibly carcinogenic at some dose). They also have Group III (insufficient evidence). They once had Group IV (not carcinogenic) but it only ever got used once and was retired.  The “at some dose” proviso is also important; for example, sunlight is a Group I carcinogen.  The reports were all pretty good — much better than when bacon got into Group I several years ago.  Perhaps the best was at Stuff, where they actually quoted the recommended dose limit: David Spiegelhalter, an emeritus statistics professor at Cambridge University, said the guidance means that “average people are safe to drink up to 14 cans of diet drink a day … and even this ‘acceptable daily limit’ has a large built-in safety factor.” That’s safe for cancer, not for all possible problems, but it’s still a lot.  I’d also link to the Twitter thread by Martyn Plummer, a statistician and former IARC researcher, but linking to Twitter threads now doesn’t work because of the War on Twitter.
  • Katie Kenny at Stuff had an excellent story about measurement accuracy and her infant son’s weight.
  • Mediawatch writes about one weird trick for getting your press releases covered in the news (be sure always to call it research)
June 18, 2023

Looking for ChatGPT again

When I wrote about ChatGPT-detecting last time, I said that overall accuracy wasn’t enough; you’d need to know about the accuracy in relevant groups of people:

In addition to knowing the overall false positive rate, we’d want to know the false positive rate for important groups of students.  Does using a translator app on text you wrote in another language make you more likely to get flagged? Using a grammar checker? Speaking Kiwi? Are people who use semicolons safe?

Some evidence is accumulating.   The automated detectors can tell opinion pieces published in Science from ChatGPT imitations  (of limited practical use, since Science has an actual list of things it has published).

More importantly, there’s a new preprint that claims the detectors do extremely poorly on material written by non-native English speakers. Specifically, on essays from the TOEFL English exam, the false positive rate averaged over several detectors was over 50%.  The preprint also claims that ChatGPT could be used to edit the TOEFL essays (“Enhance the word choices to sound more like that of a native speaker”) or its own creations (“Elevate the provided text by employing advanced technical language”) to reduce detection.

False positives for non-native speakers are an urgent problem with using the detectors in education. Non-native speakers may already fall under more suspicion, so false positives for them are even more of a problem. However, it’s quite possible that future versions of the detectors can reduce this specific bias (and it will  be important to verify this).

The ability to get around the detectors by editing is a longer-term problem.  If you have a publicly-available detector and the ability to modify your text, you can make changes until the detector no longer reports a problem.   There’s fundamentally no real way around this, and if the process can be automated it will be even easier.  Having a detector that isn’t available to students would remove the ability to edit — but “this program we won’t let you see says you cheated” shouldn’t be an acceptable solution either.

Smartphone blood pressure

At Ars Technica there’s an interesting story about a smartphone addon being developed to provide rapid, inexpensive blood pressure measurement.   Home blood pressure machines aren’t all that cheap and they’re a bit tricky to use reliably on your  own. Smartphones, obviously, are more expensive than the blood pressure machines, but people might already own them because they are multi-use devices — you can read StatsChat on them as well.  Blood pressure measurement is important because high blood pressure not only presents a risk of heart and brain and kidney damage but is something we can actually treat, safely and effectively, at very low cost.

The gadget uses a smartphone torch and camera to shine light into your finger, and also measures how hard you’re pressing on it. Taking a range of readings at different pressures over a few minutes lets it estimate your blood pressure.  If you heard about problems with oxygen perfusion measurements during the early pandemic, you might worry that this isn’t going to work well in people with dark skin.  The researchers tried to look at this, but it’s not clear whether they had any particularly dark-skinned people in the study (they divided their participants up into Asian, Hispanic, and White, and this was in San Diego).  The other question is accuracy. The story says

 The current version of BPClip produces measurements that may differ from cuff readings by as much as 8 mmHg.

The linked research paper, on the other hand, says 8 mmHg is the mean absolute error — the average difference between the new gadget and a traditional reading. You’d expect about 40% of readings to differ by more than that, and maybe 10% to differ by twice as much.  In fact, the research paper has this picture where the vertical axis is the difference between the new gadget and traditional BP measurements (and the horizontal axis is the average of the two)

It’s clear that the gadget produces measurements that may differ from cuff readings by as much as 20 mmHg — and that’s after dropping  out 5 of the 29 study participants because they couldn’t get good measurements.  When researchers do carefully report their accuracy it’s a pity to get it wrong in translation

June 11, 2023

72% of landlords?

Chris Luxon, via Q+A:

“72% of all landlords are Mum & Dad investors… they are not evil property speculators”

This is attracting a lot of comment, because it can be interpreted as two quite different claims: either that the landlord of a randomly chosen rental has a 72% chance of being a “Mum & Dad investor” or that a randomly chosen person or corporation who is a landlord has a 72% chance of being a “Mum & Dad investor”.  Mr Luxon presumably means the latter, which is the more natural statistical interpretation, though (and this is a frequent Statschat theme) not necessarily the most relevant number to the policy question at hand.

Strictly speaking, we don’t have data on the fraction of  landlords who are Mum & Dad investors or evil property speculators. No-one makes landlords report whether they have children, whether they own property for speculation or investment, or even whether they are evil.  Some landlords might be evil and have children and not be speculators!  However, if we follow Kirsty Johnston a few years ago in the Herald in dividing up landlords by the numbers of properties they own, it turns out that (a) most of the actual people who own homes don’t own a lot and (b) the occasional people who each own a lot of homes collectively own a lot of homes.

Proportion of properties owned by people who own

One property – 27.1 per cent
Two properties – 13.2 per cent
Three properties – 6.5 per cent
Four to six properties – 10.7 per cent
Seven to 20 properties – 9.7 per cent
More than 20 properties – 9.4 per cent

Of the people who own more than one property, 52% own only two and 70% own two or three.

The word ‘people’ is important here: homes are also owned by companies.  Companies are definitionally not Mum & Dad investors, but they also aren’t necessarily evil property speculators.  The largest in that Herald story was Housing New Zealand, as it then was, who had 60,000 rental properties, and the category would include some other non-profit providers.  At least some of the people who don’t like landlords feel differently about community housing providers.

And, finally homes are owned by trusts, which is a much more difficulty category to survey, since there isn’t a centralised registry of beneficiaries of trusts. The Herald didn’t even hazard a guess.

So, there’s plenty of room for Chris Luxon’s claim to be largely true. Most landlords are probably small-scale, but the small proportion who aren’t will still add up to a lot of rentals.  However, he almost certainly doesn’t have reliable population data about either the parental status or moral alignment of landlords. Nor is it clear that the typical portfolio size of landlords should be decisive in deciding how to tax and regulate them.  Mum & Dad takeaways still need to follow food safety rules.

June 10, 2023

Chocolate for the brain?

Q: Did you see that chocolate and wine really are good for the brain?

A: Not convinced

Q: A randomised trial, though. 16% improvement in memory!

A: A randomised trial, but not 16% improvement in memory

Q: It’s what the Herald says

A: Well, more precisely, it’s what the Daily Telegraph says

Q: What does the research say?

A: It’s a good trial. People were randomly allocated to either flavanols from cocoa or placebo, and they did tests that are supposed to evaluate memory. The scores of both groups improved about 15% over the trial period. The researchers think this is due to practice.

Q: You mean like how Donald Trump remembered the words he was asked in a cognitive function test?

A: Yes, like that.

Q: But the story says there was a benefit in people who had low-flavanol diets before the study.

A: Yes, there was borderline evidence of a difference between people with high and low levels of flavanols in their urine. But that’s a difference of 0.1 points in average score compared to an average change of about 1 point. Nearly all the change in the treated people also showed up in the placebo group, so only a tiny fraction of it could be an effect of treatment.

Q: Was this the planned analysis of the trial or just something they thought up afterwards?

A: It was one of the planned analyses. One of quite a lot of planned analyses.

Q: That’s ok, isn’t it?

A: It’s ok if you don’t get too excited about borderline results in one of the analyses, yes.

Q: So it doesn’t work?

A: It might work — this is relatively impressive as dietary micronutrient evidence goes.  But if it works, it only works for people with low intakes of tea and apples and cherries and citrus and peppers and chocolate and soy.

Q: <sigh> We don’t really qualify, do we?

A: Probably not.

Q: So if we were eating chocolate primarily for the health effects?

A: We’d still be doing it wrong.