Posts written by Thomas Lumley (2548)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

June 18, 2023

Looking for ChatGPT again

When I wrote about ChatGPT-detecting last time, I said that overall accuracy wasn’t enough; you’d need to know about the accuracy in relevant groups of people:

In addition to knowing the overall false positive rate, we’d want to know the false positive rate for important groups of students.  Does using a translator app on text you wrote in another language make you more likely to get flagged? Using a grammar checker? Speaking Kiwi? Are people who use semicolons safe?

Some evidence is accumulating.   The automated detectors can tell opinion pieces published in Science from ChatGPT imitations  (of limited practical use, since Science has an actual list of things it has published).

More importantly, there’s a new preprint that claims the detectors do extremely poorly on material written by non-native English speakers. Specifically, on essays from the TOEFL English exam, the false positive rate averaged over several detectors was over 50%.  The preprint also claims that ChatGPT could be used to edit the TOEFL essays (“Enhance the word choices to sound more like that of a native speaker”) or its own creations (“Elevate the provided text by employing advanced technical language”) to reduce detection.

False positives for non-native speakers are an urgent problem with using the detectors in education. Non-native speakers may already fall under more suspicion, so false positives for them are even more of a problem. However, it’s quite possible that future versions of the detectors can reduce this specific bias (and it will  be important to verify this).

The ability to get around the detectors by editing is a longer-term problem.  If you have a publicly-available detector and the ability to modify your text, you can make changes until the detector no longer reports a problem.   There’s fundamentally no real way around this, and if the process can be automated it will be even easier.  Having a detector that isn’t available to students would remove the ability to edit — but “this program we won’t let you see says you cheated” shouldn’t be an acceptable solution either.

Smartphone blood pressure

At Ars Technica there’s an interesting story about a smartphone addon being developed to provide rapid, inexpensive blood pressure measurement.   Home blood pressure machines aren’t all that cheap and they’re a bit tricky to use reliably on your  own. Smartphones, obviously, are more expensive than the blood pressure machines, but people might already own them because they are multi-use devices — you can read StatsChat on them as well.  Blood pressure measurement is important because high blood pressure not only presents a risk of heart and brain and kidney damage but is something we can actually treat, safely and effectively, at very low cost.

The gadget uses a smartphone torch and camera to shine light into your finger, and also measures how hard you’re pressing on it. Taking a range of readings at different pressures over a few minutes lets it estimate your blood pressure.  If you heard about problems with oxygen perfusion measurements during the early pandemic, you might worry that this isn’t going to work well in people with dark skin.  The researchers tried to look at this, but it’s not clear whether they had any particularly dark-skinned people in the study (they divided their participants up into Asian, Hispanic, and White, and this was in San Diego).  The other question is accuracy. The story says

 The current version of BPClip produces measurements that may differ from cuff readings by as much as 8 mmHg.

The linked research paper, on the other hand, says 8 mmHg is the mean absolute error — the average difference between the new gadget and a traditional reading. You’d expect about 40% of readings to differ by more than that, and maybe 10% to differ by twice as much.  In fact, the research paper has this picture where the vertical axis is the difference between the new gadget and traditional BP measurements (and the horizontal axis is the average of the two)

It’s clear that the gadget produces measurements that may differ from cuff readings by as much as 20 mmHg — and that’s after dropping  out 5 of the 29 study participants because they couldn’t get good measurements.  When researchers do carefully report their accuracy it’s a pity to get it wrong in translation

June 11, 2023

72% of landlords?

Chris Luxon, via Q+A:

“72% of all landlords are Mum & Dad investors… they are not evil property speculators”

This is attracting a lot of comment, because it can be interpreted as two quite different claims: either that the landlord of a randomly chosen rental has a 72% chance of being a “Mum & Dad investor” or that a randomly chosen person or corporation who is a landlord has a 72% chance of being a “Mum & Dad investor”.  Mr Luxon presumably means the latter, which is the more natural statistical interpretation, though (and this is a frequent Statschat theme) not necessarily the most relevant number to the policy question at hand.

Strictly speaking, we don’t have data on the fraction of  landlords who are Mum & Dad investors or evil property speculators. No-one makes landlords report whether they have children, whether they own property for speculation or investment, or even whether they are evil.  Some landlords might be evil and have children and not be speculators!  However, if we follow Kirsty Johnston a few years ago in the Herald in dividing up landlords by the numbers of properties they own, it turns out that (a) most of the actual people who own homes don’t own a lot and (b) the occasional people who each own a lot of homes collectively own a lot of homes.

Proportion of properties owned by people who own

One property – 27.1 per cent
Two properties – 13.2 per cent
Three properties – 6.5 per cent
Four to six properties – 10.7 per cent
Seven to 20 properties – 9.7 per cent
More than 20 properties – 9.4 per cent

Of the people who own more than one property, 52% own only two and 70% own two or three.

The word ‘people’ is important here: homes are also owned by companies.  Companies are definitionally not Mum & Dad investors, but they also aren’t necessarily evil property speculators.  The largest in that Herald story was Housing New Zealand, as it then was, who had 60,000 rental properties, and the category would include some other non-profit providers.  At least some of the people who don’t like landlords feel differently about community housing providers.

And, finally homes are owned by trusts, which is a much more difficulty category to survey, since there isn’t a centralised registry of beneficiaries of trusts. The Herald didn’t even hazard a guess.

So, there’s plenty of room for Chris Luxon’s claim to be largely true. Most landlords are probably small-scale, but the small proportion who aren’t will still add up to a lot of rentals.  However, he almost certainly doesn’t have reliable population data about either the parental status or moral alignment of landlords. Nor is it clear that the typical portfolio size of landlords should be decisive in deciding how to tax and regulate them.  Mum & Dad takeaways still need to follow food safety rules.

June 10, 2023

Chocolate for the brain?

Q: Did you see that chocolate and wine really are good for the brain?

A: Not convinced

Q: A randomised trial, though. 16% improvement in memory!

A: A randomised trial, but not 16% improvement in memory

Q: It’s what the Herald says

A: Well, more precisely, it’s what the Daily Telegraph says

Q: What does the research say?

A: It’s a good trial. People were randomly allocated to either flavanols from cocoa or placebo, and they did tests that are supposed to evaluate memory. The scores of both groups improved about 15% over the trial period. The researchers think this is due to practice.

Q: You mean like how Donald Trump remembered the words he was asked in a cognitive function test?

A: Yes, like that.

Q: But the story says there was a benefit in people who had low-flavanol diets before the study.

A: Yes, there was borderline evidence of a difference between people with high and low levels of flavanols in their urine. But that’s a difference of 0.1 points in average score compared to an average change of about 1 point. Nearly all the change in the treated people also showed up in the placebo group, so only a tiny fraction of it could be an effect of treatment.

Q: Was this the planned analysis of the trial or just something they thought up afterwards?

A: It was one of the planned analyses. One of quite a lot of planned analyses.

Q: That’s ok, isn’t it?

A: It’s ok if you don’t get too excited about borderline results in one of the analyses, yes.

Q: So it doesn’t work?

A: It might work — this is relatively impressive as dietary micronutrient evidence goes.  But if it works, it only works for people with low intakes of tea and apples and cherries and citrus and peppers and chocolate and soy.

Q: <sigh> We don’t really qualify, do we?

A: Probably not.

Q: So if we were eating chocolate primarily for the health effects?

A: We’d still be doing it wrong.

June 7, 2023

Briefly

  • Michael Neilson and Chris Knox at the NZ$ Herald have an excellent look at crime statistics and what you can’t straightforwardly conclude from them
  • Outsourcing:  there was a story about garlic to stop Covid.  Here are responses from a blog at the University of Waikato, and from misinformation.wiki
  • Many French labour regulations start to apply when you have 50 employees, says the Economist, showing the graph on the right that has a big drop in the number of businesses reporting 50 or more employees

    The graph on the left shows the same self-reporting from a different source.  The graph in the middle shows actual numbers of employees estimated using payroll data, with nothing happening at 50 — an interesting difference (from)
May 31, 2023

Recycling

From Newshub: New study into therapeutic cannabis finds 96pct of participants report benefits in medical conditions.

This is deeply unsurprising, as I pointed out when similar results came out in 2017.  People who keep on using cannabis therapeutically are likely to think it works, because otherwise they wouldn’t.

 

May 22, 2023

Sniffing out Covid accuracy?

This research, reported at Newshub, is pretty good: dogs are being evaluated for detecting Covid in reasonably large numbers of people at schools, and the results are actually published in a respectable peer-reviewed journal.  The numbers are still getting oversold a bit:

The study said [the dogs] were 83 percent accurate at identifying COVID-19-positive students, and 90 percent on the mark at picking out virus-negative students.

There are two issues here. First, 90% accuracy on virus-negative students is not at all good.  The current prevalence of Covid in NZ is maybe 1%, and the idea of dog-based-testing is for the subset of people who are possibly infected but apparently healthy, which will cut the numbers further.  Most of the people who test positive will be false positives, who will then need to be isolated for further testing.   Antigen tests have an accuracy in uninfected people (specificity) more like 99%.

Achieving 83% accuracy for identifying actual infections sounds good, and I’ve seen social media comments that it’s better than the antigen tests. It isn’t really: as the story says, the accuracy figure measures how well the dogs agree with the antigen tests.  83% accuracy means that the dogs are missing 17% of the infections that the antigen tests pick up.

If your school was doing regular testing of all students, this study suggests that you could reduce the amount of testing by using dogs to screen people first and only testing the kids who the dogs select. The accuracy would be lower than testing everyone, but it might be cheaper.   The study doesn’t really argue that dogs would be good on their own.

The story goes on to say

Dogs have also been using their powerful noses to detect other diseases such as cancer with shockingly high accuracy. 

Again, this is true (at least in a sense). Research studies, mostly small, have reported results of this sort for years. The fact that we aren’t currently using dogs to screen for any of these diseases suggests either that there are practical barriers to implementation, or that it doesn’t actually work all that well in practice.

May 5, 2023

Old people live longer?

From Ars Technica

An analysis of 2,007 damaged or defective hard disk drives (HDDs) has led a data recovery firm to conclude that “in general, old drives seem more durable and resilient than new drives.”

If you go out and lo0k at human deaths over a period of years, you will also find that Baby Boomers are much more likely die over age 70 than Gen X, and Gen X are more likely to die after 50 than millennials.  It’s too late for Boomers to die young.

Going on, we see that

backup and cloud storage company Backblaze uses hard drives that surpass the average life span Secure Data Recovery saw among the HDDs clients sent it last year. At the end of 2022, Backblaze’s 230,921 hard drives had an average age of 3.6 years versus the 2-year, 10-month average time before failure among the drives Secure Data Recovery worked on last year.

Again, if you look at an older group of people, the average age at death will be older than if you use a younger group.

This all wouldn’t matter so much, except that they are also trying to draw conclusions about novel disk drive technologies being less reliable.  There are statistical techniques to account for the different follow-up time of different groups of drives, but it’s quite possible those techniques would just tell you “nope, too soon to tell”.

April 27, 2023

Having your say

There’s a 1News headline Only 1 in 4 Aucklanders back Wayne Brown’s sweeping cuts

I was expecting this to be based on proportions supporting the cuts in public submissions, but I was glad to see that the Council commissioned a proper survey in addition to accepting general comments and 1News reported it correctly.  That’s an excellent combination: open public comment allows people to point out problems or solutions you haven’t considered and a survey allows quantitative measurement for issues where you know the options in advance. More organisations should do this.

As you’d expect, trying to get quantitative results from the public comment doesn’t work very well. For example, 1News reported that 40% of submitters didn’t want any cuts, as opposed to the estimate of 7% for all Auckland in the survey.  People who have something to say are more likely to say it.

The number of submissions was very large: more than 10% of the number of votes in the mayoral election.  Even so, the submissions are very unrepresentative. That’s not a problem if you aren’t trying to get quantitative results; having the input biased towards experts and people who care a lot can be helpful.  It would be a problem if you were just counting the results.

April 26, 2023

Missing injuries

Stuff has an important report on road injuries in Auckland, based on a review by Auckland Transport (that doesn’t seem to be available).   They looked at the numbers of serious injuries over three years, both as reported by the police crash analysis system and as reported by the hospitals where the people turned up for treatment. These numbers were not the same

Even the discrepancy for cars is a bit surprising: ‘serious injuries’ here mean hospital admission (an overnight stay).  It’s not clear whether the police are missing half the serious-injury crashes or under-reporting how many people end up in hospital, but I would have expected more completeness of reporting for hospitalisation of people in vehicles. (According to the law, even accidents involving bikes or e-scooters without cars  that result in injury have to be reported to the police, but I’m less surprised this doesn’t happen.)

Some of the bike and pedestrian and ‘transport device’ injuries that are being missed will be crashes that didn’t involve cars.  That’s the dashed/solid distinction in the graph. The data match international research on e-scooter crashes suggesting that most injuries are to the rider, and so may be relatively less likely to get reported to police.  For bikes, a potentially worrying category is injuries involving stationary cars — ‘doorings’ — which might be less likely to get reported than those involving moving cars.

Another important concern, though, is the definition of ‘serious injury’. An injury causing broken bones and resulting in weeks or months of significant disability, but not involving an overnight hospital stay, would not be meet the threshold.  This implies even the MoH statistics are also missing a lot of injuries that a normal person would call ‘serious’.