Posts written by Thomas Lumley (2534)

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient

March 2, 2023

Pie and anti-pie

There are other issues with this graph (from the ABC’s Dan Ziffer): these are components of inflation rather than causes, why ‘beyond 3%’?  The big issue, though, is the pie, where the positive number add to 104% and then there’s the negative 4%.

You can’t have negative numbers in a pie chart; that isn’t how pies work.  If you combine 104% of a pie and 4% of an anti-pie, you’ll end up on this list

January 9, 2023

Briefly

  • “We were able to put together a relatively good data set of case numbers for all states, but we were explicitly forbidden to make the data publicly available, even though our data was more accurate than what was appearing in the media.” Rob Hyndman, quoted by the ABC
  • Yet another example that counting isn’t simply neutral, from the Wikipedia entry for the Bechdel Test, via depths of wikipedia: “What counts as a character or as a conversation is not defined. For example, the Sir Mix-a-Lot song “Baby Got Back” has been described as passing the Bechdel test, because it begins with a valley girl saying to another “oh my god, Becky, look at her butt”. 
  • From the Washington Post: is your name more common for dogs or people? (in the US, of course)
  • From the New York Times, estimated carbon emissions by neighbourhood across the USA.
  • From David Hood, using the Ministry of Health public data, our holiday Covid wave. Something different seems to have happened in Tarāwhiti, and it seems to have happened at roughly the same time as the Rhythm’N’Vines festival
January 8, 2023

Murderous Kiwis

Newshub has a story Map: New Zealand’s murder hotspots revealed.

This is the map

The map (and the text) don’t say what these geographical units are. Based on the context and the presence of “Counties Manukau” as one of them, I would expect them to be police districts: this (just a map, no data) is from the NZ Police website

There’s a few confusing things about the Newshub map, though.  We seem to be missing Wellington (in the text, too), along with Auckland City and Northland. The ‘Southern’, ‘Eastern’, and ‘Central’ police districts are under a label ‘Auckland’ at the top right, making them look as though they might be southern, eastern, and central Auckland.

As always, there’s the question of the appropriate denominator.  Police districts are large enough that the distinction between the location of the murder and the residence of the victim might not matter too much (in contrast to census area units and assault), and I’m going to assume that the data include homicides in private homes (in contrast to census area units and assault) because that would have been mentioned otherwise. So it seems reasonable to use a general population denominator. This is trickier than I would have expected; it seems quite hard to find the police district populations. If you’re putting in a police OIA request like this one you might want to ask them for populations as well.

Looking at maps, the police districts seem to (at least approximately) be combinations of DHBs*, so I used the populations of those DHBs. Here are the comparisons just by counts of homicides over nearly three years (we’re missing Wellington and Northland)

And here are the (approximated) rates per thousand people over those three years. You might worry about how well the three Auckland districts can be separated; it wouldn’t be hard to combine them.

Bay of Plenty looks higher and Canterbury, Counties, and Waitematā look lower when you account for the differences in numbers of people.  Comparisons like this usually want rates (how dangerous), not counts (how many), if a relevant denominator is available.

Newshub does get points, though, for correctly saying all these numbers are pretty low by international standards.

 

* DHB: Deprecated Health Boundary

January 5, 2023

How common is long covid and why don’t we know?

You see widely varying estimates for the probability of getting long Covid and for the recovery prognosis. Some of this is because people are picking numbers to recirculate that match their prejudices, but some of it is because these are hard questions to answer.

For example, the Hamilton Spectator (other Hamilton, not ours) reports a Canadian study following 106 people for a year. The headline was initially 75 per cent of COVID ‘long haulers’ free of symptoms in 12 months: McMaster study. It’s now 25 per cent of COVID patients become ‘long haulers’ after 12 months: Mac study. Both are misleading, though the second is better.

This study started out with 106 people, with an average age of 57. They had substantially more severe Covid than average:

Twenty-six patients recovered from COVID19 at home, 35 were admitted to the ICU, and 45 were hospitalized but not ICU-admitted

For comparison, in New Zealand the hospitalisation rate has been about 1% of reported cases, with about 0.03% of reported cases admitted to the ICU. It’s not a representative sample, and this matters for estimating overall prevalence. On top of that, only half the study participants have 12-month data. That means the proportion known to have become ‘long-haulers’ is only about 12%; the 25% is a guess that the people who didn’t continue with the study were similar.

A more general problem is that “long covid” isn’t an easily measurable thing. There are people who are still unwell in various ways a long time after they get Covid. There are multiple theories about what exactly is the mechanism, and it’s quite possible that more than one of these theories is true — we don’t even know that ‘long covid’ is just a single condition.  Because we aren’t sure about the mechanism or mechanisms, there isn’t a test for long Covid the way there is for Covid.  If you have symptoms plus a positive RAT or PCR test for the SARS-2-Cov virus you have Covid; that’s what ‘having Covid’ means. There isn’t a simple, objective definition like that for long Covid.

Because there isn’t a simple, objective test for long covid, different studies define it in different ways: usually as having had Covid plus some set of symptoms later in time. Different studies use different symptoms. The larger the study, the more generic the symptom measurements tend to be, and so you’d expect higher rates of people to report having those symptoms.  If you simply ask about ‘fatigue’ you’ll pick up people with ordinary everyday <gestures-broadly-at-internet-and-world> as well as people with crushing post-Covid exhaustion, even though they’re very different.

There are also different time-frames in different studies: more people will have symptoms for three months than for twelve months just because twelve months is longer.  Twelve-month follow-up also implies the study must have started earlier; a study that followed people for twelve months after initial illness won’t include anyone who had Omicron and might include a lot of unvaccinated people.

The different definitions and different populations matter. The majority of people in New Zealand have had Covid. There’s no way that 25% them have the sort of long Covid that someone like Jenene Crossan or Daniel Freeman did; it would be obvious in the basic functioning of society.   Some people do have disabling long Covid; some people have milder versions; some have annoying post-Covid symptoms; some people seem to recover ok (though they might be at higher risk of other disease in the future). We don’t have good numbers on the size of these groups, or ways to predict who is who, or treatments, and it’s partly because it’s difficult and partly because the pandemic keeps changing.

It’s also partly because we haven’t put enough resources into it.

Ok boomers?

A graph, which has been popular on the internets, in this instance via Matthew Yglesias

Another graph, showing the same thing per capita rather than as shares of the population, also via Matthew Yglesias. This one appears to have a very different message.

And a third graph, from the FRED system operated by the Federal Reserve Bank, showing US real per-capita GDP

So: Gen X have a much lower share of US wealth than the Baby Boomers did at the same age.  This is partly because we are a smaller fraction of the population than they were: per-capita wealth is similar.  But per-capita wealth being similar isn’t as good as it sounds, because the US as a whole is substantially richer now than when the Boomers were 50.

This isn’t a gotcha for either of the first two graphs — different questions are allowed to have different answers — but it might be useful context for the comparison

January 1, 2023

Briefly

  • The “Great Kiwi Christmas Survey” led to stories at Herald, Newshub, Farmers Weekly, and Radio NZ on what people were eating for their Christmas meal.  The respondents for the “Great Kiwi Christmas Survey” were variously described as “over 1000”, “over 1800”, and “over 3300” Kiwis, which seems a bit vague. According to newsroom, this was actually a bogus poll: “We promoted the survey through social media channels and sent the survey to those people who had signed up to receive information from us,” concedes Lisa Moloney, the promotions manager for Retail Meat NZ and Beef + Lamb NZ.  Headlines based on bogus polls aren’t ever ok — even when you don’t think the facts really matter. Newsroom argued that the results under-represented vegetarians, which is plausible, but you can’t really tell from the data presented on the number of vegetarians. Not all Christmas meals at which vegetarians are present will be centred around plant-based food, as any vegetarian can tell you.
  • Stuff, with the help of Auckland Transport, wrote about Auckland’s most prolific public transport user. Apparently, someone took 3400 trips over a year.  It’s surprising that’s even possible: nearly ten trips per day, every day,  and since the person is doing this on a gold card, starting no earlier than 9am on weekdays.  Assuming the numbers are correct — actually, whether the numbers are correct or not — it’s also a bit disturbing that this analysis was done.  The summaries of typical and top 100 users seem a lot more reasonable. The piece says “Stuff asked to interview the person, however Auckland Transport would not reveal their identity for privacy reasons.”, which is good, but you might want them not to be in a position to reveal it.
  • “Support for low-income housing followed a similar pattern, with broad approval for building it someplace in the country (82 percent) but much less for building it locally (65 percent)” at 538. There should be a word for this.
  • Interesting discussion on the Slate Money podcast about a data display, the “Fed Dot Plot”, which shows the best guesses of members of the Federal Reserve Open Market Committee as to what interest rates they will want in the future; each dot is one person.  The Fed is trying to de-emphasise this graph at the moment — partly because people tend to over-interpret it. Importantly, there’s no individual uncertainty shown, and there’s no way to tell how much of the difference between people is due to difference in what they think the economic situation will be and how much is due to differences in how they expect they will want to react to it.
December 31, 2022

Death by Chocolate?

The BBC: Hershey sued in US over metal in dark chocolate claim. 

This is a slight variation on normal headline grammar:  Hershey isn’t being sued over something they claimed; they are being sued because Consumer Reports claims to have found surprisingly high concentrations of lead and cadmium in dark chocolate from a wide range of manufacturers, small and large, organic and conventional, fair-trade and … whatever the opposite of that is.  The cadmium seems to come from the soil — chocolate eaters are on the wrong end of phytoremediation here — and the experts don’t actually know where the lead comes from. Hershey is being sued because they’re a potentially rewarding target, not because they are more at fault than other chocolate makers.

So, how bad is it? Consumer Reports say that the heavy-metal concentrations exceed health standards if you eat an ounce (like, 30g) every day. To get this result, they used the strictest health thresholds they could find: as they phrase it, “CR’s scientists believe that California’s levels are the most protective available”.  We can look at how California computed its threshold (MADL) for cadmium — at least, how it did in 2001; it’s possible there’s a stricter threshold that I haven’t found on Google.  The procedure was to take the highest concentration with no observed adverse effects in animals, scale it by weight, and divide by 1000 for safety.  With cadmium, they didn’t have a no-effect study, they only had a study showing adverse effects, so they put in an extra factor of 10 to account for that.  So, the threshold we’re comparing to is 10,000 times lower than the lowest concentration definitely shown to be harmful.  The California law doesn’t say it’s dangerous to exceed this threshold; it says that if you’re under this threshold you’re so safe that you don’t have to warn consumers that there’s cadmium present. (PDF)

For chemicals known to the state to cause reproductive toxicity, an exemption from the warning requirement is provided by the Act when a person in the course of doing business is able to demonstrate that an exposure for which the person is responsible will have no observable reproductive effect, assuming exposure at 1,000 times the level in question

Presumably the same is basically true of lead.  Now, lead and cadmium are well worth avoiding, even at levels not specifically known to be harmful. Lead, in particular, seems to have small adverse effects even at very low concentrations.  But the level of risk from doses anywhere in the vicinity the California MADL is, by careful design, very low.

We can look at NZ dietary exposures to cadmium, in the incredibly-detailed NZ Total Diet Study (PDF). We’re averaging about 5.2 ug per kg of bodyweight per month for women, 6.6 for men, and 12 for 5-6year old kids. The provisional monthly tolerable dose given in that report is 25.

Our numbers are a bit  higher than France and Australia, a bit lower than Hong Kong, and about the same as Italy.  If you take the hypothetical 58kg woman used in the California regulatory maths, she would consume about 10 ug/day of cadmium. The California limit is 4.1 and the NZ limit is 48. So, an ounce of high-cadmium dark chocolate per day, if it’s, say, twice the California limit, is a significant fraction of the typical cadmium consumption, but well under any levels actually known to have health risks.

For years, the StatsChat rule on dark chocolate has been “If you’re eating it primarily for the health benefits, you’re doing it wrong”. That still seems to hold.

 

 

 

December 7, 2022

Good reporting of numbers

Stuff has a new fact-check column, “The Whole Truth”, and there’s a good example with discussion of youth crime trends, by James Halpin.

The graphs are just the sort of thing I use and recommend: enough history and (where appropriate) enough context to see what trends are just continuing and where things might have changed

It’s clear that the orange line in the left panel is different from basically everything else.  It looks as though the blue line might be going up, but it’s clearly still lower than it was in recent years.

That is, one category of crime in one age group is up.  Overall, robberies and burglaries, even those specifically committed by young people, aren’t increasing, but these vehicle crimes are.  They go on to say that ram-raids by young people are up; the absolute numbers are small, but these are serious crimes, with damage out of proportion to the amount stolen. It’s unlikely to be reporting bias — again, these are serious crimes that would usually be reported.

The data can’t really support a general ‘kids today’ narrative, but there is a real, specific, problem.

December 6, 2022

Briefly

  • I’ve often complained about misleading bar graphs in reporting electoral opinion polls. 1News just punted on the whole issue with this:
  • The cost of the Meola Road rebuild, $47.5 million, has been inaccurately portrayed as the cost of the bike lane that’s a minor component of it. Twitter user @ArcCyclist got the actual breakdown from the Council:

    While I’m at it, I do want to note one way it’s a bad table: the cycleway number is given to whole dollars, with everything else given in cents, so it looks even smaller than it really is. You usually don’t want to delete trailing zeroes in a table.
  • The ESR Covid wastewater dashboard is now at poops.nz. Yes, really.
  • There’s a new “technical report for future UK Chief Medical Officers, Government Chief Scientific Advisers, National Medical Directors and public health leaders in a pandemic” from the UK. Even if you aren’t among that exalted company, some of the information may be useful to public citizens as well
  • The Ministry of Health is seeking public comment on something it wrote about ‘precision health’. There might be StatsChat readers who have reckons.
  • Eric Crampton notes that cost-benefit ratios for transport projects are defined in an idiosyncratic way that makes them hard to compare either with each other or with non-transport projects.
  • The first drug to convincingly delay Type I diabetes onset has been approved. The average benefit is about two years, and the treatment will be marketed at US$200,000.  Cost-effectiveness research suggests this is way more than it’s worth for most people, even in the US where insulin for Type I diabetes is very expensive.
November 28, 2022

99.44% pure

From the Guardian: Computer says there is a 80.58% probability painting is a real Renoir. The story goes on to say Dr Carina Popovici, Art Recognition’s CEO, believes that this ability to put a number on the degree of uncertainty is important.

It’s definitely valuable to put a number on the degree of uncertainty. What’s much less clear is that it’s valuable to put a number on the uncertainty to four-digit precision.   Let’s think about what it would take to be that precise.

If the 80.58% number was estimated from a proportion of observed data in some sense, quoting it to four digits would only make sense if the uncertainty was less than about 0.05%.  A standard error of 0.05% would need a sample size of more than five hundred million.

Another way you can get an estimate with high precision is including subjective expert opinion, which would be entirely appropriate in a context like this. There’s no limit to how precise this can be for the person whose opinion it is — you believe exactly what you believe — but there are very strong limits on how precise it can realistically be as a guide to others.  If the computer isn’t the one buying the Renoir, other people probably shouldn’t care about its opinion to more than one or two digits accuracy.

Sometimes when you come up with an estimate you want to quote it to higher precision than is directly useful — lots of statistical software, including some I write, quotes four or more digits in the default output. This allows rounding to happen closer to the point of use, such as before it’s in a headline in the mainstream media.