Posts from July 2022 (9)

July 28, 2022

Counting bots better

I wrote before about estimating the proportion of spam bots among the apparent people on Twitter. The way Twitter does it seems ok. According to some people in the internet who seem to know about Delaware mergers and acquisitions law it doesn’t even matter if the way Twitter does it is ok, as long as it roughly matches what they have claimed they do. But it’s still interesting from a statistics point of view to ask whether it could be done better given the existence of predictive models (“AI”, if you must). It’s also connected to my research.

Imagine we have a magic black box that spits out “Bot” or “Not” for each user. We don’t know how it works (it’s magic) and we don’t know how much to trust it (it’s magic). We feed in the account details of 217 million monetisable daily active users and it chugs and whirrs for a while before saying “You have 15696969 bots.”

We’re not going to just tell investors “A magic box says we have 15696969 bots among our daily active users“, but it’s still useful information. We also have reviewed a genuine random sample of 1000 accounts by hand, over a couple of weeks, and we get 54 bots. We don’t want to just ignore the magic box and say “we have 5.4% bots” What should our estimate be, combining the two? It obviously depends on how accurate the magic box is! We can get some idea by looking at what the magic box says for the 1000 accounts reviewed by hand.

Maybe the magic box says 74 of the 1000 accounts are bots: 50 of the ones that really are, and 24 others. That means it’s fairly accurate, but it overcounts by about 40%. Over all of Twitter, you probably don’t have 15696969 bots; maybe you have more like 11,420,000 bots. If we want the best estimate that doesn’t require trusting the magic box and only requires trusting the random sampling, we can divide up Twitter into accounts the box says are bots and ones that it says aren’t bots, estimate the true proportion in each group, and combine. In this example, we’d get 5.3% with a 95% confidence interval of (4.4%, 6.2%). If we didn’t have the magic box at all, we’d get an estimate of 5.4% with a confidence interval of (4.0%, 6.8%). The magic box has improved the precision of the estimate. With this technique, the magic box can only be helpful. If it’s accurate, we’ll get a big improvement in precision. If it’s not accurate, we’ll get little or no improvement in precision, but we still won’t introduce any bias.

The techique is called post-stratification, and it’s the simplest form of a very general approach to using information about a whole population to improve an estimate from a random sample. Improving estimates of proportions or counts with post-stratification is a very old idea (well, very old by the standards of statistics). More recent research in this area includes ways to improve estimation of more complicated statistical estimates, such as regression models. We also look at ways to use the magic box to pick a better random sample — in this example, instead of picking 1000 users at random we might pick a random sample of 500 accounts that the magic box says are bots and 500 accounts that it says are people. Or maybe it’s more reliable on old accounts than new ones, and we want to take random samples from more new accounts and fewer old accounts.

In practical applications the real limit on this idea is the difficulty of doing random sampling. For Twitter, that’s easy. It’s feasible when you’re choosing which medical records from a database to check by hand, or which frozen blood samples to analyse, or which Covid PCR swabs to send for genome sequencing. If you’re sampling people, though, the big challenge is non-response. Many people just won’t fill in your forms or talk to you on the phone or whatever. Post-stratification can be part of the solution there, too, but the problem is a lot messier.

View comments (2)

July 27, 2022

Attendance figures

By Thomas Lumley

Chris Luxon said today on RNZ Morning Report that “55% of kids aren’t going to school regularly”. On Twitter, Simon Britten said “In Term 1 of 2022 the Unjustified Absence rate was 6.4%, up from 4.1% the year prior. Not great, but also not 50%.”

It’s pretty unusual for NZ politicians to make straightforwardly false statements about publicly available statistics, so if there are numbers that seem to disagree or are just surprising, the most likely explanation is that the number doesn’t mean what you think it means. It sounds like we have a disagreement about facts here, but we actually have a disagreement about which summary is most useful.

New Zealand does have an ongoing problem with school attendance — according to the Government, not just the Opposition. The new Attendance and Engagement Strategy document (PDF) says that the percentage of regular attendance was 59.7% in 2021, down from 69.5% in 2015. The aim is to raise this to 70% by 2024 and 75% by 2026.

So if the unjustified absence rate is 6.4%, how can the regular attendance rate be 59.7% or 45%? “Regular attendance” is defined as attending school at least 90% of the time — so if you miss more than one day per fortnight, or more than one week per term, you are not attending regularly.

For example, suppose half the kids in NZ missed one week and one day in term 1. The absence rate would be about 12% but the regular attendance rate would be 50%. The unjustified absence rate could be anything from 0% to 12%. It’s quite possible to have a 5% unjustified absence rate and a 50% regular attendance rate.

Now we want more details. They are available here. The regular attendance rate is down dramatically this year, from 66.8% in term 1 last year to 46.1% in term 1 this year. The proportion of half-days attended is down less dramatically, from 90.5% in term 1 last year to 84.5% in term 1 this year. Justified absences are up 4.5 percentage points and unjustified absences up by just under 2 percentage points.

What’s different between term 1 this year and term 1 last year?

Well…

It wouldn’t be surprising if a fair fraction of NZ kids took a week off school in term 1, either because they had Covid or because they were in isolation as household contacts. That’s what should have happened, from a public health point of view. It’s actually a bit surprising to me that justified absences weren’t even higher. Term 1, 2022, shouldn’t really representative of the long-term state of schools in NZ. Attendance rates were higher before the Omicron spike; they will probably be higher in the future even without anti-truancy interventions.

It’s reasonable to be worried about school attendance, as the Government and Opposition both claim they are. I don’t think “55% of kids aren’t going to school regularly” is a particularly good way to describe a Covid outbreak. Last year’s figures are more relevant if you want to talk about the problem seriously.

July 26, 2022

Briefly

By Thomas Lumley

Derek Lowe writes “Late last week came this report in Science about doctored images in a series of very influential papers on amyloid and Alzheimer’s disease. That’s attracted a lot of interest, as well it should, and as a longtime observer of the field (and onetime researcher in it), I wanted to offer my own opinions on the controversy.” As he says, the interest in amyloid is not just (or primarily) driven by the allegedly fraudulent research. There’s a lot of support for the importance of beta-amyloid from genetics: mutations that cause early-onset Alzheimer’s, and perhaps even more convincingly, a mutation found in Icelanders that protects against Alzheimers. The alleged fraud is bad, as is the current complete failure of research into treatments, but the link between the two isn’t as strong as some people are implying.
Prof Casey Fiesler, who teaches in the area of tech ethics and governance, is developing a TikTok-based tech ethics and privacy course
ESR’s Covid wastewater dashboard is live. This is important because Everyone Poops. We don’t have an exact conversion from measured viruses to active cases, and the conversion could vary with the strain of Covid and with age of the patients, but at least it won’t depend on who decides to get tested and report their test results.
The wastewater data will be an excellent complement for the prevalence survey that the Ministry of Health is starting up. The survey, assuming that a reasonable fraction of people go along with getting tested, will give a direct estimate of the true population infection rate, but it will not be as detailed as the wastewater data, which can give estimates for relatively small areas and short time frames.
Briefing on the Data and Statistics Bill from the NZ Council of Civil Liberties. If you follow StatsChat you’ve seen these points before. And you will see them again.

View comments (1)

NRL Predictions for Round 20

By David Scott

Team Ratings for Round 20

The basic method is described on my Department home page.
Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Panthers	14.17	14.26	-0.10
Storm	11.32	19.20	-7.90
Rabbitohs	6.28	15.81	-9.50
Sharks	3.35	-1.10	4.50
Roosters	2.59	2.23	0.40
Sea Eagles	2.07	10.99	-8.90
Cowboys	1.90	-12.27	14.20
Broncos	0.62	-8.90	9.50
Eels	-0.52	2.54	-3.10
Raiders	-1.53	-1.10	-0.40
Dragons	-2.91	-7.99	5.10
Bulldogs	-6.09	-10.25	4.20
Titans	-7.16	1.05	-8.20
Knights	-7.27	-6.54	-0.70
Wests Tigers	-9.27	-10.94	1.70
Warriors	-9.54	-8.99	-0.60

Performance So Far

So far there have been 144 matches played, 98 of which were correctly predicted, a success rate of 68.1%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Eels vs. Broncos	Jul 21	14 – 36	5.00	FALSE
2	Dragons vs. Sea Eagles	Jul 22	20 – 6	-4.10	FALSE
3	Knights vs. Roosters	Jul 22	12 – 42	-3.80	TRUE
4	Raiders vs. Warriors	Jul 23	26 – 14	13.80	TRUE
5	Panthers vs. Sharks	Jul 23	20 – 10	14.50	TRUE
6	Rabbitohs vs. Storm	Jul 23	24 – 12	-4.00	FALSE
7	Bulldogs vs. Titans	Jul 24	36 – 26	2.90	TRUE
8	Cowboys vs. Wests Tigers	Jul 24	27 – 26	16.00	TRUE

Predictions for Round 20

Here are the predictions for Round 20. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Sea Eagles vs. Roosters	Jul 28	Sea Eagles	2.50
2	Warriors vs. Storm	Jul 29	Storm	-15.40
3	Eels vs. Panthers	Jul 29	Panthers	-11.70
4	Titans vs. Raiders	Jul 30	Raiders	-2.60
5	Sharks vs. Rabbitohs	Jul 30	Sharks	0.10
6	Broncos vs. Wests Tigers	Jul 30	Broncos	12.90
7	Knights vs. Bulldogs	Jul 31	Knights	1.80
8	Dragons vs. Cowboys	Jul 31	Cowboys	-1.80

View comments (9)

July 19, 2022

NRL Predictions for Round 19

By David Scott

Team Ratings for Round 19

The basic method is described on my Department home page.
Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Panthers	15.25	14.26	1.00
Storm	11.64	19.20	-7.60
Sea Eagles	6.52	10.99	-4.50
Rabbitohs	6.30	15.81	-9.50
Roosters	3.96	2.23	1.70
Sharks	3.85	-1.10	5.00
Cowboys	1.96	-12.27	14.20
Eels	0.71	2.54	-1.80
Broncos	0.02	-8.90	8.90
Raiders	-0.73	-1.10	0.40
Dragons	-6.88	-7.99	1.10
Titans	-7.55	1.05	-8.60
Bulldogs	-7.64	-10.25	2.60
Knights	-9.13	-6.54	-2.60
Warriors	-9.37	-8.99	-0.40
Wests Tigers	-10.90	-10.94	0.00

Performance So Far

So far there have been 136 matches played, 96 of which were correctly predicted, a success rate of 70.6%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Cowboys vs. Sharks	Jul 15	12 – 26	3.20	FALSE
2	Eels vs. Warriors	Jul 15	28 – 18	16.60	TRUE
3	Roosters vs. Dragons	Jul 16	54 – 26	11.90	TRUE
4	Sea Eagles vs. Knights	Jul 16	42 – 12	17.00	TRUE
5	Titans vs. Broncos	Jul 16	12 – 16	-4.70	TRUE
6	Wests Tigers vs. Panthers	Jul 17	16 – 18	-25.90	TRUE
7	Storm vs. Raiders	Jul 17	16 – 20	18.00	FALSE
8	Bulldogs vs. Rabbitohs	Jul 17	28 – 36	-11.50	TRUE

Predictions for Round 19

Here are the predictions for Round 19. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Eels vs. Broncos	Jul 21	Eels	3.70
2	Dragons vs. Sea Eagles	Jul 22	Sea Eagles	-10.40
3	Knights vs. Roosters	Jul 22	Roosters	-10.10
4	Raiders vs. Warriors	Jul 23	Raiders	14.10
5	Panthers vs. Sharks	Jul 23	Panthers	14.40
6	Rabbitohs vs. Storm	Jul 23	Storm	-2.30
7	Bulldogs vs. Titans	Jul 24	Bulldogs	2.90
8	Cowboys vs. Wests Tigers	Jul 24	Cowboys	15.90

View comments (1)

July 18, 2022

Briefly

By Thomas Lumley

Training data for emotions/sentiment from Google appears to be badly wrong (Inconceivable!)
About 12% of people surveyed in the UK said they knew “a great deal” or “a fair amount” about a non-existent candidate for leader of the Conservative Party. More reassuringly, the proportion who had ‘never heard of’ this candidate was much higher than for the real candidates.
The New York Times asks what’s the chance that Trump adversaries McCabe and Comey got tax audits — and, much more usefully, shows how the answer to this question depends on how you define the comparison
Hilda Bastian looks at the evidence on whether female national leaders handled the pandemic better, now that we have more follow-up
From the President of the Royal Society (of London), the need for data literacy, but also the need to “avoid shoehorning everything to do with numbers into a box labelled “Maths”, which has negative connotations for many. If you use that box as a place to pigeonhole quantitative literacy, you are shooting yourself in the foot.” (disclaimer: he’s a statistician)
A re-analysis suggests that the vaccine effectiveness data for the Sputnik coronavirus vaccine cannot possibly be correct. Among other red flags, the estimated effectiveness in different age groups was far more similar than would be expected even if the true effectiveness was identical in the groups.

Sampling and automation

By Thomas Lumley

Q: Did you see Elon Musk is trying to buy or maybe not buy Twitter?

A: No, I have been on Mars for the last month, in a cave, with my eyes shut and my fingers in my ears

Q: <poop emoji>. But the bots? Sampling 100 accounts and no AI?

A: There are two issues here: estimating the number of bots, and removing spam accounts

Q: But don’t you need to know how many there are to remove them?

A: Not at all. You block porn bots and crypto spammers and terfs, right?

Q: Yes?

A: How many?

Q: Basically all the ones I run across.

A: That’s what Twitter does, too. Well, obviously not the same categories. And they use automation for that. Their court filing says they suspend over a million accounts a day (paragraph 65)

Q: But the 100 accounts?

A: They also manually inspect about 100 accounts per day, taken from the accounts that they are counting as real people — or as they call us, “monetizable daily active users” — to see if they are bots. Some perfectly nice accounts are bots — like @pomological or @ThreeBodyBot or @geonet or the currently dormant @tuureiti — but bots aren’t likely to read ads with the same level of interest as monetizable daily active users do, so advertisers care about the difference.

Q: Why not just use AI for estimation, too?

A: One reason is that you need representative samples of bots and non-bots to train the AI, and you need to keep coming up with these samples over time as the bots learn to game the AI

Q: But how can 100 be enough when there are 74.3 bazillion Twitter users?

A: The classic analogy is that you only need to taste a teaspoon of soup to know if it’s salty enough. Random sampling really works, if you can do it. In many applications, it’s hard to do: election polls try to take a random sample, but most of the people they sample don’t cooperate. In this case, Twitter should be able to do a genuine random sample of the accounts they are counting as monetizable daily active users, and taking a small sample allows them to put more effort into each account. It’s a lot better to look at 100 accounts carefully than to do a half-arsed job on 10,000.

Q: 100, though? Really?

A: 100 per day. They report the proportion every 90 days, and 9000 is plenty. They’ll get good estimates of the average even over a couple of weeks

View comments (1)

July 12, 2022

NRL Predictions for Round 18

By David Scott

Team Ratings for Round 18

The basic method is described on my Department home page.
Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Panthers	16.64	14.26	2.40
Storm	12.93	19.20	-6.30
Rabbitohs	6.58	15.81	-9.20
Sea Eagles	5.71	10.99	-5.30
Cowboys	2.99	-12.27	15.30
Roosters	2.98	2.23	0.70
Sharks	2.81	-1.10	3.90
Eels	1.24	2.54	-1.30
Broncos	0.08	-8.90	9.00
Raiders	-2.02	-1.10	-0.90
Dragons	-5.90	-7.99	2.10
Titans	-7.60	1.05	-8.60
Bulldogs	-7.92	-10.25	2.30
Knights	-8.32	-6.54	-1.80
Warriors	-9.90	-8.99	-0.90
Wests Tigers	-12.29	-10.94	-1.40

Performance So Far

So far there have been 128 matches played, 90 of which were correctly predicted, a success rate of 70.3%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Sharks vs. Storm	Jul 07	28 – 6	-10.80	FALSE
2	Knights vs. Rabbitohs	Jul 08	28 – 40	-11.90	TRUE
3	Wests Tigers vs. Eels	Jul 09	20 – 28	-11.00	TRUE
4	Broncos vs. Dragons	Jul 10	32 – 18	8.00	TRUE

Predictions for Round 18

Here are the predictions for Round 18. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Cowboys vs. Sharks	Jul 15	Cowboys	3.20
2	Eels vs. Warriors	Jul 15	Eels	16.60
3	Roosters vs. Dragons	Jul 16	Roosters	11.90
4	Sea Eagles vs. Knights	Jul 16	Sea Eagles	17.00
5	Titans vs. Broncos	Jul 16	Broncos	-4.70
6	Wests Tigers vs. Panthers	Jul 17	Panthers	-25.90
7	Storm vs. Raiders	Jul 17	Storm	18.00
8	Bulldogs vs. Rabbitohs	Jul 17	Rabbitohs	-11.50

View comments (3)

July 5, 2022

NRL Predictions for Round 17

By David Scott

Team Ratings for Round 17

The basic method is described on my Department home page.
Here are the team ratings prior to this week’s games, along with the ratings at the start of the season.

	Current Rating	Rating at Season Start	Difference
Panthers	16.64	14.26	2.40
Storm	14.78	19.20	-4.40
Rabbitohs	6.57	15.81	-9.20
Sea Eagles	5.71	10.99	-5.30
Cowboys	2.99	-12.27	15.30
Roosters	2.98	2.23	0.70
Eels	1.48	2.54	-1.10
Sharks	0.96	-1.10	2.10
Broncos	-0.40	-8.90	8.50
Raiders	-2.02	-1.10	-0.90
Dragons	-5.42	-7.99	2.60
Titans	-7.60	1.05	-8.60
Bulldogs	-7.92	-10.25	2.30
Knights	-8.31	-6.54	-1.80
Warriors	-9.90	-8.99	-0.90
Wests Tigers	-12.53	-10.94	-1.60

Performance So Far

So far there have been 124 matches played, 87 of which were correctly predicted, a success rate of 70.2%.
Here are the predictions for last week’s games.

	Game	Date	Score	Prediction	Correct
1	Sea Eagles vs. Storm	Jun 30	36 – 30	-7.80	FALSE
2	Knights vs. Titans	Jul 01	38 – 12	-0.80	FALSE
3	Panthers vs. Roosters	Jul 01	26 – 18	17.90	TRUE
4	Bulldogs vs. Sharks	Jul 02	6 – 18	-4.70	TRUE
5	Cowboys vs. Broncos	Jul 02	40 – 26	5.30	TRUE
6	Rabbitohs vs. Eels	Jul 02	30 – 12	6.70	TRUE
7	Warriors vs. Wests Tigers	Jul 03	22 – 2	6.50	TRUE
8	Dragons vs. Raiders	Jul 03	12 – 10	-0.90	FALSE

Predictions for Round 17

Here are the predictions for Round 17. The prediction is my estimated expected points difference with a positive margin being a win to the home team, and a negative margin a win to the away team.

	Game	Date	Winner	Prediction
1	Sharks vs. Storm	Jul 07	Storm	-10.80
2	Knights vs. Rabbitohs	Jul 08	Rabbitohs	-11.90
3	Wests Tigers vs. Eels	Jul 09	Eels	-11.00
4	Broncos vs. Dragons	Jul 10	Broncos	8.00

View comments (2)

Subscribe:

Receive our posts via email:

Posts from July 2022 (9)

Team Ratings for Round 20

Performance So Far

Predictions for Round 20

Team Ratings for Round 19

Performance So Far

Predictions for Round 19

Team Ratings for Round 18

Performance So Far

Predictions for Round 18

Team Ratings for Round 17

Performance So Far

Predictions for Round 17

Latest posts