April 17, 2023

Looking for ChatGPT

Turnitin, a company that looks for similar word sequences in student assignments, says that it can now detect ChatGPT writing.  (Stuff, RNZ)

The company is 98% confident it can spot when students use ChatGPT and other AI writing tools in their work, Turnitin’s Asia Pacific vice president James Thorley said.

“We’re under 1% in terms of false positive rate,” he said.

It’s worth looking at what that 1% actually means.  It appears to mean that of material they tested that was genuinely written by students, only 1% was classified as being written by ChatGPT.  This sounds pretty good, and it’s a substantial achievement if it’s true. This doesn’t mean that only 1% of accusations from the system are wrong. The proportion of false accusations will depend on how many students are really using ChatGPT. If none of them are, 100% of accusations will be false; if all of them are, 100% of accusations will be true.

What does the 1% rate mean for a typical student?  An average student might hand in 4 assignments per course, for 4 courses per semester, two semesters per year.  That’s nearly 100 assignments in a three-year degree.  A false accusation rate of 1 in 100 means an average of one false accusation for each innocent student, which doesn’t sound quite as satisfactory.

The average is likely to be misleading, though.  Some people will be more likely than others to be accused.  In addition to knowing the overall false positive rate, we’d want to know the false positive rate for important groups of students.  Does using a translator app on text you wrote in another language make you more likely to get flagged? Using a grammar checker? Speaking Kiwi? Are people who use semicolons safe?

Turnitin emphasize, as they do with plagiarism, that they don’t want to be blamed for any mistakes — that all their tools do is raise questions.  For plagiarism, that’s a reasonable argument.  The tool shows you which words match, and you can then look at other evidence for or against copying. Maybe the words are largely boilerplate. Maybe they are properly attributed, so there is copying but not plagiarism.  In the other direction, maybe there are text similarities beyond exact word matches, or there are matching errors — both papers think Louis Armstrong was the first man on the moon, or something.  With ChatGPT there’s none of this. It’s hard to look for additional evidence in the text, since there is no real way to know whether something you see is additional or is part of the evidence that Turnitin already used.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    Antonio Rinaldi

    Numbers are very different from the ones reported here:
    https://realworlddatascience.net/viewpoints/editors-blog/posts/2023/03/15/AI-screening.html
    OpenAI stresses that the current version of the classifier “should not be used as a primary decision-making tool”, and users should take that statement to heart – especially if they are planning to vet student homework with it. In evaluations, OpenAI reports that its classifier correctly identifies AI-written text as “likely AI-written” only 26% of the time, while human written text is incorrectly labelled as AI-written 9% of the time.

    2 years ago

    • avatar
      Thomas Lumley

      It’s a different classifier, and also a different corpus of real text, so it’s not that surprising the accuracy is different.

      I assume Turnitin are correct about some corpus of text they’ve tested, but it’s hard to tell how that will generalise

      2 years ago