May 30, 2014

Trusting your data or your model

Even with large amounts of data, automated predictions must usually incorporate explicit or implicit prior understanding of the structure of the problem. “Look for anything” is not good enough: “anything” is too big.

Here, for your weekend light entertainment, are some examples where the prior structure was too strong or too weak:

The example that prompted this post, from the blog of Melville House Press, is about automated scanning of books to create digital editions

 in many old texts the scanner is reading the word ‘arms’ as ‘anus’ and replacing it as such in the digital edition. As you can imagine, you don’t want to be getting those two things mixed up.

A similar phenomenon was pointed out at Language Log a decade ago

Fear not your toes, though they are strong,
The conquest doth to you belong;

Daniel Dennett recounts two anecdotes of speech recognition, one human and one computer, which err in the opposite direction to the text recognition example. The computer one:

An AI speech-understanding system whose development was funded by DARPA (Defense Advanced Research Projects Agency), was being given its debut before the Pentagon brass at Carnegie Mellon University some years ago. To show off the capabilities of the system, it had been attached as the “front end” or “user interface” on a chess-playing program. The general was to play white, and it was explained to him that he should simply tell the computer what move he wanted to make. The general stepped up to the mike and cleared his throat–which the computer immediately interpreted as “Pawn to King-4.” 

And, the example that is frustratingly familiar to so many of us: mobile phone autocorrupt, which you can search for yourself.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »