January 18, 2017

Recognising te reo

Those of you on Twitter will have seen the little ‘translate this tweet’ suggestions that it puts up. If you’re from or in New Zealand you probably will have seen that reo Māori is often recognised by the algorithm as Latvian, presumably because Latvian also has long vowels indicated by macrons.   I’ve always been surprised by this, because Latvian looks so different.

It turns out I’m right.  Even looking just at individual letters, it’s very easy to distinguish the two.  I downloaded 74000 paragraphs of Latvian Wikipedia, a total of 6.5 million letters, and looked at how long the Latvians can go without using letters that don’t appear in te reo: specifically, s,z,j,v,d,c, g not as ng, the six accented consonants, and any consonant at the end of a word. On average, I only needed to wait five letters to know the language is Latvian rather than Māori, and 99% of the time it took less than 21 letters.

Another language that Twitter often guesses is Finnish. That makes more sense: many of the letters not used in Māori are also rare or absent in Finnish, and ‘g’ appears mostly as ‘ng’.   However, Finnish does have ‘s’, has ‘ä’ and ‘ö’, and ‘y’, and has words ending in consonants, so it should also be feasible to distinguish.

 

Update: Indonesian is another popular guess, but it has ‘d’,’j’,’y’,”b”, and it has lots of works ending with consonants.  The average time to rule out te reo is slightly longer, at nearly 6 characters, and the 99th percentile is 22 letters.  So if the algorithm can’t tell, it should probably guess it’s not Indonesian.

Update: For very short tweets, and those in mixed languages, nothing’s going to work, but this is about tweets where the answer is obvious to a human.

avatar

Thomas Lumley (@tslumley) is Professor of Biostatistics at the University of Auckland. His research interests include semiparametric models, survey sampling, statistical computing, foundations of statistics, and whatever methodological problems his medical collaborators come up with. He also blogs at Biased and Inefficient See all posts by Thomas Lumley »

Comments

  • avatar
    David Hood

    At the moment your unit of analysis is the single tweet in isolation. If you aggregate to the level of author, you can add the assumption that if an aurora has recognisably tweeted in Te Reo in the past, then short or mixed tweets are less likely to be Finnish than an author writing some tweets in identifiable Finnish.

    Similarly, if you aggregate to the network, you can extend that logic to if a person is replying to tweets of others that are identifiably in Te Reo and not engaging with Finnish tweets, that network pattern offers evidence for stray words.

    8 years ago

    • avatar
      Thomas Lumley

      That’s certainly true. Twitter has information to do much better if they cared enough.

      However, I think Twitter is using an off-the-shelf character-set-based language recognition algorithm, probably from Microsoft, and my point is *even that* should do much better.

      8 years ago

      • avatar
        Thomas Lumley

        Even better, Twitter could ask users for languages they’re likely to tweet in, for example. There can’t be many people with more than half a dozen of those.

        8 years ago