The Turing Test: Why I Think Eugene Goostman Didn’t Pass

On June 7th, 2014, the 60th anniversary of the death of computer scientist Alan Turing, the University of Reading announced that a chatbot named Eugene Goostman had passed the Turing Test, the litmus test for artificial intelligence.  As stories of our impending robot overlords’ approach tend to do, the story went viral.  It was picked up by news outlets from Slate to the Washington Post—even tech outlets like The Verge and PC World jumped on the Eugene Goostman bandwagon.  To have the media tell it, we had just entered a Blade Runner-esque world where the line between human and robotic intelligence will become blurred.  Does your wife dream of electric sheep yet?

The issue with all this coverage is that, in my opinion, Eugene Goostman didn’t actually pass the Turning Test!  To understand why, you first need to understand the Turing Test.  Proposed by Alan Turing in 1950, the idea behind the test is that a computer that can reliably convince people talking to it that it is human is functionally intelligent.  Because we are basically gigantic intelligent meat computers, Turing holds, a computer that can pass for one of us is also intelligent.  It’s a devilishly simple idea that in my opinion is quite sound.  If we can’t distinguish something from intelligence based on the output, who are we to say it isn’t intelligent.

Every year, thousands of chatbots take the Turing Test in competitions large and small.  The general setup is this: a group of judges sit down at computers.  They then talk via some sort of IM program to either a chatbot or a human in another room—they don’t know which.  After some period of time, they decide whether the person they were chatting with was a human or artificial.  It’s that simple.

My issue with Eugene Goostman’s performance at the University of Reading hinges on two key points.  Firstly, the programmers presented their bot as a Ukrainian teenager who didn’t speak English well, which allowed the bot to talk in a way no normal human would.  Secondly, I don’t think the bot actually reliably fooled the judges into thinking it was human.  In essence, I don’t think that any of Turning’s parameters were followed.

Sadly, the full transcripts are unavailable and Eugene’s website is down because of too many requests.  However, plenty of people have published the transcripts of their conversations with Eugene (see here, here, and here), and frankly the results are unimpressive at best.  Eugene’s MO seems to be trying to drive the conversation towards a few topics that he knows how to discuss using keywords.  When you talk about something Eugene doesn’t know, he generally just makes a snarky comment and asks you about something he does know how to talk about—your job, say, or where you work.  Often he is unintelligible or just plain bizarre.  But because he claims to be a foreign child, the judges are willing to pass it all off to xenophobic ideas about those crazy Ukrainians and their bad English skills.  Without the character, it all just falls through.

The other major issue I have with calling Eugene’s results a “pass” is that he only convinced 33% of the judges in a small sample size.  Out of 30 judges that spoke to Eugene, 10 thought he was a human and 20 thought he was a computer.  The threshold University of Reading used for a pass was 30%—a number taken out of context from a later paper by Alan Turing that was never intended to be a pass level for the Turing Test.   Turing never set a numeric goal; his only criteria was that the deception be “reliable”.  Eugene Goostman’s results are a 33%-of-the-time-it-works-all-the-time deal.  I doubt anyone would call a car that starts 33% of the time reliable.  Further, there were only 30 judges, bringing the √n law into play.  The √n law states that for a sample you can generally expect deviation from the average value by an amount of up to √n, where n is the number of trials you are performing.  For a sample size n = 30, √n = 5.48, meaning that Eugene Goostman’s true average could have been as low as 5 in 30, or 17%.  To mean anything significant, we’d need a much larger sample size.

At the end of the day, I do think it’s really cool that we’ve made something like Eugene Goostman.  It means that we’re getting somewhere by scientifically breaking down language, and it definitely is a harbinger of better understanding of our own brains.  I think that the day a machine passes the Turing Test will come soon—computer development tends to be exponential in nature.  However, I don’t think that day was June 7th.  Until that day comes, keep doing what only humans are known to do—wondering infinitely about the infinite wonder that is life.


One thought on “The Turing Test: Why I Think Eugene Goostman Didn’t Pass

  1. Taking the results with a grain of salt was probably the best reaction any news source should have had. Of course, I’m hardly flabbergasted by the exaggeration and misinterpretation of the Reading test, considering many journalist’s nascent knowledge of the actual methodology and procedures required for science.

    The Reading test does highlight a few important issues, however: the need to take cultural bias into account and the idea that perhaps a more accurate AI-masquerade of human speech ought to replicate most human’s imperfect grasp of language.

    On the subject of linguistics, do you take a more behaviorist approach to the origin of language, or do you think the ability to construct and comprehend language is inherently wired in the brain?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s