Why Watson on Jeopardy is AI’s Moon Landing Moment

In case you missed it the first two nights, tonight is the final human vs. machine match on Jeopardy! Watson, a question-answering AI three years in the making, has been pitted against two of the best human players of all time. And it is cleaning up.

As an AI guy, I feel a little like I’m watching the moon landing, and a little like somebody is showing embarrassing home movies.

First, the moon landing part. This is amazing stuff here, people. Do not be jaded by Google. There is a tremendous difference between retrieving something very related to your question, and actually answering a question. (Or in this case, posing the question; but even IBM calls the project “DeepQA,” for “Question-answering.”) Sentences are extraordinarily tricky, supple, varied things, and AI that attempts to understand natural language sentences using parse trees and deterministic rules usually falls flat on its face. The difference between “man bites dog” and “dog bites man” isn’t captured in a lot of search retrieval algorithms, but when Watson must understand a phrase such as “its largest airport is named for a World War II hero; its second largest, for a World War II battle”—no Google search for “world war II airport” is going to suffice. (Try it.)

In the cases where Watson has fallen down, as in the previous example, I think it’s generally been because of a failure to parse, or its version thereof; but it’s been remarkably resilient against extremely tricky phrasings. The first night, I was blown away by its answer to the Daily Double. The category was “Literary APB,” and the clue was what seemed an extremely sideways reference to Mr. Hyde: “Wanted for killing Sir Danvers Carew; appearance—pale & dwarfish; seems to have a split personality.” This is the kind of thing that can cause natural language processing (NLP) researchers fits if they’re trying to write code that parses the sentence.

What I didn’t notice the first time I saw the clue, though, was that “Sir Danvers Carew” was a dead giveaway to a machine with huge databases of text associations at its digital fingertips. It would be likely to point to other things in the classic book with extremely high confidence, by virtue of its commonly appearing near them in text. Of course, the machine must still understand that the correct answer is “Hyde” and not the book title or author or place—so its answer was still extremely impressive.

But the second night was on the whole less exciting than the first, precisely because there were fewer sideways references like this, and more “keyword” type answers. A whole category was devoted to providing the common name for an obscure medical term or its symptoms—easy for Watson, because its starting point for its searches is likely to be the most specific words in the clue. The Beatles lyrics category the first round was like this—every time a human chose it, I yelled at the screen, “Don’t do it! It’s a trap!” Still, even in this kind of clue, I was amazed at Watson’s breadth of phrase knowledge—the most remarkable being its knowing that “Isn’t that special” was a favorite saying of The Church Lady.

Okay, but about the embarrassing home movies. As much as we AI researchers are making fantastic progress in solving real problems in artificial cognition, we are fundamentally still too ready to hype, and believe our own hype. Watching the IBM infomercials the second night, which promised revolutions in medical science, prompted a mental montage of overoptimistic “Future Work” sections of papers and “Broader Impacts” sections of NSF grants. It’s how the work often gets funded, this maybe-you-could-use-this-to-save-babies kind of argument, but in many cases, it just seems like so much hot air. For one thing, the kinds of statistical reasoning that Watson presumably uses, called Bayesian networks, have been applied to medical diagnosis for quite a while, at least in academic work. What Watson really seems to be about is the same thing the chess playing Deep Blue was about—namely, raising the prestige of a technology consulting company.

And then there was the little matter that, shortly after the “we could use this for medicine” argument, Watson responded to the U.S. Cities question with “What is Toronto??????” This kind of thing is why AI people always show videos instead of doing live demos. It worked in testing, we swear! But it’s extremely difficult to catch this kind of thing beforehand in machine learning, precisely because the learner ultimately acquires more complexity than we put in.

Watson’s successes and failures both point to the fact that it was ultimately engineered by people. For example, the first night, when Ken Jennings got a question wrong, Watson acted as if it hadn’t heard Ken Jennings’ answer and just repeated it. I’m told the IBM team’s reaction was simply being surprised that Ken Jennings would ever get something wrong; they hadn’t counted on the possibility. It’s that brittleness that reminds us that Watson is ultimately a human triumph—it’s not a machine that’s up there, it’s a team of quite a few researchers pulling all-nighters in order to make something truly awesome. And in that way, it is like a moon landing.

The overall winner is apparently being determined by the sum of the dollar amounts of the two games—which is maybe too bad, because Watson’s carefully engineered bet-deciding mechanism now seems like it will go to waste. (Watson’s bets seem weirdly specific just because it’s presumably optimizing an expected payoff equation, one which may put different weights on winning versus winning more.) It seems unlikely that the humans will pull out an upset tonight if the questions are as keyphrase-able as the medical and Beatles categories of the previous nights. But who knows? Perhaps the producers have picked some questions that will require some tricky understanding of the sentences. Whatever Watson’s underlying algorithm, it still seems clear that it’s sometimes not actually understanding what the question is asking, but “going with its gut.” But more often than not, I’m extremely impressed with how well it handles the crazy sentence structures of Jeopardy! clues.

What’s hard for Watson is easy for us, and vice versa; but what’s hard or easy for Watson was surely hard for its team, and they deserve the highest kudos for this remarkable accomplishment.

Kevin Gold is an Assistant Professor in the Department of Interactive Games and Media at RIT. He received his Ph.D. in Computer Science from Yale University in 2008, and his B.A. from Harvard in 2001. When he is not thinking up new ideas for his research, he enjoys reading really good novels, playing geeky games, listening to funny, clever music, and reading the webcomics xkcd and Dresden Codak.


Subscribe to this thread