Contextualizing AI: The Cat and The Mistaken Hat

Stories of AI errors are becoming increasingly well known to non-technical audiences. Yet the hype around artificial intelligence continues unabated — most recently with the CEO of Google declaring AI “more profound than… electricity or fire.”

This hype persists because people — both technical and non-technical alike:

  1. underestimate what our brains are doing, and
  2. overestimate what machine learning is doing.

A close look at vision and language, two areas where artificial intelligence is making big progress, shows why.

The Mistaken Hat

A great way to understand a computer program is to examine its errors. The same is true of the human brain.

In the titular story of his famous work, The Man Who Mistook His Wife for a Hat, Oliver Sacks describes his patient, Dr P.:

“Sometimes a student would present himself, and Dr P. would not recognise him; or, specifically, would not recognise his face. The moment the student spoke, he would be recognised by his voice. Such incidents multiplied, causing embarrassment, perplexity, fear — and, sometimes, comedy. For not only did Dr P. increasingly fail to see faces, but he saw faces when there were no faces to see: genially, Magoo-like, when in the street he might pat the heads of water hydrants and parking meters, taking these to be the heads of children; he would amiably address carved knobs on the furniture and be astounded when they did not reply.”

Dr. Sacks recounts an episode where:

“[Dr. P.] reached out his hand and took hold of his wife’s head, tried to lift it off, to put it on. He had apparently mistaken his wife for a hat! His wife looked as if she was used to such things.”

Dr. P.’s condition, known as agnosia, sheds light on how human visual perception works, which involves:

  1. taking in the visual scene
  2. perceiving forms (i.e., shapes and patterns)
  3. associating those forms with known objects

Deficiency at the first level is blindness. Deficiency at the second level is known as apperceptive agnosia. And deficiency at the third level is known as associative agnosia.

That we can see but not recognize is a surprise to many people. Yet it is a demonstration that our brains are far more complex than most of us realize. We take for granted the significant processing that goes into perceiving forms and associating them with our existing knowledge — and our ignorance of this complexity leads us to underestimate just how much work is going on in our brain.

Computer Vision

At a high level, visual perception in computers is similar to visual perception in humans. Computers also need to:

  1. take in the visual scene
  2. perceive forms
  3. associate those forms with known objects.

But there’s an additional challenge: the machine needs to know which forms to look for. Humans have historically been responsible for identifying the important forms. It’s done via a task called feature engineering. With the relevant forms pre-specified, the computer then only has to scan the image for those forms (called “features”).

A classic example of this is optical character recognition (OCR) — the task of recognizing letters in images.

The computer looks for features like lines, intersections, and loops, comparing what it finds in the image with the known characters it is trying to match. If the identified features are similar enough to those of a particular character, then a match is declared.

The problem is that feature engineering is difficult and expensive. For complex forms feature engineering can be so difficult as to be effectively impossible. Yet without good features, machine learning algorithms are generally inaccurate to the point of being useless.

Seeing Cats

With recent advances in deep learning, the story of visual perception in computers is changing. Though deep learning ideas have been around for decades, the enormous amount of data coming online today — via companies like Google and Facebook — is finally bringing them to life.

Sidestepping the technical details of how deep learning works, the value in the technology is that it enables computers to discover far more complex features (i.e., forms) on their own. This can eliminate the need for feature engineering and greatly reduce the difficulty and expense of building accurate machine learning algorithms.

A nice demonstration of this comes from one of the first deep learning papers to receive mainstream attention — a Google paper on the automatic identification of features in YouTube videos. The Google researchers took 10 million images from YouTube videos and let a deep learning algorithm (a neural network) loose, seeing what features the algorithm would discover. The effort was massive — a cluster of 1,000 computers comprising 16,000 cores that ran for three days.

The results were equally impressive. The neural network learned complex features like the detailed face of a cat:

Although not super clear (the image is a sort of “average” of the cat faces the neural network encountered in the YouTube stills): there are eyes, ears, whiskers, a nose, and even patterns in the fur. This level of detail in an automatically generated feature was previously unprecedented, and a critical step towards enabling computers to accurately recognize cat faces like this:

When contrasted with the earlier work on optical character recognition, it becomes clear why there is so much excitement around deep learning. Deep learning helps correct computer vision’s apperceptive agnosia — its difficulty in perceiving forms!

Of course, without being told that this detailed form corresponds to a “cat,” the computer would not know what it is looking at. It would still suffer from associative agnosia. But that is a straightforward association to learn.

Can Men Cook?

So far this might sound like machines are getting pretty close to human level performance. This progress is why people are so excited about the advances being made in AI (along with computers finally beating top humans at Go). But a closer examination shows just how far from human-level capabilities we really are.

First, it’s important to understand that the cat-face feature would not match a cat shown from a different perspective:

This is despite the fact that this cat also has a visible eye, ear, whiskers, a nose, and patterned fur. While the neural network might have a feature that corresponds to this cat-on-its-back pose, there is no guarantee. In fact, when this image is uploaded to the Wolfram Language Image Identification Project (another neural network), we are told it is a ”Chesapeake Bay retriever” (a breed of dog).

Why? The reason is unknown. Which gets us to the second issue: machine learning has a transparency problem. We usually don’t know which features these algorithms are learning. This is a major problem for high stakes fields like law and medicine. (Overcoming this limitation is an area of research called “Explainable AI”.)

This might not be a problem if the algorithms were always right, but they are not. The reason they are so often wrong relates to a third consideration to keep in mind: there is no requirement that the machine learn features and associations that are intelligent, coherent, or what a person would look for. Rather, what these programs are designed to do is learn statistical correlations. That they may be incidental or bizarre is secondary.

Unfortunately, people misjudge the number and impact of these circumstantial correlations because they only hear about them when the correlations are insidious and easily understandable — for example when they show traits of racism or sexism:

This image is one of the examples discussed in a nice research paper focused on reducing bias in machine learning algorithms: Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints.

But bias is only the tip of the iceberg. The problem is a fundamental artifact of how these algorithms are designed to work — to think fast, not slow. They lack a representation of the world within which to contextualize their learnings, and they lack the ability to think logically. By contrast, when a human is asked to identify the gender of a person in a picture, they know to look at the person’s features, not their surroundings.

From John Elway To Jeff Dean

In language processing — my own area of relative expertise — the discrepancy between human and machine is more sobering. Though the machine learning fundamentals are the same in both language and vision, reading comprehension is a far more difficult task.

Two points should make this clear. First, many animals have excellent visual perception; only humans have the complex language skills needed to produce philosophy, literature, and mathematics. Second, unlike computer vision, where small alterations to an image don’t change what is shown, altering even a single letter in a paragraph can change its meaning drastically.

Despite what many people believe, machine learning algorithms don’t read. There’s been a lot of hype in the legal tech space about question answering AIs, but it has only ever been hype. The reason is simple: these machine learning algorithms are still limited to looking for correlations — there is no comprehension going on.

So while Microsoft is reporting that their AI is “now as good as humans on [the Stanford Question Answering Dataset] Reading Test”, an excellent paper by Robin Jia and Percy Liang demonstrates just how unintelligent these algorithms really are. (No offense intended to Microsoft. They do a good job of qualifying their software’s achievement, including discussing the Jia and Liang paper. But they do severely overstate how impressive their algorithm is — by downplaying what it is that humans do.)

What Jia and Liang showed was that adding irrelevant sentences to paragraphs in the Stanford Question Answering Dataset could cause machine learning algorithms to start giving incorrect answers.

The following is an example:

When the sentence in blue is added, the algorithm switches from answering “John Elway” for the question “What is the name of the quarterback who was 38 in Super Bowl XXXIII?” to answering “Jeff Dean.”

The point is not that the machine learning algorithm now gets the answer wrong. Rather, it is what this tells us about how the algorithm works and how it ever got the answer right.

The algorithm knows nothing about football, quarterbacks, or Super Bowls. It knows nothing of the real world. What it does know is the form of the question, and that it can infer from that question the expected form of the answer. Once it knows what form to look for, it simply has to find an example in the paragraph. The new sentence causes confusion because it adds an additional possible answer, meaning that a random selection is now more likely to be wrong.

Modeling Time

Contrast this with one of the language processing tasks we have at Judicata: identifying the causes of action identified in a judicial decision. (These are the wrongs alleged in a lawsuit, for example, negligence, fraud, or breach of contract.) To automate this task accurately, we needed to write code that modeled time. It had to understand verb tenses, dates, and events known to be sequential. This gave our code the ability to figure out what was happening in the past, and what was happening in the present.

This sort of thinking slow is something that machine learning can’t do. In fact, state of the art machine learning still has a hard time figuring out a verb’s tense. SyntaxNet — Google’s deep learning code that identifies parts of speech and the relationships between words in a sentence — struggles on relatively simple legal language like:

“The judgment of the Court of Appeal is affirmed in part and reversed in part, and the case remanded for further proceedings consistent with this opinion.”

SyntaxNet gets the dependencies right (which is impressive):

But it messes up the verb tense for the words “reversed” and “remanded.” SyntaxNet is a great open source contribution and a valuable part of our system, but it is not accurate enough to be relied upon alone.

The Hype Matters

Hyped-up Artificial Intelligence is nothing new. But the hype is usually followed by a period of reduced funding and interest known as an AI Winter.

While I don’t expect another AI winter — there are enough areas where artificial intelligence is now quite powerful and useful — I do think our AI summer has started to turn cold. This isn’t a bad thing.

My concern is that a generation of CEOs and software engineers now believes the only way to solve a problem is to use machine learning. There’s an insightful joke about regular expressions (a tool programmers use to match textual patterns) that many programmers are familiar with:

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

The idea is that regular expressions are often the wrong tool for the job, and because they can be so unwieldy to use, they make the programmer’s job more difficult.

I believe we’ve entered an era where machine learning is the new regular expression. As I write this there are countless companies using machine learning to tackle problems that can be solved better and quicker with simple rule-based code.

Fortunately, over time this problem will work itself out. The advances being made in machine learning today will soon become just one more well-understood tool in the software engineer’s belt.