AI’s Third Wave: A Perspective From The World Of Law

Published in

Judicata

15 min readOct 16, 2018

A few months ago, the Defense Advanced Research Projects Agency (DARPA) announced that it:

“is now interested in researching and developing ‘third wave’ AI theory and applications that address the limitations of first and second wave technologies by making it possible for machines to contextually adapt to changing situations.”

As DARPA describes it:

The first wave of AI was about “Handcrafted Knowledge”
The second wave of AI was about “Statistical Learning”; and
The third wave should be about “Contextual Adaptation”

This is a welcome acknowledgment of the limitations inherent in the machine learning techniques that dominate the field of Artificial Intelligence today.

While we won’t see significant advances in “third wave” AI for many years to come — or even a jelling around what precisely the “third wave” is — these next generation technologies will likely have a big impact on the field of law, which is a welcome prospect for a field severely in need.

Understanding why requires an examination of the first two waves — AI’s past and present — and their critical shortcomings.

Machines Thinking Fast And Slow

In his book, Thinking, Fast and Slow, Daniel Kahneman describes two different modes of human thought: “System 1” is fast, automatic, and subconscious; “System 2” is slower, more calculating, and more logical.

System 1 — thinking fast — can quickly do things like solve “2+2=?” and complete the phrase “war and ….”

System 2 — thinking slow — can count the number of vowels on a page and reason about complex logic problems.

Distilled down to their essence: thinking fast recognizes, thinking slow reasons.

These two modes of thought parallel and exemplify the goals of the two periods of AI development that DARPA calls the first and second “waves.”

The first wave was focused on developing System 2 — thinking slow — logic capabilities. Much of the effort was directed towards expert systems — traditionally programmed software that “emulates the decision-making ability of a human expert.” The best known of these is probably TurboTax.

The second wave was focused on developing System 1 — thinking fast — pattern-matching capabilities. It’s crowning achievement was the ascendance of deep learning — a collection of techniques that turbo-charged statistical machine learning and led to impressive gains on problems like image recognition, speech recognition, and game play.

These waves — their different concerns and orthogonal techniques — have deep historical and philosophical origins. Michael I. Jordan, one of the world’s leading Artificial Intelligence researchers, links them back to the birth of the field, noting in a broad manifesto, Artificial Intelligence — The Revolution Hasn’t Happened Yet:

“It was John McCarthy (while a professor at Dartmouth, and soon to take a position at MIT) who coined the term “AI,” apparently to distinguish his budding research agenda from that of Norbert Wiener (then an older professor at MIT). Wiener had coined “cybernetics” to refer to his own vision of intelligent systems — a vision that was closely tied to operations research, statistics, pattern recognition, information theory and control theory. McCarthy, on the other hand, emphasized the ties to logic. In an interesting reversal, it is Wiener’s intellectual agenda that has come to dominate in the current era, under the banner of McCarthy’s terminology. (This state of affairs is surely, however, only temporary; the pendulum swings more in AI than in most fields.)”

Jordan follows this observation with the declaration:

“But we need to move beyond the particular historical perspectives of McCarthy and Wiener.”

Between A Rock And A Hard Place

This “need to move beyond” is rooted in limitations inherent in the approaches born of the two perspectives.

At the highest level, the challenges behind AI’s first and second waves are similar — the rock and hard place of cost and complexity. The cost is the expense of developing intelligent software. The complexity is the multiplicity of the universe being modeled.

For logic-based approaches, the cost is straightforward. It’s the labor — the dollars and cents needed to pay people to write traditional if-then-statement code. As the complexity of the problem increases, so does the logic, making the solution ever more expensive.

For statistics-based approaches, the cost is more indirect. It’s the product of two factors (ignoring the money it takes to run the computers):

the unit cost of an item of training data; and
the overall amount of data needed to fully train the algorithm.

As the complexity of a problem increases, so does the amount of training data needed. If the unit cost of adding another item of training data is more than zero, then the cost of the machine’s education will grow as the complexity increases.

Deep learning algorithms throw fuel on this fire. The underlying power of deep learning is its ability to pull nuance from very large amounts of data. In our Internet-connected world — with more and more data coming online every day — algorithms capable of feasting on this bounty are a blessing.

It should thus come as no surprise that the big successes of the second wave are problems where the available data was plentiful and the cost of collecting that data was relatively small.

Deep learning did wonders with image recognition because companies like Facebook have billions of photos manually labeled by the individuals who uploaded them. The marginal cost to collect and train on an additional photo is basically zero.

Deep learning had a similarly large impact on game play because the cost of having the computer play itself one more time is negligible. On its way to beating the world’s top Go players, AlphaGo played itself more than one million times.

Yet beyond image recognition, speech recognition, game play, and a few other areas, deep learning hasn’t had the enormous impact that many prognosticators predicted.

The reason is that for most problems the cost characteristics are not that favorable, and for many problems there is little prospect of that changing in the next few decades — if ever. Understanding why requires a small diversion into the Theory of Computation. (For a deep dive, see Chapter 5 of this book; it’s a book I hold dear to my heart.)

Computational Complexity

Computer scientists have a concept known as computational complexity, which is a measure of the resources an algorithm needs to solve a given problem — where the resource is typically measured in time or space (which boils down to computing time or computer memory).

For some problems, the needed resources are minimal and solving the problem requires only a small amount of time or space — even as the input gets bigger. An example is the problem of determining whether an integer is odd. No matter how large the input number, the solution always requires the same fixed amount of time — checking if the last digit is a 1, 3, 5, 7, or 9. It doesn’t matter if the number has one digit, one hundred digits, or one thousand digits.

For other problems, the needed resources are more substantial — as the input to the algorithm gets bigger, an increasing amount of time or space is needed. An example is the problem of identifying the smallest number in a list of numbers. Solving the problem involves scanning the list and remembering the smallest number encountered so far. The longer the list, the longer it takes to scan. So the time needed to solve the problem grows linearly with the length of the list.

Other problems can require even more resources to solve. For example, there are problems where every input size increase doubles the time needed for the algorithm to run. For these, the amount of resources required grows exponentially with the size of the input, and the time needed to solve the problem can quickly become prohibitively large — taking days, weeks, months, or even years.

Feature Complexity

With this background in mind, it’s important to understand that a similar complexity dynamic exists with the training data that machine learning algorithms need for learning. For some problems, the amount of training data required is minimal. For others, the amount of training data required can be quite large.

The rate at which these data needs grow is what I call a problem’s feature complexity, since it depends on the relationships between the features in the universe being modeled.

Consider, for example, a universe of dogs that bark. To keep it small, we’ll limit it to just two breeds: Yorkies and Pomeranians.

There are two things we can say in this universe:

The Yorkie is barking.
The Pomeranian is barking.

If we add a 2-color color descriptor, we can say four things:

The brown Yorkie is barking.
The brown Pomeranian is barking.
The orange Yorkie is barking.
The orange Pomeranian is barking.

If we add a 2-size size descriptor, we can say eight things:

The small brown Yorkie is barking.
The small brown Pomeranian is barking.
The small orange Yorkie is barking.
The small orange Pomeranian is barking.
The large brown Yorkie is barking.
The large brown Pomeranian is barking.
The large orange Yorkie is barking.
The large orange Pomeranian is barking.

The pattern should be clear: each time we add an independent feature to this universe (or in the language of statistics, a degree of freedom) the number of things that can happen in the universe doubles. The number of permutations therefore grows exponentially with the number of independent features.

Why does this matter?

The most interesting problems in Artificial Intelligence connect back to the real world, and the number of independent features found there is quite large. The exponential feature complexity of these problems makes it effectively impossible to collect and train on data that captures all the different ways the features can interact.

So if an algorithm needs to be trained on all the permutations of the universe in order to model it well, then the algorithm is going to struggle when the feature complexity is exponential and there are more than a small number of independent features.

Exponentially Dumb

One of the important advantages of the deep learning algorithms that dominate today is their ability to discover meaningful features on their own (i.e., they don’t need to be pre-identified and programmed by a person).

But this advantage becomes a problem when the training data is not sufficiently comprehensive to teach all the important features (and only those features) and how they interact with one another. The problem isn’t obvious when the algorithms are doing the right thing — that’s why we get grandiose predictions extrapolating from research and narrow proof-of-concept demonstrations.

Yet it becomes crystal clear when the algorithms make mistakes like:

These algorithms aren’t learning important features and are instead focusing on shortcuts that behave poorly on less common inputs. That’s because they need training data that covers the full exponentially-expanding universe. They need data to show the impact of both the presence and absence of every single feature.

Put differently, the algorithms aren’t learning that baseballs are round because they’re not being sufficiently trained on images with baseball-like stitching but which aren’t baseballs. Similarly, they’re not learning that leopards are shaped like cats because they’re not being sufficiently trained on leopard-skin objects that aren’t cats. These are features fundamental to the identity of the objects, yet they are not being learned.

In each of the above examples — and in countless others like them — there’s always a sense that with more data the algorithm would have performed better. But that need for more data — exponentially more data — is what makes these algorithms so limited.

This is a big part of the reason we need to “move beyond” the second wave.

From Eight To Infinity

A second important reason to “move beyond” is the inability of the technologies dominating the second wave to think logically — to think slowly.

Andrew Ng, one of the heavyweights of the deep learning community, noted a couple years ago:

“Almost all of AI’s recent progress is through one type, in which some input data (A) is used to quickly generate some simple response (B) … If a typical person can do a mental task with less than one second of thought, we can probably automate it using AI either now or in the near future.”

While I disagree with his sanguine assessment of the timeline (“now or in the near future”) it’s an important acknowledgment that recent progress in the field of Artificial Intelligence is limited to capabilities of the thinking fast variety.

Some people may hold out hope that the statistical pattern-matching approaches of the second wave will one day develop the logic-based capabilities targeted in the first wave. But that is a lost cause.

While any individual logic-based problem can be re-framed as a pattern-matching problem, the exponential feature complexity of the world means the amount of training data needed quickly explodes. Moreover, the needed training data doesn’t just grow fast — it grows endlessly.

Statistical pattern-matching is how we get chatbots that answer “seventy — two” to the question “how much is ten minus two” — notwithstanding that calculators (like the abacus) have been able to solve that problem for millenia.

We could train a machine learning algorithm to recognize that the answer to the question “how much is ten minus two” is “eight” — but what would be the point? There are an infinite number of math problems out there; so no pattern-matching algorithm will ever be able to function as an open-ended and reliable calculator.

And that is just the story when it comes to math.

In their recent book, The Book of Why, Judea Pearl (another AI heavyweight and the father of Bayesian Networks) and Dana Mackenzie, address the limits of statistical pattern-matching at an even more fundamental level. They dive into why statistics can never lead to human level intelligence: simply put, “statistics alone cannot tell which is the cause and which is the effect.”

Back To The Future

So if we’re now looking toward the future of AI and no longer so enamored with the present or disparaging of the past, what should some of the functional capabilities of the next generation of AI algorithms be?

The limitations found in the second wave provide a good start.

First, we need to extricate ourselves from the crushing data dependence that exists today. We need algorithms whose data needs grow linearly with the number of independent features (or something closer to that) and not exponentially. While we don’t really understand how humans learn — or the scale of our own training data needs when we are young — humans can learn new and complicated lessons from even a single data point.

Second, we need machine learning algorithms that can also think slowly — logically — bridging the gap between the first and second waves. There are important questions to ask about whether these logical reasoning abilities should be learned or innate, and how these thinking fast and thinking slow capabilities should interact. Should there be one unified system? Two separate systems? Three or more? However it is done, we need to think of logic as a fundamental capability and not an emergent skill.

Third, we need algorithms that are transparent. Although expressed in the language of computer code, the first wave technologies were readable and comprehensible. It was possible to understand why they did what they did. By contrast, the deep learning algorithms that dominate the second wave are inscrutable black boxes. They are difficult to interrogate — to the point where understanding why they did what they did is effectively unknowable. This not only makes an issue out of trust, but it also limits the utility of the applications since they are hard to debug and therefore slow to iterate on and improve.

Abstracting And Reasoning

In comparing the first and second waves — and setting out goals for the third wave — DARPA identified four key capabilities that comprise a “notional intelligence scale”:

Perceiving rich, complex and subtle information
Learning within an environment
Abstracting to create new meanings
Reasoning to plan and to decide

DARPA’s goal is the development of third wave technologies that can perform each of these well. The second wave technologies were good at perceiving and learning, but stumbled on abstracting and reasoning.

Will machine learning software that supports linear growth, logical thinking, and transparency get us to abstracting and reasoning?

The devil will be in the details, but there is reason to believe that it can.

Let’s start with the target of being able to learn linearly rather than exponentially. In order to do this, we’ll need algorithms (and architectures) that can generalize broadly and accurately from the limited training data provided to them. Recognizing that certain features and values are similar — and can be modeled in comparable ways — reduces the feature complexity of the problem. For example, abstracting “brown” and “black” into a concept called “color,” and “small” and “large” into a concept called “size,” will make it possible to understand how “color” and “size” interact without needing to evaluate every permutation of the two.

Thus, abstraction is a critical means towards our linear-machine-learning end.

As for thinking logically, we’ll need software that can learn rules found in the real world. Propositional logic provides a good language and starting point for doing this (although it won’t get us all the way there). Software that can learn and express patterns in terms of implication (if-then; →), conjunction (and; ∧), disjunction (or; ∨), negation (not; ¬), and biconditional (if and only if; ↔) will give us the rule-based statements we need to build a reasoning machine. We’ll also need a higher-level controller to link these rule-based building blocks — to sequentially apply them in a logical and contextually sensible way.

It’s worth noting that just as deep learning research was active for decades before hitting prime time in the last decade, there is prior research into topics like logic learning machines that may find a new renaissance as part of this third wave.

Finally, while not specifically relevant to abstracting or reasoning, software transparency is an accelerant for the adoption of Artificial Intelligence technologies. It not only serves the critical purpose of enabling oversight and trust (which help with real-world usage), but it also tightens the feedback-improvement loop which helps speed up development and deployment. Transparency drives more and better feedback, which is critical to enabling a virtuous loop where we have: (1) usage, which (2) leads to feedback, which (3) drives improvement to the software, which (4) enables more usage.

Artificial Intelligence And Law

Having worked at the intersection of law and technology for the last decade (and worked and studied in the individual fields for a decade before that) I’m excited that the field of law is likely to be an area of development for these third wave technologies.

The intersection between AI and law goes back many years — to the 1940s — before the phrase “Artificial Intelligence” was even coined. Yet seventy years later, very little that looks like artificial intelligence is actually used by lawyers (or replacing them). The sole exceptions — software for e-discovery and due diligence — are fundamentally about information retrieval (search and ranking) and have little to do with the application of logic or rules, abstracting or reasoning.

The reason for the minimal impact is twofold. First, when it comes to abstracting and reasoning — core skills needed by lawyers — the second wave technologies have been largely irrelevant. Second, language is hard, and law is inextricably tied to language. Natural Language Processing (NLP) is probably the area where deep learning has most famously not had a revolutionary impact — leading to relatively small incremental gains rather than the big leaps found in speech and image recognition.

However, this doesn’t mean law isn’t a good area of focus for the field of Artificial Intelligence — or a place where “third wave” AI can’t have a big impact. To the contrary, the incredibly challenging (and changing) nature of the law — constantly shaped by and shaping the real world — makes it a very rich source.

L. Thorne McCarty’s explanation for why he focused on tax law — from his classic 1977 paper, Reflections on TAXMAN: An Experiment in Artificial Intelligence and Legal Reasoning — is just as apt today for why to focus on the field of law more generally:

“Superimposed on a manageable foundation of manageable complexity is another system of concepts as unruly as any that can be found in the law, with all the classical dilemmas of legal reasoning: contrasts between ‘form’ and ‘substance,’ between statutory ‘rules’ and judicially created ‘principles,’ between ‘legal formality’ and ‘substantive rationality.’”

McCarty continues:

“It is perhaps no accident that two of the more quotable passages on the problems of legal interpretation have arisen in corporate tax cases. Holmes’ famous metaphor — ‘[a] word is not a crystal, transparent and unchanged, it is the skin of a living thought’: — appeared first in Towne v. Eisner, the earliest stock dividend case. And the simile of Learned Hand — ‘the meaning of a sentence may be more than that of the separate words, as a melody is more than the notes’ — appeared first in Helvering v. Gregory.”

Over the past year, there has been an important shift in the discourse around Artificial Intelligence.

Now that well-regarded AI researchers and technologists like Judea Pearl, Michael I. Jordan, and Rodney Brooks are saying that statistical machine learning isn’t enough to get us where we want to go, AI skepticism is no longer a sign of ignorance. This matters because it opens the door for serious research that takes as its starting point the limits of statistical machine learning. And that research will be increasingly targeted at the challenges that have kept the second wave from being more useful to lawyers — the ability to abstract, reason, and think slow like a real human being.

The demanding nature of the law led my company — Judicata — to address the technical limitations of statistical machine learning early on. We had to develop ways to blend together first and second wave approaches. For example, we married a simple “first wave” rule-based language parser with a sophisticated “second wave” statistical parser (originally Stanford Parser and later SyntaxNet). The statistical parsers hadn’t learned a number of important and basic grammar rules; the hybrid approach allowed us to produce a more accurate parser than either one offered us alone.

With more companies working to get past the limits of the second wave technologies, we’re in a much better position to make meaningful progress as a field.

That’s an exciting prospect and one I’m happy to be a part of.