Computational Linguistics

So I was on this “make an earth-shattering new programming language” kick the other week and bought a whole bunch of used books on Amazon. One of them was “Computational Lingustics” by Ralph Grishman. This is an old book, from 1986. But apparently it’s a classic in the field. It’s about teaching computers to read human languages like English. That’s what computational linguistics is, basically.

It’s funny to read it. You read each part and think, “this approach will never scale to real-world problems.” Over and over. For everything you read. But the thing is, the authors of these approaches know their solutions don’t scale. They are just computer scientists building things in the hopes that other people can build better things from them. And they are so hopeful for the future.

But the future turned out bleak for computational linguists. The concepts we had in 1986 are the same things we have now: interpreters for very small sets of the language only. No real progress on the knowledge-storing solution, let alone the grammar parsing solution.

To my untrained eye, the book broke down into two big problems. If you want a computer to be able to understand English, you need to teach it the grammar of English. This is a massive undertaking. The book is full of great examples about why this is hard, like the “gapping phenomena” we find in English conjunctions.

In English, conjunctions can be used to combine a whole sentence with a fragment of a sentence in the same form as the first half, optionally dropping parts. For instance, if your sentence is “subject verb object”, then you can write “subject verb object AND subject verb object”, or “subject verb object AND verb object”, or “subject verb object AND object”. All these are allowed:

I ate the cake and Bob ate the pasta.
I ate the cake and ate the pasta.
I ate the cake and the pasta.

But there are special cases. It’s also okay to say “subject verb object AND subject object”. It looks a little archaic but it’s totally legal.

I ate the cake and Bob the pasta.

The thousands of special case rules in the language make a traditional grammar impractical. There are many theoretical approaches to fixing this outlined in the book, but none of them scale to the size needed.

And then we get to the “context” part of English. If you see a paragraph like:

“In the early morning dawn, the battleship fired two torpedos at the cruise liner. It sank beneath the ocean, leaving no survivors.”

Did the battleship sink or the cruise liner? You know, because you know that when you fire a torpedo at a boat, it is likely to sink. A computer does not know that, and cannot disambiguate this sentence.

There are lots of approaches to this problem, too. For instance, the concept of “scripts” could help. The idea is that we store data about the world in stories — little scripts about how something works. When faced with the sentence,

“Bob was seated at the booth and his order was taken quickly, but thirty minutes later he still had no food. Bob was furious.”

Why was Bob furious? If we think of “going to a restaurant” as a script, then the script goes something like “you are seated, you place your order, you get your food, the waiter checks on you, you receive the check,” etc. And somewhere in there you could codify the notion that steps shouldn’t take more than a certain amount of time to complete.

If you thought, “Wow, that approach will never scale,” then you win a vapor cookie. Because of course it won’t. Well, didn’t. None of the other approaches have scaled either, as is evidenced by the lack of computers who can interpret English paragraphs.

I think the whole approach is wrong. This is the “brute force” approach to a solution. The final correct solution won’t use anything even remotely resembling these approaches. It will have to tackle the problem from a completely different angle.