This could lead to the next big breakthrough in common sense AI

You’ve probably heard us say this countless times: GPT-3, the gargantuan AI that spews uncannily human-like language, is a marvel. It’s also largely a mirage. You can tell with a simple trick: Ask it the color of sheep, and it will suggest “black” as often as “white”—reflecting the phrase “black sheep” in our vernacular.

That’s the problem with language models: because they’re only trained on text, they lack common sense. Now researchers from the University of North Carolina, Chapel Hill, have designed a new technique to change that. They call it “vokenization,” and it gives language models like GPT-3 the ability to “see.”

It’s not the first time people have sought to combine language models with computer vision. This is actually a rapidly growing area of AI research. The idea is that both types of AI have different strengths. Language models like GPT-3 are trained through unsupervised learning, which requires no manual data labeling, making them easy to scale. Image models like object recognition systems, by contrast, learn more directly from reality. In other words, their understanding doesn’t rely on the kind of abstraction of the world that text provides. They can “see” from pictures of sheep that they are in fact white.

AI models that can parse both language and visual input also have very practical uses. If we want to build robotic assistants, for example, they need computer vision to navigate the world and language to communicate about it to humans.

But combining both types of AI is easier said than done. It isn’t as simple as stapling together an existing language model with an existing object recognition system. It requires training a new model from scratch with a data set that includes text and images, otherwise known as a visual-language data set.

The most common approach for curating such a data set is to compile a collection of images with descriptive captions. A picture like the one below, for example, would be captioned “An orange cat sits in the suitcase ready to be packed.” This differs from typical image data sets, which would label the same picture with only one noun, like “cat.” A visual-language data set can therefore teach an AI model not just how to recognize objects but how they relate to and act on one other, using verbs and prepositions.

But you can see why this data curation process would take forever. This is why the visual-language data sets that exist are so puny. A popular text-only data set like English Wikipedia (which indeed includes nearly all the English-language Wikipedia entries) might contain nearly 3 billion words. A visual-language data set like Microsoft Common Objects in Context, or MS COCO, contains only 7 million. It’s simply not enough data to train an AI model for anything useful.

“Vokenization” gets around this problem, using unsupervised learning methods to scale the tiny amount of data in MS COCO to the size of English Wikipedia. The resultant visual-language model outperforms state-of-the-art models in some of the hardest tests used to evaluate AI language comprehension today.

“You don’t beat state of the art on these tests by just trying a little bit,” says Thomas Wolf, the cofounder and chief science officer of the natural-language processing startup Hugging Face, who was not part of the research. “This is not a toy test. This is why this is super exciting.”

From tokens to vokens

Let’s first sort out some terminology. What on earth is a “voken”?

In AI speak, the words that are used to train language models are known as tokens. So the UNC researchers decided to call the image associated with each token in their visual-language model a voken. Vokenizer is what they call the algorithm that finds vokens for each token, and vokenization is what they call the whole process.

The point of this isn’t just to show how much AI researchers love making up words. (They really do.) It also helps break down the basic idea behind vokenization. Instead of starting with an image data set and manually writing sentences to serve as captions—a very slow process—the UNC researchers started with a language data set and used unsupervised learning to match each word with a relevant image (more on this later). This is a highly scalable process.

The unsupervised learning technique, here, is ultimately the contribution of the paper. How do you actually find a relevant image for each word?

Vokenization

Let’s go back for a moment to GPT-3. GPT-3 is part of a family of language models known as transformers, which represented a major breakthrough in applying unsupervised learning to natural-language processing when the first one was introduced in 2017. Transformers learn the patterns of human language by observing how words are used in context and then creating a mathematical representation of each word, known as a “word embedding,” based on that context. The embedding for the word “cat” might show, for example, that it is frequently used around the words “meow” and “orange” but less often around the words “bark” or “blue.”

This is how transformers approximate the meanings of words, and how GPT-3 can write such human-like sentences. It relies in part on these embeddings to tell it how to assemble words into sentences, and sentences into paragraphs.

There’s a parallel technique that can also be used for images. Instead of scanning text for word usage patterns, it scans images for visual patterns. It tabulates how often a cat, say, appears on a bed versus on a tree, and creates a “cat” embedding with this contextual information.

The insight of the UNC researchers was that they should use both embedding techniques on MS COCO. They converted the images into visual embeddings and the captions into word embeddings. What’s really neat about these embeddings is that they can then be graphed in a three-dimensional space, and you can literally see how they are related to one another. Visual embeddings that are closely related to word embeddings will appear closer in the graph. In other words, the visual cat embedding should (in theory) overlap with the text-based cat embedding. Pretty cool.

You can see where this is going. Once the embeddings are all graphed and compared and related to one another, it’s easy to start matching images (vokens) with words (tokens). And remember, because the images and words are matched based on their embeddings, they’re also matched based on context. This is useful when one word can have totally different meanings. The technique successfully handles that by finding different vokens for each instance of the word.

For example:

The token is the word “contact” in both examples. But in the first sentence, context suggests that the word refers to contact information, so the voken is the contact icon. In the second sentence, the context suggests the word refers to touch, so the voken shows a cat being stroked.

The researchers used the visual and word embeddings they created with MS COCO to train their vokenizer algorithm. Once trained, the vokenizer was then able to find vokens for the tokens in English Wikipedia. It’s not perfect. The algorithm only found vokens for roughly 40% of the tokens. But that’s still 40% of a data set with nearly 3 billion words.

With this new data set, the researchers retrained a language model known as BERT, an open-source transformer developed by Google that predates GPT-3. They then tested the new and improved BERT on six different language comprehension tests, including SQuAD, the Stanford Question Answering Dataset, which asks models to answer reading comprehension questions about a series of articles, and SWAG, which tries to trip up models with subtleties of the English language to probe whether it’s merely mimicking and memorizing. The improved BERT performed better on all of them, which Wolf says is nothing to sneeze at.

The researchers, Hao Tan, a PhD student, and Mohit Bansal, his advisor, will be presenting their new vokenization technique in two weeks at the Conference on Empirical Methods in Natural Language Processing. While the work is still early, Wolf sees their work as an important conceptual breakthrough in getting unsupervised learning to work for visual-language models. It was a similar spark that helped dramatically advance natural-language processing back in the day.

“In NLP, we had this huge breakthrough over two years ago, and then suddenly NLP was a field where a lot of things were happening and it kind of got ahead of all the other AI fields,” he says. “But we have this problem of connecting text with other things. So it’s like this robot that is only able to talk but cannot see, cannot hear.”

“This paper is one example where they managed to connect it to another modality and it works better,” he says. “You could imagine that maybe some of these techniques could be reused when you want to leverage this really powerful language model in a robot. Maybe you use the same thing to connect the robot’s senses to text.”