How does DALL-E and other forms of Generative AI work?


DALL-E is incredibly good. Not too many years ago, it was easy to conclude that AI technologies would never produce anything of a quality approaching artistic composition or human writing. Now producing the generative model programs that power DALL-E 2 and Google’s LaMDA chatbot pictures and Words scary like a real person’s work. Dall-E makes artistic or photorealistic images of a variety of objects and scenes.

How do these image-forming models work? Do they function like a person and should we consider them intelligent?

How diffusion models work

Generative Pre-Trained Transformer 3 (GPT-3) is the latest advancement in AI technology. The proprietary computer code was developed by the misnamed OpenAI, a Bay Area technology company that began as a non-profit organization before turning for profit and licensing GPT-3 to Microsoft. GPT-3 was designed to produce words, but OpenAI optimized a version to produce DALL-E and its sequel DALL-E 2 using a technique called Diffusion Modeling.

Diffusion models perform two sequential processes. They ruin pictures, then try to rebuild them. Programmers give the model real images with human-attributed meanings: dog, oil painting, banana, sky, 1960s sofa, etc. The model diffuses—that is, moves—them through a long chain of sequential steps. In the destruction sequence, each step slightly alters the image passed to it by the previous step, adds random noise in the form of meaningless scattershot pixels, and then passes it to the next step. Repeated again and again, this leads to the fact that the original image gradually becomes static and its meaning disappears.

We cannot predict how well or why an AI like this will work. We can only judge if the results look good.

When this process is complete, the model runs it in reverse order. Beginning with the almost meaningless noise, it pushes the image back through the series of sequential steps, this time trying to reduce the noise and bring back meaning. At each step, the model’s performance is judged by the probability that the less noisy image produced at that step has the same meaning as the original, real image.

While blurring the image is a mechanical process, making it clear again is a search for something like meaning. The model is gradually “trained” by adjusting hundreds of billions of parameters — think small dimmer switches that control a circuit of lights from full off to full on — within neural networks in code to “turn up” steps that change the Likelihood to improve meaningfulness of image and “reject” steps that don’t. If you do this process over and over on many images, adjusting the model parameters each time, the model will eventually be tuned to take a meaningless image and develop it through a series of steps into an image that looks like the original input image.

Subscribe for counterintuitive, surprising, and impactful stories delivered to your inbox every Thursday

To generate images that have text meanings associated with them, words describing the training images are simultaneously passed through the noise and noise reduction chains. In this way, the model is not only trained to produce an image with a high probability of meaning, but also with a high probability that the same descriptive words are associated with it. The creators of DALL-E trained it on a huge amount of images with associated meanings, pulled from all over the Internet. DALL-E can generate images that match such a strange range of input phrases because that’s how the Internet was.

generative AI

These images were created using generative AI called Stable Diffusion, which is similar to DALL-E. The prompt for the creation of the images read, “Color photo of Abraham Lincoln drinking beer in front of the Seattle Space Needle with Taylor Swift.” Taylor Swift looked a little creepy in the first image, but maybe after a few beers she looks like Abraham Lincoln. (Image credit: Big Think, Stable Diffusion)

The inner workings of a diffusion model are complex. Despite the organic feel of his creations, the process is entirely mechanical and based on probability calculations. (This paper works through some of the equations. Warning: the math is difficult.)

At its core, math is about breaking difficult operations into separate, smaller, and simpler steps that computers can do almost as well, but much faster. The mechanics of the code are understandable, but the system of optimized parameters that its neural networks pick up on in the training process is utter gibberish. A set of parameters that produces good images is indistinguishable from a set that produces bad images – or near-perfect images with some unknown but serious flaws. Therefore, we cannot predict how well or why an AI like this will work. We can only judge if the results look good.

Are generative AI models intelligent?

So it’s very hard to tell how much DALL-E is like a person. The best answer is probably not at all. People don’t learn or create that way. We don’t take sensory data from the world and then reduce it to random noise; Nor do we create new things by starting with total randomness and then denoising them. The outstanding linguist Noam Chomsky pointed out that a generative model like GPT-3 produces words in a meaningful language no differently than it would produce words in a meaningless or impossible language. In this sense it has no concept of the meaning of language, a fundamentally human quality.

generative AI

These images were created using generative AI called Stable Diffusion, which is similar to DALL-E. The prompt used to create the images: “Portrait of Conan Obrien in the style of Vincent Van Gogh.” (Image credit: Big Think, Stable Diffusion)

Even if they aren’t like us, are they intelligent in some other way? In the sense that they can do very complex things, sort of. On the other hand, a computer-controlled lathe can produce highly complex metal parts. By the definition of the Turing test (i.e., determining whether its output is indistinguishable from that of a real person), it certainly could. On the other hand, extremely simplistic and hollow chat robot programs have been doing this for decades. But nobody thinks machine tools or rudimentary chatbots are intelligent.

A better intuitive understanding of current generative model AI programs might be to think of them as extraordinarily capable idiot mimics. You are like a parrot that can listen to human speech and produce not just human words, but groups of words in the right patterns. If a parrot listened to soap operas for a million years, it could probably learn to string together emotionally overwrought, dramatic interpersonal dialogue. If you’ve spent those millions of years giving him crackers to find better phrases and yelling at him for bad ones, things could get even better.

Or consider another analogy. DALL-E is like a painter who lives his whole life in a gray, windowless room. They show him millions of landscape pictures with the names of the colors and motifs. Then give him color with color labels and ask him to match the colors and create patterns that statistically mimic the motif labels. He makes millions of random paintings, compares each one to a real landscape, and then changes his technique until they start looking realistic. However, he couldn’t tell you anything about what a real landscape is.

Another way to gain insight into diffusion models is to look at the images of a simpler model. DALL-E 2 is the most advanced of its kind. Version one of DALL-E often produced images that were almost correct but clearly not quite, such as: dragon giraffes whose wings are not properly attached to their bodies. A less powerful open source competitor is known to produce disturbing images that are dreamlike and bizarre and not entirely realistic. The flaws inherent in the meaningless statistical mashups of a diffusion model are not hidden like those of the far more sophisticated DALL-E 2.

The future of generative AI

Whether you find it wondrous or terrifying, it seems we have just entered an age where computers can generate convincing fake images and phrases. It is bizarre that an image of meaning to a person can be generated from mathematical operations on near-meaningless statistical noise. While the machinations are lifeless, the result looks like something more. We’ll see if DALL-E and other generative models evolve into something with a deeper kind of intelligence, or if they can only mimic the world’s biggest idiots.


Comments are closed.