Visualising language

How good is DALL·E 3, and what might it struggle with?

Oct 17, 2023

I finally got access to GPT-4V and DALL·E 3. So to test it out, I decided to take some posts I’ve tweeted about cute things my son has done in the past year, and see if it could illustrate them.

I said: ‘Create a light hearted image to illustrate the following story involving a toddler’ and pasted the text, then saved the first image it produced. I didn’t refine the images with any follow up prompts.

There’s lots more to explore with what this visual functionality could do, and how it could be combined with the other tools emerging. But for now, let’s start with a simpler question: was it any good at producing images from this simple prompt? Read on and judge for yourself…

My son didn’t believe me that the brown bear in The Snail and the Whale wasn’t a polar bear, so I showed him a photo…
Which made him want to fact check the existence of every animal in the book…
“HUMPBACK WHALE PHOTO PLEASE”
“PENGUIN PHOTO PLEASE”

Large parcel just got delivered.
“Is it croissants?” asks my son.

My son has been getting various pots and dishes out and calling them all ‘bowl’ and now I’m doubting my knowledge of what a bowl is.

Mindfulness tip from my toddler: when the world gets a bit stressful, go outside and count the bees.

TOOTHBRUSH AIRPLANE! Although my son is happy to see me after a week away, he seems especially interested in the adventures of my toiletries.

Toddler now insisting that I read my own book (which he’s chosen) and he reads his… but of course then listening in on me reading.

So how did it do?

You’ll probably have your own opinions on the artistic qualities of the above. But it’s notable that, with sparse prompts, there’s definitely a certain style preference – and cultural bias – for the default images DALL·E 3 produces. There are also occasional errors and typos in the generation process (although it’s much, much better than DALL·E 2).

I think another big factor in dictating its usefulness will be how it works with memory. The ‘Transformer’ algorithm that converts language input to output (i.e. the ‘T’ in GPT, and the model that powers DALL·E) was revolutionary because it framed language as a problem of where to place attention. This approach has become very good at taking text input requests and generating short text or image outputs. And in the case of GPT-4V, taking images as input and generating text outputs. However, it seems to me that generating a series of stylistically consistent images (e.g. multiple images for a story) is a much harder problem, because an algorithm has to retain so much visual information – and hence manage memory – at each step.

There are already lots of examples out there of people using ChatGPT to generate written stories, then Midjourney to generate accompanying images. But often these examples either only generate a single image for the story, or generate a sequence that are stylistically quite variable (e.g. characters and settings don’t look the same from page to page, which would definitely frustrate my two-year-old).

I’ve been working on a few applications of LLMs recently (more on this in a follow-up post), and the difference between good and bad performance often comes down to how information is synthesised and passed between steps. So I’m really curious to see what progress people can make on this challenge for narrative sequences of images – and what the next chapter will bring.

Understanding the unseen

Discussion about this post