Do Androids Dream of Electric Sheep?
Humans magically dream: we process the data that we collected in the past by putting those into a context based on the knowledge that we have about the world and spice them with emotions and other subjective factors.
Is it possible to repeat this process with AI engines that can process data and generate an interpretable visual output? If yes, how those outputs will differ from each other if we feed them to different image generators? A variety of factors, including recent experiences, emotions, and memories can influence our dreams. Could we attribute this factor to a human characteristic? As the pictures are eventually zeros and ones and image generators lack emotions, we can expect that the output will be synthetic.
In this experiment, I will try to mimic this mystic process of dreaming and interpretation with multiple AI engines, to utilize their best capabilities. The goal is to find the similarities and differences in their output to understand the various kind of image generation processes better.
Pictures will be collected from my trip to Berlin, and those will be the inputs of the first tool, similar to the visual inputs that we collect for our dreams throughout the day.
I feed them to the CLIP interrogator that collects the prompts that are capable of describing each picture. Similar to when you try to recall your dream’s most important and tangible parts. I will use CLIP’s most descriptive and accurate prompt, based on my experience so far, it is the first one for each image.
Those first interrogated prompts will be the base prompts that will be used for Midjourney, similar to a condensed summary of the pictures that you invoke from your dream. Then Midjourney creates the final output from the information that has been provided, similar to a friend, who is passionate both about our dreams and drawing.
This process is like talking about your recent dream to a friend and asking him/her to draw your dream! But in our case, it is more like a robot talking about its recent dream to a robot friend and asking it to draw the dream!
(Yes, humans process way more information, than a couple of images when they’re dreaming. The chosen pictures are meant to illustrate the process, accepting them as condensed visual summaries of the day may be excusable. There are tons of more sophisticated methods for this, but hey, dreams are interpreted from the vast amount of chaotic information anyway)
After inputting the pictures above into the CLIP interrogator, the following prompts have received:
From this, I created a long prompt, separated it with commas, and used it as input for Midjourney. As a result, it generated the following set of pictures:
So what do we see here?
- Music in the park is a central element in all the pictures. It could be the result of the order, as it was the first picture and the first prompt as well. Music is attributed to string and keyboard instruments (guitars and pianos) but not drums.
- Bananas are there as well, but somehow the yellowness is strongly represented in the final output, probably because the blue-yellow flag is in one of the latter prompts
- A person sitting at a table, reading is in most pictures, but the book is not identifiable
- Bicycles are on 3 out of 4 outputs but are very distorted
- Tunnels are not emphasized, but there are arcs in most of the pictures. It could be an impact of having the words ‘statue’ and ‘art’ among the prompts. On the contrary buildings and stairs are way more detectable
- The cans are detectable but placed in absolutely unreasoned places
- I do not see the statues there but — on top of the surrealistic nature of the process itself — multiple elements have been influenced by Dali’s artworks :)
- No flags at all. Even though, blue and yellow are the dominant colors of the pictures
- No escalator. Maybe MJ felt that the stairs would make them useless in a setting like this, or after a specific amount of prompts the weights that define the instruments of the output picture will be so small, that won’t be displayed on the output at all
The same prompts generate a different output with Stable Diffusion
- Music in the park is the main element of the picture but somehow contains more instruments and diverse national characteristics
- Banana is quite hidden, but you can find it next to the graffiti on the wall
- No tunnels, but railing and stairs are framing the whole picture
- No cans at all, nor the counter
- Statues are in the background, realistically placed close to the buildings
- No flags, but the yellow color is quite visible
- No escalator, but the building in the background is quite tall
So, to wrap up:
- Midjourney is an artistic dreamer, Stable Diffusion is a rational thinker: The overall output of Stable Diffusion has fewer details and more distortion but is somehow more coherent. While MJ’s output looks like a chaotic (but stylish) dream with a bunch of unrelated components, Stable Diffusion creates a more matching scene, where citizens are on the street jamming together, next to graffiti and right behind the park. Despite that Stable Diffusion’s image parts are more artificial, but the elements are placed in a more conscious, realistic manner
- Abstract concepts will be materialized in objects unless you define an activity (music vs. reading by sitting next to a table): in the case of MJ, some concepts, like music, are attributed to certain objects, while other activities like reading are more prone to be displayed as an activity. It could be the result of the prompt itself since reading is tied in the prompt to a person who is sitting at a table
- Recurring prompts will define the whole picture: even though both the flag and banana should be yellow on the outputs, many instruments and clothing of the people are also yellow, while the prompts do not require this
- Complex objects could be distorted: it is not necessarily general, but look at those bicycles..in case you even find them.
- The order matters: it could happen that after a point the generator won’t try to display the prompt in one picture, as the escalator is missing on both generator’s inputs.
I am just a dreamer!
The complex process of dreaming puzzled scientists and philosophers for centuries. During dreaming, the brain’s certain sensory systems are activated, but the information is not always processed logically or coherently. This often results in bizarre and surreal dreams.
Interestingly, Midjourney is more capable to recall and visualize more input prompts, while Stable Diffusion creates a picture that is more abstract and coherent. It draws a picture that displays a scene from a weird street music festival, while Midjourney is more focused on putting the most attributes to the picture at the expense of consistency. It seems that the order of the prompts matters and the weight of the prompts is not equal.
Okay, but how those pictures are created?
Similarly to some recent ML models, the AI image generation tool Midjourney and the prompt engineering tool CLIP interrogator mechanisms are challenging to interpret and explain, but a high-level summary could help you to understand the core methodology. It is important to highlight that MJ is not an open-source technology, therefore I briefly summarise the assumed technology behind them, to introduce Generative Adversarial Networks (or GANs), “the most interesting idea in the last 10 years in machine learning”.
Hugging Face’s Clip Interrogator
According to it’s creators, the CLIP (Contrastive Language-Image Pre-Training) model (that provided the prompts from our input images) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task.
To extract related text prompts from our image input, the CLIP model classifies the elements of the picture into a set of pre-defined categories based on its content. Once the image has been classified, the model generates text prompts related to the image’s content.
Image generators
Image generation tools are trained on huge datasets of pairs of text and images. The semantic representation is used as input to the image generation model, which is a type of generative adversarial network (GAN). It consists of two neural networks: a generator network and a discriminator network.
The generator network generates an image out of a noise that should match the description. Similar to an honest helpful friend, a trained discriminator network then evaluates the generated image and provides feedback to the generator on how to improve the image. The generator network continues to generate images and receive feedback from the discriminator network by using neural networks that can learn a complex function, such as the underlying distribution of the data. This process goes on and on until it produces an image that is deemed to be of high quality and matches the text prompt description.
You can find more info about GANs in this well-written article here.
May you rest in a deep and restless slumber
In the TV series Westworld, robots are programmed to dream during their sleep phase. Those dreams are controlled, but they can sometimes experience lucid dreaming, where they become aware that they are dreaming. This can lead to the bots questioning their reality and existence.
By the end of this experiment, we can conclude that abstractions and the selection of the features, and weighing the different inputs are not something that only humans do while they are dreaming.
Currently, the process of creating the images based on the prompts is an iteration of a labeled dataset, that has been labeled by humans. The image creation process does not contain any anger, love, or any subjective feeling toward the specific image or topic. Even though those models have no emotional involvement toward the subjects, their choices seem quite arbitrary. Or maybe the labels are already impacted by the labeler’s feelings and this way defining the outputs?