Credit: Alex Mordvintsev
For the layman, DeepDream is perhaps best explained as a kind of algorithmic pareidolia (the phenomenon of seeing faces or familiar objects in, for instance, a patch of clouds), and more often than not, the algorithm produces bizarre and fascinating images that are sure to catch the attention of anyone who comes across them. However, DeepDream is not just a fun gimmick.
While the behavior of trained neural networks are usually opaque to humans, techniques like DeepDream are helping researchers shed a light on their inner workings by allowing the networks to reveal to us what they are seeing. This has given rise to the field known as interpretability, which aims to make sense of the behavior of trained neural networks.
In this post, we will explore how DeepDream works at a high level through interactive visualizations. We will also program an implementation of the algorithm using PyTorch and apply it to a pretrained version of the Inception-v3 model. This implementation will make it easy for you to create customized âdreamsâ by running the algorithm on your own images and tweaking its parameters.
To understand how DeepDream works at an intuitive level, recall that neural networks rely on backpropagation and are therefore differentiable end-to-end. DeepDream exploits this property â not by minimizing a loss function â but by maximizing the output of a hidden layer of oneâs own choosing. Crucially, this optimization is not performed by tweaking the networkâs parameters, but by tweaking the input image itself. This has the interesting effect of the input image slowly morphing into an often surreal version of itself. As shown in the animation below, the modifications are applied iteratively, with each run accentuating the effects more and more.
Select a layer and click play. Darker colors represent higher output values.
More formally, assume that $x_n$ is the input image for iteration $n$ of the algorithm, and $y$ is the L2 norm of the corresponding output of an arbitrary hidden layer of a CNN. To maximize $y$, we obtain $\frac{\partial y}{\partial x_n}$ and update the input image using the update rule $x_{n+1} := x_n + \alpha \frac{\partial y}{\partial x_n}$ where $\alpha$ is some predetermined step size. This process is also known as gradient ascent.
At a high level, that is all there is to it. However, if we implement this exactly as described, the patterns that appear will occur at a low level of resolution across the image and will often suffer from high-frequency artifacts. While there is nothing inherently wrong with this, we can make use of a few tricks to give our images that extra âpop.â
To solve the resolution issue, we can start by constructing a Gaussian pyramid from the original image. This is simply a set of increasingly downscaled versions of the image referred to as âoctavesâ. The octaves are then processed using gradient ascent, from smallest to largest, for a fixed number of iterations. Importantly, since the gradient values are often extremely small, the gradient is normalized before being added to the image.
Following the gradient ascent steps, the difference between the untampered octave and its âdreamyâ counterpart is obtained as dream - octave
. In the subsequent iteration (this time with a larger octave), this difference is scaled up to match the size of the new octave. These are then added before being sent through the network, and the process is repeated.
When an octave has been processed, the changes to the image are extracted, upscaled and applied to the next octave before it too gets processed.
Upscaling the changes in this way ensures that the patterns become bigger and more defined, which usually leads to a more intriguing result. You will in other words notice that adjusting the number of octaves is an effective way of managing the size of the patterns in the final image.
The final thing we need to take care of is the occurrence of high-frequency artifacts that reduce the âsmoothnessâ of our images. One way to combat this is by using what the authors of DeepDream call âjitter,â which is simply randomized application of the np.roll
function. This means that the image is shifted along the $x$ and $y$ axis while maintaining its original dimensions. The animation below illustrates the combined application of octaves and jitter.
For the best results, jitter should be applied before each step of gradient ascent. I struggled for many hours trying to eliminate the noisy artifcats, with jitter seeming to make almost no difference. The problem turned out to be caused by the fact that I was only applying jitter once per octave.
Below is a side-by-side comparison of images optimized using either none, some, or all of the techniques mentioned here.
(1) original, (2) only gradient normalization, (3) gradient normalization and octaves, (4) gradient normalization, octaves, and jitter.
Each of the images below are the result of optimizing for different layers in the Inception-v3 model (Mixed_5b
through Mixed_7c
). You will notice that the patterns become gradually more complex as the depth of the layer increases. This is a clear demonstration of how each of the networkâs representations are built on top of other more rudimentary ones, going all the way down to the basic colors, edges, and shadows that are encoded in the earliest layers.
While optimizing for the output of deeper layers produces gradually more complex patterns, the images also reach a point at which most of the high-level structure appears more noisy than when optimizing for shallower layers. Unfortunately, this makes it difficult for complex patterns such as dogs, cats, cars or people to show up in the final image. I am still not exactly sure what is responsible for this but I would love to know!
Optimizing photographs is fun, but it is not the most useful application of DeepDream from a research perspective. To get the clearest view into a networkâs learned abstractions, one can prevent the input image from biasing the output by starting with an image of random noise. The output thus becomes a more crystallized version of what the layer or filter has learned to recognize.
Personally, I find that the most effective (and beautiful) way of optimizing random noise is by progressively zooming into the image while running gradient ascent for a fixed number of steps in each frame. Rather than optimizing for whole layers, optimizing for specific channels usually gives the best results. Below are a few examples.
Optimizing for random channels in the Inception-v3 modelâs Mixed_5
and Mixed_6
layers, starting with random noise and using progressive zoom.
The provided implementation (see below) makes it easy for you to create images like this, at any resolution you like. Further, it allows you to not only optimize for any layer or channel in the network, but for any combination of layers and channels. The possibilities are endless!
To help you run DeepDream on your own images, check out the GitHub repo here.
The entrypoint for the algorithm is contained in the file dream.py
. This is also where you can tweak its hyperparameters such as num_octaves
, steps_per_octave
, and step_size
. Perhaps most importantly, the code contains the following dictionary:
Changing the dictionary values lets you select which channels and/or layers to optimize. Supplying "all"
as a value optimizes the whole layer while a tuple of ints optimizes the specific channels located at the corresponding indices.
To optimize an image, simply load it using PIL
and pass it to the dream
function:
img = PIL.Image.load("your_image.jpg")
dream_img = dream(img, num_octaves=3, steps_per_octave=100, settings=settings)
dream_img.save("dream.jpg")
In order to optimize random noise using progressive zoom, use the functions random_noise
and zoom_dream
:
noise = helpers.random_noise(250, 250)
dream_noise = zoom_dream(noise, num_frames=50, steps_per_frame=50, settings=settings)
dream_noise.save("dream_noise.jpg")
On GitHub, you will find a Jupyter notebook which makes it easy to try out both approaches.
While DeepDream can be a useful tool for researchers, it is also simply a fun and interesting way of breathing new, surreal life into images of all kinds.
Personally, I find many of DeepDreamâs images incredibly strange yet oddly familiar. They are reminiscent of phosphenes (the phenomenon of seeing light without light actually entering the eye), which is fascinating given that it hints at the fact that the visual abstractions in the brain might be similar to the ones encoded in neural networks.
Despite the fact that neural networks are âtoy modelsâ of the human brain, there are similarities between the two. Perhaps we will one day be using interpretability techniques to gain insight into the inner workings of the human mind. If that is not an exciting prospect, I donât know what is.