Stable Diffusion V1

Generate photorealistic images

This model is also available through the function StableDiffusionSynthesize in the Wolfram Function Repository

Released in 2022, Stable Diffusion V1 is a latent text-to-image diffusion model. The model is conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder. The authors introduce a progressive training approach to generate high-quality and diverse images using latent diffusion models that gradually refining the images by iteratively optimizing the latent space representation obtained from the powerful autoencoders. Latent diffusion models (LDMs) achieve highly competitive performance on various tasks, including unconditional image generation, inpainting and super-resolution, while significantly reducing computational requirements compared to pixel-based diffusion models (DMs). The model is trained on 512x512 images obtained from a subset of the LAION-5B database.

Training Set Information

LAION-5B, containing 5.85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world.

Model Information

Examples

Download Example Notebook

Open in Wolfram Cloud

Resource retrieval

Get the pre-trained net:

In[1]:=

Out[1]=

NetModel parameters

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:=

Out[2]=

Pick a non-default net by specifying the parameters:

In[3]:=

Out[5]=

Pick a non-default uninitialized net:

In[6]:=

Out[7]=

Evaluation function

Write an evaluation function to automate the image generation:

In[8]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/56e2e139-7e1a-4568-8225-d4174bcf71d2"]

Basic usage

Define a test prompt:

In[9]:=

prompt = "funny photo of a cute fluffy cat enjoying a morning croissant and drinking coffee in a balcony with breathtaking view, 3d render, cinematic, hyperdetailed, cartoon, animation, pixar disney render";

Generate an image based on the test prompt:

In[10]:=

In[11]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/f8238ede-4392-44ea-8e2b-366848205b3a"]

Define the guidance scale, which by default is set to 7.5, and regenerate the image. Lowering the guidance scale can lead to generated outputs that deviate further from the specified prompt and may introduce excessive randomness, resulting in less fidelity to the desired characteristics:

In[12]:=

Out[13]=

The negative prompt is used to address specific issues or challenges in generating desired images. By providing a negative prompt, it allows the model to focus on avoiding or minimizing certain undesired features or qualities in the generated image. Define the negative prompt and recreate an image:

In[14]:=

negativePrompt = "Images cut out at the top, left, right, bottom, bad anatomy,bad composition, poorly rendered face, poorly rendered paws";

In[15]:=

netEvaluate[<|
"Prompt" -> prompt, "NegativePrompt" -> negativePrompt
|>]

Out[16]=

The reference image is used to guide the model in generating images with a desired style or with desired attributes. Recreate an image by utilizing the negative prompt and the reference image, allowing the diffusion process to progress through 90% of its steps:

In[17]:=

newPrompt = "funny photo of a cute fluffy cat enjoying a morning croissant and drinking coffee in a balcony with breathtaking view to the Eiffel Tower, 3d render.";
newNegativePrompt = "bad view, bad composition, cropped image, missing view, missing cup of coffee";
diffusionStrength = 0.9;

Please note that even when the negative prompt is used, it may not guarantee the resolution of all the problems:

In[18]:=

netEvaluate[<|
"Prompt" -> newPrompt, "NegativePrompt" -> newNegativePrompt,
"Image" -> (img -> diffusionStrength)
|>]

In[19]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/7f4d9124-9c73-4ae0-a368-ab8a9a8c22ce"]

Possible issues

Using a guidance scale smaller than 1 (classifier-free guidance) might not produce a meaningful result:

In[20]:=

netEvaluate[
<|"Prompt" -> (prompt -> 0)|>,
MaxIterations -> 20,
RandomSeeding -> 1234
]

Out[21]=

Note that using the same text for both the prompt and negative prompt results in an identical outcome to the classifier-free guidance. However, this approach takes twice as long in terms of processing time:

In[22]:=

netEvaluate[
<|
"Prompt" -> prompt, "NegativePrompt" -> prompt
|>,
MaxIterations -> 20,
RandomSeeding -> 1234
]

Out[23]=

Visualize backward diffusion

Define the prompt and the negative prompt:

In[24]:=

prompt = "Red convertible car parked near hot air balloons";
negativePrompt = "Images cut out at the top, left, right, bottom, bad composition, watermark, out of frame, unreal engine, unnatural";

Visualize the backward diffusion at intervals of 25% of the generation process:

In[25]:=

Out[69]=

Variational autoencoder

Define a test image:

In[70]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/49ac9611-eae9-4264-81e5-147fa233bbd4"]

In the Stable Diffusion training pipeline, the encoder is responsible for encoding the reference image into a low-dimensional latent representation, which will serve as the input for the U-Net model. After the training, it is not used for generating images. Get the variational autoencoder (VAE) encoder:

In[71]:=

Out[71]=

Encode the test image to obtain the mean and variance of the latent space distribution conditioned on the input image:

In[72]:=

In[73]:=

Out[73]=

The distribution mean can be interpreted as a compressed version of the image. Calculate the compression ratio:

In[74]:=

inputSize = Times @@ NetExtract[
NetModel["Stable Diffusion V1", "Part" -> "Encoder"], {"Input", "Output"}];
outputSize = Times @@ Dimensions[encoded["Mean"]];
N[outputSize/inputSize]

Out[75]=

In the Stable Diffusion pipeline, the denoised latent representations generated through the reverse diffusion process are transformed back into images using the VAE decoder. Get the VAE decoder:

In[76]:=