# Wolfram Neural Net Repository

Immediate Computable Access to Neural Net Models

Generate photorealistic images

Released in 2022, Stable Diffusion V1 is a latent text-to-image diffusion model. The model is conditioned on the (non-pooled) text embeddings of a CLIP ViT-L/14 text encoder. The authors introduce a progressive training approach to generate high-quality and diverse images using latent diffusion models that gradually refining the images by iteratively optimizing the latent space representation obtained from the powerful autoencoders. Latent diffusion models (LDMs) achieve highly competitive performance on various tasks, including unconditional image generation, inpainting and super-resolution, while significantly reducing computational requirements compared to pixel-based diffusion models (DMs). The model is trained on 512x512 images obtained from a subset of the LAION-5B database.

- LAION-5B, containing 5.85 billion CLIP-filtered image-text pairs, 14x bigger than LAION-400M, previously the biggest openly accessible image-text dataset in the world.

Get the pre-trained net:

In[1]:= |

Out[1]= |

This model consists of a family of individual nets, each identified by a specific parameter combination. Inspect the available parameters:

In[2]:= |

Out[2]= |

Pick a non-default net by specifying the parameters:

In[3]:= |

Out[5]= |

Pick a non-default uninitialized net:

In[6]:= |

Out[7]= |

Write an evaluation function to automate the image generation:

In[8]:= |

Define a test prompt:

In[9]:= |

Generate an image based on the test prompt:

In[10]:= |

In[11]:= |

Define the guidance scale, which by default is set to 7.5, and regenerate the image. Lowering the guidance scale can lead to generated outputs that deviate further from the specified prompt and may introduce excessive randomness, resulting in less fidelity to the desired characteristics:

In[12]:= |

Out[13]= |

The negative prompt is used to address specific issues or challenges in generating desired images. By providing a negative prompt, it allows the model to focus on avoiding or minimizing certain undesired features or qualities in the generated image. Define the negative prompt and recreate an image:

In[14]:= |

In[15]:= |

Out[16]= |

The reference image is used to guide the model in generating images with a desired style or with desired attributes. Recreate an image by utilizing the negative prompt and the reference image, allowing the diffusion process to progress through 90% of its steps:

In[17]:= |

Please note that even when the negative prompt is used, it may not guarantee the resolution of all the problems:

In[18]:= |

In[19]:= |

Using a guidance scale smaller than 1 (classifier-free guidance) might not produce a meaningful result:

In[20]:= |

Out[21]= |

Note that using the same text for both the prompt and negative prompt results in an identical outcome to the classifier-free guidance. However, this approach takes twice as long in terms of processing time:

In[22]:= |

Out[23]= |

Define the prompt and the negative prompt:

In[24]:= |

Visualize the backward diffusion at intervals of 25% of the generation process:

In[25]:= |

Out[69]= |

Define a test image:

In[70]:= |

In the Stable Diffusion training pipeline, the encoder is responsible for encoding the reference image into a low-dimensional latent representation, which will serve as the input for the U-Net model. After the training, it is not used for generating images. Get the variational autoencoder (VAE) encoder:

In[71]:= |

Out[71]= |

Encode the test image to obtain the mean and variance of the latent space distribution conditioned on the input image:

In[72]:= |

In[73]:= |

Out[73]= |

The distribution mean can be interpreted as a compressed version of the image. Calculate the compression ratio:

In[74]:= |

Out[75]= |

In the Stable Diffusion pipeline, the denoised latent representations generated through the reverse diffusion process are transformed back into images using the VAE decoder. Get the VAE decoder:

In[76]:= |

Out[76]= |

Decode the result to reconstruct the image by choosing the mean of the posterior distribution:

In[77]:= |

Out[77]= |

Compare the results:

In[78]:= |

Out[78]= |

Inspect the number of parameters of all arrays in the net:

In[79]:= |

Out[79]= |

Obtain the total number of parameters:

In[80]:= |

Out[80]= |

Obtain the layer type counts:

In[81]:= |

Out[81]= |

- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, "High-Resolution Image Synthesis with Latent Diffusion Models," arXiv:2112.10752 (2022)
- Available from: https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/main
- Rights: Copyright (c) 2022 Robin Rombach and Patrick Esser and contributors CreativeML Open RAIL-M dated August 22, 2022