Function Repository Resource:

StableDiffusionSynthesize

Source Notebook

Synthesize images using the Stable Diffusion neural network

Contributed by: Maria Sargsyan and Nikolay Murzin

ResourceFunction["StableDiffusionSynthesize"][prompt]

synthesize an image given a string or explicit text embedding vector as prompt.

ResourceFunction["StableDiffusionSynthesize"][prompt→latent]

use an initial image or a noise as latent for the diffusion starting point.

ResourceFunction["StableDiffusionSynthesize"][prompt→latent→guidanceScale]

specify a guidance scale.

ResourceFunction["StableDiffusionSynthesize"][{negativeprompt,prompt}→latent→guidanceScale]

specify a guidance scale with a negative prompt.

ResourceFunction["StableDiffusionSynthesize"][<|"Prompt"→…,"NegativePrompt"→…,"Latent"→…,"GuidanceScale"→…,…|>]

provide an association with explicit arguments.

ResourceFunction["StableDiffusionSynthesize"][prompt,n]

generate n instances for the same prompt specification.

ResourceFunction["StableDiffusionSynthesize"][{p₁,p₂,…}]

generate multiple images.

ResourceFunction["StableDiffusionSynthesize"][{p₁,p₂,…},n]

generate multiple images for each prompt.

Details and Options

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.

Stable Diffusion uses NetModel["Stable Diffusion V1"]

It is possible to use a precomputed embedding instead of a text prompt using NetModel["CLIP Multi-domain Feature Extractor","EvaluationNet:ViT-L/14-Text"], which should produce a real vector with 768 elements

For a list of inputs a pair of {negativeprompt,prompt} can be used as a shorthand for <|"Prompt"→prompt,"NegativePrompt"→negativeprompt|>. A negative prompt is the opposite of a prompt insofar as it tells the model what not to generate.

Instead of giving an image as a starting point, any precomputed latent or a noise of dimension {4, 64, 64} can be used.

If an image is given as a latent, it will be encoded using NetModel[{"Stable Diffusion V1","Part"→"Encoder"}]

NetGraph/NetChain argument options like TargetDevice and BatchSize are supported.

Examples

Basic Examples (3)

Generate an image by giving a text prompt:

In[1]:=

Out[1]=

Generate multiple images:

In[2]:=

Out[2]=

Guide an initial image with a prompt:

In[3]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/6df4f0b4-d044-4fe5-8947-8b066939f673"]

Out[3]=

Scope (3)

Use negative prompt for additional guidance (bottom row):

In[4]:=

Grid@ResourceFunction["StableDiffusionSynthesize"][{
"A cat in party hat",
{"image cut out, bad anatomy, bad composition, odd looking, weird, odd, unnatural", "A cat in party hat"}}, 3, BatchSize -> 3]

Out[4]=

Use a precomputed text embedding:

In[5]:=

$embedding = NetModel[{"CLIP Multi-domain Feature Extractor", "InputDomain" -> "Text", "Architecture" -> "ViT-L/14"}][ "A cat in a party hat", NetPort[{"post_normalize", "Output"}]];$

In[6]:=

Out[6]=

Use an explicit initial noise:

In[7]:=

In[8]:=

Out[8]=

Options (7)

GuidanceScale (1)

A higher guidance scale encourages generation of images that are more closely linked to the prompt, usually at the expense of lower image quality:

In[9]:=

SeedRandom[7];
noise = RandomVariate[NormalDistribution[], {4, 64, 64}];
prompt = "A cat in party hat";
negativePrompt = "image cut out, bad anatomy, bad composition, odd looking, weird, odd, unnatural";

In[10]:=

GraphicsRow@ResourceFunction["StableDiffusionSynthesize"][{
prompt -> noise,
{negativePrompt, prompt} -> noise -> 3,
{negativePrompt, prompt} -> noise,
{negativePrompt, prompt} -> noise -> 14}]

Out[10]=

Strength (1)

Specify encoding strength (how much to transform the reference image):

In[11]:=

In[12]:=

(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/2b5a1059-1d25-4155-b1b3-1d6b4059ef84"]

Out[12]=

MaxIterations (1)

Specify number of diffusion iterations (default is 50):

In[13]:=

Out[13]=

ProgressReporting (1)

Default Automatic reporting shows latent images in the process of diffusion:

ProgressReporting→False disables it.

ReturnIntermediates (1)

Return intermediate images for each diffusion iteration:

In[14]:=

GraphicsRow[
BlockRandom[SeedRandom[33]; ResourceFunction["StableDiffusionSynthesize"][
"A cat in a party hat", "ReturnIntermediates" -> True, BatchSize -> 3][[;; ;; 5]]]]

Out[14]=

ReturnLatents (1)

Return a pair with a list of latents and the result {latents,result}:

In[15]:=

GraphicsRow[
Image /@ Transpose[
BlockRandom[SeedRandom[33]; ResourceFunction["StableDiffusionSynthesize"][
"A cat in a party hat", "ReturnLatents" -> True][[
1, ;; ;; 5]]], {1, 4, 2, 3}]]

Out[15]=

TextEncoder, UNet, Encoder and Decoder (1)

Specify custom neural network parts from a different trained checkpoints, modified by Text Inversion, LoRA or other techniques:

In[16]:=

Out[16]=

Possible Issues (2)

By default TargetDevice is "GPU" as the network is extremely slow and not recommended to be run on "CPU":

In[17]:=

Out[17]=

"UNetTargetDevice" and "UNetBatchSize" options can overwrite TargetDevice and BatchSize, which may be useful when Decoder can't handle the same BatchSize for decoding too many images:

Neat Examples (1)

Progressively modify the neural network to see how it gradually breaks down:

In[18]:=

In[19]:=

In[20]:=

In[21]:=

In[22]:=

GraphicsGrid[Partition[images = Table[
ResourceFunction["StableDiffusionSynthesize"][
"A cat in a party hat" -> seedimage, "UNet" -> NetReplacePart[unet, Merge[{arrays, randomArrays}, Apply[Normal[#1 + p (#2 - #1)] &]]]],
{p, 0, .15, 0.01}
], 8]]

Out[22]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

1.0.0 – 30 June 2023

Related Resources

License Information

This work is licensed under a Creative Commons Attribution 4.0 International License

Wolfram Function Repository

StableDiffusionSynthesize

Details and Options