Function Repository Resource:

StableDiffusionSynthesize

Source Notebook

Synthesize images using the Stable Diffusion neural network

Contributed by: Maria Sargsyan and Nikolay Murzin

ResourceFunction["StableDiffusionSynthesize"][prompt]

synthesize an image given a string or explicit text embedding vector as prompt.

ResourceFunction["StableDiffusionSynthesize"][promptlatent]

use an initial image or a noise as latent for the diffusion starting point.

ResourceFunction["StableDiffusionSynthesize"][promptlatentguidanceScale]

specify a guidance scale.

ResourceFunction["StableDiffusionSynthesize"][{negativeprompt,prompt}latentguidanceScale]

specify a guidance scale with a negative prompt.

ResourceFunction["StableDiffusionSynthesize"][<|"Prompt","NegativePrompt","Latent","GuidanceScale",|>]

provide an association with explicit arguments.

ResourceFunction["StableDiffusionSynthesize"][prompt,n]

generate n instances for the same prompt specification.

ResourceFunction["StableDiffusionSynthesize"][{p1,p2,}]

generate multiple images.

ResourceFunction["StableDiffusionSynthesize"][{p1,p2,},n]

generate multiple images for each prompt.

Details and Options

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
Stable Diffusion uses NetModel["Stable Diffusion V1"]
It is possible to use a precomputed embedding instead of a text prompt using NetModel["CLIP Multi-domain Feature Extractor","EvaluationNet:ViT-L/14-Text"], which should produce a real vector with 768 elements
For a list of inputs a pair of {negativeprompt,prompt} can be used as a shorthand for <|"Prompt"prompt,"NegativePrompt"negativeprompt|>. A negative prompt is the opposite of a prompt insofar as it tells the model what not to generate.
Instead of giving an image as a starting point, any precomputed latent or a noise of dimension {4, 64, 64} can be used.
If an image is given as a latent, it will be encoded using NetModel[{"Stable Diffusion V1","Part""Encoder"}]
NetGraph/NetChain argument options like TargetDevice and BatchSize are supported.

Examples

Basic Examples (3) 

Generate an image by giving a text prompt:

In[1]:=
ResourceFunction["StableDiffusionSynthesize"]["A cat in party hat"]
Out[1]=

Generate multiple images:

In[2]:=
ResourceFunction["StableDiffusionSynthesize"]["A cat in party hat", 5]
Out[2]=

Guide an initial image with a prompt:

In[3]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/6df4f0b4-d044-4fe5-8947-8b066939f673"]
Out[3]=

Scope (3) 

Use negative prompt for additional guidance (bottom row):

In[4]:=
Grid@ResourceFunction["StableDiffusionSynthesize"][{
   "A cat in party hat",
   {"image cut out, bad anatomy, bad composition, odd looking, weird, odd, unnatural", "A cat in party hat"}}, 3, BatchSize -> 3]
Out[4]=

Use a precomputed text embedding:

In[5]:=
embedding = NetModel[{"CLIP Multi-domain Feature Extractor", "InputDomain" -> "Text", "Architecture" -> "ViT-L/14"}][
   "A cat in a party hat", NetPort[{"post_normalize", "Output"}]];
In[6]:=
ResourceFunction["StableDiffusionSynthesize"][embedding]
Out[6]=

Use an explicit initial noise:

In[7]:=
SeedRandom[33];
noise = RandomVariate[NormalDistribution[], {4, 64, 64}];
In[8]:=
ResourceFunction["StableDiffusionSynthesize"][
 "A cat in a party hat" -> noise]
Out[8]=

Options (7) 

GuidanceScale (1) 

A higher guidance scale encourages generation of images that are more closely linked to the prompt, usually at the expense of lower image quality:

In[9]:=
SeedRandom[7];
noise = RandomVariate[NormalDistribution[], {4, 64, 64}];
prompt = "A cat in party hat";
negativePrompt = "image cut out, bad anatomy, bad composition, odd looking, weird, odd, unnatural";
In[10]:=
GraphicsRow@ResourceFunction["StableDiffusionSynthesize"][{
   prompt -> noise,
   {negativePrompt, prompt} -> noise -> 3,
   {negativePrompt, prompt} -> noise,
   {negativePrompt, prompt} -> noise -> 14}]
Out[10]=

Strength (1) 

Specify encoding strength (how much to transform the reference image):

In[11]:=
SeedRandom[1];
noise = RandomVariate[NormalDistribution[], {4, 64, 64}];
In[12]:=
(* Evaluate this cell to get the example input *) CloudGet["https://www.wolframcloud.com/obj/2b5a1059-1d25-4155-b1b3-1d6b4059ef84"]
Out[12]=

MaxIterations (1) 

Specify number of diffusion iterations (default is 50):

In[13]:=
ResourceFunction["StableDiffusionSynthesize"]["A cat in a party hat", MaxIterations -> 4]
Out[13]=

ProgressReporting (1) 

Default Automatic reporting shows latent images in the process of diffusion:

ProgressReportingFalse disables it.

ReturnIntermediates (1) 

Return intermediate images for each diffusion iteration:

In[14]:=
GraphicsRow[
 BlockRandom[SeedRandom[33]; ResourceFunction["StableDiffusionSynthesize"][
    "A cat in a party hat", "ReturnIntermediates" -> True, BatchSize -> 3][[;; ;; 5]]]]
Out[14]=

ReturnLatents (1) 

Return a pair with a list of latents and the result {latents,result}:

In[15]:=
GraphicsRow[
 Image /@ Transpose[
   BlockRandom[SeedRandom[33]; ResourceFunction["StableDiffusionSynthesize"][
      "A cat in a party hat", "ReturnLatents" -> True][[
     1, ;; ;; 5]]], {1, 4, 2, 3}]]
Out[15]=

TextEncoder, UNet, Encoder and Decoder (1) 

Specify custom neural network parts from a different trained checkpoints, modified by Text Inversion, LoRA or other techniques:

In[16]:=
ResourceFunction[
 "StableDiffusionSynthesize"]["A cat in a party hat", 5, "TextEncoder" -> modifiedTextEncoder, "UNet" -> modifiedUnet, "Decoder" -> modifiedDecoder]
Out[16]=

Possible Issues (2) 

By default TargetDevice is "GPU" as the network is extremely slow and not recommended to be run on "CPU":

In[17]:=
ResourceFunction["StableDiffusionSynthesize"]["A cat in a party hat", TargetDevice -> "CPU"]
Out[17]=

"UNetTargetDevice" and "UNetBatchSize" options can overwrite TargetDevice and BatchSize, which may be useful when Decoder can't handle the same BatchSize for decoding too many images:

Neat Examples (1) 

Progressively modify the neural network to see how it gradually breaks down:

In[18]:=
unet = NetModel["Stable Diffusion V1"];
In[19]:=
arrays = Information[unet, "Arrays"];
In[20]:=
seedimage = BlockRandom[SeedRandom[33]; RandomVariate[NormalDistribution[], {4, 64, 64}]];
In[21]:=
randomArrays = Information[NetInitialize[unet, All, RandomSeeding -> 33563], "Arrays"];
In[22]:=
GraphicsGrid[Partition[images = Table[
    ResourceFunction["StableDiffusionSynthesize"][
     "A cat in a party hat" -> seedimage, "UNet" -> NetReplacePart[unet, Merge[{arrays, randomArrays}, Apply[Normal[#1 + p (#2 - #1)] &]]]],
    {p, 0, .15, 0.01}
    ], 8]]
Out[22]=

Requirements

Wolfram Language 13.0 (December 2021) or above

Version History

  • 1.0.0 – 30 June 2023

Related Resources

License Information