In this series of articles, I will introduce image generation using Stable Diffusion and ComfyUI. This first article discusses the theory and provides basic examples. My goal is to generate precise images of yoga poses using only your local machine. Due to legal restrictions, I cannot use images from the Internet. Given these limitations, how can generative AI assist?
Stable Diffusion
Let's introduce Stable Diffusion, a family of models for image generation. It began in Germany as a collaboration between various companies and universities in 2021, focusing on the innovation called the latent diffusion model. Companies like Runway and EleutherAI participated in the project. Initially, Stability AI contributed computing resources, later hiring many of the original researchers, and is now the official maintainer of Stable Diffusion models.
These models are distributed under "open" licenses emphasizing responsible AI. There are some OpenRAIL restrictions on their usage. From version 3.5 onwards, enterprises with over $1 million in revenue are required to pay a fee to Stability AI.
The open availability of these models has fostered a dynamic and contributing ecosystem. A community has emerged around fine-tuned versions and additional models like Refiners, Upscalers, ControlNets, and Low-Rank Adapters (which will be introduced in this series). This vibrant community also offers tools like user-friendly interfaces to interact with the models (such as Automatic1111 Web UI or ComfyUI) and tools to aid the fine-tuning process, like Dreambooth or Kohya SS. Platforms like Hugging Face and civitai.com allow the community to share models, prompts, images, and tutorials.
ComfyUI
ComfyUI is a fascinating GUI for Stable Diffusion that allows users to run Stable Diffusion models locally and create image generation workflows. It's an intuitive, modular, and customizable platform distributed under the GPL 3 license.
Workflows interconnect nodes through links to design and execute processes like text-to-image or image-to-image generation. Custom nodes, contributed by the community, make it highly extensible. We'll use ComfyUI throughout this tutorial series. Installation is straightforward and detailed [here]. To make the GUI more user-friendly, I recommend installing the following plugins:
- ComfyUI-Manager: manage custom nodes by installing, removing, disabling, and enabling them
- ComfyUI-Crystools: provides resource monitoring, a progress bar, time elapsed, and metadata management
- ComfyUI-Workspace-Manager: helps organize and manage workflows and models
As this tutorial series progresses, we'll need to install additional custom nodes (using the ComfyUI Manager) to create increasingly complex workflows.
Simple Text-to-Image Workflow
Now that we're set, let's begin with a simple workflow: I want Stable Diffusion to generate an image of a girl doing yoga in a park.
Here is the workflow in ComfyUI:
Watch the following video to see it in action.
All information and parameters of the workflow are embedded in the generated images as EXIF metadata. This allows you to load an image back into ComfyUI to visualize and manipulate its original workflow, very convenient!
Image generation relies on mathematical algorithms and calculations, meaning the same workflow with identical parameters always produces the same image.
We've generated our first image; let's explain this simple workflow and its nodes.
Model
We'll use Stable Diffusion models. The most popular versions are Stable Diffusion 1.5 and Stable Diffusion Extra Large, known as SDXL. Many fine-tuned models based on these versions are available, specialized in certain styles or subjects.
To install a model, download the checkpoint file and store it in a specific ComfyUI folder, or use the Manager or Workspace Manager to search and install it automatically.
For this initial tutorial, we'll use the vanilla SDXL 1.0 model.
Prompts
Prompts are a crucial part of the workflow. They guide generation based on text describing the desired image (positive prompt) and elements to avoid (negative prompt).
They utilize the CLIP model (Contrastive Language-Image Pretraining) created by OpenAI in 2021 under a MIT license. Trained on 400 million image-text pairs, it links descriptive text with images and handles a wide range of prompts. These prompts are converted into embeddings (mathematical vectors) to guide the model during generation.
Descriptions can be sentences or comma-separated keywords. Keywords or features can be weighted using commas or numbers, like ((yoga)) or (yoga:1.2).
Our positive prompt is: girl, doing yoga, lotus pose, green eyes, cinematic lighting, long hair, ((blue yoga outfit)), best quality, park.
Latent Space
The mathematical multi-dimensional space, known as Latent Space, is where the diffusion process occurs. The model works not on pixels but on a vectorial representation of the image.
This is why we must provide an empty Latent Image for the generation process in our workflow.
After transforming the Latent Image, it needs conversion back to a pixel-based image via a “VAE Decode” (Variational Autoencoder), an additional neural network.
Generation Process
The generation process unfolds step-by-step in the Latent Space, starting with a random image that is incrementally denoised with details guided by the prompts, illustrated in the following video.
The detailed final image is then decoded into pixels by the VAE.
Several parameters can be adjusted in the KSampler node, responsible for this generation process:
- The seed determines the randomness of the initial noisy image
- We keep the default number of steps at 20
- CFG represents the guidance, the influence of prompts on the process
- Various sampling and scheduling algorithms are available, differing in speed and quality; we’ll stick with the default for now
- We set the denoising percentage to 100% (value of 1.00), given we provided a completely noisy image as input
As explained earlier, due to the mathematical basis of this process, the same model, input image (here, the same seed), prompts, and sampling/scheduling parameters will always produce the exact same image.
When tweaking parameters and optimizing your workflow, it's recommended to fix the seed initially to observe how changing inputs or parameters affects the final image.
It captures the prompt features such as a blue yoga outfit, park, or long hair but isn't quite the expected Lotus pose... So, how can we refine our prompts?
Embeddings (Textual Inversions)
Instead of an exhaustive prompt to describe image features, we can use Textual Inversions as shortcuts. Also called Embeddings, they are vector representations of text providing preset instructions for style, theme, texture, etc.
The community shares many Embeddings as downloadable files to be included in ComfyUI’s folders. Note that an Embedding cannot create something not already in the model itself, which is why we need Embeddings tailored to our SDXL model.
I've downloaded several Embeddings to determine the style of our image, so let’s test some by adding them to the positive prompt.
Below are some output examples generated without changing anything in our initial workflow.
Multiple Embeddings can be used in both positive and negative prompts. For instance, specific Embeddings are available to avoid poorly generated hands.
There are also Embeddings for poses, but not as precise as needed for generating yoga poses... Clearly, a simple Text-to-Image workflow isn't sufficient for my requirements.
The poses are essentially incorrect, with even the first image failing to depict the desired Lotus yoga pose. How can we address the issue of pose accuracy? We'll be experimenting with some new techniques in the second part of this tutorial.
In the meantime, have fun creating workflows in ComfyUI! Feel free to explore my YouTube tutorials as well.
Top comments (0)