DEV Community

Cover image for The Yoga of Image Generation – Part 1
raphiki
raphiki

Posted on • Originally published at blog.worldline.tech

The Yoga of Image Generation – Part 1

In this series of articles, I will introduce image generation using Stable Diffusion and ComfyUI. This first article discusses the theory and provides basic examples. My goal is to generate precise images of yoga poses using only your local machine. Due to legal restrictions, I cannot use images from the Internet. Given these limitations, how can generative AI assist?

Stable Diffusion

Stable Diffusion Logo

Let's introduce Stable Diffusion, a family of models for image generation. It began in Germany as a collaboration between various companies and universities in 2021, focusing on the innovation called the latent diffusion model. Companies like Runway and EleutherAI participated in the project. Initially, Stability AI contributed computing resources, later hiring many of the original researchers, and is now the official maintainer of Stable Diffusion models.

These models are distributed under "open" licenses emphasizing responsible AI. There are some OpenRAIL restrictions on their usage. From version 3.5 onwards, enterprises with over $1 million in revenue are required to pay a fee to Stability AI.

The open availability of these models has fostered a dynamic and contributing ecosystem. A community has emerged around fine-tuned versions and additional models like Refiners, Upscalers, ControlNets, and Low-Rank Adapters (which will be introduced in this series). This vibrant community also offers tools like user-friendly interfaces to interact with the models (such as Automatic1111 Web UI or ComfyUI) and tools to aid the fine-tuning process, like Dreambooth or Kohya SS. Platforms like Hugging Face and civitai.com allow the community to share models, prompts, images, and tutorials.

ComfyUI

ComfyUI Logo

ComfyUI is a fascinating GUI for Stable Diffusion that allows users to run Stable Diffusion models locally and create image generation workflows. It's an intuitive, modular, and customizable platform distributed under the GPL 3 license.

Workflows interconnect nodes through links to design and execute processes like text-to-image or image-to-image generation. Custom nodes, contributed by the community, make it highly extensible. We'll use ComfyUI throughout this tutorial series. Installation is straightforward and detailed [here]. To make the GUI more user-friendly, I recommend installing the following plugins:

As this tutorial series progresses, we'll need to install additional custom nodes (using the ComfyUI Manager) to create increasingly complex workflows.

Simple Text-to-Image Workflow

Now that we're set, let's begin with a simple workflow: I want Stable Diffusion to generate an image of a girl doing yoga in a park.

Here is the workflow in ComfyUI:

Simple Text-to-Image workflow

Watch the following video to see it in action.

All information and parameters of the workflow are embedded in the generated images as EXIF metadata. This allows you to load an image back into ComfyUI to visualize and manipulate its original workflow, very convenient!

Image generation relies on mathematical algorithms and calculations, meaning the same workflow with identical parameters always produces the same image.

We've generated our first image; let's explain this simple workflow and its nodes.

Model

Load Checkpoint Node

We'll use Stable Diffusion models. The most popular versions are Stable Diffusion 1.5 and Stable Diffusion Extra Large, known as SDXL. Many fine-tuned models based on these versions are available, specialized in certain styles or subjects.

To install a model, download the checkpoint file and store it in a specific ComfyUI folder, or use the Manager or Workspace Manager to search and install it automatically.

For this initial tutorial, we'll use the vanilla SDXL 1.0 model.

Prompts

Prompt Nodes

Prompts are a crucial part of the workflow. They guide generation based on text describing the desired image (positive prompt) and elements to avoid (negative prompt).

They utilize the CLIP model (Contrastive Language-Image Pretraining) created by OpenAI in 2021 under a MIT license. Trained on 400 million image-text pairs, it links descriptive text with images and handles a wide range of prompts. These prompts are converted into embeddings (mathematical vectors) to guide the model during generation.

Descriptions can be sentences or comma-separated keywords. Keywords or features can be weighted using commas or numbers, like ((yoga)) or (yoga:1.2).

Our positive prompt is: girl, doing yoga, lotus pose, green eyes, cinematic lighting, long hair, ((blue yoga outfit)), best quality, park.

Latent Space

The mathematical multi-dimensional space, known as Latent Space, is where the diffusion process occurs. The model works not on pixels but on a vectorial representation of the image.

This is why we must provide an empty Latent Image for the generation process in our workflow.

Latent Space Nodes

After transforming the Latent Image, it needs conversion back to a pixel-based image via a “VAE Decode” (Variational Autoencoder), an additional neural network.

Generation Process

The generation process unfolds step-by-step in the Latent Space, starting with a random image that is incrementally denoised with details guided by the prompts, illustrated in the following video.

The detailed final image is then decoded into pixels by the VAE.

KSampler Node

Several parameters can be adjusted in the KSampler node, responsible for this generation process:

  • The seed determines the randomness of the initial noisy image
  • We keep the default number of steps at 20
  • CFG represents the guidance, the influence of prompts on the process
  • Various sampling and scheduling algorithms are available, differing in speed and quality; we’ll stick with the default for now
  • We set the denoising percentage to 100% (value of 1.00), given we provided a completely noisy image as input

As explained earlier, due to the mathematical basis of this process, the same model, input image (here, the same seed), prompts, and sampling/scheduling parameters will always produce the exact same image.

When tweaking parameters and optimizing your workflow, it's recommended to fix the seed initially to observe how changing inputs or parameters affects the final image.

T2I - Generated Image

It captures the prompt features such as a blue yoga outfit, park, or long hair but isn't quite the expected Lotus pose... So, how can we refine our prompts?

Embeddings (Textual Inversions)

Instead of an exhaustive prompt to describe image features, we can use Textual Inversions as shortcuts. Also called Embeddings, they are vector representations of text providing preset instructions for style, theme, texture, etc.

The community shares many Embeddings as downloadable files to be included in ComfyUI’s folders. Note that an Embedding cannot create something not already in the model itself, which is why we need Embeddings tailored to our SDXL model.

I've downloaded several Embeddings to determine the style of our image, so let’s test some by adding them to the positive prompt.

Embeddings

Below are some output examples generated without changing anything in our initial workflow.

Styles through embedding

Multiple Embeddings can be used in both positive and negative prompts. For instance, specific Embeddings are available to avoid poorly generated hands.

There are also Embeddings for poses, but not as precise as needed for generating yoga poses... Clearly, a simple Text-to-Image workflow isn't sufficient for my requirements.

The problem with Yoga poses

The poses are essentially incorrect, with even the first image failing to depict the desired Lotus yoga pose. How can we address the issue of pose accuracy? We'll be experimenting with some new techniques in the second part of this tutorial.

In the meantime, have fun creating workflows in ComfyUI! Feel free to explore my YouTube tutorials as well.

Top comments (0)