Shawn Wang

Posted on Mar 5

Alibaba Releases Wan2.1: A Breakthrough in Open-Source Video Generation Models

#ai #aigc #comfyui #design

Introduction

Alibaba has recently open-sourced Wan2.1, a powerful video generation model that has achieved state-of-the-art performance in the field of AI video generation. Released under the Apache 2.0 license, Wan2.1 is now available for developers worldwide through GitHub and HuggingFace platforms.

Key Features of Wan2.1

Wan2.1 stands out in the AI video generation landscape with several impressive capabilities:

Superior Performance: Ranks #1 on the VBench leaderboard, outperforming both open-source and commercial models
High Resolution Support: Capable of generating videos up to 720P resolution
Low Hardware Requirements: Can run on consumer-grade GPUs with as little as 8GB VRAM
Multilingual Text Support: Uniquely able to generate videos with both Chinese and English text/subtitles
Natural Motion: Produces videos with natural movement, avoiding the distortions common in earlier AI video models

Multiple Task Support

Wan2.1 supports a variety of generation tasks:

Task Type	Description
Text-to-Video (T2V)	Generates complete videos from text descriptions
Image-to-Video (I2V)	Creates dynamic videos from a single image
Video-Edit	AI optimization or modification of existing videos
Text-to-Image (T2I)	Generates high-quality images from text
Video-to-Audio (V2A)	Creates AI audio that matches video content

Available Models

The open-source release includes four specific models across two parameter sizes:

Wan2.1-I2V-14B-720P: 14B parameter model for generating high-definition 720P videos from images
Wan2.1-I2V-14B-480P: 14B parameter model for generating 480P videos from images
Wan2.1-T2V-14B: 14B parameter model for text-to-video generation, supporting both 480P and 720P resolutions
Wan2.1-T2V-1.3B: A lightweight 1.3B parameter model that can run on almost any consumer GPU, requiring only 8.19GB VRAM to generate a 5-second 480P video

The 1.3B model is particularly noteworthy as it outperforms other 5B parameter models and even some larger models, making it an efficient option for developers with limited computational resources.

Technical Innovations

Wan2.1 incorporates several technical innovations:

3D Spatiotemporal VAE

Wan2.1 utilizes an advanced 3D spatiotemporal variational autoencoder (Wan-VAE) that achieves:

More efficient video compression while maintaining temporal consistency
Support for 1080P long videos without losing temporal information
Faster processing and higher quality compared to traditional VAEs
2.5x faster video reconstruction speed on A800 GPUs compared to HunYuanVideo

Video Diffusion Transformer (DiT)

The model employs a mainstream video DiT structure with:

Full Attention mechanism for effective modeling of long-term spatiotemporal dependencies
Flow Matching framework combined with T5 encoder
MLP for processing time embeddings

Data Processing

The training process involved a four-step data curation workflow focusing on:

Basic dimensions
Visual quality
Motion quality

The pre-training process was divided into four stages, gradually increasing resolution and video duration to optimize training within computational constraints.

💡 Looking for AI Image Inspiration?

Explore VisionGeni AI: a completely free, no-signup gallery of Stable Diffusion 3.5 & Flux images with prompts. Try our Flux prompt generator instantly to spark your creativity.

How to Use

Developers can download and use the models through:

GitHub: https://github.com/Wan-Video/Wan2.1
HuggingFace: https://huggingface.co/Wan-AI

The models can be run locally using a Gradio Web interface for an interactive experience.

Wan2.1 also has been integrated with ComfyUI, allowing users to leverage the model within the ComfyUI workflow: https://comfyanonymous.github.io/ComfyUI_examples/wan/

DEV Community

Alibaba Releases Wan2.1: A Breakthrough in Open-Source Video Generation Models

Introduction

Key Features of Wan2.1

Multiple Task Support

Available Models

Technical Innovations

3D Spatiotemporal VAE

Video Diffusion Transformer (DiT)

Data Processing

How to Use

Top comments (0)

Read next

Small Model from Huggingface with Video understanding

Write AI agent from scratch without LangChain and CrewAI

Introducing Feeding Frenzy: Open-Source AI for Sales with Twilio

Goku AI: China's Model Outperforms OpenAI's Yet Again