Model Overview Card for Wan 2.1
Basic Information
- Model Name: Wan 2.1
- Developer/Creator: Alibaba
- Release Date: February 25, 2025
- Version: 2.1
- Model Type: AI Video Generation Model
Description
Overview:
Wan 2.1, developed by Alibaba's Wan AI team, is a state-of-the-art video foundation model designed for advanced generative video tasks. Supporting Text-to-Video (T2V), it incorporates groundbreaking innovations to deliver high-quality outputs with exceptional computational efficiency.
Key Features:
- Visual text generation: Generates text in both Chinese and English within videos.
- 3D Variational Autoencoder (Wan-VAE): Encodes and decodes unlimited-length 1080P videos with temporal precision.
- High-quality outputs: Produces visually dynamic and temporally consistent videos at resolutions of up to 720P.
Intended Use:
Wan 2.1 is designed for applications in:
- Creative industries (video production).
- Content generation for social media and marketing campaigns.
- Automated workflows involving multimedia processing.
Language Support:
The model supports multilingual text generation, including Chinese and English.
Technical Details
Architecture:
Wan 2.1 is built on the diffusion transformer paradigm with several innovative features:
- 3D Variational Autoencoder (Wan-VAE): Enhances spatio-temporal compression and ensures temporal causality during video generation.
- Video Diffusion DiT Framework: Uses Flow Matching with a T5 Encoder for text encoding and cross-attention layers embedded in transformer blocks.
Performance Metrics:
Wan 2.1 achieves an impressive 84.7% VBench score, excelling in dynamic scenes, spatial consistency, and aesthetics. It generates 1080p video at 30 FPS with realistic motion, thanks to its advanced space-time attention mechanism. As a leading open-source video generation model, it rivals proprietary alternatives like Sora, though they may outperform it in certain areas.
Usage
Code Samples
The model is available on the AI/ML API platform as "Wan 2.1" .
Params:
- negative_prompt [str]: The negative prompt to use. Use it to address details that you don't want in the image. This could be colors, objects, scenery and even the small details (e.g. moustache, blurry, low resolution).
- seed [int]: Random seed for reproducibility. If None, a random seed is chosen
- aspect_ratio [9:16, 16:9]: Aspect ratio of the generated video
- inference_steps [int]: Number of inference steps for sampling. Higher values give better quality but take longer.
- guidance_scale [number]: Classifier-free guidance scale. Controls prompt adherence / creativity
- shift [number]: Noise schedule shift parameter. Affects temporal dynamics
- sampler ['unipc', 'dpm+']: The sampler to use for generation.
- enable_safety_checker [boolean]: If set to true, the safety checker will be enabled.
- enable_prompt_expansion [boolean]: Whether to enable prompt expansion.
To get the generated video
API Documentation
Detailed API Documentation is available here.
Ethical Guidelines
Alibaba emphasizes responsible usage of Wan 2.1 for ethical applications in content creation while discouraging misuse such as deepfake generation or inappropriate content creation.
Licensing
Wan 2.1 is licensed under Apache 2.0, allowing both commercial and research use with transparent terms.
Get Wan 2.1 API here.