Video
Active

Wan 2.7 video

Whether you're generating videos from scratch, animating images, or guiding motion with references, Wan 2.7 is designed to balance visual quality, speed, and cost-efficiency.
Wan 2.7 videoTechflow Logo - Techflow X Webflow Template

Wan 2.7 video

Wan 2.7 is Alibaba Tongyi Lab's most capable video generation system to date. It collapses four distinct creation modes — text-to-video, image-to-video, reference-to-video, and natural-language video editing.

Four ways to create video

Each mode in the Wan 2.7 suite targets a specific production scenario. They share the same underlying diffusion transformer architecture but expose different input contracts and motion-handling strategies.

T2V

Text to Video

Turn written prompts into 720p–1080p video clips. Thinking mode handles dense, multi-shot scene descriptions with higher compositional accuracy than prior versions.

I2V

Image to Video

Animate a single image or a 9-grid multi-angle set. Specify both first and last frame, and the model auto-infers the motion in between while holding subject identity stable.

R2V

Reference to Video

Pass up to five image, video, or audio references in one call. Locks appearance, voice tone, lip sync, camera movement, and effects simultaneously — industry-leading reference count.

VideoEdit

Video Edit

Rewrite existing footage with a plain-language instruction. Handles local edits, style transfer, colorization, and restoration without full re-generation.

Text to Video

Most text-to-video models treat a prompt as a flat string. Wan 2.7's T2V endpoint feeds it through an internal reasoning pass — what the team calls "thinking mode" — before generation begins. The result is noticeably better layout on complex prompts: multi-character scenes hold spatial logic, camera directions land where you expect them, and lighting descriptions actually propagate across the full clip.

Key T2V capabilities

  • Accepts natural-language scene descriptions with embedded camera and lighting instructions
  • Generates multi-shot sequences from a single prompt with automatic transition planning
  • Supports audio URL input for synchronized background music or sound design
  • Prompt expansion option rewrites short inputs with cinematographic detail before generation
  • Output at 720p or 1080p, in 16:9, 9:16, 1:1, 4:3, and 3:4 aspect ratios
  • Duration is configurable; 2–15 seconds depending on scene complexity

Prompt expansion is worth enabling when you're working from short or incomplete descriptions. The model internally elaborates on scene depth, focal length, and motion dynamics — then exposes the actual prompt used so you can inspect and iterate on it.

Image to Video

Where most image-to-video tools animate from a starting frame and let motion drift wherever physics and chance take it, Wan 2.7's I2V gives you explicit control over both endpoints. You supply the first frame, the last frame, and the model fills in the motion path. Subject identity stays consistent across the transition, which eliminates the ghosting and gradual drift that typically ruins longer clips.

Multi-angle and 9-grid support

When you need a product shown from multiple perspectives in the same sequence, the 9-grid input lets you feed in a contact sheet of reference angles. The model stitches these into a coherent multi-shot clip rather than treating each angle as a separate generation, keeping brand visuals consistent across every frame.

I2V also accepts

  • A preceding video clip as a continuation reference (first_clip_url) for scene-to-scene flow
  • A driving audio track for lip-sync or rhythm-matched motion
  • Optional text prompt layered on top of image references for guided motion direction

Reference to Video

R2V is arguably the most technically ambitious mode in the suite. It's built for teams that need the same person, character, or product to appear consistently across many clips — without a traditional fine-tuning or LoRA workflow. You pass references in; the model extracts identity embeddings and locks them into the generation process.

The five-reference ceiling is the highest in the industry right now. You can mix image and video references freely within that budget, which means you can supply a front-facing photo, a side profile, two motion clips showing how the character moves, and an audio clip capturing their voice and the output holds all of those attributes simultaneously.

What R2V locks in

  • Visual appearance and facial geometry across varied lighting and camera angles
  • Voice tone and lip sync from a reference audio clip (reference_voice)
  • Camera movement style extracted from a reference video
  • Special effects or visual motifs carried forward from reference material
  • Complex, high-motion actions reproduced stably without identity collapse

For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.

API Pricing

  • 720p: $0.13 per second
  • 1080p: $0.195 per second

Where teams are using it in production

Wan 2.7 spans a wide range of commercial applications. The combination of high-resolution output, character-stable R2V, and natural-language editing removes dependencies that previously required dedicated production crews or per-project model fine-tuning.

  • Ad creative generation
  • AI avatar pipelines
  • Product showcases
  • Social content at scale
  • Storyboarding & pitches
  • Archival restoration
  • Concept visualization
  • Training data synthesis

Choosing the right mode

Mode Primary input Best for Max refs
T2V Text prompt + optional audio New scenes from scratch
I2V 1 or 9-grid images, first/last frame Animating existing visuals 9 images
R2V Images, videos, audio refs Character-consistent clips 5 mixed

Architecture and output parameters

Wan 2.7 is built on a Diffusion Transformer (DiT) foundation combined with Flow Matching — the same architectural direction that has driven consistent scaling gains in both image and video generation over the past two years. Cross-attention handles text conditioning; full spatio-temporal attention captures motion dynamics across both spatial and temporal axes simultaneously.

Shared output parameters across all modes

  • Resolution: 720p or 1080p
  • Aspect ratio:1 6:9, 9:16, 1:1, 4:3, 3:4
  • Duration: 2–10 seconds for I2V and R2V; up to 15 seconds for T2V
  • Prompt extension: optional intelligent rewriting for short inputs
  • Seed control: full seed parameter for reproducible outputs

Four ways to create video

Each mode in the Wan 2.7 suite targets a specific production scenario. They share the same underlying diffusion transformer architecture but expose different input contracts and motion-handling strategies.

T2V

Text to Video

Turn written prompts into 720p–1080p video clips. Thinking mode handles dense, multi-shot scene descriptions with higher compositional accuracy than prior versions.

I2V

Image to Video

Animate a single image or a 9-grid multi-angle set. Specify both first and last frame, and the model auto-infers the motion in between while holding subject identity stable.

R2V

Reference to Video

Pass up to five image, video, or audio references in one call. Locks appearance, voice tone, lip sync, camera movement, and effects simultaneously — industry-leading reference count.

VideoEdit

Video Edit

Rewrite existing footage with a plain-language instruction. Handles local edits, style transfer, colorization, and restoration without full re-generation.

Text to Video

Most text-to-video models treat a prompt as a flat string. Wan 2.7's T2V endpoint feeds it through an internal reasoning pass — what the team calls "thinking mode" — before generation begins. The result is noticeably better layout on complex prompts: multi-character scenes hold spatial logic, camera directions land where you expect them, and lighting descriptions actually propagate across the full clip.

Key T2V capabilities

  • Accepts natural-language scene descriptions with embedded camera and lighting instructions
  • Generates multi-shot sequences from a single prompt with automatic transition planning
  • Supports audio URL input for synchronized background music or sound design
  • Prompt expansion option rewrites short inputs with cinematographic detail before generation
  • Output at 720p or 1080p, in 16:9, 9:16, 1:1, 4:3, and 3:4 aspect ratios
  • Duration is configurable; 2–15 seconds depending on scene complexity

Prompt expansion is worth enabling when you're working from short or incomplete descriptions. The model internally elaborates on scene depth, focal length, and motion dynamics — then exposes the actual prompt used so you can inspect and iterate on it.

Image to Video

Where most image-to-video tools animate from a starting frame and let motion drift wherever physics and chance take it, Wan 2.7's I2V gives you explicit control over both endpoints. You supply the first frame, the last frame, and the model fills in the motion path. Subject identity stays consistent across the transition, which eliminates the ghosting and gradual drift that typically ruins longer clips.

Multi-angle and 9-grid support

When you need a product shown from multiple perspectives in the same sequence, the 9-grid input lets you feed in a contact sheet of reference angles. The model stitches these into a coherent multi-shot clip rather than treating each angle as a separate generation, keeping brand visuals consistent across every frame.

I2V also accepts

  • A preceding video clip as a continuation reference (first_clip_url) for scene-to-scene flow
  • A driving audio track for lip-sync or rhythm-matched motion
  • Optional text prompt layered on top of image references for guided motion direction

Reference to Video

R2V is arguably the most technically ambitious mode in the suite. It's built for teams that need the same person, character, or product to appear consistently across many clips — without a traditional fine-tuning or LoRA workflow. You pass references in; the model extracts identity embeddings and locks them into the generation process.

The five-reference ceiling is the highest in the industry right now. You can mix image and video references freely within that budget, which means you can supply a front-facing photo, a side profile, two motion clips showing how the character moves, and an audio clip capturing their voice and the output holds all of those attributes simultaneously.

What R2V locks in

  • Visual appearance and facial geometry across varied lighting and camera angles
  • Voice tone and lip sync from a reference audio clip (reference_voice)
  • Camera movement style extracted from a reference video
  • Special effects or visual motifs carried forward from reference material
  • Complex, high-motion actions reproduced stably without identity collapse

For AI avatar production, digital influencer pipelines, or any project where character consistency has historically meant expensive per-subject fine-tuning, R2V changes the economics significantly.

API Pricing

  • 720p: $0.13 per second
  • 1080p: $0.195 per second

Where teams are using it in production

Wan 2.7 spans a wide range of commercial applications. The combination of high-resolution output, character-stable R2V, and natural-language editing removes dependencies that previously required dedicated production crews or per-project model fine-tuning.

  • Ad creative generation
  • AI avatar pipelines
  • Product showcases
  • Social content at scale
  • Storyboarding & pitches
  • Archival restoration
  • Concept visualization
  • Training data synthesis

Choosing the right mode

Mode Primary input Best for Max refs
T2V Text prompt + optional audio New scenes from scratch
I2V 1 or 9-grid images, first/last frame Animating existing visuals 9 images
R2V Images, videos, audio refs Character-consistent clips 5 mixed

Architecture and output parameters

Wan 2.7 is built on a Diffusion Transformer (DiT) foundation combined with Flow Matching — the same architectural direction that has driven consistent scaling gains in both image and video generation over the past two years. Cross-attention handles text conditioning; full spatio-temporal attention captures motion dynamics across both spatial and temporal axes simultaneously.

Shared output parameters across all modes

  • Resolution: 720p or 1080p
  • Aspect ratio:1 6:9, 9:16, 1:1, 4:3, 3:4
  • Duration: 2–10 seconds for I2V and R2V; up to 15 seconds for T2V
  • Prompt extension: optional intelligent rewriting for short inputs
  • Seed control: full seed parameter for reproducible outputs

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices