Video
Active

Seedance 1.5 Pro

Enjoy optimized performance across resolutions and formats, from prototypes to production-grade outputs.
Seedance 1.5 ProTechflow Logo - Techflow X Webflow Template

Seedance 1.5 Pro

Seedance 1.5 API supports multi-modal inputs for seamless integration of visuals, audio, and text.

What Is Seedance 1.5 Pro?

Most AI video tools operate in two steps: generate the visual, then stitch in audio afterward. That two-step process is why so much AI-generated content looks and sounds slightly off, the sound effects land a beat late, lips don't quite match words, and ambient noise feels pasted in.

Seedance 1.5 Pro takes a different approach entirely. Developed by ByteDance's Seed team and published in December 2025, it is a foundational model built from the ground up for native, joint audio-video generation. Audio and video aren't added to each other, they're created together, sharing the same generation process, the same attention layers, the same loss functions.

The result is millisecond-level synchronization between what you see and what you hear: lips that move in precise time with spoken words, ambient sounds that materialize exactly when objects collide on screen, background music that breathes with the pacing of the shot.

Seedance 1.5 Pro API Pricing

  • Video with audio: $2.81
  • Video without audio: $1.56

What Seedance 1.5 Pro Can Do

Six capabilities that set this model apart from every other video generation API on the market today.

Native Audio-Video Joint Generation

Ambient sounds, action effects, background music, and human voices are generated simultaneously with the video frames, not appended afterward. The dual-branch Diffusion Transformer processes both modalities in parallel, synchronized at the architectural level.

Multilingual Lip-Sync with Dialect Support

The model understands phonemes, the individual sounds that make up speech, and maps them correctly to lip shapes in real time. This works across English, Mandarin, Japanese, Korean, Spanish, Indonesian, Cantonese, Shanxi dialect, and Sichuan dialect, with each language's natural rhythm preserved.

Cinematic Camera Control

Specify professional camera movements directly in your prompt: dolly zooms, Hitchcock effects, crane movements, tracking shots, whip pans, and orbits. The model processes compositional language — golden hour lighting, rack focus, shallow depth of field and executes it accurately across the generated clip.

Film-Grade Emotional Performance

Subtle micro-expressions, a slight swallow, eyes widening, anxiety transitioning to relief, are rendered accurately based on the prompt context and image input. This removes the mechanical stiffness common in AI video. Characters behave, not just move.

Character Consistency Across Shots

When generating multiple clips for the same narrative, the model preserves character identity: faces don't morph, clothing stays consistent, and proportions remain stable even during complex movements or 12-second clips. Provide a reference image as an anchor to lock appearance across a full sequence.

10× Inference Acceleration

Through multi-stage distillation and quantization, ByteDance achieved a 10x speedup in inference over the base model. What once took 20–30 minutes now takes 2–3 minutes without meaningful quality loss — fast enough for real products, not just demos.

How Seedance 1.5 Pro Works Under the Hood

Understanding the architecture helps you write better prompts and predict model behavior. Here's what's actually happening when you make an API call.

Dual-Branch Diffusion Transformer (DB-DiT)

The core is a 4.5 billion parameter Dual-Branch Diffusion Transformer. Two parallel branches — one for video frames, one for audio waveforms — run concurrently and share information through cross-modal attention fusion modules. Because both branches see each other's representations during generation, they stay in lock-step from the very first denoising step.

Multi-Stage Training Pipeline

The model was trained on mixed-modal datasets using curriculum-based data scheduling, robust captioning, and semantic enrichment. Pre-training covers text-to-audio-video (T2VA), image-to-audio-video (I2VA), and unimodal tasks (T2V, I2V). This multi-task approach means a single model handles all input modes without switching contexts between API calls.

Post-Training Alignment (SFT + RLHF)

After pre-training, the team ran Supervised Fine-Tuning on curated high-quality data, followed by Reinforcement Learning from Human Feedback with multi-dimensional reward models calibrated for audio-visual contexts — not just visual preference signals. This is why the model follows complex narrative prompts reliably, rather than generating visually attractive but semantically incoherent clips.

Benchmarked on SeedVideoBench-1.5

Performance was measured using SeedVideoBench-1.5, an internally developed benchmark covering both the video stream (subjects, motion, interaction, cinematography) and the audio stream (vocal types, non-speech audio properties, synchronization). Evaluation uses both a 5-point Likert scale and pairwise Good-Same-Bad metrics for subjective quality — the same methodology used for professional production content review.

Full Seedance 1.5 Pro Spec Sheet

Everything you need to plan your integration before writing a line of code.

Parameter Value / Detail
Model ID seedance-1-5-pro
Architecture Dual-Branch Diffusion Transformer (DB-DiT), extended MMDiT backbone
Parameters 4.5 billion
Input Modes Text-to-Video (T2V), Image-to-Video (I2V), Text-to-Audio-Video (T2VA), Image-to-Audio-Video (I2VA)
Output Resolutions 480p, 720p, 1080p
Aspect Ratios 16:9, 9:16, 1:1, 4:3, 21:9
Video Duration 4–12 seconds per generation; "auto" mode selects optimal length based on prompt
Audio Generation Native joint generation: ambient sound, action effects, SFX, instruments, BGM, human voice
Lip-Sync Precision Millisecond-level phoneme-accurate synchronization
Supported Languages English, Mandarin, Japanese, Korean, Spanish, Portuguese, Indonesian, Cantonese, Shanxi dialect, Sichuan dialect
Camera Controls Pan, tilt, zoom, dolly, truck, orbit, crane, tracking shot, whip pan, Hitchcock zoom
Post-Training SFT on curated high-quality data + RLHF with multi-dimensional reward models
Inference Speed 10× acceleration vs. base model (2–3 min generation vs. 20–30 min)
Character Consistency Reference frame conditioning for multi-shot identity preservation

What Developers Are Building with Seedance 1.5 Pro

These are the real-world categories where native audio-video generation creates the most concrete value.

Short-Form Social Content at Scale

Generate TikTok and Reels-format content in 9:16 at volume. With character consistency and natural dialogue, content teams can produce multi-episode virtual creator series without actors or studios.

Multilingual Product Videos

Create the same product demo in English, Japanese, Mandarin, and Spanish from a single source image with native lip-sync in each language. One key visual, six market-ready cuts, no localization agency.

Film & TV Pre-Visualization

Generate storyboard animatics with camera movements and emotionally expressive characters to communicate director intent to crews before principal photography — faster and cheaper than traditional animatics.

E-Commerce Demonstration Videos

Animate product images into short cinematic demonstrations with voiceover and ambient environment sounds. A still photo of a coffee maker can become a 10-second atmospheric clip with steam, pouring sounds, and narration.

Interactive Narrative Apps & Games

Build interactive story experiences where player choices generate new video clips in real time, each with synchronized dialogue, effects, and music. Game cutscenes without a cutscene budget.

Micro-Drama & Advertising Production

Seedance 1.5 Pro's RLHF training specifically targeted advertising, micro-dramas, and narrative content. Short emotional arcs, dialogue-heavy scenes, and brand voice all come through with production coherence.

Multilingual Lip-Sync: Which Languages Does It Support?

The lip-sync capabilities of Seedance 1.5 Pro go deeper than simple language detection. The model was trained on phoneme-level data across each of these languages and dialects, meaning it doesn't just move lips — it moves the correct lip shapes for the actual sounds being made in that language's phonological system.

This is particularly visible in dialect support, where standard Mandarin and Sichuan dialect have genuinely different phoneme distributions. The model handles both distinctly, not as variants of the same thing.

Frequently Asked Questions

Does Seedance 1.5 Pro always generate audio, or can I get silent video?

Both modes are supported. Set audio: false in your request for a silent clip at the same quality. Audio generation is on by default since it's a core differentiator, but disabling it does not affect video quality and slightly reduces generation time.

How does the character consistency feature work in practice?

Pass a reference image in the image_url field alongside your prompt. The model uses this as an anchor for face, clothing, and style. Across multiple calls with the same reference image, character identity is preserved even when camera angles, lighting, and actions vary substantially between shots.

What resolution should I target for social media vertical video?

For TikTok, Instagram Reels, and YouTube Shorts, use aspect_ratio: "9:16" at 720p for the best speed-to-quality tradeoff at scale, or 1080p for hero content where quality justifies the extra generation time. The 9:16 aspect ratio is natively supported — no cropping or letterboxing artifacts.

What Is Seedance 1.5 Pro?

Most AI video tools operate in two steps: generate the visual, then stitch in audio afterward. That two-step process is why so much AI-generated content looks and sounds slightly off, the sound effects land a beat late, lips don't quite match words, and ambient noise feels pasted in.

Seedance 1.5 Pro takes a different approach entirely. Developed by ByteDance's Seed team and published in December 2025, it is a foundational model built from the ground up for native, joint audio-video generation. Audio and video aren't added to each other, they're created together, sharing the same generation process, the same attention layers, the same loss functions.

The result is millisecond-level synchronization between what you see and what you hear: lips that move in precise time with spoken words, ambient sounds that materialize exactly when objects collide on screen, background music that breathes with the pacing of the shot.

Seedance 1.5 Pro API Pricing

  • Video with audio: $2.81
  • Video without audio: $1.56

What Seedance 1.5 Pro Can Do

Six capabilities that set this model apart from every other video generation API on the market today.

Native Audio-Video Joint Generation

Ambient sounds, action effects, background music, and human voices are generated simultaneously with the video frames, not appended afterward. The dual-branch Diffusion Transformer processes both modalities in parallel, synchronized at the architectural level.

Multilingual Lip-Sync with Dialect Support

The model understands phonemes, the individual sounds that make up speech, and maps them correctly to lip shapes in real time. This works across English, Mandarin, Japanese, Korean, Spanish, Indonesian, Cantonese, Shanxi dialect, and Sichuan dialect, with each language's natural rhythm preserved.

Cinematic Camera Control

Specify professional camera movements directly in your prompt: dolly zooms, Hitchcock effects, crane movements, tracking shots, whip pans, and orbits. The model processes compositional language — golden hour lighting, rack focus, shallow depth of field and executes it accurately across the generated clip.

Film-Grade Emotional Performance

Subtle micro-expressions, a slight swallow, eyes widening, anxiety transitioning to relief, are rendered accurately based on the prompt context and image input. This removes the mechanical stiffness common in AI video. Characters behave, not just move.

Character Consistency Across Shots

When generating multiple clips for the same narrative, the model preserves character identity: faces don't morph, clothing stays consistent, and proportions remain stable even during complex movements or 12-second clips. Provide a reference image as an anchor to lock appearance across a full sequence.

10× Inference Acceleration

Through multi-stage distillation and quantization, ByteDance achieved a 10x speedup in inference over the base model. What once took 20–30 minutes now takes 2–3 minutes without meaningful quality loss — fast enough for real products, not just demos.

How Seedance 1.5 Pro Works Under the Hood

Understanding the architecture helps you write better prompts and predict model behavior. Here's what's actually happening when you make an API call.

Dual-Branch Diffusion Transformer (DB-DiT)

The core is a 4.5 billion parameter Dual-Branch Diffusion Transformer. Two parallel branches — one for video frames, one for audio waveforms — run concurrently and share information through cross-modal attention fusion modules. Because both branches see each other's representations during generation, they stay in lock-step from the very first denoising step.

Multi-Stage Training Pipeline

The model was trained on mixed-modal datasets using curriculum-based data scheduling, robust captioning, and semantic enrichment. Pre-training covers text-to-audio-video (T2VA), image-to-audio-video (I2VA), and unimodal tasks (T2V, I2V). This multi-task approach means a single model handles all input modes without switching contexts between API calls.

Post-Training Alignment (SFT + RLHF)

After pre-training, the team ran Supervised Fine-Tuning on curated high-quality data, followed by Reinforcement Learning from Human Feedback with multi-dimensional reward models calibrated for audio-visual contexts — not just visual preference signals. This is why the model follows complex narrative prompts reliably, rather than generating visually attractive but semantically incoherent clips.

Benchmarked on SeedVideoBench-1.5

Performance was measured using SeedVideoBench-1.5, an internally developed benchmark covering both the video stream (subjects, motion, interaction, cinematography) and the audio stream (vocal types, non-speech audio properties, synchronization). Evaluation uses both a 5-point Likert scale and pairwise Good-Same-Bad metrics for subjective quality — the same methodology used for professional production content review.

Full Seedance 1.5 Pro Spec Sheet

Everything you need to plan your integration before writing a line of code.

Parameter Value / Detail
Model ID seedance-1-5-pro
Architecture Dual-Branch Diffusion Transformer (DB-DiT), extended MMDiT backbone
Parameters 4.5 billion
Input Modes Text-to-Video (T2V), Image-to-Video (I2V), Text-to-Audio-Video (T2VA), Image-to-Audio-Video (I2VA)
Output Resolutions 480p, 720p, 1080p
Aspect Ratios 16:9, 9:16, 1:1, 4:3, 21:9
Video Duration 4–12 seconds per generation; "auto" mode selects optimal length based on prompt
Audio Generation Native joint generation: ambient sound, action effects, SFX, instruments, BGM, human voice
Lip-Sync Precision Millisecond-level phoneme-accurate synchronization
Supported Languages English, Mandarin, Japanese, Korean, Spanish, Portuguese, Indonesian, Cantonese, Shanxi dialect, Sichuan dialect
Camera Controls Pan, tilt, zoom, dolly, truck, orbit, crane, tracking shot, whip pan, Hitchcock zoom
Post-Training SFT on curated high-quality data + RLHF with multi-dimensional reward models
Inference Speed 10× acceleration vs. base model (2–3 min generation vs. 20–30 min)
Character Consistency Reference frame conditioning for multi-shot identity preservation

What Developers Are Building with Seedance 1.5 Pro

These are the real-world categories where native audio-video generation creates the most concrete value.

Short-Form Social Content at Scale

Generate TikTok and Reels-format content in 9:16 at volume. With character consistency and natural dialogue, content teams can produce multi-episode virtual creator series without actors or studios.

Multilingual Product Videos

Create the same product demo in English, Japanese, Mandarin, and Spanish from a single source image with native lip-sync in each language. One key visual, six market-ready cuts, no localization agency.

Film & TV Pre-Visualization

Generate storyboard animatics with camera movements and emotionally expressive characters to communicate director intent to crews before principal photography — faster and cheaper than traditional animatics.

E-Commerce Demonstration Videos

Animate product images into short cinematic demonstrations with voiceover and ambient environment sounds. A still photo of a coffee maker can become a 10-second atmospheric clip with steam, pouring sounds, and narration.

Interactive Narrative Apps & Games

Build interactive story experiences where player choices generate new video clips in real time, each with synchronized dialogue, effects, and music. Game cutscenes without a cutscene budget.

Micro-Drama & Advertising Production

Seedance 1.5 Pro's RLHF training specifically targeted advertising, micro-dramas, and narrative content. Short emotional arcs, dialogue-heavy scenes, and brand voice all come through with production coherence.

Multilingual Lip-Sync: Which Languages Does It Support?

The lip-sync capabilities of Seedance 1.5 Pro go deeper than simple language detection. The model was trained on phoneme-level data across each of these languages and dialects, meaning it doesn't just move lips — it moves the correct lip shapes for the actual sounds being made in that language's phonological system.

This is particularly visible in dialect support, where standard Mandarin and Sichuan dialect have genuinely different phoneme distributions. The model handles both distinctly, not as variants of the same thing.

Frequently Asked Questions

Does Seedance 1.5 Pro always generate audio, or can I get silent video?

Both modes are supported. Set audio: false in your request for a silent clip at the same quality. Audio generation is on by default since it's a core differentiator, but disabling it does not affect video quality and slightly reduces generation time.

How does the character consistency feature work in practice?

Pass a reference image in the image_url field alongside your prompt. The model uses this as an anchor for face, clothing, and style. Across multiple calls with the same reference image, character identity is preserved even when camera angles, lighting, and actions vary substantially between shots.

What resolution should I target for social media vertical video?

For TikTok, Instagram Reels, and YouTube Shorts, use aspect_ratio: "9:16" at 720p for the best speed-to-quality tradeoff at scale, or 1080p for hero content where quality justifies the extra generation time. The 9:16 aspect ratio is natively supported — no cropping or letterboxing artifacts.

Try it now

400+ AI Models

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

The Best Growth Choice
for Enterprise

Get API Key
Testimonials

Our Clients' Voices