Seedance 2.0 — One Pager
Video Model
Seedance 2.0 is ByteDance's latest video model and a top contender alongside Veo 3.1 and Kling 3. Its main draw is that it generates video and synchronized audio together in a single pass — sound effects, ambient audio, and lip-synced dialogue, with no post-production layering. It's strong on realistic physics, multi-shot editing within a single generation, and director-level camera control. It takes text, image, and multi-reference (image / video / audio) inputs. It originally shipped in two tiers — a Standard tier for maximum quality and a Fast tier for lower-latency production work — and the family has since added Mini (lightweight) and 4K (high-resolution) variants.
Variants at a glance
| UPDATE — June 2026: Two variants have been added to the Seedance 2.0 family — Mini (lightweight, speed/cost-optimized) and 4K (high-resolution output). |
| Variant | Type | Modalities | Max Res. | Talking points |
|---|---|---|---|---|
| Seedance 2.0 | Video | Text / Image / Reference → Video | 4K | Max-quality tier; supports up to 4K; higher latency. |
| Seedance 2.0 Fast | Video | Text / Image / Reference → Video | 720p | Lower latency and cost for production workloads; capped at 720p. |
| Seedance 2.0 Mini | Video | Text / Image / Reference → Video | 720p | Lightweight tier for speed- and cost-sensitive work. Early access. |
| Seedance 2.0 4K | Video | Text / Image / Reference → Video | 4K | High-resolution variant; outputs up to 4K. Early access. |
Key Features
Audio & Output
- Native audio: Generates SFX, ambient sound, and lip-synced dialogue jointly with the video in a single pass — no post layering. Wrap dialogue in double quotes for lip-sync.
- Adaptive duration & aspect: Set duration or aspect ratio to “auto” and the model picks the optimal length and framing for the inputs.
Inputs
- Three input modes: Text-to-video; image-to-video (start frame plus optional end frame); and reference-to-video.
- Multimodal references: Combine up to 9 images, 3 video clips, and 3 audio files (12 files max) in one generation; reference them in-prompt as @Image1 / @Video1 / @Audio1.
Motion & Camera
- Director-level camera control: Dolly zooms, rack focuses, tracking shots, POV switches, and handheld movement, controlled via prompt.
- Realistic physics & motion: Collisions, cloth/fabric, fluid, and character motion; handles complex action like sports, dancing, and fight scenes.
- Multi-shot editing: Natural cuts and multiple shots within a single generation.
- Editing & extension: Provide a reference video and describe changes (replace an object, swap a background, alter style), or describe what happens next to extend the clip while keeping characters and style consistent.
Technical Capabilities
| Modalities | Text-to-video, image-to-video, reference-to-video (images + video + audio) |
| Duration | “auto”, or 4–15 seconds |
| Resolution / quality |
Standard up to 1080p; Fast up to 720p (480p / 720p / 1080p); new 4K variant outputs up to 4K (early access). Note: 4K uses H.265/HEVC encoding, directly outputs 10-bit depth at 4K resolution |
| Aspect ratios | auto, 21:9, 16:9, 4:3, 1:1, 3:4, 9:16 |
| Audio | Native, synchronized — SFX, ambient, and lip-synced dialogue; on by default |
| Reference inputs | Up to 9 images, 3 videos, and 3 audio files (12 total) |
| Input image formats | jpg, jpeg, png, webp, gif, avif |
Limitations
- Fast tier is capped at 720p — 1080p output requires the Standard tier.
- Maximum clip length is 15 seconds.
- Reference-to-video constraints: reference videos must total 2–15 s at 480p–720p each; reference audio must total ≤ 15 and requires at least one image or video.
- Only Seedance 2.0 supports 4K resolution output; Seedance 2.0 fast and Seedance 2.0 mini do not
Prompting Tips
- Be specific — describe camera movement, lighting, mood, and the exact actions you want.
- Wrap spoken lines in double quotes for lip-synced speech (e.g., he says: “Remember this moment.”).
- Label reference assets explicitly in the prompt: @Image1, @Video1, @Audio1.
- For edits, state both what to change and what to preserve.
- Start with 5-second generations to nail the style, then increase the duration.