Published: June 21, 2026
Happy Horse 1.1 is Alibaba's video model, among the top performers on the Artificial Analysis arena. It keeps the single-pipeline architecture: text-to-video, image-to-video, and reference-to-video all run through one model, with joint audio-video synthesis in a single forward pass. Dialogue, ambient sound, music, and Foley are produced natively alongside the visuals, with multilingual lip-sync across seven languages — its strongest suit, and what makes it a real contender to Veo 3.1 and Seedance 2.0.
Key Features
Generation Modes
- Text-to-Video — Build a full scene with synchronized audio from a prompt alone.
- Image-to-Video — Animate a still as the first frame, adding motion and native sound.
- Reference-to-Video — Drive a scene from 1–9 reference images to keep characters and style consistent.
Audio
- Native joint audio — Dialogue, ambient, music, and Foley generated with the picture in one pass.
- Multilingual lip-sync — Phoneme-level lip-sync across seven languages, matched to the spoken audio.
Consistency & Motion
- Character consistency — Subjects referenced as character1…character9 stay on-model across shots.
- Coherent motion — Strong temporal coherence and prompt adherence across shots.
- Multi-shot in one prompt — Sequence several shots in one generation: lead each segment with a timecode range (e.g. 00-05, then 05-10), each with its own action.
Technical Capabilities
| Provider | Alibaba, served on fal.ai · commercial use OK |
| Resolution | 720p, 1080p |
| Duration | 3–15s (default 5) |
| Frame rate | 24 fps |
| Aspect ratio | T2V & R2V — 16:9, 9:16, 1:1, 4:3, 3:4, 21:9, 9:21, 5:4, 4:5 (default 16:9); I2V inherits the source image |
| Lip-sync languages | English, Mandarin, Cantonese, Japanese, Korean, German, French |
| Reference images (R2V) | 1–9 · JPEG / PNG / WEBP |
| Input image (I2V) | JPEG / PNG / BMP / WEBP · aspect 1:2.5–2.5:1 |
Limitations
- Caps at 1080p — no native 4K. For 4K or heavy multi-character scenes, Seedance 2.0 or Kling 3.0 may fit better.
- Clips top out at 15s; longer sequences must be stitched from multiple generations.
- In image-to-video, lip-sync only activates when the source image shows a face.
- Camera and lighting control is prompt-driven — less granular than tools with explicit camera controls.
Prompting Tips
- Spell out motion, framing, lighting, and pacing — the model follows scene direction closely.
- Multi-shot: lead each segment with a timecode range (00-05, 05-10…), each with its own prompt.
- For talking-head shots, put the spoken lines in the prompt; audio and lip-sync match them.
- In reference-to-video, name subjects character1…character9 in upload order.
- Use clean, high-res reference images (single clear subject) to lock identity and wardrobe.