Happy Horse 1.1 – Artlist

Published: June 21, 2026

Happy Horse 1.1 is Alibaba's video model, among the top performers on the Artificial Analysis arena. It keeps the single-pipeline architecture: text-to-video, image-to-video, and reference-to-video all run through one model, with joint audio-video synthesis in a single forward pass. Dialogue, ambient sound, music, and Foley are produced natively alongside the visuals, with multilingual lip-sync across seven languages — its strongest suit, and what makes it a real contender to Veo 3.1 and Seedance 2.0.

Key Features

Generation Modes

Text-to-Video — Build a full scene with synchronized audio from a prompt alone.
Image-to-Video — Animate a still as the first frame, adding motion and native sound.
Reference-to-Video — Drive a scene from 1–9 reference images to keep characters and style consistent.

Audio

Native joint audio — Dialogue, ambient, music, and Foley generated with the picture in one pass.
Multilingual lip-sync — Phoneme-level lip-sync across seven languages, matched to the spoken audio.

Consistency & Motion

Character consistency — Subjects referenced as character1…character9 stay on-model across shots.
Coherent motion — Strong temporal coherence and prompt adherence across shots.
Multi-shot in one prompt — Sequence several shots in one generation: lead each segment with a timecode range (e.g. 00-05, then 05-10), each with its own action.

Technical Capabilities

Provider	Alibaba, served on fal.ai · commercial use OK
Resolution	720p, 1080p
Duration	3–15s (default 5)
Frame rate	24 fps
Aspect ratio	T2V & R2V — 16:9, 9:16, 1:1, 4:3, 3:4, 21:9, 9:21, 5:4, 4:5 (default 16:9); I2V inherits the source image
Lip-sync languages	English, Mandarin, Cantonese, Japanese, Korean, German, French
Reference images (R2V)	1–9 · JPEG / PNG / WEBP
Input image (I2V)	JPEG / PNG / BMP / WEBP · aspect 1:2.5–2.5:1

Limitations

Caps at 1080p — no native 4K. For 4K or heavy multi-character scenes, Seedance 2.0 or Kling 3.0 may fit better.
Clips top out at 15s; longer sequences must be stitched from multiple generations.
In image-to-video, lip-sync only activates when the source image shows a face.
Camera and lighting control is prompt-driven — less granular than tools with explicit camera controls.

Prompting Tips

Spell out motion, framing, lighting, and pacing — the model follows scene direction closely.
Multi-shot: lead each segment with a timecode range (00-05, 05-10…), each with its own prompt.
For talking-head shots, put the spoken lines in the prompt; audio and lip-sync match them.
In reference-to-video, name subjects character1…character9 in upload order.
Use clean, high-res reference images (single clear subject) to lock identity and wardrobe.