Omnihuman v1.5 – Artlist

ByteDance’s Omnihuman 1.5 AI Avatar model transforms static images into talking videos, prioritizing lip-sync precision over general motion capabilities. The model is audio-driven: it accepts any image and audio as inputs, and animates the image the match the speech.
Built for avatar creation and video use cases where realistic speech animation matters more than general motion and camera movements.

Key Features

Image to Avatar: Converts any static image into a dynamic video where the subject appears to speak the provided audio.
Audio-Driven Lip Synchronization: Precisely matches mouth movements to the phonetics and timing of the input audio, ensuring natural speech animation.
Studio-Grade Output: Produces high-quality, realistic videos suitable for professional use across various applications, especially useful for talking-heads, product commercials, and UGC
Continuous Camera Movement: Enables the generation of videos with highly dynamic motion and continuous camera movement, enhancing cinematic quality.
Rhythmic and Emotional Performances: Excellent with musical inputs and singing
Text Directions: Accepts text prompt for directing the output (in addition to the audio input)

Technical Capabilities

Modalities: Image-and-audio-to-Video (I2V), optional text prompt
Resolution: 720p, 1080p
Durations: Up to 30 seconds
Asset Limits: requires one image and one audio file
Language Support: all languages
Output Formats: MP4 video

Limitations

Works for single-speaker only