Published: June 24, 2025
Video Model
Gemini Omni Flash is Google's multimodal video generator that turns text, images, audio, and video into short clips with physics-aware motion.
It is meant to be the replacement for Veo — specializing in video editing but also capable of generating videos from text and/or references.
Omni is first-and-foremost a video editor — it excels at video-to-video generations.
Variants at a Glance
UPDATE — June 2026
Not supported at launch:
- Audio References
- Video References (not to be confused with V2V)
- Last Frame
- Scene Extension
Note: all outputs carry invisible SynthID + C2PA metadata
| Variant | Type | Modalities | Max Res. | Talking Points |
|---|---|---|---|---|
| Google Omni Flash | Video | Text / Image / Reference → Video | 720p | Accepts video and image as inputs |
Note: Audio and video references TBD
Key Features
Audio & Output
- Native audio: Omni natively creates audio (such as perfectly synced lip movement and environmental footsteps) alongside the video.
Inputs
- Two input modes: Text-to-video; image-to-video (note: audio-to-video and reference-to-video coming soon).
- Multimodal references: Combine up to 5 images and 1 video clip in one generation; reference them in-prompt as @Image1 / @Video1 or conversationally as "use the first image as..."
Video Edit
- Director-level camera control: Switching camera style to handheld movement, changing camera positions and POV all controlled via prompt.
- Sketch-to-Direct: The model excels at understanding annotations and sketch lines. You can draw a camera path, or cross out an object and the model understands you want to remove the object from the video.
- Editing & extension: Omni excels at editing existing video footage. You can tell the model to remove an object, replace a character, change the lighting, whatever you can think of.
- Camera Controls: The model excels at spatial understanding. For example — you can use an existing video and prompt the model to "rotate the camera so we see the subject from the left".
Technical Capabilities
| Attribute | Details |
|---|---|
| Modalities | Text-to-video, image-to-video |
| Duration | 3–10 seconds |
| Resolution / Quality | 720p |
| Aspect Ratios | 9:16, 16:9 |
| Audio | Native, synchronized — SFX, ambient, and lip-synced dialogue; on by default |
| Reference Inputs | Up to 5 reference images |
| Video to Video Inputs | 1 video up to 10 seconds |
Limitations
- The model is capped at 720p.
- Maximum clip length is 10 seconds.
Prompting Tips
- Be specific — describe camera movement, lighting, mood, and the exact actions you want.
- Label reference assets explicitly in the prompt: @Image1, @Video1 or "in the first image..."
- For edits, state both what to change and what to preserve.
- Start with 5-second generations to nail the style, then increase the duration.