Gemini Omni Flash – Artlist

Gemini Omni Flash is Google's multimodal video generator that turns text, images, audio, and video into short clips with physics-aware motion. It is the replacement for Veo — specializing in video editing but also capable of generating videos from text and/or references.

Omni is first-and-foremost a video editor — it excels at video-to-video generations.

Variants at a Glance

Google Omni Flash: Video generation model. Accepts text, image, and video reference inputs. Max resolution: 720p. Talking point: Accepts video and image as inputs.

Key Features

Audio & Output
- Native audio: Omni natively creates audio (such as perfectly synced lip movement and environmental footsteps) alongside the video.
Inputs
- Two input modes: Text-to-video; image-to-video (note: audio-to-video and reference-to-video coming soon).
- Multimodal references: Combine up to 5 images and 1 video clip in one generation; reference them in-prompt as @Image1 / @Video1 or conversationally as "use the first image as..."
Video Edit
- Director-level camera control: Switch camera style to handheld movement, change camera positions and angles, apply cinematic moves like dolly zooms and tracking shots.
- Motion and transitions: Smooth cuts between clips, complex multi-shot scenes with consistent subjects and lighting.
Advanced Controls
- Negative prompts: Specify what you don't want in the output to refine quality.
- Seed control: Reproduce specific outputs for consistency across variations.

Technical Capabilities

Modalities: Text-to-video, Reference-to-video (image/video references)
Maximum Resolution: 720p
Aspect Ratios: 16:9, 9:16
Duration: 3–10 seconds per generation
Audio: Native audio generation included
Max Reference Images: Up to 5 images
Max Reference Videos: 1 video clip
Credit cost: 150 credits per second (audio included)

Best Use Cases

Professional Video Production: Create cinematic video content with advanced camera control, multi-shot storytelling, and precise motion handling for commercial, advertising, and narrative projects.

Video Editing & Transformation: Edit existing footage to change style, lighting, or narrative while maintaining original motion and structure. Perfect for rapid iterations and creative variations.

Multi-Language Content: Generate videos with consistent characters and styling across multiple language variations using reference images and multimodal inputs.

Strengths

Physics-aware motion that feels natural and believable
Director-level camera control for cinematic results
Native audio sync with lip-sync and environmental sounds
Strong multi-shot generation with consistent subjects
Flexible multimodal reference system (up to 5 images + 1 video)
Excellent for video editing and transformation workflows

Limitations

Resolution capped at 720p — no 1080p or 4K output.
Coming soon features: Audio-to-video and reference-to-video inputs are in development.
Duration: Limited to 3–10 seconds per generation.

Prompting Tips

Use references: Upload reference images to define character appearance and style consistency.
Describe camera movement: Specify camera style (handheld, steady, tracking shot) for cinematic control.
Layer instructions: Combine text prompts with image references for maximum control and consistency.
Negative prompts: Use negative prompts to exclude unwanted elements and refine output quality.
Seed control: Lock the seed value to reproduce specific variations or maintain consistency across related clips.