Gemini Omni: Google's new multimodal model turns text, images, and audio into video

Announced by Sundar Pichai on the Mountain View stage, Gemini Omni is the model that reasons across all media. The first member of the family, Gemini Omni Flash, will arrive this summer on the Gemini app, YouTube Shorts, and Flow.

Published: 2026-05-19T19:30:00+02:00 Topic area: How AI infrastructure is evolving

Google raised the bar on video generation. Gemini Omni is a new family of multimodal models that — in CEO Sundar Pichai's words — can "create anything from any input," starting with video. Combining text, images, audio, and existing clips, Omni doesn't simply stitch material together: it reasons across the inputs to produce a coherent output, with an explicit understanding of physics, culture, history, and science.

What changes versus previous video models

The novelty isn't generating video — it's the way. According to TechCrunch, Omni introduces conversational editing: you can modify characters, backgrounds, and scene elements through voice commands, as if speaking to an editor. The model also supports photo editing via natural language, building on Nano Banana, the experimental tool Google had previously introduced.

Another feature picks up and extends the logic of OpenAI Sora's Cameos: users will be able to create videos with their own digital avatar. Onboarding requires a specific procedure — recording yourself reading out a sequence of numbers — designed to discourage unauthorized deepfakes.

SynthID and traceability

All content generated by Omni will include Google's SynthID digital watermark, which allows users to verify whether a video was generated by the Gemini family. It's the same logic Google is extending with C2PA Content Credentials verification in the Gemini app, in Search, and in Chrome.

Availability

Gemini Omni Flash is the first model in the family and can render 10-second clips. Summer rollout planned for the Gemini app, YouTube Shorts, and the Flow creative studio.

Gemini Omni: Google's new multimodal model turns text, images, and audio into video

What changes versus previous video models

SynthID and traceability

Availability

Sources

Tags