OpenAI, mirroring the strides taken by startups such as Runway and industry giants like Google and Meta, is venturing into the realm of video generation.
Today, OpenAI introduced Sora, an innovative AI model designed to craft videos from text inputs. Whether provided with concise or intricate descriptions or even a static image, Sora demonstrates its capability by producing high-definition scenes reminiscent of movies, complete with multiple characters, varied motions, and intricate background elements, according to OpenAI’s assertions.
Moreover, Sora boasts the ability to “extend” existing video clips, endeavoring to fill in any missing elements.
“In its development, Sora has acquired a profound comprehension of language, enabling it to accurately decipher prompts and fabricate engaging characters that convey vivid emotions,” OpenAI elucidates in a blog post. “The model not only grasps the user’s prompt but also comprehends how those elements manifest in the physical realm.”
OpenAI’s demo page for Sora may exude grandeur — the aforementioned statement being a case in point. Nevertheless, the carefully selected examples from the model do appear rather striking, particularly when compared to other text-to-video technologies available.
To begin with, Sora can produce videos in various styles, such as photorealistic, animated, or monochrome, stretching up to a minute in duration — a significant advancement compared to most text-to-video models. These videos maintain a commendable coherence, steering clear of what I term “AI oddities,” where objects behave in physically implausible manners.
Take, for instance, this virtual tour of an art gallery, entirely generated by Sora.
Or this animation of a flower blooming:
I will say that some of Sora’s videos with a humanoid subject — a robot standing against a cityscape, for example, or a person walking down a snowy path — have a video game-y quality to them, perhaps because there’s not a lot going on in the background. AI weirdness manages to creep into many clips besides, like cars driving in one direction, then suddenly reversing or arms melting into a duvet cover.
OpenAI — for all its superlatives — acknowledges the model isn’t perfect. It writes:
“[Sora] may struggle with accurately simulating the physics of a complex scene, and may not understand specific instances of cause and effect. For example, a person might take a bite out of a cookie, but afterward, the cookie may not have a bite mark. The model may also confuse spatial details of a prompt, for example, mixing up left and right, and may struggle with precise descriptions of events that take place over time, like following a specific camera trajectory.”
OpenAI’s very much positioning Sora as a research preview, revealing little about what data was used to train the model (short of ~10,000 hours of “high-quality” video) and refraining from making Sora generally available. Its rationale is the potential for abuse; OpenAI correctly points out that bad actors could misuse a model like Sora in myriad ways.
OpenAI says it’s working with experts to probe the model for exploits and building tools to detect whether a video was generated by Sora. The company also says that, should it choose to build the model into a public-facing product, it’ll ensure that provenance metadata is included in the generated outputs.
“We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology,” OpenAI writes. “Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.”