Text-to-Video Generation Tools: Trends, Challenges, and Opportunities

July, 2024

Video generation technology has undergone multiple evolutionary leaps. In the 2010s, convolutional neural networks (CNNs), a deep learning algorithm, revolutionized the ability to classify and analyze images, spreading its applications across facial recognition, medical imaging, and more. In 2014, generative adversarial networks (GANs) emerged, another deep learning algorithm capable of creating realistic images but limited to simpler tasks like anime creation, face swapping, and basic video analytics due to the lack of context understanding and gesture synchronization.

Despite these advances, the mass commercialization of AI video generation tools has faced significant challenges. Applications were restricted to a few images, heavily reliant on ML engineers, and plagued by lengthy model training cycles and substantial training data requirements. Early AI-driven video applications were also prohibitively expensive; for example, the visual effects for face-swapping Paul Walker in the “Fast and Furious” franchise cost around $50 million.

In late 2022, the landscape changed dramatically with the introduction of large language models (LLMs) that opened up a new market for cost-effective text-to-video generation capabilities. This enabled mass adoption through a natural language interface. Startups like Runway and Pika Labs brought sophisticated video generation to a broader audience. However, challenges like photo-realism, visual clarity, stability, video customization, long-form video creation, and prompt adherence remained.

On February 15, 2024, OpenAI shook the market with the launch of Sora, an advanced text-to-video generation tool. Sora outperformed its predecessors by supporting long-form video creation of up to two minutes and producing hyper-realistic videos with meticulous visual details. It also demonstrated an improved emotional and contextual understanding of prompts. While the industry eagerly anticipates Sora’s commercial launch, the market is rapidly evolving. Sora is just the beginning, with new solutions emerging to push the boundaries further, such as the following:

    • In June 2024, Runway launched Gen-3 Alpha, its latest text-to-video model. This model can generate videos with hyper-realistic human objects displaying diverse emotions and improved temporal consistency in video sequences.
    • In June 2024, Kuaishou Technology launched Kling, a text-to-video model that generates photorealistic 1080p videos at 30 frames per second and supports clips up to two minutes.
    • In May 2024, Google released Veo, with advanced text-to-video generation capabilities, including enhanced prompt adherence and support for longer video durations of up to one minute or more.

Will Video Generation Tools Survive in the Era of Multimodality?

With the recent unveiling of GPT-4o and its advanced multimodal input and output capabilities, enterprises face a critical question: is there a need for a standalone Gen AI video generation platform? The answer is a resounding yes, and here’s why:

    • Built-in, domain-specific customizations: Many platform vendors have tailored their text-to-video solutions with domain-specific data, allowing enterprises to generate videos for specialized use cases such as corporate communication, employee training, product marketing, customer onboarding, and sales pitches. For example, Uniform, a virtual workspace provider, used HeyGen’s video generation tool to create highly personalized video sales presentations for new clients, resulting in a 38% increase in sales and a 70% boost in employee productivity.
    • Prebuilt visual customizations: Text-to-video tools often have extensive libraries of millions of royalty-free videos, images, avatars, and templates. This variety lets users infuse different visual tones, nuances, and emotions into their videos on the fly. For instance, HENNGE utilized Synthesia’s text-to-video tool to generate investor results announcement videos with customized, realistic digital avatars, reducing total video creation time by 50%.
    • Competitive pricing model: Vendors such as Synthesia and Fliki offer competitive pricing for text-to-video tools at around USD 22 per month, which includes all prebuilt customizations, high-quality video outputs, and templates without extra costs for prompts. In comparison, ChatGPT’s Plus plan, which offers a multimodality feature, starts at USD 20 per month with limited input/output capacity and video library support, making dedicated text-to-video tools more cost-effective.
    • Comprehensive suite of low-code video editing features: Many platforms, including Pictory and Runway, offer a comprehensive suite of low-code video editing features, such as custom voice addition, motion tracking integration, video denoising, video frame adjustment, and video captioning. Furthermore, ChatGPT and ERNIE Bot offer basic video editing capabilities like background modification, color grading, and text overlays.
    • Alignment challenges in multimodal AI: Multimodal AI platforms, still in their infancy, face significant challenges, as highlighted in Avasant’s Research Byte on Harnessing Multimodal AI: Innovations and Applications. These platforms are trained on diverse modalities such as text, audio, images, and video, making data alignment difficult. Determining correlations among these varied data types stands in contrast to the relative simplicity of single-modality tools like text-to-video, which do not require evaluating complex cross-modal relationships and, therefore, result in fewer hallucinations.

So, despite the rise of multimodal AI like GPT-4o, standalone text-to-video generation platforms will continue to thrive. Their domain-specific customizations, extensive visual libraries, competitive pricing, higher accuracy levels, and comprehensive editing features remain indispensable for enterprises seeking high-quality, tailored video content.

Businesses Are Increasingly Experimenting with Video Generation Tools

Video generation tools like Sora and Mora (an open-source text-to-video generation platform) drive a transformative shift in enterprise applications far beyond traditional movie production and scene generation. As these tools become more cost-effective and sophisticated, with enhanced semantic coherence and prebuilt customizations, their applications rapidly expand across various industries.

Educational institutions are leveraging these tools to create more interactive and engaging videos, revolutionizing the learning experience. The film and television industry uses them to efficiently generate complex scenes, saving time and resources. Advertisers are crafting personalized advertisements with ease, all through simple text commands.

Screenshot 2024 07 09 at 1.09.08 PM 1030x575 - Text-to-Video Generation Tools: Trends, Challenges, and Opportunities

Figure 1: Experimentation of video generation tools across domains over the next 18–24 months

This technological advancement is not just an incremental improvement; it represents a dramatic leap forward in creating and consuming video content. Enterprises across sectors can now harness the power of text-to-video generation tools in unimaginable ways, leading to a new era of innovation and creativity.

Inherent Risks of Text-To-Video Tools and Enterprise Strategies for Mitigation

Although text-to-video platforms have garnered significant interest and are poised to become indispensable, they also bring inherent risks. Therefore, enterprises must establish robust governance and risk strategies to mitigate the following threats posed by these tools:

  1. Copyright Infringement
    • Unauthorized use: Text-to-video tools might inadvertently use copyright material without proper licensing, risking Digital Millennium Copyright Act (DMCA) takedown notices and litigation. Enterprises should ensure that models only access licensed or royalty-free content.
    • Content ownership: Determining rights for AI-generated content is complex due to unclear legal frameworks. Enterprises need clear guidelines from their AI governance teams to avoid disputes over content ownership.
  1. Model Bias and Ethical Implications
    • Biased outputs: AI models trained on biased data can produce discriminatory content, unfairly portraying individuals based on race or gender. Enterprises should conduct bias audits and implement mitigation techniques like resampling or fairness-aware algorithms.
    • Ethical concerns: Risks include leaking sensitive or personally identifiable information and distributing harmful content. Enterprises must ensure transparency in model training data and deploy a red team to monitor tool performance against regional legal and ethical guidelines before production deployment.

Enterprises must thoroughly examine the vendor’s data security and privacy measures when sourcing a text-to-video platform. For instance, OpenAI plans to embed Coalition for Content Provenance and Authenticity (C2PA) metadata watermarks to content generated through Sora, alongside an advanced content moderation system. These vendor-provided safeguards can mitigate initial risks but are not sufficient on their own. Security threats can escalate quickly; for example, less than 12 hours after its release in June 2024, Luma AI’s Dream Machine, a generative AI model capable of producing high-quality videos from text and image inputs, was jailbroken.

Thus, enterprises must establish a robust internal governance policy with explicit guardrails. This policy should ensure continuous monitoring of AI-generated content and verify alignment with company policies and global AI guidelines. By proactively addressing these risks and implementing comprehensive governance and risk management strategies, enterprises can harness the power of text-to-video tools while safeguarding their integrity and trust.

However, some external risks beyond an organization’s control can arise with technological advancements, especially as malicious actors gain access to sophisticated tools. One such significant risk is misinformation through deepfakes, leading to credibility issues. Advanced text-to-video tools can create high-quality deepfakes, which can be used for carrying out scams, impersonating CEOs, and creating fraudulent content. Enterprises should implement detection algorithms and use emerging tools such as Sentinel and Sensity to detect deepfakes. Legal and compliance issues may also arise, as deepfake videos can falsely implicate a company, leading to legal battles and regulatory scrutiny. To counter this, enterprises should adopt blockchain verification and digital watermarking and train employees on deepfake risks, verification protocols, and legal preparedness to address such content.


By Chandrika Dutt, Associate Research Director, Avasant, and Abhisekh Satapathy, Lead Analyst, Avasant