Council on Undergraduate Research (CUR) - Text-to-Video Generation with GANs and BERT: Synthesizing Realistic Videos from Natural Language Descriptions

Text-to-Video Generation with GANs and BERT: Synthesizing Realistic Videos from Natural Language Descriptions

New possibilities in human-computer interaction have been made easier by the rapid development of artificial intelligence, which has enabled the production of realistic multimedia material. This study combines Generative Adversarial Networks (GANs) with BERT-based natural language processing to investigate a unique deep learning framework for text-to-video production. The project's goal is to generate realistic short videos that seamlessly connect language and video content by faithfully capturing the semantics of textual descriptions

The framework comprises two core components: a video generator and a discriminator. The generator utilizes a combination of text embeddings, derived from a pre-trained BERT model, and random noise to produce sequences of video frames. These frames are processed into a coherent video format, maintaining spatial and temporal consistency. The discriminator, on the other hand, evaluates the generated videos against real videos, ensuring that the outputs are indistinguishable from authentic footage.

To train and evaluate the model, this study utilized a publicly available short video dataset sourced from Kaggle, which was preprocessed into frame sequences. Data augmentation techniques were applied to enhance diversity, and adversarial learning was employed, enabling the generator and discriminator to iteratively improve their performance through competition.

Currently, the model has yet to effectively capture the complex relationships between textual semantics and visual elements, and further exploration is ongoing. The study aims to investigate its potential applications in content creation, entertainment, and education. Furthermore, this research explores scalability by extending the framework to handle longer videos or more complex textual inputs. In order to improve realism, future work will involve integrating sophisticated loss functions, fine-tuning the architecture, and optimizing the generator for higher-resolution outputs.

By combining state-of-the-art language understanding and video synthesis, this research contributes to bridging the gap between text and video generation, paving the way for innovative AI-driven multimedia solutions.

Presenter

Beining Niu

Text-to-Video Generation with GANs and BERT: Synthesizing Realistic Videos from Natural Language Descriptions

Description

Back to Sessions

Custom JS

Text-to-Video Generation with GANs and BERT: Synthesizing Realistic Videos from Natural Language Descriptions

Category

Description

Custom CSS