Free-bloom: Zero-shot Text-to-video Generator With LLM Director And LDM Animator · The Large Language Model Bible Contribute to LLM-Bible

Free-bloom: Zero-shot Text-to-video Generator With LLM Director And LDM Animator

Huang Hanzhuo, Feng Yufan, Shi Cheng, Xu Lan, Yu Jingyi, Yang Sibei. Arxiv 2023

[Paper]    
Attention Mechanism Merging Model Architecture Prompting Tools Training Techniques

Text-to-video is a rapidly growing research area that aims to generate a semantic, identical, and temporal coherence sequence of frames that accurately align with the input text prompt. This study focuses on zero-shot text-to-video generation considering the data- and cost-efficient. To generate a semantic-coherent video, exhibiting a rich portrayal of temporal semantics such as the whole process of flower blooming rather than a set of “moving images”, we propose a novel Free-Bloom pipeline that harnesses large language models (LLMs) as the director to generate a semantic-coherence prompt sequence, while pre-trained latent diffusion models (LDMs) as the animator to generate the high fidelity frames. Furthermore, to ensure temporal and identical coherence while maintaining semantic coherence, we propose a series of annotative modifications to adapting LDMs in the reverse process, including joint noise sampling, step-aware attention shift, and dual-path interpolation. Without any video data and training requirements, Free-Bloom generates vivid and high-quality videos, awe-inspiring in generating complex scenes with semantic meaningful frame sequences. In addition, Free-Bloom is naturally compatible with LDMs-based extensions.

Similar Work