Language Grounded Qformer For Efficient Vision Language Understanding · The Large Language Model Bible Contribute to LLM-Bible

Language Grounded Qformer For Efficient Vision Language Understanding

Choraria Moulik, Sekhar Nitesh, Wu Yue, Zhang Xu, Singhal Prateek, Varshney Lav R.. Arxiv 2023

[Paper]    
Efficiency And Optimization Model Architecture Multimodal Models Pretraining Methods Training Techniques Transformer

Large-scale pretraining and instruction tuning have been successful for training general-purpose language models with broad competencies. However, extending to general-purpose vision-language models is challenging due to the distributional diversity in visual inputs. A recent line of work explores vision-language instruction tuning, taking inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. However, these approaches rely heavily on large-scale multi-modal pretraining for representation learning before eventual finetuning, incurring a huge computational overhead, poor scaling, and limited accessibility. To that end, we propose a more efficient method for QFormer-based vision-language alignment and demonstrate the effectiveness of our strategy compared to existing baselines in improving the efficiency of vision-language pretraining.

Similar Work