Language Grounded Qformer For Efficient Vision Language Understanding

Choraria Moulik, Sekhar Nitesh, Wu Yue, Zhang Xu, Singhal Prateek, Varshney Lav R.. Arxiv 2023

[Paper]
Efficiency And Optimization Model Architecture Multimodal Models Pretraining Methods Training Techniques Transformer

Large-scale pretraining and instruction tuning have been successful for training general-purpose language models with broad competencies. However, extending to general-purpose vision-language models is challenging due to the distributional diversity in visual inputs. A recent line of work explores vision-language instruction tuning, taking inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. However, these approaches rely heavily on large-scale multi-modal pretraining for representation learning before eventual finetuning, incurring a huge computational overhead, poor scaling, and limited accessibility. To that end, we propose a more efficient method for QFormer-based vision-language alignment and demonstrate the effectiveness of our strategy compared to existing baselines in improving the efficiency of vision-language pretraining.

The Large Language Model Bible

Language Grounded Qformer For Efficient Vision Language Understanding

Choraria Moulik, Sekhar Nitesh, Wu Yue, Zhang Xu, Singhal Prateek, Varshney Lav R.. Arxiv 2023

Similar Work