Pixlore: A Dataset-driven Approach To Rich Image Captioning

Bonilla Diego. Arxiv 2023

[Paper]
Fine Tuning GPT Model Architecture Multimodal Models Pretraining Methods RAG Training Techniques Transformer

In the domain of vision-language integration, generating detailed image captions poses a significant challenge due to the lack of a curated and rich dataset. This study introduces PixLore, a novel method that leverages Querying Transformers through the fine-tuning of the BLIP-2 model using the LoRa method on a standard commercial GPU. Our approach, which involves training on a carefully assembled dataset from state-of-the-art Computer Vision models combined and augmented by ChatGPT, addresses the question of whether intricate image understanding can be achieved with an ensemble of smaller-scale models. Comparative evaluations against major models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite having considerably fewer parameters, is rated higher than the existing State-of-the-Art models in over half of the assessments. This research not only presents a groundbreaking approach but also highlights the importance of well-curated datasets in enhancing the performance of smaller models.

The Large Language Model Bible

Pixlore: A Dataset-driven Approach To Rich Image Captioning

Bonilla Diego. Arxiv 2023

Similar Work