Learning To Decompose Visual Features With Latent Textual Prompts

Wang Feng, Li Manling, Lin Xudong, Lv Hairong, Schwing Alexander G., Ji Heng. Arxiv 2022

[Paper]
Model Architecture Multimodal Models Prompting RAG Security Training Techniques

Recent advances in pre-training vision-language models like CLIP have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness in the case of inaccurate text descriptions during retrieval-based inference (the challenge for zero-shot protocol); or 2) breaking the well-established vision-language alignment (the challenge for linear probing). To address them, we propose Decomposed Feature Prompting (DeFo). DeFo leverages a flexible number of learnable embeddings as textual input while maintaining the vision-language dual-model architecture, which enables the model to learn decomposed visual features with the help of feature-level textual prompts. We further use an additional linear layer to perform classification, allowing a scalable size of language inputs. Our empirical study shows DeFo’s significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning method by 7.6%.

The Large Language Model Bible

Learning To Decompose Visual Features With Latent Textual Prompts

Wang Feng, Li Manling, Lin Xudong, Lv Hairong, Schwing Alexander G., Ji Heng. Arxiv 2022

Similar Work