[Paper]
[Code]
Although perception systems have made remarkable advancements in recent
years, they still rely on explicit human instruction or pre-defined categories
to identify the target objects before executing visual recognition tasks. Such
systems cannot actively reason and comprehend implicit user intention. In this
work, we propose a new segmentation task – reasoning segmentation. The task is
designed to output a segmentation mask given a complex and implicit query text.
Furthermore, we establish a benchmark comprising over one thousand
image-instruction-mask data samples, incorporating intricate reasoning and
world knowledge for evaluation purposes. Finally, we present LISA: large
Language Instructed Segmentation Assistant, which inherits the language
generation capabilities of multimodal Large Language Models (LLMs) while also
possessing the ability to produce segmentation masks. We expand the original
vocabulary with a