Unified Questioner Transformer For Descriptive Question Generation In Goal-oriented Visual Dialogue · The Large Language Model Bible Contribute to LLM-Bible

Unified Questioner Transformer For Descriptive Question Generation In Goal-oriented Visual Dialogue

Matsumori Shoya, Shingyouchi Kosuke, Abe Yuki, Fukuchi Yosuke, Sugiura Komei, Imai Michita. Arxiv 2021

[Paper]    
Agentic Attention Mechanism Model Architecture Pretraining Methods Reinforcement Learning Transformer

Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions. We train our model with two variants of CLEVR Ask datasets. The results of the quantitative and qualitative evaluations show that UniQer outperforms the baseline.

Similar Work