MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning en (arxiv.org)

Large language models have shown their remarkable capabilities as a general interface for various language-related applications. Motivated by this, we target to build a unified interface for completing many vision-language tasks including image description, visual question answering, and visual grounding, among others. The...