BuboGPT:

    Enabling Visual Grounding in Multi-Modal LLMs


    Bytedance Inc.   *Equal Contribution   +Project Lead

    BuboGPT is an advanced Large Language Model (LLM) that incorporates multi-modal inputs including text, image and audio, with a unique ability to ground its responses to visual objects. It demonstrates remarkable chat abilities for arbitrary image-audio data understanding, whether aligned or unaligned.

    Bubo owls are well known for having strong vision and hearing abilities that help them thrive.

    Abstract

    LLMs have demonstrated remarkable abilities at interacting with humans through language, especially with the usage of instruction-following data. Recent advancements in LLMs, such as MiniGPT-4, LLaVA, and X-LLM, further enlarge their abilities by incorporating multi-modal inputs, including image, video, and speech. Despite their effectiveness at generating precise and detailed language understanding of the given modality signal, these LLMs give up the ability to ground specific parts of inputs, thus only constructing a coarse-grained mapping. However, explicit and informative correspondence between text and other modalities will not only improve the user experience but also help to expand the application scenario of multi-modal LLMs.

    1. BuboGPT Architecture . We build a multi-modal LLM, BuboGPT for multi-modal understanding including image, audio and text by learning a common semantic space and further explore the fine-grained relation between different visual objects and different modalities.
    2. Multimodal Instruct Data. We construct a high-quality multi-modal instruction-tuning dataset including fine-grained audio descriptions and cross-modal sound localization, and introduce both positive and negative image-audio pairs for semantic matching to facilitate the cross-modal understanding..

    BuboGPT Architecture

    As the figure shown, we perform joint multi-modal understanding and chatting for text, vision and audio, which is achieved by learning a shared representation space that aligns well with pre-trained Vicuna. We also build an off-the-shelf visual grounding pipeline to explore the fine-grained relation between different visual objects and modalities.

    The framework of BuboGPT.

    BuboGPT: Training Procedure

    BuboGPT connects different modality Q-Former with pre-trained large language model Vicuna, using a simple projection matrix. We consider a two-stage instruction-tuning procedure:

    • Stage 1: Single-modal Pre-training. We train the corresponding modality Q-Former and linear projection layer on a large number of modality-text paired data.
    • Stage 2: Multi-Modal Instruct Tuning. We curate a high-quality multi-modal instruction-following dataset to fine tune only the linear projection layer:
      • Image-Text: We employ two previously published datasets from MiniGPT-4 and LLaVa for visual instruct tuning.
      • Audio-Text: We build a series of expressive and descriptive data to facilitate this process based on Clotho dataset.
      • Audio-Image-Text: We build <audio, image, text> pairs to act as triple-modality instruction tuning dataset based on VGGSS dataset and further introduce negative set to enhance our model.

    -->

    Examples on Fine-grained Visual Understanding

    We first consider using a single image as input for fine-grained visual understanding with grounding. As the exmaples shown, the model can accurately associate textural words or phrases with image regions in various scenarios with different complexities.


    Examples on Audio Understanding

    When a single audio clip is provided for audio understanding, BuboGPT gives informative descriptions covering nearly all acoustic parts included, even when some audio fragments are too short for humans to notice, see examples for details.


    Examples on Aligned audio-image understanding

    We show that BuboGPT can perform sound localization with a matched audio-image pair provided, which gives a perfect example for aligned audio-image understanding, see examples for details.


    Examples on Arbitrary audio-image understanding

    The BuboGPT can also tell whether the image and audio are relevant to each other and generate high-quality response for arbitrary audio-image understanding, see examples for details.

    BibTeX

    
      @article{zhao2023bubogpt,
        author      = {Yang Zhao and Zhijie Lin and Daquan Zhou and Zilong Huang and Jiashi Feng and Bingyi Kang},
        title       = {BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs},
        publisher   = {arXiv:2307.08581},
        year        = {2023}
      }
      
    主站蜘蛛池模板: 亚洲国产成人久久一区WWW | 中文字幕乱码人妻一区二区三区 | 日韩一区二区三区不卡视频| 国产另类ts人妖一区二区三区| 日韩欧美一区二区三区免费观看| 国产伦精品一区二区三区四区| 国产日韩精品一区二区三区在线| 亚洲熟妇无码一区二区三区导航| 无码人妻精品一区二区三区99仓本 | 日本一区午夜爱爱| 国产主播一区二区三区| 无码毛片一区二区三区视频免费播放| 国产福利电影一区二区三区,亚洲国模精品一区 | 成人久久精品一区二区三区| 日本精品一区二区三区四区| 色狠狠一区二区三区香蕉| 福利一区国产原创多挂探花| 99久久精品午夜一区二区| 一本大道东京热无码一区| 国产成人精品无码一区二区老年人| 韩国福利影视一区二区三区| 国产精品无码一区二区三级| 久久婷婷色综合一区二区| 欧洲精品一区二区三区在线观看| 极品少妇一区二区三区四区| 手机福利视频一区二区| 国产日韩精品一区二区在线观看播放 | 中文乱码字幕高清一区二区| 夜夜添无码试看一区二区三区| 中文字幕VA一区二区三区| av无码精品一区二区三区四区| 熟妇人妻系列av无码一区二区| 无码人妻少妇色欲AV一区二区 | 国产AV午夜精品一区二区三| 日本一区二区三区高清| 无码aⅴ精品一区二区三区| 国产免费av一区二区三区| 精品日韩亚洲AV无码一区二区三区| 蜜桃传媒视频麻豆第一区| 无码少妇一区二区三区浪潮AV | 亚洲中文字幕久久久一区|