Mini-Gemini:

    Mining the Potential of Multi-modality Vision Language Models

    The Chinese University of Hong Kong

    Updates: Mini-Gemini is comming! We release the paper, code, data, models, and demo for Mini-Gemini.

    Abstract

    In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current framework with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpass the developed private models.



    Model

    The framework of Mini-Gemini is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

    BibTeX

    
    @article{li2024minigemini,
      title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
      author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
      journal={arXiv preprint arXiv:2403.18814},
      year={2024}
    }
      

    Acknowledgement

    This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

    Examples









    主站蜘蛛池模板: 国产福利一区二区精品秒拍| 久久国产精品一区二区| 中文字幕精品亚洲无线码一区应用| 夜夜添无码一区二区三区| 国产一区二区不卡老阿姨| 日本一区二区在线播放| 伊人色综合一区二区三区| 日韩国产一区二区| 国产Av一区二区精品久久| 亚洲视频一区二区在线观看| 亚洲av一综合av一区| 国产无线乱码一区二三区| 怡红院一区二区三区| 无码精品一区二区三区在线| 中文字幕在线无码一区二区三区| 欧洲精品免费一区二区三区| 无码人妻一区二区三区av| 国产微拍精品一区二区| 性色av无码免费一区二区三区| 亚洲av无码一区二区三区观看| 亚洲AV美女一区二区三区 | 久久精品免费一区二区| 国产亚洲3p无码一区二区| 久久精品一区二区三区不卡| 久久se精品一区二区国产| 亚洲国产精品无码久久一区二区| 国精品无码一区二区三区在线| 国产亚洲一区区二区在线| 成人影片一区免费观看| 免费视频一区二区| 青青青国产精品一区二区| 亚洲乱色熟女一区二区三区蜜臀| 国产一区二区在线| 狠狠爱无码一区二区三区| 国产美女口爆吞精一区二区| 国产成人无码一区二区三区在线| 亚洲AV无码一区二区二三区软件| 91视频国产一区| 国产伦理一区二区三区| 日本视频一区二区三区| 国模无码一区二区三区不卡|