Mini-Gemini:

    Mining the Potential of Multi-modality Vision Language Models

    The Chinese University of Hong Kong

    Updates: Mini-Gemini is comming! We release the paper, code, data, models, and demo for Mini-Gemini.

    Abstract

    In this work, we introduce Mini-Gemini, a simple and effective framework enhancing multi-modality Vision Language Models (VLMs). Despite the advancements in VLMs facilitating basic visual dialog and reasoning, a performance gap persists compared to advanced models like GPT-4 and Gemini. We try to narrow the gap by mining the potential of VLMs for better performance and any-to-any workflow from three aspects, i.e., high-resolution visual tokens, high-quality data, and VLM-guided generation. To enhance visual tokens, we propose to utilize an additional visual encoder for high-resolution refinement without increasing the visual token count. We further construct a high-quality dataset that promotes precise image comprehension and reasoning-based generation, expanding the operational scope of current VLMs. In general, Mini-Gemini further mines the potential of VLMs and empowers current framework with image understanding, reasoning, and generation simultaneously. Mini-Gemini supports a series of dense and MoE Large Language Models (LLMs) from 2B to 34B. It is demonstrated to achieve leading performance in several zero-shot benchmarks and even surpass the developed private models.



    Model

    The framework of Mini-Gemini is conceptually simple: dual vision encoders are utilized to provide low-resolution visual embedding and high-resolution candidates; patch info mining is proposed to conduct patch-level mining between high-resolution regions and low-resolution visual queries; LLM is utilized to marry text with images for both comprehension and generation at the same time.

    BibTeX

    
    @article{li2024minigemini,
      title={Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models},
      author={Li, Yanwei and Zhang, Yuechen and Wang, Chengyao and Zhong, Zhisheng and Chen, Yixin and Chu, Ruihang and Liu, Shaoteng and Jia, Jiaya},
      journal={arXiv preprint arXiv:2403.18814},
      year={2024}
    }
      

    Acknowledgement

    This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

    Examples









    主站蜘蛛池模板: 无码丰满熟妇一区二区| 成人国产一区二区三区| 精品视频在线观看一区二区三区| 亚洲字幕AV一区二区三区四区| 少妇精品久久久一区二区三区 | 免费无码一区二区| 五月婷婷一区二区| 人妻精品无码一区二区三区 | 中文字幕一区二区精品区| 欧美日韩精品一区二区在线视频| 成人免费视频一区二区三区| 亚洲AV无码一区二区乱孑伦AS| 亚洲av无码一区二区三区四区| 波多野结衣一区在线观看| 国产精品亚洲一区二区麻豆 | 东京热无码一区二区三区av| 少妇激情av一区二区| 香蕉一区二区三区观| 天天综合色一区二区三区| 日韩人妻一区二区三区免费| 天天看高清无码一区二区三区| 国产福利微拍精品一区二区| 日韩精品中文字幕无码一区| 中文字幕一区二区三区四区| aⅴ一区二区三区无卡无码| 国产一区二区草草影院| 人妻在线无码一区二区三区| 中文字幕在线一区二区在线| 日本亚洲成高清一区二区三区| 国产精品区AV一区二区| 精品国产一区二区三区2021| 国产一区二区三区不卡在线看 | 视频一区二区三区免费观看| 国产精品成人一区无码| 国产一区高清视频| 熟女少妇丰满一区二区| 末成年女A∨片一区二区| 97se色综合一区二区二区| 免费在线视频一区| 国产成人无码一区二区在线观看| 国产无线乱码一区二三区|