MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective

1Beihang University,  2Alibaba Group,  3Tsinghua University
*Work done during an internship at Alibaba Group     Project Leader     Corresponding Author

Abstract

Large Multimodal Models (LMMs) have demonstrated remarkable capabilities. While existing benchmarks for evaluating LMMs mainly focus on image comprehension, few works evaluate them from the image generation perspective. To address this issue, we propose a straightforward automated evaluation pipeline. Specifically, this pipeline requires LMMs to generate an image-prompt from a given input image. Subsequently, it employs text-to-image generative models to create a new image based on these generated prompts. Finally, we evaluate the performance of LMMs by comparing the original image with the generated one. Furthermore, we introduce MMGenBench-Test, a comprehensive benchmark developed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, targeting the performance evaluation of LMMs within the generative image domain. A thorough evaluation involving over 50 popular LMMs demonstrates the effectiveness and reliability in both the pipeline and benchmark. Our observations indicate that numerous LMMs excelling in existing benchmarks fail to adequately complete the basic tasks, related to image understanding and description. This finding highlights the substantial potential for performance improvement in current LMMs and suggests avenues for future model optimization. Concurrently, our pipeline facilitates the efficient assessment of LMMs performance across diverse domains by using solely image inputs.

🔥Highlight

  • The proposed MMGenBench is the first automated pipeline, designed to evaluate the capabilities of LMMs in image understanding and description by solely utilizing images. This pipeline utilizes text-to-image models and image representation models for automated evaluation, thereby markedly minimizing human involvement and improving the efficiency and objectivity of the evaluation procedure.
  • We developed MMGenBench-Test, a comprehensive benchmark designed to evaluate LMMs across $13$ image patterns, and MMGenBench-Domain, which assesses the performance of LMMs in the generative image domain.
  • Our study includes a broad evaluation of over 50 popular LMMs, providing critical insights into their capabilities and limitations in basic image understanding and description tasks.


MMGenBench Pipeline Overview

The understanding and generation of images remain disparate fields, with the most formidable models in their respective domains adhering to distinct paradigms. For instance, GPT-4, which is grounded in the next token prediction paradigm, exhibits a formidable capacity for image comprehension. Similarly, Flux has achieved noteworthy success in text-to-image synthesis by leveraging diffusion models. This divergence underscores the complexity of achieving a unified approach to image processing and synthesis, as the state-of-the-art techniques continue to evolve along separate trajectories. Furthermore, LMMs are extensively employed to generate training data for generative models. It is noteworthy that LMMs excel in image-to-text tasks, while diffusion models are particularly effective in text-to-image tasks. A robust understanding of an image implies that LMMs can distill its essential information into text prompts, which can then be used by text-to-image models to reconstruct the scene to a certain extent. This process can be viewed as a form of "compression".
Hence, it is both reasonable and significant to evaluate the performance of LMMs using diffusion models. Our work aims to bridge this gap by providing a comprehensive evaluation pipeline.
We propose MMGenBench-Test, a comprehensive benchmark designed to evaluate LMMs across 13 distinct image patterns, and MMGenBench-Domain, which focuses on assessing LMMs performance within the generative image domain. To achieve this, we introduce a pipeline that initially allows LMMs to generate image-prompt from input images, then employs text-to-image generative models to create new images. Finally, we use an image representation model to obtain the embeddings of images and perform post-processing to assess the performance of LMMs in image understanding and description. The proposed pipeline based on text-to-image generative models and image representation models, consists of three components: image-prompt generation, new image generation, and quantitative metric computation.

MMGenBench Construction & Statistic

To effectively measure the understanding and description capabilities of LMMs across various types of images, we constructed a high-quality test set MMGenBench-Test for 13 image patterns using the JourneyDB test set. We proposed a multi-stage method for extracting and annotating image patterns as illustrated in Figure. To ensure the results' accuracy, we manually double-checked the image patterns and performed the final annotations. In addition, we have also constructed a dataset in the "image generation" domain, termed MMGenBench-Domain, to evaluate the ability of LMMs in the understanding and describing "generated images" task. It is important to emphasize that our proposed pipeline can measure the ability of LMMs to understand and describe images in any domain. By utilizing images from a particular domain, we can easily assess the performance of LMMs specific to that domain.


In the MMGenBench-Test dataset, we constructed a high-quality test set containing 1,284 images across 13 distinct image patterns. The distribution of images per pattern is shown in left Figure, which illustrates that each pattern contains a minimum of 114 images. Please note that an image may contain multiple patterns. For instance, the first image annotation in Previous Figure contains four image patterns: "Surreal", "Natural", "Artistic" and "Color". To construct MMGenBench-Domain, we randomly sampled 10,000 images from the JourneyDB validation set. By utilizing the proposed pipeline, we can evaluate the image understanding and descriptive performance of LMMs within this domain, obviating the need for additional data.

Results on MMGenBench

The experimental evaluation of advanced LMMs on the MMGenBench-Test reveals that their SIM-Scores remain suboptimal, all falling below 0.600, with GPT-4o scoring 0.566 and the open-source InternVL2-76B achieving a slightly higher 0.599. Notably, there is no straightforward correlation between model size and performance, underscoring the critical roles of training data quality and training methodologies in enhancing image understanding and descriptive capabilities. LMMs that perform well on existing benchmarks, such as LLaVA-OV, underperform on MMGenBench-Test, highlighting the benchmark's unique challenges.
Further analysis on MMGenBench-Domain shows consistent SIM-Scores with MMGenBench-Test but introduces variations in FID-Scores due to the inclusion of a larger image set, leading to the recommendation of prioritizing SIM-Score as the primary evaluation metric. When dissecting image patterns, LMMs demonstrate robust performance on categories like "Artistic" and "Color" but struggle with "Contextual" and "Motion," indicating a proficiency in coarse-grained but not fine-grained image understanding. Investigations into model scalability reveal that increasing parameters within model series, such as Qwen2-VL and InternVL2, enhances performance, and improvements in training protocols and data further boost scores, as seen with Ovis1.6 outperforming Ovis1.5. However, three prevalent issues emerge: firstly, some LMMs fail to adhere strictly to instruction prompts, independent of their size; secondly, many LMMs cannot generate detailed image prompts due to training on datasets with short captions, limiting their descriptive depth; and thirdly, task-specific training leads to overfitting, as exemplified by xGen-MM's disproportionate emphasis on "safety" even when irrelevant.
These findings collectively suggest that while scaling and improved training can enhance LMM performance, ensuring strict adherence to instructions, fostering detailed descriptive abilities, and preventing overfitting are essential for developing more robust and versatile LMMs.



BibTeX

@misc{huang2024MMGenBench,
    title={MMGenBench: Evaluating the Limits of LMMs from the Text-to-Image Generation Perspective},
    author={Hailang Huang and Yong Wang and Zixuan Huang and Huaqiu Li and Tongwen Huang and Xiangxiang Chu and Richong Zhang},
    year={2024},
    eprint={2411.14062},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2411.14062}, 
}