COLM
2024

Can MLLMs Perform Multimodal In-Context Learning for Text-to-Image Generation?

Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee
text-to-image
in-context-learning

Abstract

The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements.

Resources

Read paper
View Github

Related Publications

Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

TMLR
2026
search
transformer
View Job

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

EACL
2026
speculative-decoding
vision-lanuage
View Job

Draft-based Approximate Inference for LLMs

ICLR
2026
speculative-decoding
kv-cache
View Job