Can MLLMs Perform Multimodal In-Context Learning for Text-to-Image Generation?

Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee

text-to-image

in-context-learning

Abstract

The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation. To overcome these challenges, we explore strategies like fine-tuning and Chain-of-Thought prompting, demonstrating notable improvements.

Related Publications

Transformers in the Dark: Navigating Unknown Search Spaces via Bandit Feedback

TMLR

2026

transformer

TABED: Test-Time Adaptive Ensemble Drafting for Robust Speculative Decoding in LVLMs

EACL

2026

speculative-decoding

vision-lanuage

Draft-based Approximate Inference for LLMs

ICLR

2026

speculative-decoding

kv-cache