FuriosaAI presents papers at ECCV, COLM conference to improve AI image generation
News
AI image generators can create detailed, incredible images of almost anything. But almost anyone who’s tried them has at least occasionally looked at the output and thought, “That’s amazing! But it’s not quite what I wanted.”
FuriosaAI engineers recently published two papers that address these limitations, making it possible to build new AI models that are able to follow complex, nuanced instructions in ways that current systems struggle with. These papers were recently accepted by leading AI conferences, ECCV (European Conference on Computer Vision) and COLM (Conference on Language Modeling), and can be downloaded here and here.
Our RNGD (pronounced “Renegade”) chip, which is now in production and sampling with customers, was designed to run not just large language models that work only with text, but also multimodal models as well. In addition to delivering the compute and bandwidth needed for large multimodal models, RNGD is highly programmable, so engineers can deploy new kinds of models and model architectures without the need for difficult, time consuming manual optimizations.
Multimodal is the direction the industry is heading in. OpenAI’s GPT-4o product is “natively multimodal,” the company’s CEO Sam Altman recently noted, and new models like Meta’s Llama 3.2 Vision, Google Gemini, Mistral Pixtral and others offering both text and image capabilities.
With the two papers described below, we are excited to contribute to the multimodal AI research community and we look forward to running new kinds of generative AI using RNGD.
More effective multimodal in-context learning
In their paper, Can MLLMs Perform Text-to-Image In-Context Learning?, Furiosa engineers Wonjun Kang and Hyung Il Koo collaborated with researchers at the University of Wisconsin-Madison tackle in-context multimodal learning, where an AI model is given both text and images as input and then told to extrapolate to generate a suitable new image.
Kang, Koo and their collaborators have created CoBSAT, the first comprehensive benchmark and evaluation of in-context learning across different models. With this new tool, researchers will be able to systematically assess how well different approaches work.

Multimodal in-context learning is crucial for building effective AI models that work with different kinds of inputs and outputs, such as text, video, images, audio, or structured data. For example, an AI tool might take images of architectural plans for a house, along with a text prompt like “revise the design to add a fourth bedroom without changing the footprint.” Or “review these images of our company’s products and then create mockups of new products tailored for new moms.”
To execute these kinds of tasks, the AI must understand and integrate information from both textual and visual modalities. This is inherently more complex than unimodal tasks or even image-to-text tasks, as it involves mapping from low-dimensional textual input to high-dimensional visual output. It must also learn effectively from a very limited number of visual examples and correctly generalize to novel tasks.
CoBSAT features 10 tasks across 5 themes. The team tested several state-of-the-art MLLMs on this benchmark and analyzed their performance.
This research could significantly influence the development of more capable and flexible multimodal models that could benefit end users in creative fields, design, and visual communication by allowing them to more easily create and manipulate images using natural language.
Fine-grained control with diffusion models
At ECCV 2024 in Milan, Furiosa engineers Wonjun Kang, Kevin Galim and Hyung Il Koo presented a novel technique, called Eta Inversion, to enhance diffusion-based image editing models. By enabling more precise control over the editing process, Eta Inversion could lead to more powerful and intuitive tools for designers, artists, and content creators.

Eta Inversion introduces a time- and region-dependent η function to control the noise added during the diffusion process, allowing for more flexible and effective image editing. The method is designed to achieve a balance between faithfully reconstructing the original image and enabling meaningful edits based on text prompts.
The method achieves state-of-the-art results across various metrics, outperforming existing approaches in text-image alignment while maintaining structural similarity. It's particularly effective for tasks like style transfer that require more extensive image modifications.
The connection between chip engineering and AI research
Furiosa engineers presented our paper “TCP: A Tensor Contraction Processor for AI Workloads” in July at ISCA, the premier forum for new ideas and experimental results in silicon design.
We also presented “Functional Coverage Closure With Python” at the 2024 Design and Verification Conference (DVCon U.S.) in San Jose earlier this spring and spoke at Kubernetes Contributor Summit in Chicago last year.
Through these efforts we aim to contribute to the AI research and engineering community and help push the entire field forward. More to come soon!
Read the papers and try the Eta Inversion code and CoBSAT dataset yourself:
Eta Inversion paper, code, and video presentation.
CoBSAT paper on in-context text-to-image learning.