Explain why multimodal prompting with frozen language models works

Determine the underlying mechanisms and conditions under which multimodal prompting approaches that freeze the language model and inject learned image representations into its input embeddings (as in systems such as Frozen and Flamingo) successfully enable vision–language generation, and characterize how visual information is interpreted and utilized by the frozen decoder to produce text outputs.

Background

The survey reviews vision–language generation methods that freeze LLMs and train only an image encoder, injecting visual embeddings into the LLM’s input to enable multimodal in-context learning (e.g., Frozen and Flamingo). These approaches achieve strong zero- and few-shot performance despite minimal cross-modal fine-tuning of the language component, raising questions about how frozen decoders leverage injected visual features.

Following this discussion, the authors explicitly note that it remains a question why such prompting-based multimodal methods are effective, motivating investigation into the representational and architectural factors that enable successful transfer from visual embeddings to fluent text generation.

References

However, it still remains a question why this kind of prompting based method work in multimodal generation.

A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT  (2303.04226 - Cao et al., 2023) in Section 4.2, Vision Language Decoders — Frozen decoders