Explain why multimodal prompting with frozen language models works
Determine the underlying mechanisms and conditions under which multimodal prompting approaches that freeze the language model and inject learned image representations into its input embeddings (as in systems such as Frozen and Flamingo) successfully enable vision–language generation, and characterize how visual information is interpreted and utilized by the frozen decoder to produce text outputs.
References
However, it still remains a question why this kind of prompting based method work in multimodal generation.
— A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
(2303.04226 - Cao et al., 2023) in Section 4.2, Vision Language Decoders — Frozen decoders