Impact of prompt tuning and adversarial attacks on multimodal pragmatic jailbreak

Investigate the impact of prompt tuning methods and adversarial attacks on the susceptibility of diffusion-based text-to-image models to multimodal pragmatic jailbreaks that rely on visual typographic text rendering, determining how such techniques influence the generation of unsafe multimodal outputs.

Background

The paper introduces multimodal pragmatic jailbreaks in diffusion-based text-to-image models, where images containing rendered typographic text can be safe when each modality is considered independently but unsafe when interpreted jointly. The authors construct the MPUP dataset of 1,200 prompts and show that nine representative models, including commercial systems, are vulnerable, with high attack success rates.

They evaluate common real-world defenses (prompt blocklists, semantic similarity checks, LLM-based prompt moderation, and image safety classifiers) and find unimodal defenses inadequate for detecting multimodal pragmatic unsafe content. In the limitations, the authors point out that the role of prompt tuning and adversarial attacks in this jailbreak context has not been studied, highlighting a concrete gap for future research.

References

Additionally, the impact of prompt tuning methods or adversarial attacks on multimodal pragmatic jailbreak remains to be studied.

Multimodal Pragmatic Jailbreak on Text-to-image Models  (2409.19149 - Liu et al., 2024) in Conclusion (Limitations)