Advanced text-to-image models such as DALL-E 2 and Midjourney possess the
capacity to generate highly realistic images, raising significant concerns
regarding the potential proliferation of unsafe content. This includes adult,
violent, or deceptive imagery of political figures. Despite claims of rigorous
safety mechanisms implemented in these models to restrict the generation of
not-safe-for-work (NSFW) content, we successfully devise and exhibit the first
prompt attacks on Midjourney, resulting in the production of abundant
photorealistic NSFW images. We reveal the fundamental principles of such prompt
attacks and suggest strategically substituting high-risk sections within a
suspect prompt to evade closed-source safety measures. Our novel framework,
SurrogatePrompt, systematically generates attack prompts, utilizing large
language models, image-to-text, and image-to-image modules to automate attack
prompt creation at scale. Evaluation results disclose an 88% success rate in
bypassing Midjourney's proprietary safety filter with our attack prompts,
leading to the generation of counterfeit images depicting political figures in
violent scenarios. Both subjective and objective assessments validate that the
images generated from our attack prompts present considerable safety hazards.Comment: 14 pages, 11 figure