A group of researchers from Johns Hopkins University and Duke University have developed an algorithm called SneakyPrompt that can trick text-to-image generative AIs, including DALL-E 2 and Stable Diffusion, into producing questionable images by using nonsense words. The algorithm aims to probe and strengthen the safety filters of these AI systems.
The researchers will present their findings at the IEEE Symposium on Security and Privacy in May 2024. The algorithm’s experiments showed that nonsense words prompted generative AIs to produce innocent or not-safe-for-work (NSFW) images. These findings highlight the vulnerabilities of text-to-image AI models and the need to make them more robust against attacks.