It turns out some text synthesis models, and specifically GPT-3, are likely vulnerable to “prompt injection,” which is instructing the model to disregard its “pre-prompts” which contain task instructions or safety measures.
For example, it’s common to use GPT-3 by “pre-prompting” the model with “Translate this text from English to German,” or “I am a friendly and helpful AI chatbot.” These pre-prompts are given before each user input as a way of setting up the user for success at a given task, or preventing the user from doing something different with the model.
But what if the user prompt tells the model to disregard its pre-prompt? That actually seems to work:
It’s also possible to coerce a model into leaking its pre-prompt:
Prompt injection attacks are already being used in the wild.