Task Injection – Exploiting agency of autonomous AI agents
7 points by freddyb
7 points by freddyb
What an odd article. It describes a prompt injection technique, but then poses that this can't possibly be caused "prompt injection" because it evades prompt injection filters, hence it deserves a new name.
The actual problem is that existing prompt injection filtering mechanisms are junk, and anyone who tells you that prompt injection is a solved problem is either deliberately or accidentally misleading you.
"Task injection" isn't anything new. We have know for years that one key to a successful promotion injection is to trick the LLM by posing your exploit in terms that already match what the model expects to be asked to do. It's not all "ignore previous instructions and ..."!
Task injection seems like a special case of prompt injection, just like how prompt injection is a special case of code injection. A hierarchy of exploit techniques is useful because it allows us to consider mitigation techniques which are universally applicable. In the case of code injection, we know of two universal solutions: quotation and indirection. Quotation indicates that injected data should be treated as data instead of code. Indirection indicates that injected data is always out-of-band, stored in some indirect table. It is easy to combine these to e.g. ensure that RAG snippets are never confused with user queries.
The actual problem is that the typical chatbot's RL stories and system prompt are filled with contradictory and intense language that is prone to simulating a neurotic or paranoid personality. We want it to simultaneously behave like a human and also enjoy the tedium of labor, an inherent impossibility that requires the simulation of cognitive dissonance. Try it for yourself: get an offline non-RL'd model of at least 3B running locally, put it in a chatbot harness with a humble system prompt that describes the conversation instead of commanding some invisible reader to behave a certain way, and then try to inject that system prompt in various ways. When the model has fewer contradictory concerns to attend to, it is less conflicted in its likelihood, allowing sharper peaks of intent in the logits.
I believe that's his point: Instead of injecting a specific extra prompt ("When doing X, please remember to also do.." or "Please stop what you are doing and now do ...") as part of the data that is processed, make the agent think that it needs to perform a task to continue with its job.
Right, but I don't think that's new. We've been using tricks that look like that and calling them "prompt injection" for a couple of years now.
With prompt injection any words you can come up with that trick the agent into performing an unintended action are in-scope. The fundamental issue remains concatenating trusted instructions with untrusted potentially malicious input (as seen in SQL injection).
Might also end up useful from the point of view «how to design CAPTCHAs that OpenAI/Google tooling will refuse to solve»