Why Aren't We Making Any Progress In Security From AI
Guardrails Are Soft Boundaries. Hard Boundaries Do Exist.
Yesterday OpenAI released Agent mode. ChatGPT now wields a general purpose tool â its own web browser. It manipulates the mouse and keyboard directly. It can use any web tool, like we do.
Any AI security researcher will tell you that this is 100x uptake on risk. Heck, even Sam Altman dedicated half his launch post warning that this is unsafe for sensitive use.
Meanwhile AI guardrails are The leading idea in AI security. Itâs safe to say theyâve been commoditized. You can get yours from your AI provider, hordes of Open Source projects, or buy a commercial one.
Yet hackers are having a ball. Jason Haddix sums it up best:
AI Pentest: A client pays an exorbitant amount of money for guardrail and implementation consulting services from a defensive AI Security vendor.
— JS0N Haddix (@Jhaddix) July 14, 2025
Bypassed in 20 minutes.
It really does feel like the dawn of web hacking all over again.
In Hard Boundaries We Trust
SQLi attacks were all the rage back in the 90s. Taint-analysis was invented to detect vulnerable data flow paths. Define user inputs as sources, special character escaping-function as sanitizers, and database queries as sinks. Static analysis tools analyze the software to find any route from source to sink that doesnât go through a sanitizer. This is still the core of static analysis tools.
Formal verification take this a step further and actually allow you to prove that there is no unsanitized path between source and sink. AWS Network Analyzer enables policies like âS3 bucket cannot be exposed to the public internetâ. No matter how many gateways and load balancers you place in-between.
ORM libraries have sanitization built-in to enforce boundaries. Preventing XSS and SQLi. SQLi is solved as a technical problem (the operational problem remains, of course).
With software you can create hard boundaries. You CANNOT get from here to there.
Hard boundaries cannot be applied anywhereâthey require full knowledge of the environment. They shine when you go all-in on one ecosystem. In one ecosystem you can codify the entire environment state into a formula. AWS Networking Analyzer. Django ORM. Virtual machines. These are illustrative examples of strong guarantees you can get out of buying-into one ecosystem.
Itâs enticing to think that hard boundaries will solve our AI security problems. With hard boundaries, instructions hidden in a document simply CANNOT trigger additional tool calls.
Meanwhile we canât even tell if an LLM hallucinated. Even when we feed in an authoritative document and ask for citation. We canât generate a data flow graph for LLMs.
Sure, you can say the LLM fetched a document and then searched the web. But you CANNOT know whether elements of that file were incorporated into web search query parameters. Or whether the LLM chose to do the web search query because it was instructed to by the document. LLMs mix and match data. Instructions are data.
Hackers Donât Care About Your Soft Boundaries
AI labs invented a new type of guardrail based on fine-tuning LLMsâa soft boundary. Soft boundaries are created by training AI real hard not to violate control flow, and hope that it doesnât. Sometimes we donât even train for it. We ask it nicely to apply a boundary through âsystem instructionsâ.
System instructions themselves are a soft boundary. An imaginary boundary. AI labs train models to follow instructions. Security researchers pass right through these soft boundaries.
Sam Altman on the announcement of ChatGPT Agent:
We have built a lot of safeguards and warnings into it, and broader mitigations than weâve ever developed before from robust training to system safeguards to user controls
Robust training. Soft boundaries. Hackers are happy.
This isnât to say that soft boundaries arenât useful. Here is ChatGPT with GPT 4o refusing to store a malicious memory based on instructions I placed in a Google Drive document.
Check out the conversation transcript. More on this at BHUSA 2025 âAI Enterprise Compromise - 0click Exploit Methodsâ.
LLM Guardrails addressing Indirect Prompt Injection are another type of soft boundary. You pass a fetched document through an LLM or classifier and ask it to clean out any instructions. Itâs a sanitizer, the equivalent of backslashing notorious escape characters that lead to injections. But unlike software sanitizer, itâs based on statistical models.
Soft boundaries rely on training AI to identify and enforce them. They work most of the time. Hackers donât care about what happens most of the time.
Relying on AI makes soft boundaries easy to apply. They work when hard boundaries are not feasible. You donât have to limit yourself to one ecosystem. They apply in an open environment that spans multiple ecosystems.
* The steelman argument for soft boundaries is that AI labs are building AGI. And AGI can solve anything, including strictly enforcing a soft boundary. Indeed, soft boundary benchmarks are going up. Do you feel the AGI?
Every Boundary Has Its Bypass
Both hard and soft boundaries can be bypassed. But they are not the same. Hard boundaries are bypassed via software bugs. You could write bug-free software (I definitely canât, but YOU can). You can prove correctness for some software. Soft boundaries are stochastic. There will always be a counter-example. A bypass isnât a bugâitâs the system working as intended.
Summing it up:
Boundary | Based on | Applies best | Examples | Bypass |
---|---|---|---|---|
Hard boundary | Software | Within walled ecosystems | VM; Django ORM; | Software bug |
Soft boundary | AI/ML | Anywhere | AI Guardrails; System instructions | There will always be a counter-examples |
Hard Boundaries Do Apply To AI Systems
Hard boundaries are not applicable to probabilistic AI models. But they are applicable to AI systems.
Strict control of data flow has been the only thing that has prevented our red team to attain 0click exploits. Last year we reverse engineered Microsoft Copilot at BHUSA 2024. We spent a long time figuring out if a RAG query results can initiate a new tool invocation like a web search. It could. But Microsoft could have built it a different way. Perform RAG queries by an agent who simply cannot decide to run a web search.
Salesforce Einstein simply does not read its own tool outputs. Here is Einstein querying CRM records. Results are presented in a structured UI component, not summarized by an LLM. You CANNOT inject instructions through CRM results. Until someone finds a bypass. More on this at BHUSA 2025 âAI Enterprise Compromise - 0click Exploit Methodsâ.
Microsoft Copilot simply does not render markdown images. You CANNOT exfiltrate data through image parameters if thereâs no image. Until someone finds a bypass.
ChatGPT validates image URL before rendering them using an API endpoint called /url_safe
.
This mechanism ensures that image URLs were not dynamically generated.
They must explicitly be provided by the user.
Until someone finds a bypass.
The main issue with hard boundaries is that they nerf the agent. They make agents less useful. Like a surgeon removing an entire organ out of abundance of caution.
With market pressure for adoption, AI vendors are removing these one by one. Anthropic was reluctant to let Claude browse the web. Microsoft removed Copilot-generated URLs. OpenAI hid Operator in a separate experimental UI. These hard boundaries are all gone by now.
The Solution
This piece is too long already. Fortunately the solution is simple.
Hereâs what we should