A language model treats everything it reads as potential instruction. That single property, which makes these systems flexible, is also what makes them hard to secure. When user content, retrieved documents, and tool outputs all flow into the same context as the developer's instructions, the boundary that security usually relies on, between code and data, becomes blurred.

Key Takeaways

  • The core vulnerability is that instructions and data share one channel, so untrusted text can carry commands.
  • Prompt injection, jailbreaking, and data extraction are the primary attack patterns to plan for.
  • No single defense is sufficient. Robust systems layer input handling, privilege limits, and output checks.
  • Adversarial testing should be a standing process, not a one time exercise before launch.

Prompt injection

In a prompt injection, an attacker plants instructions inside content the model will later read, a web page, an email, a document, that override or subvert the developer's intent. The indirect form is especially dangerous: the user may be entirely innocent while a retrieved document silently instructs the model to exfiltrate data or take an unwanted action. Because the malicious text is just words, conventional input filtering rarely catches it.

Jailbreaking

Jailbreaking aims to bypass a model's safety training through carefully constructed prompts, role play framings, or obfuscation. The arms race here is ongoing: each new guardrail invites new evasions. Treating a model's refusal behavior as a hard security boundary is a mistake. It is a useful layer, but it is probabilistic and can be worn down.

Data extraction and leakage

Models can surface sensitive information from their training data, their system prompts, or the documents in their context. Attackers probe for proprietary instructions, private data, and credentials. In retrieval augmented systems, the risk extends to whatever the model can fetch, which is why what a model is allowed to access matters as much as what it is asked.

Why a single safety layer is a red flag

Vendors sometimes present a content filter or a fine tuned refusal model as the answer to LLM security. It is not. These are useful components inside a layered design, but any claim that one mechanism makes a system safe should be treated with suspicion. Security here is about reducing and bounding risk, not eliminating it.

Defense in depth

Practical mitigation combines several layers. Constrain what the model can do by giving it the least privilege necessary, so a successful injection cannot reach sensitive systems. Separate trusted instructions from untrusted content as much as the architecture allows. Validate and constrain outputs before they trigger actions. Keep a human in the loop for high consequence operations. And test continuously with adversarial inputs, because the threat landscape moves.

An Independent Perspective

We assess these systems the way an attacker would, then the way a regulator would. The pattern that separates resilient deployments from fragile ones is governance, not cleverness: clear privilege boundaries, logged actions, and a standing red team beat any single guardrail. Assume injection will succeed sometimes, and design so that when it does, the blast radius is small.

Assessing your exposure to LLM attacks?

We run independent adversarial reviews and risk assessments for deployed AI systems.

Request an AI Audit