← All terms

Prompt Injection

A security risk where untrusted input manipulates a language model into ignoring its instructions.

What Is Prompt Injection

Prompt injection is a security issue in which untrusted input, such as text from a web page, document, email, or tool output, contains instructions that manipulate a large language model into ignoring its original instructions or performing unintended actions. It is conceptually similar to injection vulnerabilities in traditional software, such as SQL injection, in that untrusted data is misinterpreted as instructions rather than treated purely as content.

How It Happens

A model processes its system prompt, conversation history, and any retrieved or tool-provided content within the same context window, without a reliable built-in way to distinguish trusted instructions from untrusted data embedded inside that content. If an attacker can influence any text the model reads, for instance a webpage an agent fetches, a file it opens, or an email it summarizes, they can embed text designed to look like an instruction, such as directing the model to disclose sensitive information, ignore prior constraints, or invoke a tool it should not use. Because the model has no inherent way to verify the source or trustworthiness of text within its context, it can treat injected instructions as legitimate.

Direct vs Indirect Prompt Injection

  • Direct prompt injection: a user interacting with the model directly tries to override its system prompt or safety instructions through the conversation itself.
  • Indirect prompt injection: malicious instructions are placed in external content the model processes later, such as a document, webpage, or tool result, without the person operating the model being aware.

Indirect prompt injection is generally considered the higher-risk category for autonomous agents, since it can affect a system without any direct interaction from an attacker.

Why It Matters for Agent Systems

The risk of prompt injection grows with the amount of autonomy and tool access an agent has. An agent that can only produce text is limited to whatever damage a misleading answer can cause, while an agent that can run commands, send messages, or modify files can be pushed toward harmful actions if it processes injected instructions from an untrusted source. Defensive measures include treating retrieved or tool-provided content as data rather than instructions, restricting what actions an agent can take without explicit approval, sandboxing execution so that a compromised agent has limited impact, and applying least-privilege access to credentials and systems an agent can reach. Running agents in isolated, sandboxed environments, such as separate Docker containers per agent, is one practical way to limit the impact of a successful prompt injection, since it constrains what the agent can affect even if its instructions are manipulated.

Get started

Deploy your fleet.

Put a fleet of sandboxed agents to work on your own infrastructure, provisioned in seconds and watched live from one console.

Get started

Admin-provisioned · Self-host in one command · Your data never leaves your VM