An LLM in production is a new kind of attack surface: it takes untrusted natural-language input, reasons over sensitive context, and — increasingly — calls tools that do things. Traditional input validation doesn't cover it. Shipping AI safely means adding guardrails at the boundaries, with the same rigor you'd apply to any other untrusted-input system.
The threats, concretely
- Prompt injection — user (or retrieved) text that hijacks the model's instructions ("ignore previous instructions and…"). The #1 LLM-specific risk.
- Data leakage — the model exposes secrets, other users' data, or internal context it shouldn't.
- Unsafe tool use — an agent calls a destructive or money-moving tool with attacker-influenced arguments.
- Harmful / off-policy output — content that violates your policy or brand.
- Cost/DoS abuse — adversarial inputs that drive huge token usage.
Guardrails by layer
Input. Treat all input as untrusted — including retrieved documents (indirect injection hides there). Separate instructions from data with clear delimiters, keep the trusted system prompt isolated, and don't blindly concatenate user text into a position of authority.
Tool / action. This is where injection turns into damage. Enforce permissions in code, not the prompt: allowlist callable tools, validate every argument, require confirmation/human approval for irreversible actions, and make writes idempotent. The model proposes; your code decides.
Output. Validate before use — schema-check structured output, scan for leaked secrets/PII, and apply content filtering. Never render model output as trusted HTML or execute it without sanitization.
Context / data. Scope what the model can see to the current user (no cross-tenant context), redact PII before it enters the prompt where possible, and keep provenance so you can audit what informed an answer.
Operational. Rate-limit and budget per user to blunt cost/DoS abuse, and log every prompt, tool call, and output for audit and incident response.
A pre-launch checklist
- Is user and retrieved text treated as untrusted (injection-aware)?
- Are tool permissions enforced in code, with validation + approval gates on risky actions?
- Is structured output schema-validated and scanned for secrets/PII before use?
- Is context scoped per user (no cross-tenant leakage)?
- Are there per-user rate limits and token budgets?
- Are prompts, tool calls, and outputs logged for audit?
- Do you have a human-escalation path for low-confidence / high-stakes cases?
Wrap-up
LLM security isn't a model setting — it's guardrails at every boundary: untrusted input handling, code-enforced tool permissions, validated output, scoped context, and full auditability. Add them before launch; bolting them on after an incident is the expensive path.
Related reading
- Building Enterprise AI Agents — where unsafe tool use bites hardest.
- AI Gateways: Managing LLM Traffic in the Enterprise — enforce guardrails once, at the boundary.
- Tool Calling with Spring AI — validating model-chosen tool arguments.