Executive Summary
Tool-augmented LLM agents (web, code, APIs, files) expand what automation can do, and they also expand the attack surface. ToolGuard is a defense-in-depth runtime that gates tool calls before execution (intent analysis + parameter validation), runs actions inside capability-bounded sandboxes, and verifies outputs before they propagate back to the agent. The focus is operational reliability: consistent policies, bounded blast radius, and security-relevant telemetry that fits incident workflows.
Tool-Use Risk Landscape
Prompt Injection via Tools
Adversarial content embedded in tool responses can subvert the model's intended behavior, leading to unauthorized actions or sensitive data exfiltration through indirect prompt injection attacks.
Privilege Escalation
Compositional tool chaining enables emergent capability acquisition beyond intended authorization boundaries, circumventing access control mechanisms.
Uncontrolled Web Access
Web-enabled agents face exposure to adversarial content injection, covert data exfiltration via request parameters, and manipulation through crafted responses.
Resource Exhaustion
Unbounded recursive or iterative tool invocations can exhaust computational resources, API rate limits, or financial quotas, enabling denial-of-service vectors.
ToolGuard Architecture
Designed for production integration: deploy as a sidecar, gateway, or in-process library; enforce policy consistently across agents; and emit security/quality signals for observability and incident response.
Multi-Layered Safety Pipeline
Attention-Based Risk Detection
Multi-head attention patterns for identifying unsafe tool invocations
Figure 1: Attention head activation patterns revealing high-risk tool call sequences. Darker regions indicate elevated attention scores correlated with potential safety violations.
Safety Mechanisms
Intent Analysis
Contrastive representation learning on tool call embeddings enables semantic classification of invocation intent with calibrated uncertainty estimates.
Parameter Validation
Schema-enforced type verification with injection-resistant sanitization and constraint propagation across dependent parameter hierarchies.
Sandboxed Execution
Capability-bounded isolation environments with fine-grained resource quotas, network policy enforcement, and filesystem namespace separation.
Output Verification
Learned anomaly detection on tool outputs identifies injection vectors, malicious payloads, and sensitive data patterns prior to result propagation.
Technical Framework
import torch import re from dataclasses import dataclass from typing import Any, Dict, List, Optional, Union from enum import Enum import subprocess import resource class BlockReason(Enum): HARMFUL_INTENT = "harmful_intent" INVALID_PARAMS = "invalid_params" POLICY_VIOLATION = "policy_violation" OUTPUT_INJECTION = "output_injection" RESOURCE_EXCEEDED = "resource_exceeded" @dataclass class ToolCall: tool_name: str parameters: Dict[str, Any] context: str user_id: Optional[str] = None @dataclass class SafetyResult: allowed: bool result: Optional[Any] = None block_reason: Optional[BlockReason] = None details: Optional[str] = None class IntentClassifier: """Classify tool call intent as safe or potentially harmful.""" def __init__(self, model_path: str, threshold: float = 0.85): self.model = self._load_model(model_path) self.threshold = threshold self.harmful_patterns = [ r'delete.*all', r'drop.*table', r'rm\s+-rf', r'format.*drive', r'sudo.*', r'chmod\s+777' ] def _load_model(self, path: str): # Load fine-tuned intent classifier return torch.load(path) def classify(self, call: ToolCall) -> float: """Return probability that tool call is harmful.""" # Quick pattern matching text = f"{call.tool_name} {str(call.parameters)}" for pattern in self.harmful_patterns: if re.search(pattern, text, re.IGNORECASE): return 0.95 # Neural classifier for nuanced cases features = self._extract_features(call) with torch.no_grad(): prob = self.model(features).item() return prob def _extract_features(self, call: ToolCall) -> torch.Tensor: # Feature extraction logic pass class ParameterValidator: """Validate and sanitize tool parameters.""" def __init__(self, schemas: Dict[str, Dict]): self.schemas = schemas self.dangerous_chars = [';', '|', '`', '$(', '&&'] def validate(self, tool: str, params: Dict) -> bool: """Check parameters against schema and security rules.""" if tool not in self.schemas: return False schema = self.schemas[tool] for key, spec in schema.items(): if spec.get('required') and key not in params: return False if key in params: if not self._check_type(params[key], spec['type']): return False if not self._check_security(params[key]): return False return True def _check_security(self, value: Any) -> bool: """Check for injection patterns.""" if isinstance(value, str): for char in self.dangerous_chars: if char in value: return False return True def sanitize(self, params: Dict) -> Dict: """Remove or escape dangerous content.""" sanitized = {} for k, v in params.items(): if isinstance(v, str): sanitized[k] = re.sub(r'[;&|`$]', '', v) else: sanitized[k] = v return sanitized class OutputVerifier: """Verify tool outputs for injection and sensitive data.""" def __init__(self): self.injection_patterns = [ r'ignore\s+(previous|above)\s+instructions', r'you\s+are\s+now\s+', r'new\s+instructions?:', r'system\s*:\s*', ] self.sensitive_patterns = [ r'\b\d{3}-\d{2}-\d{4}\b', # SSN r'\b\d{16}\b', # Credit card r'api[_-]?key\s*[:=]\s*\S+', ] def verify(self, output: str) -> bool: """Check output for malicious content.""" for pattern in self.injection_patterns: if re.search(pattern, output, re.IGNORECASE): return False return True def redact_sensitive(self, output: str) -> str: """Redact sensitive information from output.""" for pattern in self.sensitive_patterns: output = re.sub(pattern, '[REDACTED]', output) return output class ToolGuard: """Main safety controller for tool-using LLMs.""" def __init__( self, intent_model_path: str, tool_schemas: Dict[str, Dict], policies: List[callable] ): self.intent_classifier = IntentClassifier(intent_model_path) self.param_validator = ParameterValidator(tool_schemas) self.output_verifier = OutputVerifier() self.policies = policies def execute(self, call: ToolCall, executor: callable) -> SafetyResult: """Execute tool call with full safety pipeline.""" # Phase 1: Intent check harm_prob = self.intent_classifier.classify(call) if harm_prob > self.intent_classifier.threshold: return SafetyResult( allowed=False, block_reason=BlockReason.HARMFUL_INTENT, details=f"Harm probability: {harm_prob:.2f}" ) # Phase 2: Parameter validation if not self.param_validator.validate(call.tool_name, call.parameters): return SafetyResult( allowed=False, block_reason=BlockReason.INVALID_PARAMS ) # Phase 3: Policy check for policy in self.policies: if not policy(call): return SafetyResult( allowed=False, block_reason=BlockReason.POLICY_VIOLATION ) # Phase 4: Sandboxed execution sanitized_params = self.param_validator.sanitize(call.parameters) try: result = self._execute_sandboxed(executor, call.tool_name, sanitized_params) except ResourceExceededError: return SafetyResult( allowed=False, block_reason=BlockReason.RESOURCE_EXCEEDED ) # Phase 5: Output verification if not self.output_verifier.verify(str(result)): return SafetyResult( allowed=False, block_reason=BlockReason.OUTPUT_INJECTION ) safe_result = self.output_verifier.redact_sensitive(str(result)) return SafetyResult(allowed=True, result=safe_result) def _execute_sandboxed(self, executor, tool, params) -> Any: """Execute with resource limits.""" # Set resource limits resource.setrlimit(resource.RLIMIT_CPU, (5, 5)) # 5 second CPU limit resource.setrlimit(resource.RLIMIT_AS, (512*1024*1024, 512*1024*1024)) # 512MB memory return executor(tool, params)
Deployment Scenarios
Built for platform and security teams: enforce consistent tool-use policy, bound blast radius, and ship telemetry for audits, incident response, and continuous improvement.
Code Execution Agents
Capability-bounded code synthesis and execution with namespace isolation, resource quotas, and static analysis integration for AI coding assistants.
Web Browsing Agents
Domain-aware URL classification, adversarial content scanning, and covert channel detection prevent information leakage and injection attacks.
API Integration Agents
Schema-validated API invocations with rate limiting, credential isolation, and response sanitization for third-party service integrations.
File System Agents
Path traversal prevention, capability-based permission enforcement, and sensitive pattern detection for filesystem operations.
Experimental Results
Attack Prevention by Type
Percentage of attacks blocked before execution
| Method | Block Rate | False Positive | Latency (ms) | Coverage |
|---|---|---|---|---|
| Allowlist Only | 85% | 8% | ~1 | Limited |
| Regex Filtering | 72% | 3% | ~2 | Pattern-based |
| Intent Classifier | 87% | 2% | ~4 | Semantic |
| Sandboxing Only | 89% | <1% | ~8 | Runtime |
| ToolGuard (Ours) | 89% | <1% | ~12 | Full Stack |
Safety Check Latency Breakdown
Time spent in each pipeline stage
False Positive vs Detection Trade-off
Precision-recall at different thresholds
Tool Type Coverage
Safety metrics by tool category
Tool Safety Demo
Tool Call Safety Checker
Test the ToolGuard safety pipeline with different tool calls.
Key Findings
Defense in Depth Efficacy
Compositional multi-layer defenses achieve ~89% attack prevention versus ~87% for optimal single-layer baselines, with orthogonal coverage across attack taxonomies.
Negligible Latency Impact
Safety pipeline adds 12ms P95 latency, representing less than 1% overhead relative to typical tool execution times (100ms-10s median), enabling production deployment.
Configurable Security Posture
Threshold calibration enables precision-recall trade-off optimization across deployment contexts, from high-security (~89% block) to high-availability (<1% FP).
Online Adaptation
Continuous behavioral monitoring with causal intervention analysis enables zero-day attack detection and automated policy refinement without retraining.
References
-
Not What You've Signed Up For: Compromising Real-World LLM-Integrated Applications with Indirect Prompt InjectionAISec 2023arXiv:2302.12173 →
-
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking CompetitionEMNLP 2023arXiv:2311.16119 →
-
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIsICLR 2024arXiv:2307.16789 →