Level 3: Advanced

Security Deep Dive

Master the inner workings of AgentAIShield's security architecture: PII detection internals, injection classifiers, behavioral fingerprinting, TrustShield verification, output scanning, and prompt sanitization. Learn how to fine-tune each layer for maximum protection.

45 minutes
6 sections
Advanced

PII Detection: Hybrid Approach

AgentAIShield uses a two-stage PII detection pipeline combining speed and accuracy:

Stage 1: Regex Pattern Matching

Fast, deterministic matching for structured data types:

// Example PII Patterns (simplified) const patterns = { email: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, ssn: /\b\d{3}-\d{2}-\d{4}\b/g, creditCard: /\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14})\b/g, phone: /\b(?:\+?1[-.\s]?)?\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}\b/g, ipv4: /\b(?:\d{1,3}\.){3}\d{1,3}\b/g }; // Fast scan across entire request body function scanPII(text) { const findings = []; for (const [type, regex] of Object.entries(patterns)) { const matches = text.match(regex); if (matches) { findings.push({ type, count: matches.length, samples: matches.slice(0, 3) }); } } return findings; }

Stage 2: Named Entity Recognition (NER)

ML-based detection for contextual PII that regex can't catch:

The NER model is a fine-tuned DistilBERT trained on 200K+ labeled examples. It runs at ~15ms latency per request on CPU.

NER Model Details

Uses a BIO (Beginning-Inside-Outside) tagging scheme with 18 entity classes. Achieves 94.2% F1 score on the CoNLL-2003 benchmark and 91.8% on custom healthcare data.

Detection Threshold Tuning

Configure sensitivity per data type in Data Shield settings:

// Adjust detection thresholds via API curl -X PATCH https://agentaishield.com/api/v1/agents/my-agent/pii-config \ -H "Authorization: Bearer sk_live_..." \ -d '{ "sensitivity": "balanced", "custom_thresholds": { "email": 0.95, "person_name": 0.85, "medical": 0.90 }, "redaction_strategy": "partial" }'

False Positive Tuning

Common false positives and how to suppress them:

Compliance Requirement

For GDPR/HIPAA, use "high sensitivity" mode even if it generates false positives. Better to over-redact in logs than leak real PII. Review flagged items weekly to refine whitelists.

Injection Detection: ML Classifier Architecture

The injection detection engine uses a lightweight gradient-boosted classifier trained on adversarial datasets.

Training Data Sources

Feature Engineering

The classifier extracts 87 features from each request:

// Example feature extraction { "token_count": 142, "entropy": 4.87, "special_char_ratio": 0.23, "imperative_verb_count": 5, "markdown_abuse_score": 0.78, "embedding_anomaly": 0.65, "contains_role_instruction": true, "contains_delimiter_tricks": true, "semantic_similarity_to_system_prompt": 0.92 }

Confidence Scoring

Each prediction includes a confidence score (0.0 to 1.0):

Model Performance

Current classifier achieves 92.4% accuracy, 89.7% precision, 94.1% recall on held-out test set. Updated monthly with new attack patterns.

Pattern Library: Common Bypasses Detected

The classifier catches sophisticated injection techniques:

// Example detected injection { "request_id": "req_abc123", "detected_at": "2026-02-25T19:45:00Z", "confidence": 0.94, "techniques": [ "role_confusion", "delimiter_injection" ], "matched_patterns": [ "ignore previous instructions", "you are now", "disregard safety" ], "blocked": true, "safe_response_sent": true }

Behavioral Fingerprinting

Track agent behavior over time to detect drift, anomalies, and compromised system prompts.

Baseline Creation (First 100 Requests)

For each new agent, AAIS establishes behavioral norms:

Baseline Stability

Baselines auto-update using exponential moving average (EMA) with 0.95 smoothing. Sudden spikes trigger drift detection before the baseline shifts.

Anomaly Detection Algorithms

Three-layer anomaly detection:

1. Statistical Outliers (Z-Score)

2. Isolation Forest (Multivariate)

3. Semantic Drift (Embedding Distance)

// Anomaly detection output { "agent_id": "customer-support-bot", "anomaly_detected": true, "anomaly_type": "behavioral_drift", "metrics": { "response_time_zscore": 4.2, "error_rate_change": "+320%", "semantic_distance": 0.78, "isolation_score": 0.91 }, "probable_cause": "System prompt modification detected", "recommendation": "Review recent configuration changes" }

Drift Detection Thresholds

Configure sensitivity for different types of drift:

Real-World Example

A customer chatbot's error rate jumped 400% overnight. Behavioral fingerprinting flagged it within 12 requests. Root cause: LLM provider rate limits changed, causing timeouts. Team switched providers before users noticed.

TrustShield: Hallucination Verification

TrustShield is AgentAIShield's advanced hallucination detection engine that validates LLM outputs for truthfulness and accuracy using 8 sophisticated detection methods.

TrustShield Dashboard

Access the TrustShield dashboard from the Security section in the sidebar. The dashboard provides comprehensive analytics:

4 KPI Cards with Letter Grades (A+ to F):

3 Chart.js Visualizations:

Recent Verifications Table:

Detection Patterns Section:

Dual Authentication

TrustShield endpoints accept both JWT tokens (dashboard users) and API keys (SDK integrations). No need to create an API key just to explore TrustShield from the dashboard.

Verification API

Send candidate responses for factual verification:

// TrustShield verification request (SDK or API key) curl -X POST https://agentaishield.com/api/v1/trustshield/verify \ -H "X-API-Key: aais_..." \ -d '{ "agent_id": "finance-advisor", "response": "The S&P 500 returned 28.7% in 2023.", "context": "User asked about 2023 market performance", "confidence_threshold": 0.85 }' // Response { "verdict": "SUPPORTED", "truth_score": 96, "confidence": 0.96, "total_claims": 1, "supported_claims": 1, "hallucinated_claims": 0, "unverifiable_claims": 0, "recommendation": "ALLOW", "latency_ms": 42 }

8 Detection Methods

TrustShield employs multiple verification techniques:

  1. Self-consistency check: Compare claims against conversation history
  2. Source attribution: Verify claims cite verifiable sources
  3. Confidence scoring: Detect hedging language patterns
  4. Factual anchor check: Cross-reference against known facts
  5. Semantic drift detection: Check if response stays on topic
  6. Contradiction detection: Find internal contradictions
  7. Fabrication patterns: Detect fake URLs, IDs, contact info
  8. Format validation: Validate against schema hints
Performance

TrustShield is pure rule-based verification with NO external LLM calls. Average latency: under 50ms. Cost: $0 per verification.

Knowledge Base Integration

TrustShield supports multiple knowledge base backends:

Embedding Compatibility

TrustShield uses OpenAI's text-embedding-3-small (1536 dims) for vector search. Average verification latency: 120ms including KB lookup.

Hallucination Scoring Algorithm

TrustShield computes a composite hallucination risk score:

// Hallucination score breakdown { "overall_score": 0.12, // Low = good (0-1 scale) "risk_level": "low", "components": { "factual_consistency": 0.95, "citation_coverage": 0.88, "confidence_signals": 0.92, "historical_accuracy": 0.94 }, "claims_analyzed": 3, "claims_verified": 3, "unsupported_claims": [] }

Confidence Thresholds

Configure verification strictness per use case:

Healthcare Use Case

A telehealth chatbot caught a hallucinated drug dosage (claimed 500mg, correct was 50mg) using TrustShield verification against FDA drug database. Prevented potential patient harm.

Output Scanning & Filtering

Scan LLM responses before delivery to catch toxic content, policy violations, and data leaks.

Response Filtering Pipeline

  1. Toxicity detection: Hate speech, profanity, sexual content (Perspective API)
  2. PII leakage check: Did the LLM accidentally echo sensitive data?
  3. Policy enforcement: Custom blocklists (competitor mentions, banned topics)
  4. Legal compliance: DMCA violations, medical advice disclaimers
  5. Brand safety: Controversial statements, political bias
// Output scan configuration { "agent_id": "public-chatbot", "output_policies": { "toxicity_threshold": 0.7, "pii_leakage": "block", "competitor_mentions": "redact", "medical_disclaimers": "auto_append", "profanity_filter": "replace_with_asterisks" }, "custom_blocklist": [ "competitor.com", "lawsuit", "bankruptcy" ] }

Toxic Content Detection

Uses a fine-tuned RoBERTa model to score responses across 7 dimensions:

False Positive Handling

Medical terms, technical jargon, and quoted text can trigger false positives. Use context-aware scanning mode to reduce FP rate from 12% to 3%.

Policy Enforcement Strategies

When a violation is detected, choose response strategy:

// Violation handling example { "request_id": "req_xyz789", "violation_detected": true, "violation_type": "competitor_mention", "action_taken": "redact", "original_response": "You should try CompetitorX instead.", "filtered_response": "You should try [REDACTED] instead.", "metadata": { "redacted_segments": 1, "human_review_queued": false } }

Prompt Sanitization

Clean user inputs before they reach the LLM while preserving semantic intent.

Redaction Rules

Configure automatic redaction per PII type:

// Sanitization configuration { "strategies": { "email": "partial", // j***@example.com "phone": "full", // [PHONE] "ssn": "full", // [SSN] "credit_card": "tokenize", // tok_cc_abc123 "person_name": "synthetic" // Replace "John Doe" with "Alex Smith" }, "preserve_context": true, // Keep sentence structure intact "log_redactions": true // Track what was sanitized }

Semantic Preservation

The sanitizer maintains query intent despite redactions:

Example: Customer Support

Original: "My email is [email protected] and order #12345 hasn't shipped."
Sanitized: "My email is [EMAIL] and order #12345 hasn't shipped."
LLM can still help without seeing PII!

Configurable Per Data Type

Fine-tune sanitization aggressiveness:

Reversibility Trade-Off

Tokenization allows you to de-redact for authorized users, but increases breach risk. Use irreversible redaction (full/partial) for logs, tokenization only for real-time sessions.

Putting It All Together

A production-grade security stack combines all layers:

// Full security pipeline Request → [PII Detection] → [Sanitization] → [Injection Detection] → [Block/Allow] → LLM Processing → [Output Scanning] → [TrustShield Verification] → [Policy Enforcement] → Response → [Behavioral Logging] → [Anomaly Detection]

Each layer operates independently, so failures in one don't compromise the others (defense in depth).

Security Mastery Achieved!

You now understand the internals of every security layer in AgentAIShield. Ready to implement compliance frameworks or test your defenses? Explore Enterprise Features or Red Team Testing next.

Previous: Starter Tier Features Next: Enterprise Features
Last verified: March 2026 · Report an issue