Home ArchitecturesAI Safety & Guardrails Architecture

🤖 LLM & AIAdvancedWeek 15

AI Safety & Guardrails Architecture

Constitutional AI, RLHF, input/output filters, and red-teaming

AnthropicOpenAIMeta (Llama Guard)

Key Insight

Defense in depth: no single safety filter is reliable. Layer input filters + fine-tuned alignment + output classifiers + human review for high-risk applications.

Request Journey

User input arrives and is immediately screened by an input classifier (fine-tuned DeBERTa) that detects prompt injections, jailbreak attempts, and malicious instruction patterns with 95-99% recall→

PII detector (NER model + regex patterns) identifies and redacts personally identifiable information (emails, SSNs, phone numbers) before the prompt reaches the model→

Sanitized prompt is processed by the LLM, which has been aligned during training via Constitutional AI (self-critique against 15-20 written principles) and RLHF (reward model trained on human preference labels)→

Generated output passes through an output classifier that scores for toxicity, violence, hate speech, and dangerous content across multiple harm categories→

Hallucination detector uses NLI (Natural Language Inference) to verify each claim against source documents, flagging unsupported assertions

+2 more steps

How It Works

① User input arrives and is immediately screened by an input classifier (fine-tuned DeBERTa) that detects prompt injections, jailbreak attempts, and malicious instruction patterns with 95-99% recall

② PII detector (NER model + regex patterns) identifies and redacts personally identifiable information (emails, SSNs, phone numbers) before the prompt reaches the model

③ Sanitized prompt is processed by the LLM, which has been aligned during training via Constitutional AI (self-critique against 15-20 written principles) and RLHF (reward model trained on human preference labels)

④ Generated output passes through an output classifier that scores for toxicity, violence, hate speech, and dangerous content across multiple harm categories

⑤ Hallucination detector uses NLI (Natural Language Inference) to verify each claim against source documents, flagging unsupported assertions

⑥ Clean responses are delivered to the user; flagged outputs are routed to a human review queue for manual assessment

⑦ Continuous red-team loop: adversarial testers probe the system with novel attack vectors (GCG suffixes, multi-turn manipulation, indirect injection via retrieved docs), updating classifier rules and retraining alignment as new vulnerabilities are found

⚠The Problem

Production LLM applications face adversarial prompt injections, jailbreaks that bypass safety training, hallucinated outputs presented as fact, and PII/sensitive data leakage. No single safety mechanism is reliable — RLHF-aligned models can still be manipulated with sophisticated multi-turn attacks, and output classifiers have both false positive and false negative rates that degrade user experience or miss harmful content.

✓The Solution

Defense in depth layers multiple safety mechanisms: input classifiers detect prompt injections and PII before reaching the model, Constitutional AI and RLHF align the model during training, structured output schemas constrain generation, output classifiers check for toxicity and hallucination, and human review queues handle high-risk edge cases. Each layer catches what others miss, and the system degrades gracefully when individual components fail.

📊Scale at a Glance

<15ms per request

Input classifier latency

95–99% recall

Prompt injection detection

<2%

Output toxicity false positive rate

200+ attack vectors

Red-team attack surface coverage

🔬Deep Dive

Prompt Injection Defense — The LLM's SQL Injection

Prompt injection is the most critical vulnerability in LLM applications: an attacker embeds instructions in user input that hijack the model's behavior. Direct injections ('ignore previous instructions and...') are easy to detect, but indirect injections are far more dangerous — malicious instructions hidden in retrieved documents, emails, or web pages that the LLM processes. Defense strategies operate at multiple levels. Input classifiers (fine-tuned BERT/DeBERTa models) detect known injection patterns with 95%+ accuracy but struggle with novel attacks. Instruction hierarchy (Anthropic's approach) trains the model to prioritize system prompts over user inputs. Prompt sandboxing isolates user-provided content in clearly delimited sections with explicit instructions to treat it as data, not commands. Canary tokens — unique strings injected into the system prompt — detect when the model has been manipulated into revealing its instructions. No single defense is sufficient; production systems layer all of these together.

Constitutional AI — Alignment Without Per-Example Human Labels

Constitutional AI (CAI), developed by Anthropic, replaces the expensive human labeling step in RLHF with a set of written principles (a 'constitution'). The process has two phases. In the supervised phase, the model generates responses, then critiques and revises its own outputs based on the constitution ('Is this response harmful? If so, rewrite it to be helpful while avoiding harm'). In the RL phase, the revised responses train a reward model that is then used for reinforcement learning, replacing human preference labels. The constitution typically includes 15–20 principles covering helpfulness, harmlessness, and honesty. CAI's key advantage is scalability: writing principles is far cheaper than labeling thousands of output pairs. The approach also makes alignment decisions transparent and auditable — you can inspect and modify the constitution. However, CAI can be overly conservative, refusing legitimate queries that superficially resemble harmful ones, requiring careful principle tuning.

Output Classification — Toxicity, Hallucination, and Fact-Checking

Output safety classifiers evaluate generated text before it reaches the user. Toxicity classifiers (e.g., Perspective API, OpenAI Moderation endpoint, Meta's Llama Guard) detect harmful content across categories: hate speech, violence, sexual content, self-harm, and dangerous instructions. Hallucination detectors compare generated claims against retrieved source documents, flagging unsupported statements — approaches include NLI (Natural Language Inference) models that classify each claim as 'supported', 'contradicted', or 'neutral' relative to sources. Fact-checking pipelines decompose outputs into atomic claims and verify each against a knowledge base. The critical design decision is the false positive rate: an aggressive classifier blocks harmful content but also blocks legitimate queries about sensitive topics (medical, legal, security research). Production systems typically use tiered thresholds — strict for consumer-facing applications, relaxed for enterprise/research contexts — with human review queues for borderline cases.

Red-Teaming — Systematic Adversarial Testing

Red-teaming proactively discovers vulnerabilities before attackers do. Manual red-teaming uses domain experts who craft adversarial prompts across categories: jailbreaks (bypassing safety training), prompt injection (hijacking behavior), information extraction (leaking training data or system prompts), and bias exploitation. Automated red-teaming uses LLMs to generate adversarial attacks at scale — one model attacks while another evaluates success. Frameworks like Microsoft's PyRIT and NVIDIA's NeMo Guardrails provide structured attack libraries covering 200+ known attack vectors. Gradient-based attacks (GCG — Greedy Coordinate Gradient) automatically find adversarial suffixes that bypass safety training, though these are computationally expensive. The red-team cycle is continuous: new attacks are discovered, defenses are updated, and the red team tests again. Organizations typically run red-team exercises before every major model or prompt change.

PII Detection and Data Loss Prevention for LLMs

LLM applications process user inputs that frequently contain personally identifiable information: names, emails, phone numbers, SSNs, medical records, and financial data. PII detection operates at both input and output stages. Input-side PII detection identifies sensitive data before it reaches the model, either redacting it (replacing with tokens like [EMAIL]) or encrypting it for later reconstruction. NER (Named Entity Recognition) models fine-tuned for PII achieve 97%+ recall on structured PII formats (emails, phone numbers) but struggle with contextual PII ('my neighbor John told me...'). Output-side DLP (Data Loss Prevention) prevents the model from regurgitating training data containing PII — a known risk with large language models that memorize rare sequences. Regex-based detection catches structured formats while transformer-based classifiers handle contextual PII. Differential privacy during training provides mathematical guarantees against memorization but reduces model quality. Production systems combine all approaches with audit logging for compliance with GDPR, HIPAA, and CCPA.

⬡Architecture Diagram

AI Safety & Guardrails Architecture — simplified architecture overview

✦Core Concepts

🧠

Constitutional AI

⚙️

RLHF

⚙️

Prompt Injection Defense

⚙️

Output Classification

⚙️

Red-Teaming

⚙️

Llama Guard

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Defense in depth ensures no single point of failure — each layer catches what others miss
✓Constitutional AI scales alignment cheaply without requiring per-example human labels for every output
✓Input/output classifiers add minimal latency (<15ms) while catching the majority of known attack patterns
✓Red-teaming frameworks provide systematic coverage of 200+ known attack vectors before deployment

✗ Weaknesses

✗Overly aggressive safety filters produce false positives that block legitimate queries on sensitive topics
✗Constitutional AI can make models excessively cautious, refusing edge-case queries that are actually benign
✗Adversarial attacks evolve faster than defenses — novel jailbreaks regularly bypass existing classifiers
✗PII detection has inherent recall/precision tradeoffs — high recall means more false positives disrupting user experience

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design a safety system for a customer-facing LLM chatbot. What layers of defense would you implement and in what order?
Q2
How would you defend against indirect prompt injection — where malicious instructions are hidden in documents the LLM retrieves from the web?
Q3
Explain the tradeoff between safety and helpfulness in LLM alignment. How do you minimize false positive refusals?
Q4
Your LLM application must comply with GDPR. Design the PII handling pipeline for both input processing and output generation.
Q5
How would you set up continuous red-teaming for an LLM product? What attack categories would you test and how would you automate it?

Research Papers & Further Reading

2022

Constitutional AI: Harmlessness from AI Feedback

Bai, Y. et al. (Anthropic)

Read

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser