AI Safety & Guardrails Architecture
Constitutional AI, RLHF, input/output filters, and red-teaming
Key Insight
Defense in depth: no single safety filter is reliable. Layer input filters + fine-tuned alignment + output classifiers + human review for high-risk applications.
Request Journey
How It Works
โ User input arrives and is immediately screened by an input classifier (fine-tuned DeBERTa) that detects prompt injections, jailbreak attempts, and malicious instruction patterns with 95-99% recall
โก PII detector (NER model + regex patterns) identifies and redacts personally identifiable information (emails, SSNs, phone numbers) before the prompt reaches the model
โข Sanitized prompt is processed by the LLM, which has been aligned during training via Constitutional AI (self-critique against 15-20 written principles) and RLHF (reward model trained on human preference labels)
โฃ Generated output passes through an output classifier that scores for toxicity, violence, hate speech, and dangerous content across multiple harm categories
โค Hallucination detector uses NLI (Natural Language Inference) to verify each claim against source documents, flagging unsupported assertions
โฅ Clean responses are delivered to the user; flagged outputs are routed to a human review queue for manual assessment
โฆ Continuous red-team loop: adversarial testers probe the system with novel attack vectors (GCG suffixes, multi-turn manipulation, indirect injection via retrieved docs), updating classifier rules and retraining alignment as new vulnerabilities are found
โ The Problem
Production LLM applications face adversarial prompt injections, jailbreaks that bypass safety training, hallucinated outputs presented as fact, and PII/sensitive data leakage. No single safety mechanism is reliable โ RLHF-aligned models can still be manipulated with sophisticated multi-turn attacks, and output classifiers have both false positive and false negative rates that degrade user experience or miss harmful content.
โThe Solution
Defense in depth layers multiple safety mechanisms: input classifiers detect prompt injections and PII before reaching the model, Constitutional AI and RLHF align the model during training, structured output schemas constrain generation, output classifiers check for toxicity and hallucination, and human review queues handle high-risk edge cases. Each layer catches what others miss, and the system degrades gracefully when individual components fail.
๐Scale at a Glance
<15ms per request
Input classifier latency
95โ99% recall
Prompt injection detection
<2%
Output toxicity false positive rate
200+ attack vectors
Red-team attack surface coverage
๐ฌDeep Dive
Prompt Injection Defense โ The LLM's SQL Injection
Prompt injection is the most critical vulnerability in LLM applications: an attacker embeds instructions in user input that hijack the model's behavior. Direct injections ('ignore previous instructions and...') are easy to detect, but indirect injections are far more dangerous โ malicious instructions hidden in retrieved documents, emails, or web pages that the LLM processes. Defense strategies operate at multiple levels. Input classifiers (fine-tuned BERT/DeBERTa models) detect known injection patterns with 95%+ accuracy but struggle with novel attacks. Instruction hierarchy (Anthropic's approach) trains the model to prioritize system prompts over user inputs. Prompt sandboxing isolates user-provided content in clearly delimited sections with explicit instructions to treat it as data, not commands. Canary tokens โ unique strings injected into the system prompt โ detect when the model has been manipulated into revealing its instructions. No single defense is sufficient; production systems layer all of these together.
Constitutional AI โ Alignment Without Per-Example Human Labels
Constitutional AI (CAI), developed by Anthropic, replaces the expensive human labeling step in RLHF with a set of written principles (a 'constitution'). The process has two phases. In the supervised phase, the model generates responses, then critiques and revises its own outputs based on the constitution ('Is this response harmful? If so, rewrite it to be helpful while avoiding harm'). In the RL phase, the revised responses train a reward model that is then used for reinforcement learning, replacing human preference labels. The constitution typically includes 15โ20 principles covering helpfulness, harmlessness, and honesty. CAI's key advantage is scalability: writing principles is far cheaper than labeling thousands of output pairs. The approach also makes alignment decisions transparent and auditable โ you can inspect and modify the constitution. However, CAI can be overly conservative, refusing legitimate queries that superficially resemble harmful ones, requiring careful principle tuning.
Output Classification โ Toxicity, Hallucination, and Fact-Checking
Output safety classifiers evaluate generated text before it reaches the user. Toxicity classifiers (e.g., Perspective API, OpenAI Moderation endpoint, Meta's Llama Guard) detect harmful content across categories: hate speech, violence, sexual content, self-harm, and dangerous instructions. Hallucination detectors compare generated claims against retrieved source documents, flagging unsupported statements โ approaches include NLI (Natural Language Inference) models that classify each claim as 'supported', 'contradicted', or 'neutral' relative to sources. Fact-checking pipelines decompose outputs into atomic claims and verify each against a knowledge base. The critical design decision is the false positive rate: an aggressive classifier blocks harmful content but also blocks legitimate queries about sensitive topics (medical, legal, security research). Production systems typically use tiered thresholds โ strict for consumer-facing applications, relaxed for enterprise/research contexts โ with human review queues for borderline cases.
Red-Teaming โ Systematic Adversarial Testing
Red-teaming proactively discovers vulnerabilities before attackers do. Manual red-teaming uses domain experts who craft adversarial prompts across categories: jailbreaks (bypassing safety training), prompt injection (hijacking behavior), information extraction (leaking training data or system prompts), and bias exploitation. Automated red-teaming uses LLMs to generate adversarial attacks at scale โ one model attacks while another evaluates success. Frameworks like Microsoft's PyRIT and NVIDIA's NeMo Guardrails provide structured attack libraries covering 200+ known attack vectors. Gradient-based attacks (GCG โ Greedy Coordinate Gradient) automatically find adversarial suffixes that bypass safety training, though these are computationally expensive. The red-team cycle is continuous: new attacks are discovered, defenses are updated, and the red team tests again. Organizations typically run red-team exercises before every major model or prompt change.
PII Detection and Data Loss Prevention for LLMs
LLM applications process user inputs that frequently contain personally identifiable information: names, emails, phone numbers, SSNs, medical records, and financial data. PII detection operates at both input and output stages. Input-side PII detection identifies sensitive data before it reaches the model, either redacting it (replacing with tokens like [EMAIL]) or encrypting it for later reconstruction. NER (Named Entity Recognition) models fine-tuned for PII achieve 97%+ recall on structured PII formats (emails, phone numbers) but struggle with contextual PII ('my neighbor John told me...'). Output-side DLP (Data Loss Prevention) prevents the model from regurgitating training data containing PII โ a known risk with large language models that memorize rare sequences. Regex-based detection catches structured formats while transformer-based classifiers handle contextual PII. Differential privacy during training provides mathematical guarantees against memorization but reduces model quality. Production systems combine all approaches with audit logging for compliance with GDPR, HIPAA, and CCPA.
โฌกArchitecture Diagram
AI Safety & Guardrails Architecture โ simplified architecture overview
โฆCore Concepts
Constitutional AI
RLHF
Prompt Injection Defense
Output Classification
Red-Teaming
Llama Guard
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โDefense in depth ensures no single point of failure โ each layer catches what others miss
- โConstitutional AI scales alignment cheaply without requiring per-example human labels for every output
- โInput/output classifiers add minimal latency (<15ms) while catching the majority of known attack patterns
- โRed-teaming frameworks provide systematic coverage of 200+ known attack vectors before deployment
โ Weaknesses
- โOverly aggressive safety filters produce false positives that block legitimate queries on sensitive topics
- โConstitutional AI can make models excessively cautious, refusing edge-case queries that are actually benign
- โAdversarial attacks evolve faster than defenses โ novel jailbreaks regularly bypass existing classifiers
- โPII detection has inherent recall/precision tradeoffs โ high recall means more false positives disrupting user experience
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a safety system for a customer-facing LLM chatbot. What layers of defense would you implement and in what order?
- Q2
How would you defend against indirect prompt injection โ where malicious instructions are hidden in documents the LLM retrieves from the web?
- Q3
Explain the tradeoff between safety and helpfulness in LLM alignment. How do you minimize false positive refusals?
- Q4
Your LLM application must comply with GDPR. Design the PII handling pipeline for both input processing and output generation.
- Q5
How would you set up continuous red-teaming for an LLM product? What attack categories would you test and how would you automate it?
Research Papers & Further Reading
Constitutional AI: Harmlessness from AI Feedback
Bai, Y. et al. (Anthropic)
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed