LLM API Gateway Architecture
Rate limiting, token tracking, model routing, and cost management
Key Insight
LLM rate limits are in tokens/minute, not requests/minute a token bucket algorithm with separate limits for input/output tokens is required.
Request Journey
How It Works
โ Application sends a request with API key and model preference
โก Gateway authenticates the key and resolves team/user identity for quota enforcement
โข Token bucket rate limiter checks token budget (not request count โ LLMs charge per token), with separate limits for input and output tokens
โฃ Semantic cache embeds the prompt and checks cosine similarity against cached prompt-response pairs (threshold ~0.95) โ a hit returns instantly
โค On cache miss, the model router selects the optimal provider based on routing rules: cheapest model meeting quality tier, latency SLA, and provider health
โฅ Request is forwarded to the selected provider (OpenAI, Anthropic, or self-hosted)
โฆ If the primary provider fails (rate limit, timeout, 5xx), the fallback chain retries with the next provider in priority order
โง Response streams back through the gateway; cost tracker records input/output token counts, calculates cost per team, and checks against budget thresholds
โจ Usage metrics feed into dashboards with real-time budget alerts and per-team cost attribution
โ The Problem
Organizations using multiple LLM providers face runaway costs, reliability gaps when a provider is down, token-based rate limits that differ from request-based limits, and no visibility into which teams or features are consuming how much AI budget. Managing this directly in application code creates a fragmented, unmanageable mess.
โThe Solution
An LLM API gateway sits between applications and LLM providers, providing unified token-aware rate limiting, cost attribution, intelligent model routing, semantic caching, and fallback chains โ all transparent to the calling application. Token bucket algorithms track per-user/team token consumption; semantic caching returns cached responses for similar (not just identical) prompts.
๐Scale at a Glance
20-40%
Semantic Cache Hit Rate
50-70%
Cost Reduction (routing)
< 5ms p99
Gateway Overhead Target
99.99% uptime
Provider Fallback SLA
๐ฌDeep Dive
Token-Based Rate Limiting
LLM providers rate-limit in tokens per minute (TPM), not requests per minute. A single GPT-4 request can consume 1 token or 32,000 tokens โ request-based rate limiters are therefore useless. The gateway must track running token consumption per API key using a token bucket algorithm with two buckets: one for input tokens and one for output (output tokens are 3-4x more expensive). When a bucket is nearly full, the gateway queues or rejects new requests.
Semantic Caching
Exact-match caching (cache the response for identical prompts) has low hit rates โ users rarely send identical prompts. Semantic caching embeds each prompt and stores it in a vector database. On new requests, if the embedding distance to a cached prompt is below a threshold (e.g., cosine similarity > 0.97), return the cached response. This can achieve 20-40% cache hit rates on repetitive workloads like customer support or code generation for common patterns.
Model Routing and Fallback Chains
Not all queries need GPT-4. A router can classify query complexity (simple factual lookup vs. multi-step reasoning) and route to the cheapest model that meets quality requirements. A cost-optimized routing table might be: GPT-4o for complex reasoning (top 10% of queries), GPT-4o-mini for standard queries, Claude Haiku for simple tasks. Fallback chains handle provider outages: if the primary model times out after 3 seconds, retry with a fallback model automatically.
Cost Attribution and Budgeting
In a multi-team organization, the gateway tags every request with team ID, feature name, and model tier. A time-series cost database (ClickHouse, TimescaleDB) records token counts and dollar costs per request. This enables per-team cost dashboards, budget alerts ('Team X has spent 80% of this month AI budget'), and automated throttling when teams exceed budgets. Showback and chargeback reports enable finance teams to allocate AI costs accurately.
Prompt Management and Versioning
The gateway can store versioned prompt templates and inject them automatically. When an application calls generateResponse with a template ID and variables, the gateway fetches the template, fills variables, and sends to the LLM. A/B testing different prompt versions (randomly route 10% to summarize-v4) with outcome logging enables systematic prompt optimization without code deploys. Prompt registry also prevents prompt injection via strict variable substitution.
โฌกArchitecture Diagram
LLM API Gateway Architecture โ simplified architecture overview
โฆCore Concepts
Token Bucket Rate Limiting
Semantic Caching
Model Routing
Cost Attribution
Fallback Chains
Prompt Management
โTradeoffs & Design Decisions
Every architectural decision is a tradeoff. Here's what you gain and what you give up.
โ Strengths
- โCentralizes LLM governance: rate limiting, cost control, and audit logging in one place
- โSemantic caching can reduce LLM costs by 20-40% on repetitive workloads
- โModel routing reduces average cost by 50-70% without degrading quality for most requests
- โProvider fallback chains improve reliability from ~99.9% to ~99.99%
โ Weaknesses
- โThe gateway becomes a critical single point of failure โ it must be highly available with under 5ms overhead
- โSemantic caching requires careful threshold tuning โ too loose returns wrong answers, too strict misses cache hits
- โModel routing quality depends on the routing classifier โ a bad classifier sends complex queries to cheap models
- โAdds operational complexity: another service to deploy, monitor, and scale
๐ฏFAANG Interview Questions
Interview Prep๐ก These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.
These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.
- Q1
Design a token-aware rate limiter for an LLM API gateway. How do you handle token consumption that is not known until the response completes?
- Q2
How does semantic caching work? What threshold would you use for cache hits, and how would you evaluate quality?
- Q3
Design a model routing system that minimizes cost while maintaining quality. What signals would the router use?
- Q4
Your LLM gateway needs to handle 10,000 requests per minute with p99 latency under 50ms overhead. What architecture would you use?
- Q5
How would you implement per-team LLM cost attribution in a multi-tenant gateway? What data would you store and how?
Listen to the Podcast Episode
Alex & Sam break it down
Listen to a conversational deep-dive on this architecture โ real trade-offs, production context, and student-friendly explanations. Free, no login required.
Listen to EpisodeFree ยท No account required ยท Listen in browser
More LLM & AI Systems
View allGPT / Transformer Inference Architecture
KV cache, FlashAttention, quantization, and batching at scale
OpenAI ยท Anthropic ยท Google DeepMind
RAG Pipeline Architecture
Retrieval-Augmented Generation from PDF to production
OpenAI ยท LangChain ยท Cohere
Vector Database Internals
HNSW, IVF, and ANN search at billion scale
Pinecone ยท Weaviate ยท Qdrant
Listen to more architecture deep-dives
30 free podcast episodes โ Alex & Sam break down every architecture in this library. Listen in your browser, no account needed.
All architecture articles are free ยท No account needed