Home ArchitecturesLLM API Gateway Architecture

🤖 LLM & AIAdvancedWeek 12

LLM API Gateway Architecture

Rate limiting, token tracking, model routing, and cost management

OpenAIAnthropicAzure OpenAILiteLLM

Key Insight

LLM rate limits are in tokens/minute, not requests/minute a token bucket algorithm with separate limits for input/output tokens is required.

Request Journey

Application sends a request with API key and model preference→

Gateway authenticates the key and resolves team/user identity for quota enforcement→

Token bucket rate limiter checks token budget (not request count — LLMs charge per token), with separate limits for input and output tokens→

Semantic cache embeds the prompt and checks cosine similarity against cached prompt-response pairs (threshold ~0.95) — a hit returns instantly→

On cache miss, the model router selects the optimal provider based on routing rules: cheapest model meeting quality tier, latency SLA, and provider health

+4 more steps

How It Works

① Application sends a request with API key and model preference

② Gateway authenticates the key and resolves team/user identity for quota enforcement

③ Token bucket rate limiter checks token budget (not request count — LLMs charge per token), with separate limits for input and output tokens

④ Semantic cache embeds the prompt and checks cosine similarity against cached prompt-response pairs (threshold ~0.95) — a hit returns instantly

⑤ On cache miss, the model router selects the optimal provider based on routing rules: cheapest model meeting quality tier, latency SLA, and provider health

⑥ Request is forwarded to the selected provider (OpenAI, Anthropic, or self-hosted)

⑦ If the primary provider fails (rate limit, timeout, 5xx), the fallback chain retries with the next provider in priority order

⑧ Response streams back through the gateway; cost tracker records input/output token counts, calculates cost per team, and checks against budget thresholds

⑨ Usage metrics feed into dashboards with real-time budget alerts and per-team cost attribution

⚠The Problem

Organizations using multiple LLM providers face runaway costs, reliability gaps when a provider is down, token-based rate limits that differ from request-based limits, and no visibility into which teams or features are consuming how much AI budget. Managing this directly in application code creates a fragmented, unmanageable mess.

✓The Solution

An LLM API gateway sits between applications and LLM providers, providing unified token-aware rate limiting, cost attribution, intelligent model routing, semantic caching, and fallback chains — all transparent to the calling application. Token bucket algorithms track per-user/team token consumption; semantic caching returns cached responses for similar (not just identical) prompts.

📊Scale at a Glance

20-40%

Semantic Cache Hit Rate

50-70%

Cost Reduction (routing)

< 5ms p99

Gateway Overhead Target

99.99% uptime

Provider Fallback SLA

🔬Deep Dive

Token-Based Rate Limiting

LLM providers rate-limit in tokens per minute (TPM), not requests per minute. A single GPT-4 request can consume 1 token or 32,000 tokens — request-based rate limiters are therefore useless. The gateway must track running token consumption per API key using a token bucket algorithm with two buckets: one for input tokens and one for output (output tokens are 3-4x more expensive). When a bucket is nearly full, the gateway queues or rejects new requests.

Semantic Caching

Exact-match caching (cache the response for identical prompts) has low hit rates — users rarely send identical prompts. Semantic caching embeds each prompt and stores it in a vector database. On new requests, if the embedding distance to a cached prompt is below a threshold (e.g., cosine similarity > 0.97), return the cached response. This can achieve 20-40% cache hit rates on repetitive workloads like customer support or code generation for common patterns.

Model Routing and Fallback Chains

Not all queries need GPT-4. A router can classify query complexity (simple factual lookup vs. multi-step reasoning) and route to the cheapest model that meets quality requirements. A cost-optimized routing table might be: GPT-4o for complex reasoning (top 10% of queries), GPT-4o-mini for standard queries, Claude Haiku for simple tasks. Fallback chains handle provider outages: if the primary model times out after 3 seconds, retry with a fallback model automatically.

Cost Attribution and Budgeting

In a multi-team organization, the gateway tags every request with team ID, feature name, and model tier. A time-series cost database (ClickHouse, TimescaleDB) records token counts and dollar costs per request. This enables per-team cost dashboards, budget alerts ('Team X has spent 80% of this month AI budget'), and automated throttling when teams exceed budgets. Showback and chargeback reports enable finance teams to allocate AI costs accurately.

Prompt Management and Versioning

The gateway can store versioned prompt templates and inject them automatically. When an application calls generateResponse with a template ID and variables, the gateway fetches the template, fills variables, and sends to the LLM. A/B testing different prompt versions (randomly route 10% to summarize-v4) with outcome logging enables systematic prompt optimization without code deploys. Prompt registry also prevents prompt injection via strict variable substitution.

⬡Architecture Diagram

LLM API Gateway Architecture — simplified architecture overview

✦Core Concepts

⚙️

Token Bucket Rate Limiting

⚡

Semantic Caching

🧠

Model Routing

⚙️

Cost Attribution

🧠

Fallback Chains

⚙️

Prompt Management

⚖Tradeoffs & Design Decisions

Every architectural decision is a tradeoff. Here's what you gain and what you give up.

✓ Strengths

✓Centralizes LLM governance: rate limiting, cost control, and audit logging in one place
✓Semantic caching can reduce LLM costs by 20-40% on repetitive workloads
✓Model routing reduces average cost by 50-70% without degrading quality for most requests
✓Provider fallback chains improve reliability from ~99.9% to ~99.99%

✗ Weaknesses

✗The gateway becomes a critical single point of failure — it must be highly available with under 5ms overhead
✗Semantic caching requires careful threshold tuning — too loose returns wrong answers, too strict misses cache hits
✗Model routing quality depends on the routing classifier — a bad classifier sends complex queries to cheap models
✗Adds operational complexity: another service to deploy, monitor, and scale

🎯FAANG Interview Questions

Interview Prep

💡 These questions appear in FAANG system design rounds. Focus on tradeoffs, not just what the system does.

These are real system design interview questions asked at Google, Meta, Amazon, Apple, Netflix, and Microsoft. Study the architecture above before attempting.

Q1
Design a token-aware rate limiter for an LLM API gateway. How do you handle token consumption that is not known until the response completes?
Q2
How does semantic caching work? What threshold would you use for cache hits, and how would you evaluate quality?
Q3
Design a model routing system that minimizes cost while maintaining quality. What signals would the router use?
Q4
Your LLM gateway needs to handle 10,000 requests per minute with p99 latency under 50ms overhead. What architecture would you use?
Q5
How would you implement per-team LLM cost attribution in a multi-tenant gateway? What data would you store and how?

Listen to the Podcast Episode

🎙️ Free Podcast

Alex & Sam break it down

Listen to a conversational deep-dive on this architecture — real trade-offs, production context, and student-friendly explanations. Free, no login required.

Listen to Episode

Free · No account required · Listen in browser