LLM Gateway Patterns: What We Learned After 50,000 Requests
Circuit breakers, confidence scoring, failover chains — an LLM gateway isn't a proxy. After 50,000 production requests through our internal gateway, here's what the patterns actually look like.
My LLM Gateway has been in production for six months across seven of my own projects. The 50,000 requests below are the dataset I have, not a benchmark I'm running. Here is what the production traffic actually showed.
An LLM gateway is not an HTTP proxy. It's a pipeline that manages quality, cost, latency, and availability across multiple LLM providers with different capability profiles. The difference becomes clear at the first production incident: a provider degrades at 3pm on a Tuesday and your application needs to decide, in under 200ms, what to do next.
LLM APIs fail in ways that differ from traditional APIs. HTTP 500 is straightforward — retry or fail. LLM degradation is subtler: responses come back HTTP 200 but quality drops, latency increases from 800ms to 8 seconds, or the model starts refusing requests that worked yesterday due to silent safety policy updates.
Traditional circuit breakers trip on error rates. LLM circuit breakers need to trip on quality degradation, latency percentile drift, and confidence score drops — none of which appear in HTTP status codes.
The composite score determines routing: high-confidence, low-complexity requests go to the fastest/cheapest model. High-complexity or low-confidence requests route to the most capable model. Borderline requests trigger the confidence gate — a secondary smaller model evaluates routing clarity before the main routing decision.
Free-tier chain: Cerebras (ultra-fast, simple tasks) → Groq (fast, mid-complexity) → Mistral AI (balanced) → NVIDIA NIM (structured output) → Cloudflare Workers AI (always available, lower ceiling).
Each provider has a circuit breaker state: closed (normal routing), half-open (testing after degradation), open (excluded). State transitions based on: error rate over 60-second windows, p95 latency against rolling baseline, and quality scores from spot-check evaluations.
The 6.8% quality gate intervention rate is the most valuable number. Those 3,441 requests would have been routed to the wrong tier — either wastefully expensive (complex request on cheap model, poor output) or wastefully capable (simple request on most expensive model). The confidence scoring pays for its complexity in cost and quality optimization.
12ms overhead is acceptable at 800ms upstream latency. At 150ms (Cerebras), it's 8% overhead. Fix: hot path for fast models, 5-dimension fast scoring that runs in under 3ms. Latency overhead for fast-tier routing: 3–4ms. Acceptable.
The gateway code is private for now — open source release planned for Q3 2026 after another month of production hardening.