LLM Gateway AI Circuit Breaker Failover Production AI TypeScript Fastify Confidence Scoring Ollama Architecture

LLM Gateway Patterns: What We Learned After 50,000 Requests

Circuit breakers, confidence scoring, failover chains — an LLM gateway isn't a proxy. After 50,000 production requests through our internal gateway, here's what the patterns actually look like.

Rene Fichtmueller / 2026-05-21 / ~2 min read

My LLM Gateway has been in production for six months across seven of my own projects. The 50,000 requests below are the dataset I have, not a benchmark I'm running. Here is what the production traffic actually showed.

An LLM gateway is not an HTTP proxy. It's a pipeline that manages quality, cost, latency, and availability across multiple LLM providers with different capability profiles. The difference becomes clear at the first production incident: a provider degrades at 3pm on a Tuesday and your application needs to decide, in under 200ms, what to do next.

The Core Problem

LLM APIs fail in ways that differ from traditional APIs. HTTP 500 is straightforward — retry or fail. LLM degradation is subtler: responses come back HTTP 200 but quality drops, latency increases from 800ms to 8 seconds, or the model starts refusing requests that worked yesterday due to silent safety policy updates.

Traditional circuit breakers trip on error rates. LLM circuit breakers need to trip on quality degradation, latency percentile drift, and confidence score drops — none of which appear in HTTP status codes.

The 23-Dimension Confidence Gate

Confidence Score Dimensions (Sample)

Request complexity

Token count, nested instructions, multi-step reasoning

Domain specificity

Technical vs general, specialized vocabulary density

Session momentum

Consistency with previous turns in session

Intent clarity

Ambiguity score from Trie-based pattern matching

Safety risk score

ShieldX integration — prompt injection probability

The composite score determines routing: high-confidence, low-complexity requests go to the fastest/cheapest model. High-complexity or low-confidence requests route to the most capable model. Borderline requests trigger the confidence gate — a secondary smaller model evaluates routing clarity before the main routing decision.

Failover Chain

Free-tier chain: Cerebras (ultra-fast, simple tasks) → Groq (fast, mid-complexity) → Mistral AI (balanced) → NVIDIA NIM (structured output) → Cloudflare Workers AI (always available, lower ceiling).

Each provider has a circuit breaker state: closed (normal routing), half-open (testing after degradation), open (excluded). State transitions based on: error rate over 60-second windows, p95 latency against rolling baseline, and quality scores from spot-check evaluations.

What 50,000 Requests Revealed

Production Gateway Stats

Total requests routed

50,847

Free tier usage

78.3%

Failover activations

1,247 (2.5%)

Circuit breaker trips

23 events across 6 providers

Quality gate interventions

3,441 (6.8%)

Average gateway overhead

12ms

The 6.8% quality gate intervention rate is the most valuable number. Those 3,441 requests would have been routed to the wrong tier — either wastefully expensive (complex request on cheap model, poor output) or wastefully capable (simple request on most expensive model). The confidence scoring pays for its complexity in cost and quality optimization.

The Latency Problem

12ms overhead is acceptable at 800ms upstream latency. At 150ms (Cerebras), it's 8% overhead. Fix: hot path for fast models, 5-dimension fast scoring that runs in under 3ms. Latency overhead for fast-tier routing: 3–4ms. Acceptable.

The gateway code is private for now — open source release planned for Q3 2026 after another month of production hardening.