Ollama in Production: Running 70B Locally
Mac Studio M4 Pro with 48GB unified memory runs llama3.3:70b for reasoning tasks. Real latency numbers, model selection logic, and where local inference actually beats cloud.
Mac Studio M4 Pro with 48GB unified memory runs llama3.3:70b for reasoning tasks. Real latency numbers, model selection logic, and where local inference actually beats cloud.
Circuit breakers, confidence scoring, failover chains — an LLM gateway isn't a proxy. After 50,000 production requests through our internal gateway, here's what the patterns actually look like.
Tutorials show RAG working at 95% recall on toy datasets. Production systems hit 60%. The gap isn't the technology — it's three implementation choices that look harmless and aren't.