Ollama in Production: Running 70B Locally
Mac Studio M4 Pro with 48GB unified memory runs llama3.3:70b for reasoning tasks. Real latency numbers, model selection logic, and where local inference actually beats cloud.
The standard argument against local LLMs is latency and quality. Both are partially correct and increasingly wrong. A Mac Studio M4 Pro with 48GB unified memory runs llama3.3:70b at 8–12 tokens/second. For batch tasks — blog generation, document analysis, classification — that's adequate. For real-time inference it isn't. The selection logic matters.
The unified memory architecture is the key. Apple Silicon doesn't have separate GPU VRAM — the 48GB pool is shared between CPU and GPU. A 70B model at Q4 quantization fits in 38–42GB, leaving 6–10GB for system and context. No model swapping, no quantization compromise that degrades quality.
qwen2.5:3b: Classification, intent detection, real-time chat where latency beats depth. At 70+ t/s, responses feel instant. Don't use for complex reasoning or long-form generation — the quality ceiling is real.
qwen2.5:14b: The default for most tasks. Blog drafts, code generation, document summarization, Q&A on retrieved context. 30 t/s is fast enough for interactive use with a loading indicator. Quality is competitive with GPT-3.5-turbo on most technical tasks.
qwen2.5:32b: Architecture decisions, code review, multi-step reasoning. Background tasks where latency doesn't matter. 12–16 t/s means a 500-token response takes 30–40 seconds — acceptable for non-interactive work.
llama3.3:70b: Complex reasoning where cloud LLM isn't an option due to data sensitivity. The Transceiver Intelligence Platform uses this for generating detailed product analyses that contain internal pricing data.
Data sensitivity: Internal pricing, customer data, security incident details — data with regulatory or competitive sensitivity. Local inference eliminates the trust requirement for cloud transport and provider infrastructure.
High-volume repetitive tasks: At 2,000+ classification requests per day across my LLM Gateway, local inference at zero marginal cost versus $0.002/1K tokens is meaningful economics.
Predictable latency: Cloud LLM cold start latency is 500ms–2s after idle. Ollama on a local machine serves every request at consistent latency. For always-on services, local is structurally better.
Tasks requiring frontier reasoning: complex analysis, nuanced creative work. The quality gap between local 70B and GPT-4o or Claude Sonnet is shrinking but real on hard tasks.
Burst capacity: local inference is limited by the hardware in the room. For unpredictable traffic spikes, the combination works — local handles baseline, cloud handles spikes. The LLM Gateway manages this routing automatically based on queue depth and model availability.