stack local-first llm infrastructure

The Stack I Actually Ship With in 2026

The stack is the side effect, not the project. Two principles drive every decision — keep it local, and move every model to one I trained myself. Everything else is downstream.

Rene Fichtmueller / 2026-05-14 / ~4 min read

The stack isn't the point. I get asked which framework, which database, which model, which orchestrator. The honest answer is that I don't think about those choices much anymore. The stack is bycatch from the real projects, not the project itself.

What I do think about — every architectural decision in the last twenty-four months — comes down to two things. Keep it local. Push every model interaction toward LLMs I've trained myself, as fast as I can.

This post is the stack as a side effect of those two principles.

Local-first is the only principle that mattered

When I say local-first, I mean: nothing runs on infrastructure I don't control, except what genuinely has to. The list of "has to" is shorter than most people think.

Inference: local. Mac Studio M4 Pro, 48 GB unified memory, Ollama running qwen2.5:32b for background tasks and qwen2.5:3b for sub-second real-time. Free-tier cloud LLMs as fallback for simple work, frontier models as a last resort for the hardest reasoning. Local stays the backbone whenever private data is in the loop.

Storage: local. PostgreSQL 17 on a dedicated server in Frankfurt, pgvector for embeddings, TimescaleDB for time-series. Eighteen months without a data-loss event. The operational knowledge of running my own database is worth more than every managed-service feature I'm missing.

Code: local. Gitea on my own server is the primary Git remote. All development, all experimentation, all private packages go there first. GitHub gets public releases only.

Vector search: local. Qdrant in a container alongside Postgres. HTTP API, filterable, HNSW index tuning that works. Chroma, Weaviate, Milvus all evaluated. Qdrant was the only one that did filtered search correctly on the first attempt.

If a vendor disappears or changes their terms tomorrow, none of these break. That's the test.

The second principle — move everything to my own models

The reason local-first matters for me specifically: I am moving every model interaction toward LLMs I've trained myself.

Right now, a lot of the inference still goes through off-the-shelf open weights and the occasional free-tier cloud model. That's the transition state, not the destination. The destination is that every domain my projects touch gets a fine-tuned local model that understands my data, my voice, my conventions.

Three reasons this matters more than the stack choices ever will.

The cloud LLMs don't know my domain. They guess. A fine-tuned version doesn't have to.

The cloud LLMs change without warning. A prompt that worked in March doesn't work in April. My fine-tuned versions don't drift unless I change them.

The data I work with isn't something I want to send to a third party. Not because it's classified — because it's mine.

The roadmap is concrete. LLM Gateway today routes between Ollama on the Mac Studio (default) and a free-tier cloud fallback chain (Cerebras, Groq, Mistral, NVIDIA NIM, Cloudflare Workers AI). Frontier models last, only when nothing else handles the query. Fine-tuning happens on the Mac Studio via MPS-accelerated LoRA. GGUF export. Back into Ollama. Every domain-specific task earns its own small fine-tune over time. Generic foundation models stay generic. Specific projects deserve specific models.

The stack, as a side effect of those two principles

Everything else is downstream of "local-first" plus "moving to my own models". The choices stop feeling like choices.

TypeScript everywhere. Solo developer with no QA team. The compiler catches things I'd otherwise find in production at 2am.

Fastify, not Express. JSON Schema on every route eliminates a class of defensive checks. Structured logging by default.

Cloudflare Tunnels for public-facing services. No open ports. TLS handled at the edge. Ten tunnels run by one cloudflared daemon. Zero cost at my scale.

PM2 for process management. Not Kubernetes, not Docker Swarm. A solo developer running container orchestration is paying complexity tax for nothing.

These choices are noise. They could be different. They're not what makes the system work.

What I deliberately avoid

Frameworks that assume I have a team. Next.js gets a pass for projects where SSR complexity is justified. Most projects get vanilla HTML plus Fastify serving static files, which ships faster.

Managed databases. PostgreSQL on a VPS I control costs nothing beyond the lease. The dependency risk on managed-service pricing changes outweighs the features I'd gain. Same logic for managed vector DBs, managed Redis, managed anything.

Anything that requires a third-party account to function in development. If a dev environment depends on a SaaS being up, the SaaS will be down on the day I need to fix a real problem. Local-first means local-development too.

The one thing I'd change

Structured logging from day one. I spent six months on console.log and then retrofitted pino into eight services. Twenty hours of avoidable refactoring. Pino from the first commit on every new project.

And — the part that wasn't true twelve months ago and is true now — start fine-tuning earlier. The LoRA pipeline isn't expensive to set up. The first fine-tune doesn't need to be impressive. It just needs to exist, so the muscle is there when a project actually needs it.

That's the post. The stack is whatever supports those two decisions. It will be different next year. The two principles won't be.