PaperCortex: Adding a Brain to Your Document Archive

Paperless-ngx is great at storing documents. It's terrible at understanding them. PaperCortex fixes that.

I have a Paperless-ngx instance with thousands of documents. Invoices, contracts, receipts, technical specs, tax records. Paperless does OCR, stores them, lets me tag them. It's excellent at what it does.

But it doesn't understand anything. Search is keyword-based. If I search for "hotel expenses" and the receipt says "Marriott Bonn — Accommodation", Paperless won't find it. There's no semantic understanding. No automatic classification. No financial data extraction.

I built PaperCortex because I was spending hours every month manually tagging documents and extracting numbers from receipts for expense reports.

// what it does

// papercortex capabilities
semantic searchfind by meaning, not keywords
auto-classificationtype, category, correspondent, dates
receipt extractionvendor, amounts, tax, line items
bank statement matchingfuzzy match receipts to transactions
DATEV exportGerman tax standard format
natural language queries"How much on travel in Q1?"
MCP server5 tools for Claude Code integration

// semantic search changes everything

Search for "accommodation costs Germany" and PaperCortex will find your Marriott receipt from Bonn, the Airbnb invoice from Munich, and the hotel booking confirmation from Berlin. Even if none of them contain the word "accommodation". Because it understands meaning, not just strings.

This runs on local embeddings via Ollama (nomic-embed-text). Vectors stored in SQLite with HNSW indexing. No cloud. No API costs. Your documents never leave your machine.

// receipt intelligence

This is the feature that saves me the most time. Drop a receipt — scanned, photographed, PDF, doesn't matter — and PaperCortex extracts: vendor name, date, total amount, tax rate, individual line items. Multi-page receipts. Multi-currency. It handles the German "Bewirtungsbeleg" format that makes accountants cry.

The extracted data feeds into bank statement matching with fuzzy logic and confidence scoring. It finds unmatched transactions automatically. If you're German and your tax advisor wants DATEV format — one click, SKR03/SKR04 mapping included.

// mcp server for claude code

PaperCortex exposes five tools via the Model Context Protocol:

  • search — semantic document search
  • classify — auto-classify a document
  • extract — pull structured data from a document
  • query — natural language questions about your archive
  • export — generate DATEV or CSV exports

This means I can ask Claude: "How much did we spend on office supplies in March?" and it queries my actual document archive to give me a real answer with source references.

100% local. No cloud. No subscriptions. GitHub — MIT licensed.