190 years of Spanish law as a Git repository.
Ley Abierta reconstructs every Spanish law since 1835 as a Markdown file with a Git commit dated to its official BOE publication: 12,272 laws, 43,883 commits, 18 jurisdictions. On top of the repository lives a hybrid retrieval engine with a hand-rolled int8 SIMD C kernel and a Qwen stack on Nan (EU servers, no-log policy): embeddings, query analyzer, reranker and synthesis. AGPL-3.0. Live at leyabierta.es.
Spanish legislation is nominally public: the BOE exposes every law via an XML API. But the raw feed is a stream of documents, not a navigable history. No diff between versions, no graph of which reform amended which article, no way to ask "what changed in labor law in 2018?" and get a useful answer. The official search returns keyword matches in legal prose, not structured documents. Ley Abierta treats the law as source code. Each reform is a commit, each jurisdiction a folder, each version a checkable diff. The result is 190 years of legislation that's queryable and diffable, with full version history.
BOE open data API (XML/JSON · no auth required)
│
▼
┌─ pipeline [ TypeScript · Bun ] ──────────────────────────────────┐
│ fetch XML → parse → Markdown + YAML frontmatter │
│ git commit --date=BOE_PUBLICATION_DATE │
│ pre-1970 laws: date in frontmatter, git date = 1970-01-02 │
│ ELI folder structure (es + 17 CCAA = 18 jurisdictions) │
└───────────────────────────────────────────────────────────────────┘
│ │
▼ ▼
leyes/ git repo JSON cache (scratch)
(public human-readable artifact)
│
▼
┌─ api [ Elysia · SQLite · FTS5 ] ────────────────────────────────┐
│ │
│ BM25 full-text search (FTS5) │
│ + │
│ semantic search — 486k int8 vectors │
│ ┌────────────────────────────────────┐ │
│ │ C SIMD int8 cosine kernel │ │
│ │ -81% index size (7.6 GB → 1.4 GB) │ │
│ │ AVX2/FMA + NEON paths │ │
│ │ SharedArrayBuffer worker pool │ │
│ └────────────────────────────────────┘ │
│ + │
│ RRF fusion · P50 vector stage: 2.1s → 0.8s │
│ │
│ Stack: Qwen on Nan (embed + analyzer + rerank + synthesis) │
└───────────────────────────────────────────────────────────────────┘
│
▼
┌─ web [ Astro SSG · Cloudflare Pages ] ──────────────────────────┐
│ law content from leyes/ checkout at build time │
│ derived data fetched from API at build time │
│ live at leyabierta.es │
└───────────────────────────────────────────────────────────────────┘
Infrastructure: self-hosted Docker · Watchtower · GitHub Actions cron
(daily BOE ingestion) · Resend (email alerts)
Laws are versioned documents. Git is the right tool: git log -- Codigo_Penal.md is a meaningful civic interface, git diff between two commits is a legislative audit trail. Pre-1970 laws are clamped to 1970-01-02 with the real publication date preserved in the YAML frontmatter (Git's commit timestamp is Unix-epoch only). The repo is the product.
Off-the-shelf vector libraries were too slow at 486k vectors on a single VM. A hand-rolled int8 cosine kernel with SIMD intrinsics (dual path AVX2+FMA for x86_64 and NEON for arm64, two-way unrolled accumulator) cut the index from 7.6 GB to 1.4 GB (-81%) and brought the vector-search stage p50 from 2.1s to 0.8s. A SharedArrayBuffer worker pool ensures the index lives once in memory across 4 Bun Workers. No GPU and no managed vector DB.
BM25 looked obvious for legal text. The early hybrid (BM25 + RRF over Gemini dense embeddings) actually regressed on real citizen queries: FTS5 OR-expansion on "horas extras que no me pagan" matched noise across centuries of legal prose. The fix wasn't tuning BM25 weights. It was a modern-bias prompt on Qwen embeddings that biases retrieval toward recent statutes over 19th-century codes. A reproducible eval harness against citizen and omnibus question sets caught the regression before it reached prod.
The original stack used Gemini embeddings, Gemini Flash Lite as the query analyzer, and Cohere Rerank 4 Pro, all routed through OpenRouter at ~$2 per 1,000 queries with every request leaving the server. A/B evaluation replaced each component with Qwen on Nan, an EU-based inference provider with a no-log policy: embeddings, query analyzer, LLM reranker, and synthesis. Queries stay in the EU, nothing is logged on the inference side.
Every /v1/ask call creates an Opik trace with spans for embed_query, bm25, vector_knn, aggregate_pool, rrf_fusion, rerank, and synthesis. Tracing is fail-safe. Span failures never break the pipeline. Opik runs self-hosted in the same VM (full backend + frontend + Python OSS stack), so per-stage latency is debuggable in prod without exfiltrating queries to a SaaS. The catch that justified the work: the BM25 OR-explosion bug was identified at the bm25 span before it tanked recall.
Public legal infrastructure shouldn't be capturable by a SaaS. AGPL guarantees that any deployment that serves Ley Abierta over a network must publish its modifications. The license matches the project's politics: if public law is a commons, so is the engine that makes it searchable.
Ley Abierta turns 190 years of Spanish legislation, technically public and practically unusable, into a queryable Git repository with a search layer that holds up on real citizen language. A structured A/B program replaced every Gemini and Cohere component (embeddings, query analyzer, reranker, and synthesis) with Qwen on Nan, an EU-based inference provider with a no-log policy, measured in isolation against citizen and omnibus question sets. An early hybrid regression was caught by the eval harness before prod, when BM25 OR-expansion on citizen Spanish matched noise across centuries of legal prose. Index size dropped from 7.6 GB to 1.4 GB via the custom int8 SIMD C kernel. Queries stay in the EU. The whole stack is AGPL-3.0 and live at leyabierta.es.