Legal-tech RAG — multi-tenant retrieval with citations

Built the document ingestion, embeddings pipeline, and citation-anchored RAG chatbot for an AI legal-tech startup. Multi-tenant across orgs, lawyers, and cases. Migrated the vector store from Pinecone to a self-hosted Elasticsearch cluster on GCP.

Context

The client is an early-stage AI legal-tech startup. The product lets lawyers upload case documents — pleadings, contracts, transcripts, evidence exhibits, often hundreds of pages each — and ask plain-English questions, getting answers anchored to citations in the source files. Multi-tenant: organisations, lawyers within them, cases within those. Confidentiality is the whole product, so isolation between tenants is non-negotiable.

What I built

Document ingestion. Pipeline that accepts PDF, DOCX, and plain text. PDFs are the interesting case — legal documents are often image-based scans, multi-column layouts, with footnotes and exhibits stitched in. Parsed, normalised, then chunked with overlap to preserve context across boundaries.

Embeddings and vector store. Initially Pinecone — fast to start with, no infra to run. After the first few real customers it became clear that managed-vector pricing was going to outpace customer growth, and we wanted more control over indexing strategies. Migrated to a self-hosted Elasticsearch cluster on GCP, with dense-vector fields and metadata filtering for the tenant scoping. Cost dropped, throughput went up, and we got better tooling for hybrid search experiments.

Retrieval and answers. RAG with citation-anchored output — every answer carries pointers back to specific passages in specific documents, so a lawyer can verify before they act. Prompt structuring kept the model from making things up by giving it explicit citation slots to fill and rejecting outputs that didn’t fill them. Tenancy enforced both in the retrieval filter and in the answer post-processing.

Case timelines. Derived feature on top of the same corpus — the system reads through case documents and produces a structured timeline of events with citations. Required different prompting, span extraction, and de-duplication across overlapping sources.

What was hard

Citation fidelity. Easy to get the LLM to cite; harder to get it to cite the right passage, every time. Solved with a combination of retrieval-pinning, citation-slot prompting, and a verification pass.
The migration. Moving live customer data from Pinecone to Elasticsearch without downtime, while keeping the query API stable, required a dual-write/dual-read phase and careful index-version handling.
Multi-tenancy in retrieval. Vector search will happily ignore your isolation if you let it. Tenant filters had to be enforced both at index time (tenant ID in the document metadata) and at query time (filter clause), and we wrote tests to confirm cross-tenant leakage was impossible.

Stack

Python, OpenAI / Anthropic LLM APIs, embeddings (text-embedding-3-large class), Pinecone (early phase), Elasticsearch on GCP (current), PostgreSQL for app state, FastAPI for the application layer.