Case study

RAG System for Document QA

Local-first document QA with embeddings + vector retrieval and grounded LLM answering.

Oct 2025 – Dec 2025
RAGNLPSearch

Overview

A retrieval-augmented generation (RAG) system that answers questions grounded in a document set. It indexes documents into embeddings, retrieves the most relevant chunks, and generates answers constrained to retrieved context.

Problem

Teams lose time searching long documents. Keyword search misses paraphrases, and pure LLM answers hallucinate without grounding.

Solution

I implemented an end-to-end RAG pipeline: document parsing → chunking → embeddings → cosine-similarity retrieval → answer generation with guardrails.

Architecture

  • Ingest docs → clean + chunk into consistent segments
  • Compute embeddings → store in a vector index
  • Query → embed → retrieve top-k chunks (cosine similarity)
  • Answer generation → LLM uses retrieved context only; returns answer + sources

Tech stack

Embedding model: all-mpnet-base-v2 (768-dim, normalized embeddings)Retriever: NumPy cosine similarity (TOP_K=5)LLM: Gemma GGUF via llama-cpp-python (ctx=4096, max_tokens=256)PDF extraction: PyMuPDF (fitz)

Key engineering decisions

  • Local-first setup to control privacy and per-query cost.
  • Top-k retrieval + strict prompting to reduce hallucinations.
  • Basic relevance scoring to track and improve retrieval quality.

Results

  • 88% answer relevance score.

Links

What I’d improve next

  • Add reranking (cross-encoder) to improve precision on near-duplicate chunks.
  • Introduce hybrid search (BM25 + vector) for better recall on rare terms.
  • Add caching + incremental indexing for large document sets.