Case study
RAG System for Document QA
Local-first document QA with embeddings + vector retrieval and grounded LLM answering.
RAGNLPSearch
Overview
A retrieval-augmented generation (RAG) system that answers questions grounded in a document set. It indexes documents into embeddings, retrieves the most relevant chunks, and generates answers constrained to retrieved context.
Problem
Teams lose time searching long documents. Keyword search misses paraphrases, and pure LLM answers hallucinate without grounding.
Solution
I implemented an end-to-end RAG pipeline: document parsing → chunking → embeddings → cosine-similarity retrieval → answer generation with guardrails.
Architecture
- Ingest docs → clean + chunk into consistent segments
- Compute embeddings → store in a vector index
- Query → embed → retrieve top-k chunks (cosine similarity)
- Answer generation → LLM uses retrieved context only; returns answer + sources
Tech stack
Embedding model: all-mpnet-base-v2 (768-dim, normalized embeddings)Retriever: NumPy cosine similarity (TOP_K=5)LLM: Gemma GGUF via llama-cpp-python (ctx=4096, max_tokens=256)PDF extraction: PyMuPDF (fitz)
Key engineering decisions
- • Local-first setup to control privacy and per-query cost.
- • Top-k retrieval + strict prompting to reduce hallucinations.
- • Basic relevance scoring to track and improve retrieval quality.
Results
- • 88% answer relevance score.
Links
What I’d improve next
- • Add reranking (cross-encoder) to improve precision on near-duplicate chunks.
- • Introduce hybrid search (BM25 + vector) for better recall on rare terms.
- • Add caching + incremental indexing for large document sets.