Case study
Tenderz Scraper
Modular tender scraping + AI extraction suite: download, validate, chunk, and convert tender docs into CSV/JSON reports.
Overview
Tenderz is a two-part system: (1) Selenium-based scrapers that collect tenders and download documents across multiple procurement platforms, and (2) an AI algorithm suite that processes those documents (PDF/DOCX/DOC/XLSX/CSV), chunks large files, and extracts structured tender information.
Problem
Tender data is fragmented across many portals and often buried behind dynamic UIs. The documents are inconsistent in format and structure, and manual processing doesn’t scale.
Solution
I built a modular pipeline: scrapers per platform + a standardized processing and AI extraction workflow that produces structured outputs (CSV/JSON) and formatted reports for decision-making.
Architecture
- Scraper modules (per site) → navigate dynamic pages, download documents, export baseline CSV.
- File processing pipeline → discover docs → extract text (PDF/DOCX/DOC/XLSX/CSV) → clean → chunk → validate.
- AI analysis → GPT-based extraction on chunks → consolidate results → export JSON + formatted TXT report.
Tech stack
Key engineering decisions
- • Portal-specific scraper modules to isolate breakage when sites change.
- • Chunk-based processing to handle large tenders and control token/cost boundaries.
- • Validation + dedupe to reduce noisy data and prevent downstream errors.
- • Multi-format extraction support so procurement workflows don’t depend on a single file type.
Results
- • 40+ procurement platforms supported (extensible module structure).
- • Supports PDF/DOCX/DOC/XLSX/CSV document ingestion with standardized outputs.
Links
What I’d improve next
- • Move brittle selectors toward self-healing strategies (visual + semantic locators).
- • Add queue-based orchestration (e.g., Celery/Redis) for parallelism and backpressure.
- • Introduce extraction evaluation sets and automated diffing for regression control.