Case study

Tenderz Scraper

Modular tender scraping + AI extraction suite: download, validate, chunk, and convert tender docs into CSV/JSON reports.

Jun 2025 – Oct 2025

ScrapingAutomationLLM

Overview

Tenderz is a two-part system: (1) Selenium-based scrapers that collect tenders and download documents across multiple procurement platforms, and (2) an AI algorithm suite that processes those documents (PDF/DOCX/DOC/XLSX/CSV), chunks large files, and extracts structured tender information.

Problem

Tender data is fragmented across many portals and often buried behind dynamic UIs. The documents are inconsistent in format and structure, and manual processing doesn’t scale.

Solution

I built a modular pipeline: scrapers per platform + a standardized processing and AI extraction workflow that produces structured outputs (CSV/JSON) and formatted reports for decision-making.

Architecture

Scraper modules (per site) → navigate dynamic pages, download documents, export baseline CSV.
File processing pipeline → discover docs → extract text (PDF/DOCX/DOC/XLSX/CSV) → clean → chunk → validate.
AI analysis → GPT-based extraction on chunks → consolidate results → export JSON + formatted TXT report.

Tech stack

Scraping: Python + Selenium (headless capable) + webdriver-managerText extraction: PyMuPDF (PDF), python-docx + COM automation for DOC→DOCX (pywin32), pandas/openpyxl (XLSX/CSV)AI extraction: OpenAI API with configurable max chunk size (e.g., 80,000 chars) and consolidation logicOutputs: standardized CSV exports, JSON tender summaries, formatted TXT reportsDeployment: Azure VE

Key engineering decisions

• Portal-specific scraper modules to isolate breakage when sites change.
• Chunk-based processing to handle large tenders and control token/cost boundaries.
• Validation + dedupe to reduce noisy data and prevent downstream errors.
• Multi-format extraction support so procurement workflows don’t depend on a single file type.

Results

• 40+ procurement platforms supported (extensible module structure).
• Supports PDF/DOCX/DOC/XLSX/CSV document ingestion with standardized outputs.

Links

GitHub

What I’d improve next

• Move brittle selectors toward self-healing strategies (visual + semantic locators).
• Add queue-based orchestration (e.g., Celery/Redis) for parallelism and backpressure.
• Introduce extraction evaluation sets and automated diffing for regression control.

← Back to projects