Orwell Web Scraper

Solo Developer · July 2023

An async scraping pipeline resilient to blocking and churn, maintaining access on ~90% of runs and producing 26k labeled assets for downstream classification.

PythonPlaywrightaiohttpasyncio

The Problem

Collecting structured data from dynamic, JavaScript-heavy web pages at scale requires more than simple HTTP requests. Rate limiting, bot detection, and page rendering make naive approaches brittle. The goal was to produce a large, labeled dataset for downstream classification work.

Role & Constraints

Solo Developer

  • Must handle JavaScript-rendered content reliably
  • Must maintain access despite blocking and site churn
  • Must produce clean, labeled output for downstream ML use

Approach

Designed an async scraping pipeline using Playwright for browser automation and aiohttp for lightweight HTTP work. Built resilience against blocking and site churn through pacing controls, proxy rotation, and anti-detection patterns. Structured the output as labeled assets ready for downstream classification pipelines.

Media

Orwell — async scraping pipeline with resilience layer

Outcomes

  • Built an async scraping pipeline maintaining access on ~90% of runs
  • Produced 26,000 labeled assets for downstream classification
  • Implemented proxy rotation and anti-detection without external services
  • Demonstrated systems thinking applied to data collection infrastructure

Proof

GitHub Repository

Reflection

This project sits in the portfolio not because of scale, but because it demonstrates the kind of systems thinking I bring to every problem. Reliable data collection is an engineering challenge: concurrency, error recovery, resilience against churn, and producing clean output under adversarial conditions. The same discipline applies to production services.

Previous project

Form Factor

A mobile fitness app that uses ARKit body tracking and Apple HealthKit to deliver real-time exercise form analysis, built from zero to shipped product.

Next project

Enterprise Platform Engineering

Backend-focused platform engineering on scalable distributed systems at Palo Alto Networks. Building Go services, CI feedback acceleration, and operational analytics.

← All work
Fopen to new opportunities