Orwell Web Scraper
Solo Developer · July 2023
An async scraping pipeline resilient to blocking and churn, maintaining access on ~90% of runs and producing 26k labeled assets for downstream classification.
The Problem
Collecting structured data from dynamic, JavaScript-heavy web pages at scale requires more than simple HTTP requests. Rate limiting, bot detection, and page rendering make naive approaches brittle. The goal was to produce a large, labeled dataset for downstream classification work.
Role & Constraints
Solo Developer
- Must handle JavaScript-rendered content reliably
- Must maintain access despite blocking and site churn
- Must produce clean, labeled output for downstream ML use
Approach
Designed an async scraping pipeline using Playwright for browser automation and aiohttp for lightweight HTTP work. Built resilience against blocking and site churn through pacing controls, proxy rotation, and anti-detection patterns. Structured the output as labeled assets ready for downstream classification pipelines.
Media
┌─────────────┐ ┌──────────────┐ ┌────────────────┐
│ Target URLs │──▶│ Playwright │──▶│ aiohttp │
│ │ │ Browser │ │ Async Pool │
└─────────────┘ └──────┬───────┘ └───────┬────────┘
│ │
┌──────▼───────┐ ┌───────▼────────┐
│ Anti- │ │ Rate Limiting │
│ Detection │ │ + Pacing │
└──────┬───────┘ └───────┬────────┘
└────────┬──────────┘
│
┌────────▼────────┐
│ Proxy Rotation │
│ (~90% success) │
└────────┬────────┘
│
┌────────▼────────┐
│ 26k Labeled │
│ Assets │
└─────────────────┘Orwell — async scraping pipeline with resilience layer
Outcomes
- Built an async scraping pipeline maintaining access on ~90% of runs
- Produced 26,000 labeled assets for downstream classification
- Implemented proxy rotation and anti-detection without external services
- Demonstrated systems thinking applied to data collection infrastructure
Proof
Reflection
This project sits in the portfolio not because of scale, but because it demonstrates the kind of systems thinking I bring to every problem. Reliable data collection is an engineering challenge: concurrency, error recovery, resilience against churn, and producing clean output under adversarial conditions. The same discipline applies to production services.
Previous project
Form Factor
A mobile fitness app that uses ARKit body tracking and Apple HealthKit to deliver real-time exercise form analysis, built from zero to shipped product.
Next project
Enterprise Platform Engineering
Backend-focused platform engineering on scalable distributed systems at Palo Alto Networks. Building Go services, CI feedback acceleration, and operational analytics.