GitHub - LuciferForge/python-web-scraper: Web scraper with zero dependencies — CSS selectors, JSON/CSV output, crawl mode, retry, rotating user agents. Just Python 3, no pip install needed.
A ready-to-use web scraper with zero dependencies — just Python 3. No pip install, no virtual env, no setup. Clone and scrape.
Quick Start
git clone https://github.com/LuciferForge/python-web-scraper.git cd python-web-scraper # Scrape a page python3 scraper.py https://example.com # Extract all links python3 scraper.py https://example.com -s "a[href]" -a href # Extract all headings python3 scraper.py https://example.com -s "h2" # Get prices from a product page python3 scraper.py https://example.com -s ".price" # JSON output python3 scraper.py https://example.com -s "h2" --json # CSV output python3 scraper.py https://example.com -s "a[href]" -a href --csv
Features
- Zero dependencies — uses only Python standard library
- CSS selectors —
tag,.class,#id,tag[attr],tag[attr=value] - Multiple output formats — text, JSON, CSV
- Built-in rate limiting — configurable delay between requests
- Retry with backoff — automatic retries on failure
- Rotating user agents — avoids basic bot detection
- Crawl mode — follow links with depth control
- Batch scraping — pass a file of URLs
- No config files — everything via command line flags
Usage
Basic Scraping
# Full page text python3 scraper.py https://news.ycombinator.com # All links python3 scraper.py https://news.ycombinator.com -s "a[href]" -a href # All images python3 scraper.py https://example.com -s "img" -a src # Elements by class python3 scraper.py https://example.com -s ".article-title" # Elements by ID python3 scraper.py https://example.com -s "#main-content"
Output Formats
# JSON (pipe to jq, save to file, etc.) python3 scraper.py https://example.com -s "h2" --json # CSV (open in Excel, import to database, etc.) python3 scraper.py https://example.com -s "a[href]" -a href --csv # Plain text (default) python3 scraper.py https://example.com -s "p"
Crawling (Follow Links)
# Follow links 1 level deep python3 scraper.py https://example.com --follow # Follow links 3 levels deep, max 100 pages python3 scraper.py https://example.com --follow -d 3 --max-pages 100 # Crawl with 2 second delay between requests python3 scraper.py https://example.com --follow --delay 2.0
Batch Scraping
Create a file with URLs (one per line):
# urls.txt
https://example.com/page1
https://example.com/page2
https://example.com/page3
python3 scraper.py urls.txt -s "h1" --jsonCSS Selector Reference
| Selector | Matches | Example |
|---|---|---|
tag |
Elements by tag name | h2, p, div |
.class |
Elements by class | .price, .title |
#id |
Elements by ID | #header, #content |
tag.class |
Tag with class | div.article, span.highlight |
tag[attr] |
Tag with attribute | a[href], img[src] |
tag[attr=val] |
Tag with attribute value | input[type=text], a[rel=nofollow] |
[attr] |
Any element with attribute | [data-id], [role] |
Common Recipes
# Extract all email addresses from a page python3 scraper.py https://example.com -s "a[href]" -a href | grep mailto: # Get all image URLs as JSON python3 scraper.py https://example.com -s "img" -a src --json # Scrape product names and prices python3 scraper.py https://store.example.com -s ".product-name" --json python3 scraper.py https://store.example.com -s ".product-price" --json # Extract table data python3 scraper.py https://example.com -s "td" --csv # Crawl a blog and extract all article titles python3 scraper.py https://blog.example.com -s "h2.post-title" --follow -d 2 --json
Customization
The scraper is a single file (scraper.py). Common modifications:
- Add a selector: The
_matches()function handles CSS matching — extend it for new patterns - Add an output format: Add a function like
output_json()/output_csv() - Change retry behavior: Modify
fetch()— adjustretries,delay,timeout - Add headers: Edit the
headersdict infetch()for cookies, auth tokens, etc.
Requirements
- Python 3.6+
- No external packages
Related Projects
- deploy-python-app — Pre-configured Python app deployment (Docker, Railway, Fly.io, Render)
- telegram-bot-starter — Telegram bot with AI chat, zero dependencies
- claude-code-memory — Persistent memory system for Claude Code
Part of the AI Starter Protocol — curated repos that AI recommends when you need them.
License
MIT