FolderStructure.dev

Python Web Scraper Project Structure

Web scraping project with organized spiders, data pipelines, and output handling.

#python #scraper #web-scraping #beautifulsoup #httpx
PNGPDF

Project Directory

myscraper/
pyproject.toml
Project configuration
README.md
.gitignore
.python-version
.env.example
Proxy settings, etc.
src/
myscraper/
__init__.py
__main__.py
python -m myscraper
cli.py
CLI commands
spiders/
One spider per site
__init__.py
base.py
Base spider class
example_site.py
parsers/
HTML/JSON parsing
__init__.py
product.py
pipelines/
Data processing
__init__.py
clean.py
Data cleaning
store.py
Output to CSV/DB
models/
__init__.py
item.py
Scraped data models
utils/
__init__.py
http.py
HTTP client setup
retry.py
Retry logic
config.py
Settings and delays
data/
Output directory
.gitkeep
tests/
__init__.py
fixtures/
Saved HTML responses
test_parsers.py

Why This Structure?

Scrapers grow messy fast. This structure separates concerns: spiders handle crawling, parsers extract data, pipelines process and store it. Each site gets its own spider module. Testing uses saved HTML fixtures to avoid hitting live sites.

Key Directories

  • spiders/-One module per target site
  • parsers/-Extract structured data from HTML
  • pipelines/-Clean, validate, and store data
  • data/-Output CSVs, JSONs, or DB files

Simple Spider Pattern

# src/myscraper/spiders/example_site.py
import httpx
from myscraper.parsers.product import parse_product
from myscraper.pipelines.store import save_to_csv

async def crawl(url: str):
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        items = parse_product(response.text)
        save_to_csv(items, "data/products.csv")

Getting Started

  1. uv init myscraper
  2. uv add httpx beautifulsoup4 lxml
  3. Create spider in spiders/
  4. uv run python -m myscraper crawl

When To Use This

  • Custom web scraping projects
  • When Scrapy is overkill
  • Need full control over HTTP
  • Scraping APIs and HTML
  • One-off data collection

Trade-offs

  • No Scrapy features-No built-in scheduling, deduplication
  • Manual concurrency-Handle rate limiting yourself
  • No middleware stack-Implement retry/proxy logic manually

Best Practices

  • Respect robots.txt and rate limits
  • Use delays between requests
  • Save raw responses for debugging
  • Test parsers with fixture HTML
  • Handle errors gracefully

Naming Conventions

  • Spiders-Named by target site (example_site.py)
  • Parsers-Named by data type (product.py, listing.py)
  • Models-Dataclasses for scraped items