Python Web Scraper Project Structure

Web scraping project with organized spiders, data pipelines, and output handling.

#python #scraper #web-scraping #beautifulsoup #httpx

PNG PDF

Project Directory

myscraper/

pyproject.toml

Project configuration

README.md

.gitignore

.python-version

.env.example

Proxy settings, etc.

src/

myscraper/

__init__.py

__main__.py

python -m myscraper

cli.py

CLI commands

spiders/

One spider per site

__init__.py

base.py

Base spider class

example_site.py

parsers/

HTML/JSON parsing

__init__.py

product.py

pipelines/

Data processing

__init__.py

clean.py

Data cleaning

store.py

Output to CSV/DB

models/

__init__.py

item.py

Scraped data models

utils/

__init__.py

http.py

HTTP client setup

retry.py

Retry logic

config.py

Settings and delays

data/

Output directory

.gitkeep

tests/

__init__.py

fixtures/

Saved HTML responses

test_parsers.py

Why This Structure?

Scrapers grow messy fast. This structure separates concerns: spiders handle crawling, parsers extract data, pipelines process and store it. Each site gets its own spider module. Testing uses saved HTML fixtures to avoid hitting live sites.

Key Directories

spiders/-One module per target site
parsers/-Extract structured data from HTML
pipelines/-Clean, validate, and store data
data/-Output CSVs, JSONs, or DB files

Simple Spider Pattern

# src/myscraper/spiders/example_site.py
import httpx
from myscraper.parsers.product import parse_product
from myscraper.pipelines.store import save_to_csv

async def crawl(url: str):
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        items = parse_product(response.text)
        save_to_csv(items, "data/products.csv")

Getting Started

uv init myscraper
uv add httpx beautifulsoup4 lxml
Create spider in spiders/
uv run python -m myscraper crawl

When To Use This

Custom web scraping projects
When Scrapy is overkill
Need full control over HTTP
Scraping APIs and HTML
One-off data collection

Trade-offs

No Scrapy features-No built-in scheduling, deduplication
Manual concurrency-Handle rate limiting yourself
No middleware stack-Implement retry/proxy logic manually

Best Practices

Respect robots.txt and rate limits
Use delays between requests
Save raw responses for debugging
Test parsers with fixture HTML
Handle errors gracefully

Naming Conventions

Spiders-Named by target site (example_site.py)
Parsers-Named by data type (product.py, listing.py)
Models-Dataclasses for scraped items