Python Web Scraper Project Structure
Web scraping project with organized spiders, data pipelines, and output handling.
Project Directory
myscraper/
pyproject.toml
Project configuration
README.md
.gitignore
.python-version
.env.example
Proxy settings, etc.
src/
myscraper/
__init__.py
__main__.py
python -m myscraper
cli.py
CLI commands
spiders/
One spider per site
__init__.py
base.py
Base spider class
example_site.py
parsers/
HTML/JSON parsing
__init__.py
product.py
pipelines/
Data processing
__init__.py
clean.py
Data cleaning
store.py
Output to CSV/DB
models/
__init__.py
item.py
Scraped data models
utils/
__init__.py
http.py
HTTP client setup
retry.py
Retry logic
config.py
Settings and delays
data/
Output directory
.gitkeep
tests/
__init__.py
fixtures/
Saved HTML responses
test_parsers.py
Why This Structure?
Scrapers grow messy fast. This structure separates concerns: spiders handle crawling, parsers extract data, pipelines process and store it. Each site gets its own spider module. Testing uses saved HTML fixtures to avoid hitting live sites.
Key Directories
- spiders/-One module per target site
- parsers/-Extract structured data from HTML
- pipelines/-Clean, validate, and store data
- data/-Output CSVs, JSONs, or DB files
Simple Spider Pattern
# src/myscraper/spiders/example_site.py
import httpx
from myscraper.parsers.product import parse_product
from myscraper.pipelines.store import save_to_csv
async def crawl(url: str):
async with httpx.AsyncClient() as client:
response = await client.get(url)
items = parse_product(response.text)
save_to_csv(items, "data/products.csv")
Getting Started
uv init myscraperuv add httpx beautifulsoup4 lxml- Create spider in
spiders/ uv run python -m myscraper crawl
When To Use This
- Custom web scraping projects
- When Scrapy is overkill
- Need full control over HTTP
- Scraping APIs and HTML
- One-off data collection
Trade-offs
- No Scrapy features-No built-in scheduling, deduplication
- Manual concurrency-Handle rate limiting yourself
- No middleware stack-Implement retry/proxy logic manually
Best Practices
- Respect robots.txt and rate limits
- Use delays between requests
- Save raw responses for debugging
- Test parsers with fixture HTML
- Handle errors gracefully
Naming Conventions
- Spiders-Named by target site (example_site.py)
- Parsers-Named by data type (product.py, listing.py)
- Models-Dataclasses for scraped items