ML Project with DVC Structure
DVC-centric structure for data versioning, pipeline reproducibility, and remote storage.
Project Directory
ml-project/
src/
__init__.py
stages/
DVC pipeline stages
prepare.py
Data preparation
featurize.py
Feature engineering
train.py
Model training
evaluate.py
Model evaluation
utils.py
data/
DVC-tracked data
raw/
.dvc files track remote data
prepared/
features/
models/
DVC-tracked models
.gitkeep
metrics/
DVC metrics output
scores.json
dvc.yaml
Pipeline definition
dvc.lock
Pipeline state (auto-generated)
params.yaml
Hyperparameters
.dvc/
DVC config
requirements.txt
.gitignore
README.md
Why This Structure?
DVC tracks large files and datasets outside git. The dvc.yaml pipeline defines stages that DVC runs in order. Only changed stages re-execute on dvc repro. Data stays on remote storage (S3, GCS, etc.) while git stores only .dvc pointers.
Key Directories
- src/stages/-One Python file per DVC stage
- data/-Tracked by DVC, not git
- dvc.yaml-Pipeline DAG with dependencies
- params.yaml-Hyperparameters DVC can track
- metrics/-JSON metrics for
dvc metrics
dvc.yaml Example
stages:
prepare:
cmd: python src/stages/prepare.py
deps: [data/raw]
outs: [data/prepared]
train:
cmd: python src/stages/train.py
deps: [data/prepared, src/stages/train.py]
params: [train.lr, train.epochs]
outs: [models/model.pkl]
When To Use This
- Projects with large datasets
- Need reproducible training pipelines
- Team collaboration on ML experiments
- Data stored in cloud (S3, GCS, Azure)
Trade-offs
- Remote setup-Need to configure DVC remote storage
- Learning curve-DVC concepts take time
- Git+DVC workflow-Two tools to manage