ML Project with DVC Structure

DVC-centric structure for data versioning, pipeline reproducibility, and remote storage.

#ml #python #dvc #data-versioning #pipelines

PNG PDF

Project Directory

ml-project/

src/

__init__.py

stages/

DVC pipeline stages

prepare.py

Data preparation

featurize.py

Feature engineering

train.py

Model training

evaluate.py

Model evaluation

utils.py

data/

DVC-tracked data

raw/

.dvc files track remote data

prepared/

features/

models/

DVC-tracked models

.gitkeep

metrics/

DVC metrics output

scores.json

dvc.yaml

Pipeline definition

dvc.lock

Pipeline state (auto-generated)

params.yaml

Hyperparameters

.dvc/

DVC config

requirements.txt

.gitignore

README.md

Why This Structure?

DVC tracks large files and datasets outside git. The dvc.yaml pipeline defines stages that DVC runs in order. Only changed stages re-execute on dvc repro. Data stays on remote storage (S3, GCS, etc.) while git stores only .dvc pointers.

Key Directories

src/stages/-One Python file per DVC stage
data/-Tracked by DVC, not git
dvc.yaml-Pipeline DAG with dependencies
params.yaml-Hyperparameters DVC can track
metrics/-JSON metrics for dvc metrics

dvc.yaml Example

stages:
  prepare:
    cmd: python src/stages/prepare.py
    deps: [data/raw]
    outs: [data/prepared]
  train:
    cmd: python src/stages/train.py
    deps: [data/prepared, src/stages/train.py]
    params: [train.lr, train.epochs]
    outs: [models/model.pkl]

When To Use This

Projects with large datasets
Need reproducible training pipelines
Team collaboration on ML experiments
Data stored in cloud (S3, GCS, Azure)

Trade-offs

Remote setup-Need to configure DVC remote storage
Learning curve-DVC concepts take time
Git+DVC workflow-Two tools to manage