FolderStructure.dev

ML Project with DVC Structure

DVC-centric structure for data versioning, pipeline reproducibility, and remote storage.

#ml #python #dvc #data-versioning #pipelines
PNGPDF

Project Directory

ml-project/
src/
__init__.py
stages/
DVC pipeline stages
prepare.py
Data preparation
featurize.py
Feature engineering
train.py
Model training
evaluate.py
Model evaluation
utils.py
data/
DVC-tracked data
raw/
.dvc files track remote data
prepared/
features/
models/
DVC-tracked models
.gitkeep
metrics/
DVC metrics output
scores.json
dvc.yaml
Pipeline definition
dvc.lock
Pipeline state (auto-generated)
params.yaml
Hyperparameters
.dvc/
DVC config
requirements.txt
.gitignore
README.md

Why This Structure?

DVC tracks large files and datasets outside git. The dvc.yaml pipeline defines stages that DVC runs in order. Only changed stages re-execute on dvc repro. Data stays on remote storage (S3, GCS, etc.) while git stores only .dvc pointers.

Key Directories

  • src/stages/-One Python file per DVC stage
  • data/-Tracked by DVC, not git
  • dvc.yaml-Pipeline DAG with dependencies
  • params.yaml-Hyperparameters DVC can track
  • metrics/-JSON metrics for dvc metrics

dvc.yaml Example

stages:
  prepare:
    cmd: python src/stages/prepare.py
    deps: [data/raw]
    outs: [data/prepared]
  train:
    cmd: python src/stages/train.py
    deps: [data/prepared, src/stages/train.py]
    params: [train.lr, train.epochs]
    outs: [models/model.pkl]

When To Use This

  • Projects with large datasets
  • Need reproducible training pipelines
  • Team collaboration on ML experiments
  • Data stored in cloud (S3, GCS, Azure)

Trade-offs

  • Remote setup-Need to configure DVC remote storage
  • Learning curve-DVC concepts take time
  • Git+DVC workflow-Two tools to manage