ML Project MLOps Pipeline Structure
Full MLOps setup with DVC, MLflow, and CI/CD integration. For production ML systems requiring reproducibility and automation.
Project Directory
ml-pipeline/
src/
ml_pipeline/
__init__.py
data/
make_dataset.py
validate.py
Great Expectations
features/
build_features.py
models/
train.py
predict.py
evaluate.py
pipelines/
DVC pipeline definitions
dvc.yaml
Pipeline DAG
params.yaml
Hyperparameters
data/
DVC-tracked
raw/.gitkeep
processed/.gitkeep
.gitignore
/*.csv, /*.parquet
models/
DVC-tracked artifacts
.gitignore
configs/
train.yaml
Hydra config
model/
data/
tests/
test_data.py
test_model.py
pyproject.toml
dvc.yaml
Root pipeline config
dvc.lock
Pipeline state
mlflow.yaml
MLflow tracking config
Makefile
Common commands
.dvc/
DVC internals
README.md
Why This Structure?
Built for reproducible ML. DVC versions data and models alongside code. MLflow tracks experiments and registers models. The dvc.yaml pipeline ensures dvc repro rebuilds only changed stages. Configs via Hydra allow easy hyperparameter sweeps.
Key Directories
- pipelines/-
dvc.yamldefines stage dependencies - configs/-Hydra configs for experiment variations
- data/-DVC-tracked, actual files in remote storage
- models/-Trained models tracked by DVC
Getting Started
pip install -e '.[dev]'anddvc initdvc remote add -d storage s3://bucket/pathdvc pull(fetch data from remote)dvc repro(run full pipeline)mlflow ui(view experiments at localhost:5000)
When To Use This
- Production ML systems
- Teams needing reproducible experiments
- Projects with large datasets
- Automated retraining pipelines
- Model versioning and A/B testing
Trade-offs
- Significant setup-DVC remote, MLflow server required
- Learning curve-DVC pipelines, Hydra, MLflow concepts
- Overkill for small projects-Use notebook-first instead
Best Practices
- Tag releases with
dvc metrics diffoutputs - Use
params.yamlfor all hyperparameters - Run
dvc reproin CI before merge - Store data checksums, not data, in git