EmpiricML¶

EmpiricML is an open-source Python framework designed to bring the rigor of empirical science to the Machine Learning development process. Are you tired of scattered Jupyter Notebooks and untracked experiments? EmpiricML provides a structured "Laboratory" environment to help you move from messy scripts to reproducible science.

The Philosophy: ML as an Empirical Science¶

The core idea behind EmpiricML is that building a machine learning model is an iterative, scientific process. You form a hypothesis (e.g., "Adding these specific features will decrease the error"), and you must test it in a controlled environment. EmpiricML provides that environment through the Lab class. It encapsulates everything needed for rigorous ML experimentation:

Train and test data management
Cross-validation strategies
Evaluation metrics
Standardized criteria for comparing models

Key Features¶

Experiment Tracking¶

Keep a detailed ledger of every run. EmpiricML automatically stores:

Metric performance and overfitting percentages
Training and inference latency
Generated predictions for downstream analysis

Polars-Native Pipelines¶

Performance is at the heart of EmpiricML. Unlike scikit-learn pipelines which are NumPy-based, EmpiricML transformations utilize Polars LazyFrames. This allows for lightning-fast, memory-efficient data handling even with large datasets.

Automated Workflows¶

Stop writing boilerplate code for standard tasks. EmpiricML automates:

Hyperparameter Optimization (HPO)
Feature Importance calculation
Automated Feature Selection

Rigorous Model Comparison¶

Compare experiments with statistical confidence. Define comparison criteria in your Lab class based on:

Performance Thresholds: Does Model B outperform Model A by a significant margin? Statistical Tests: Use built-in tests to ensure your improvements aren't just noise

EmpiricML can automatically update and store your "Best Model" based on these predefined rules.

Fast ML Baselines¶

Go from zero to a leaderboard in seconds. With just a few lines of code, you can evaluate up to 10 baseline models (including LightGBM, XGBoost, Random Forest, MLP, and more) to establish a performance floor for your project.

Multi-Metric Evaluation¶

Evaluate models on multiple metrics simultaneously. Define a list of metrics and the Lab will track each one independently, requiring improvement on all metrics before considering a model as better. Supports per-metric minimize/maximize configuration and multi-metric HPO.

Early Stopping¶

Aborts unpromising experiments early to save compute resources.

Checkpointing¶

Save/Restore your Lab state to pause and resume work seamlessly.