Quickstart

This is a quick 5-minute beginner guide to getting used to mlvern for managing your machine learning models and experiments. Fasten your seat belts, and let’s dive in!

Initializing the Project

Create a new Python script or Jupyter notebook (note: Preferebly in a new directory) and run the following code to initialize your mlvern project:

from mlvern import Forge

forge = Forge(project="your_project_name", base_dir="your_base_directory")
forge.init()

This will create the necessary directory structure for your project. You can see newly created folder .mlvern_your_project_name in your base directory, which contains sub-directories for datasets, models, runs, and a registry.

your_base_directory/
|
├── .mlvern_your_project_name/              ## Default Folder Created by mlvern
│   ├── datasets/                           ## Preprocessed Datasets and EDA Reports
│   ├── models/                             ## Trained Models
│   ├── runs/                               ## Experiment Runs and Metrics
│   └── registry.json                       ## Universal Registry File for the Project
|
└── your_script (or) notebook.py

Registering the Dataset

Here comes the beautiful part! You can register your dataset using the following code snippet. For demonstration purposes, we will use the popular Iris dataset from scikit-learn.

Now when you register a dataset, mlvern automatically does the data inspection, Exploratory Data Analysis (EDA) using pandas-profiling and saves reports and plots under the datasets/ sub-directory.

from sklearn.datasets import load_iris

data = load_iris(as_frame=True)
df = data.frame
target = "target"

dataset_fp, is_new = forge.register_dataset(df, target)

This registers the Iris dataset within you project, and you can clearly see the registered dataset is stored in the datasets folder along with data_inspection report, statistics_report and necessary plots.

datasets/
└── dataset_hash/                ## Unique Folder for the Registered Dataset
      ├── plots/                 ## Data Distribution and Visualization Plots
      ├── reports/               ## EDA and Data Inspection Reports
      └── schema.json            ## Dataset Schema and Metadata

Training the Model

Now that you have registered your dataset, you can proceed to train your machine learning model. Here is an example of training a Logistic Regression model on the Iris dataset.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from mlvern import ModelTrainer

X = df.drop(columns=[target])
y = df[target]

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

lr_model = LogisticRegression(max_iter=200, random_state=42)

config_lr = {
     "model_type": "LogisticRegression",
     "max_iter": 200,
     "solver": "lbfgs",
     "random_state": 42,
 }

 run_id_1, metrics_1 = forge.run(lr_model, X_train, y_train, X_val, y_val, config_lr, dataset_fp)

This trains the Logistic Regression model and logs the experiment run, including model configuration and evaluation metrics, under the runs/ sub-directory. You can explore the runs/ folder to see the saved model, metrics, and configuration files.

runs/
└── run_id_1/                    ## Unique Folder for the Experiment Run
      ├── model.pkl              ## Saved Trained Model
      ├── metrics.json           ## Evaluation Metrics
      ├── config.json            ## Model Configuration
      └── logs/                  ## Training Logs

You can repeat the training process with different models or configurations. For example, let’s train a Random Forest Classifier on the same dataset.

rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

config_rf = {
     "model_type": "RandomForestClassifier",
     "n_estimators": 100,
     "random_state": 42,
 }

 run_id_2, metrics_2 = forge.run(rf_model, X_train, y_train, X_val, y_val, config_rf, dataset_fp)

This trains the Random Forest model and logs the experiment run similarly under the runs/ sub-directory. You can now compare the metrics of both models to see which one performed better.

Comparing Models

You can easily compare the performance of different models using the logged metrics. Here is an example of how to compare the Logistic Regression and Random Forest models trained earlier.

from mlvern import ModelComparator

comparator = ModelComparator(forge)

comparison_df = comparator.compare_models([run_id_1, run_id_2])
print(comparison_df)

This will display a comparison table of the evaluation metrics for both models, allowing you to easily identify which model performed better on the validation set.

Congratulations! You have successfully initialized an mlvern project, registered a dataset, trained multiple models, and compared their performance.

See the Tutorials and API documentation for more advanced features and functionalities of mlvern.