Describing your model in our yaml-based format
To make your model chap-compatible, you need your train and predict endpoints (as discussed here) need to be formally defined in a YAML format that follows the popular MLflow standard.
Your codebase need to contain a file named MLproject.
After adding the MLproject-file, your project typically look like this:
model_folder/
├── MLproject (Add this file)
├── input (input data, e.g. disease, climate)
├── output (predcitions / evaluations)
├── train.py (or R-file)
├── predict.py (or R-file)
└── pyproject.toml (or R-file)
MLproject
The MLproject file need to contain:
- An entry point in the MLproject file called
trainwith parameterstrain_dataandmodel - An entry point in the MLproject file called
predictwith parametershistoric_data,future_data,modelandout_file
Example MLproject file
Taken from our minimalist_example
name: minimalist_example_uv
uv_env: pyproject.toml
entry_points:
train:
parameters:
train_data: str
model: str
command: "python train.py {train_data} {model}"
predict:
parameters:
historic_data: str
future_data: str
model: str
out_file: str
command: "python predict.py {model} {historic_data} {future_data} {out_file}"
Predict and train
When Chap runs your model, it calls the train and predict commands defined in your MLproject file. Chap creates CSV files with the data and passes their filenames as command-line arguments by substituting the {parameter} placeholders in the command string.
Train
Chap calls the train entry point with two parameters:
| Parameter | Description |
|---|---|
train_data |
Filename of a CSV file containing the training data |
model |
Filename where your script must save the trained model |
Your train script should:
- Read the CSV from
train_data(columns:time_period,location,disease_cases, plus any covariates) - Fit your model on this data
- Save the trained model to the path given by
model(format is up to you: JSON, pickle, RDS, etc.)
Example train script (Python):
import json
import sys
import pandas as pd
def train(training_data_filename: str, model_path: str):
df = pd.read_csv(training_data_filename)
stats = df.groupby("location")["disease_cases"].agg(["mean", "std"]).to_dict()
with open(model_path, "w") as f:
json.dump(stats, f)
Call the function from the command line:
Predict
Chap calls the predict entry point with four parameters:
| Parameter | Description |
|---|---|
historic_data |
Filename of a CSV with observed data up to the prediction period |
future_data |
Filename of a CSV with the future time periods and covariates to predict for |
model |
Filename of the saved model (produced by the train step) |
out_file |
Filename where your script must write the predictions |
Your predict script should:
- Load the trained model from
model - Read
historic_dataand/orfuture_dataas needed - Generate predictions for each
(location, time_period)row infuture_data - Write a CSV to
out_filewith columns:time_period,location,sample_0,sample_1, ...,sample_N
Each sample_i column represents one draw from the predictive distribution. Chap uses these samples to compute uncertainty intervals. Typically, models produce 100 samples.
Example predict script (Python):
import json
import sys
import numpy as np
import pandas as pd
def predict(model_filename: str, historic_data_filename: str,
future_data_filename: str, output_filename: str):
with open(model_filename) as f:
stats = json.load(f)
future_df = pd.read_csv(future_data_filename)
n_samples = 100
rows = []
for _, row in future_df.iterrows():
loc = row["location"]
mean = stats["mean"].get(loc, 0)
std = stats["std"].get(loc, 1) or 1
samples = np.maximum(0, np.random.normal(mean, std, n_samples))
row_data = {"time_period": row["time_period"], "location": loc}
row_data.update({f"sample_{i}": s for i, s in enumerate(samples)})
rows.append(row_data)
pd.DataFrame(rows).to_csv(output_filename, index=False)
Call the function from the command line:
# predict.py (bottom of file)
if __name__ == "__main__":
predict(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
How parameters map to files
When Chap executes the command "python train.py {train_data} {model}", it replaces {train_data} and {model} with the actual filenames of the CSV and model files it has created. Both train and predict run in the same working directory, so the model file saved during training is directly accessible during prediction.
For example, given the MLproject entry:
Chap might execute:
python predict.py model.json historic_data_2024-01-01.csv future_data_2024-01-01.csv predictions_2024-01-01.csv
Optional parameters
Your MLproject entry points can also accept these optional parameters:
| Parameter | Description |
|---|---|
polygons |
Filename of a GeoJSON file with location polygon boundaries (only passed if spatial data is available) |
model_config |
Filename of a YAML configuration file for model-specific settings |
To use these, include the corresponding {polygons} or {model_config} placeholder in your command string