Evaluation Workflow: Comparing Models with CLI
This guide walks through the complete workflow for evaluating models, visualizing results, and comparing metrics using the CHAP CLI.
Overview
The workflow consists of three main steps:
- eval: Run a backtest and export results to NetCDF format
- plot-backtest: Generate visualizations from evaluation results
- export-metrics: Compare metrics across multiple evaluations in CSV format
Prerequisites
- CHAP Core installed (see Setup guide)
- A dataset CSV file with disease case data
- A GeoJSON file with region polygons (optional, auto-discovered if named same as CSV)
Verify Installation
Before starting, verify that the CLI tools are installed correctly:
Example Dataset
CHAP includes a small example dataset for testing and learning:
example_data/laos_subset.csv- Monthly dengue data for 3 provinces (2010-2012)example_data/laos_subset.geojson- Matching polygon boundaries
This dataset contains 108 rows with rainfall, temperature, disease cases, and population data for Bokeo, Vientiane, and Savannakhet provinces.
Step 1: Create an Evaluation
Use eval to run a backtest on a model and export results to NetCDF format.
Standard Models (GitHub URL or Local Directory)
For models hosted on GitHub or cloned locally:
chap eval \
--model-name https://github.com/dhis2-chap/minimalist_example_r \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_a_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Or using a local directory:
chap eval \
--model-name /path/to/minimalist_example_r \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_a_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Chapkit Models
Chapkit models are REST API-based models that follow the chapkit specification. See Running models with chapkit for more details.
From a running chapkit service (URL):
chap eval \
--model-name http://localhost:8000 \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/chapkit_eval.nc \
--run-config.is-chapkit-model \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
From a local chapkit model directory (auto-starts the service):
When you provide a directory path with --run-config.is-chapkit-model, CHAP automatically:
- Starts a FastAPI dev server from the model directory using
uv run fastapi dev - Waits for the service to become healthy
- Runs the evaluation
- Stops the service when complete
chap eval \
--model-name /path/to/your/chapkit/model \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/chapkit_eval.nc \
--run-config.is-chapkit-model \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Parameters
| Parameter | Description | Default |
|---|---|---|
--model-name |
Model path, GitHub URL, or chapkit service URL | Required |
--dataset-csv |
Path to CSV with disease data | Required |
--output-file |
Path for output NetCDF file | Required |
--backtest-params.n-periods |
Forecast horizon (periods ahead) | 3 |
--backtest-params.n-splits |
Number of train/test splits | 7 |
--backtest-params.stride |
Step size between splits | 1 |
--model-configuration-yaml |
Optional YAML with model config | None |
--run-config.is-chapkit-model |
Flag to indicate chapkit model | false |
--run-config.ignore-environment |
Skip environment setup | false |
--run-config.debug |
Enable debug logging | false |
--run-config.run-directory-type |
Directory handling: latest, timestamp, or use_existing |
timestamp |
--historical-context-years |
Years of historical data for plot context | 6 |
--data-source-mapping |
JSON file mapping model covariate names to CSV columns | None |
For detailed parameter descriptions and examples, see the eval Reference.
GeoJSON Auto-Discovery
If your dataset is vietnam_data.csv, CHAP will automatically look for vietnam_data.geojson in the same directory.
Step 2: Visualize the Evaluation
Use plot-backtest to generate visualizations from the evaluation results:
chap plot-backtest \
--input-file ./results/model_a_eval.nc \
--output-file ./results/model_a_plot.html \
--plot-type metrics_dashboard
Available Plot Types
| Plot Type | Description |
|---|---|
metrics_dashboard |
Dashboard showing various metrics by forecast horizon and time period |
evaluation_plot |
Evaluation summary plot with forecasts vs observations and uncertainty bands |
ratio_of_samples_above_truth |
Shows forecast bias relative to observations |
Output Formats
The output format is determined by file extension:
.html- Interactive HTML (recommended).png- Static PNG image.svg- Vector SVG image.pdf- PDF document.json- Vega JSON specification
Step 3: Create Another Evaluation
Run the same process with a different model for comparison:
chap eval \
--model-name https://github.com/dhis2-chap/chap_auto_ewars_weekly \
--dataset-csv ./data/vietnam_data.csv \
--output-file ./results/model_b_eval.nc \
--backtest-params.n-periods 3 \
--backtest-params.n-splits 7
Step 4: Export and Compare Metrics
Use export-metrics to compute metrics from multiple evaluations and export to CSV:
chap export-metrics \
--input-files example_data/example_evaluation.nc \
--input-files example_data/example_evaluation_2.nc \
--output-file ./comparison_doctest.csv
Output Format
The CSV contains one row per evaluation with metadata and metric columns:
filename,model_name,model_version,rmse_aggregate,mae_aggregate,crps,ratio_within_10th_90th,ratio_within_25th_75th,test_sample_count
model_a_eval.nc,minimalist_example_r,1.0.0,45.2,32.1,0.045,0.85,0.65,168
model_b_eval.nc,chap_auto_ewars_weekly,2.0.0,38.7,28.4,0.038,0.88,0.70,168
Available Metrics
| Metric ID | Description |
|---|---|
rmse_aggregate |
Root Mean Squared Error (across all data) |
mae_aggregate |
Mean Absolute Error (across all data) |
crps |
Continuous Ranked Probability Score |
ratio_within_10th_90th |
Coverage ratio for 10th-90th percentile interval |
ratio_within_25th_75th |
Coverage ratio for 25th-75th percentile interval |
test_sample_count |
Number of test samples |
Selecting Specific Metrics
To export only specific metrics:
chap export-metrics \
--input-files example_data/example_evaluation.nc \
--input-files example_data/example_evaluation_2.nc \
--output-file ./comparison_specific_doctest.csv \
--metric-ids rmse \
--metric-ids mae \
--metric-ids crps
Complete Example: Standard Models
Here's a complete workflow using the included example dataset (example_data/laos_subset.csv) with a minimal model for fast testing:
# Step 1: Evaluate model
chap eval \
--model-name external_models/naive_python_model_uv \
--dataset-csv example_data/laos_subset.csv \
--output-file ./eval_doctest.nc \
--backtest-params.n-splits 2 \
--backtest-params.n-periods 1
# Step 2: Plot results
chap plot-backtest \
--input-file ./eval_doctest.nc \
--output-file ./plot_doctest.html
# Step 3: Export metrics
chap export-metrics \
--input-files ./eval_doctest.nc \
--output-file ./metrics_doctest.csv
The GeoJSON file example_data/laos_subset.geojson is automatically discovered since it has the same base name as the CSV.
Complete Example: Chapkit Models
Here's a workflow using chapkit models, including both a running service and a local directory:
Option A: Using a running chapkit service
First, start your chapkit model service (e.g., using Docker):
Then run the evaluation:
# Evaluate the chapkit model
chap eval \
--model-name http://localhost:8000 \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Plot results
chap plot-backtest \
--input-file ./eval_chapkit.nc \
--output-file ./plot_chapkit.html
Option B: Using a local chapkit model directory (auto-start)
If you have a chapkit model in a local directory, CHAP can automatically start and stop the service:
# Clone or create your chapkit model
git clone https://github.com/your-org/your-chapkit-model /path/to/chapkit-model
# Evaluate with auto-start (CHAP starts the service automatically)
chap eval \
--model-name /path/to/chapkit-model \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_local_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Plot results
chap plot-backtest \
--input-file ./eval_local_chapkit.nc \
--output-file ./plot_local_chapkit.html
Comparing chapkit and standard models
You can compare chapkit models with standard models using export-metrics:
# Evaluate a standard model
chap eval \
--model-name https://github.com/dhis2-chap/minimalist_example_r \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_standard.nc \
--backtest-params.n-splits 3
# Evaluate a chapkit model
chap eval \
--model-name /path/to/chapkit-model \
--dataset-csv ./example_data/laos_subset.csv \
--output-file ./eval_chapkit.nc \
--run-config.is-chapkit-model \
--backtest-params.n-splits 3
# Compare both
chap export-metrics \
--input-files ./eval_standard.nc \
--input-files ./eval_chapkit.nc \
--output-file ./comparison.csv
Tips
- Consistent parameters: Use the same
n-periodsandn-splitswhen comparing models - Same dataset: Always use identical datasets for fair comparison
- Multiple runs: Consider running evaluations with different random seeds for robustness
- Metric interpretation: Lower RMSE/MAE/CRPS is better; higher coverage ratios indicate better calibrated uncertainty
- Chapkit auto-start: When using local chapkit directories, ensure
uvis installed and the model directory has a valid FastAPI app structure with a/healthendpoint