Evaluation Pipeline

This document explains how model evaluation (backtesting) works internally in CHAP, with a focus on the expanding window cross-validation strategy used to split time series data.

Overview

The evaluation pipeline answers the question: "How well does a model predict disease cases on data it has not seen?"

It does this by:

Splitting a historical dataset into training and test portions
Training the model on the training data
Generating predictions for each test window
Comparing predictions against observed values (ground truth)

Because disease surveillance data is a time series, we cannot use random train/test splits. Instead, CHAP uses expanding window cross-validation, where the training data always precedes the test data chronologically.

Pipeline Architecture

The evaluation flow from entry point to results:

Evaluation.create()                          # evaluation.py
    |
    +--> backtest()                          # prediction_evaluator.py
    |       |
    |       +--> train_test_generator()      # dataset_splitting.py
    |       |       Returns (train_set, splits_iterator)
    |       |
    |       +--> estimator.train(train_set)
    |       |       Returns predictor
    |       |
    |       +--> for each split:
    |               predictor.predict(historic, future)
    |               Merge predictions with ground truth
    |               Yield DataSet[SamplesWithTruth]
    |
    +--> Evaluation.from_samples_with_truth()
            Wraps results in an Evaluation object

Expanding Window Cross-Validation

The Problem

Standard k-fold cross-validation randomly assigns data points to folds. This is invalid for time series because:

Models would train on future data and predict the past
Temporal autocorrelation would leak information between folds

The Strategy

CHAP uses an expanding window approach where:

The model is trained once on an initial training set
Multiple test windows are created by sliding forward through the data
For each test window, the model receives all historical data up to that point

The key parameters are:

prediction_length: how many periods each test window covers
n_test_sets: how many test windows to create
stride: how many periods to advance between windows

How Split Indices Are Calculated

The train_test_generator function computes splits from the end of the dataset working backwards:

split_idx = -(prediction_length + (n_test_sets - 1) * stride + 1)

This ensures the last test window ends at the final period of the dataset.

Concrete Example

Consider a dataset with 20 monthly periods (indices 0-19), prediction_length=3, n_test_sets=3, stride=1:

split_idx = -(3 + (3 - 1) * 1 + 1) = -6  -> index 14

Split 0: historic = [0..14],  future = [15, 16, 17]
Split 1: historic = [0..15],  future = [16, 17, 18]
Split 2: historic = [0..16],  future = [17, 18, 19]

Train set = [0..14]  (same as split 0 historic data)

Visually, with T = train, H = extra historic context, F = future/test:

Period:   0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19

Train:    T  T  T  T  T  T  T  T  T  T  T  T  T  T  T
Split 0:  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  F  F  F
Split 1:  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  H  F  F  F
Split 2:  T  T  T  T  T  T  T  T  T  T  T  T  T  T  T  H  H  F  F  F

Note how the historic data expands with each split while the future window slides forward.

What the Model Sees

For each test split, the predictor receives:

historic_data: full dataset (all features including disease_cases) up to the split point
future_data (masked): future covariates (e.g. climate data) without disease_cases -- this is what the model uses to make predictions
future_data (truth): full future data including disease_cases -- used after prediction to evaluate accuracy

Key Components

`chap_core/assessment/dataset_splitting.py`

Handles splitting datasets into train/test portions:

train_test_generator() -- main function implementing expanding window splits
train_test_split() -- single split at one time point
split_test_train_on_period() -- generates splits at multiple split points
get_split_points_for_data_set() -- computes evenly-spaced split points

`chap_core/assessment/prediction_evaluator.py`

Runs the model and collects predictions:

backtest() -- trains model once, yields predictions for each split
evaluate_model() -- full evaluation with GluonTS metrics and PDF report

`chap_core/assessment/evaluation.py`

High-level evaluation abstraction:

Evaluation.create() -- end-to-end factory: runs backtest and wraps results
Evaluation.from_samples_with_truth() -- builds evaluation from raw prediction results
Evaluation.to_file() / from_file() -- NetCDF serialization for sharing results

Code Flow: `Evaluation.create()`

Step-by-step walkthrough of what happens when Evaluation.create() is called (e.g. from the CLI chap evaluate command):

backtest() is called with the estimator and dataset
Inside backtest(), train_test_generator() computes the split index and creates:
- A training set (data up to the first split point)
- An iterator of (historic, masked_future, future_truth) tuples
The estimator is trained once on the training set, producing a predictor
For each test split, the predictor generates samples and they are merged with ground truth into SamplesWithTruth objects
Back in create(), train_test_generator() is called again to determine the last training period
from_samples_with_truth() assembles an Evaluation object containing:
- BackTest with all forecasts and observations
- Historical observations for plotting context
The Evaluation can then be exported to NetCDF, used for metric computation, or visualized