Creating Feature Generators
This guide explains how to create custom feature generators that compute derived features for Chap datasets.
Overview
Feature generators produce computed features (e.g., cluster assignments, derived covariates) that are added to a dataset before it is passed to a model. The system provides:
- Automatic registration: Generators are discovered and available throughout Chap via a decorator
- DataSet-level API: Generators receive a full
DataSetand return an augmentedDataSet - Config-driven activation: Models opt in to generated features using a
gen:prefix inrequired_covariates
Quick Start
Here is a minimal feature generator that adds a constant column:
from chap_core.feature_generators import (
FeatureGenerator,
FeatureGeneratorSpec,
feature_generator,
)
from chap_core.spatio_temporal_data.temporal_dataclass import DataSet
@feature_generator()
class MyConstantGenerator(FeatureGenerator):
spec = FeatureGeneratorSpec(
generator_id="my_constant",
name="My Constant Feature",
description="Adds a constant column to the dataset.",
)
def generate(self, dataset: DataSet) -> DataSet:
df = dataset.to_pandas()
df["my_constant"] = 1.0
return DataSet.from_pandas(df)
A model would use this by including gen:my_constant in its required_covariates.
FeatureGeneratorSpec
from chap_core.feature_generators import FeatureGeneratorSpec
spec = FeatureGeneratorSpec(
generator_id="unique_id", # Used in config as gen:<generator_id>
name="Display Name", # Human-readable name
description="What this generator computes",
)
The generate() Method
The generate() method receives the full DataSet and must return a new DataSet with the generated feature(s) added as column(s). The typical pattern is:
- Convert to pandas with
dataset.to_pandas() - Compute and add columns
- Return
DataSet.from_pandas(df)
The generator has access to all locations and time periods, which allows cross-location features like clustering.
Complete Example: Location Population Rank
This example ranks locations by their mean population:
from chap_core.feature_generators import (
FeatureGenerator,
FeatureGeneratorSpec,
feature_generator,
)
from chap_core.spatio_temporal_data.temporal_dataclass import DataSet
@feature_generator()
class PopulationRankGenerator(FeatureGenerator):
"""Ranks locations by average population."""
spec = FeatureGeneratorSpec(
generator_id="population_rank",
name="Population Rank",
description="Assigns a rank to each location based on mean population.",
)
def generate(self, dataset: DataSet) -> DataSet:
df = dataset.to_pandas()
mean_pop = df.groupby("location")["population"].mean().rank()
df["population_rank"] = df["location"].map(mean_pop)
return DataSet.from_pandas(df)
Built-in Generator: Seasonality Cluster
The seasonality_cluster generator clusters locations by their normalized seasonal disease profiles using KMeans. It adds a cluster_id column to the dataset.
from chap_core.feature_generators import get_feature_generator
cls = get_feature_generator("seasonality_cluster")
assert cls is not None
print(f"ID: {cls.spec.generator_id}")
print(f"Name: {cls.spec.name}")
Registration and Discovery
The @feature_generator() Decorator
The decorator registers your generator class when the module is imported:
from chap_core.feature_generators import feature_generator, FeatureGenerator, FeatureGeneratorSpec
@feature_generator()
class RegisteredGenerator(FeatureGenerator):
spec = FeatureGeneratorSpec(
generator_id="registered_example",
name="Registered Example",
description="Example of a registered generator",
)
def generate(self, dataset):
return dataset
File Location
Place your generator file in chap_core/feature_generators/ and add an import to _discover_feature_generators() in chap_core/feature_generators/__init__.py:
def _discover_feature_generators():
"""Import all feature generator modules to trigger registration."""
from chap_core.feature_generators import (
seasonality_cluster,
my_new_generator, # Add your module here
)
Integration with Models
When a model lists gen:seasonality_cluster in its required_covariates, Chap automatically:
- Parses the
gen:prefix to identify the generator - Runs the generator on the dataset before passing data to the model
- For predictions, copies location-constant generated features from historic data to future data
- Skips
gen:covariates during input validation (they are not expected in the raw data)
Using the Registry
from chap_core.feature_generators import (
get_feature_generator,
get_feature_generators_registry,
list_feature_generators,
)
# Get a specific generator by ID
cls = get_feature_generator("seasonality_cluster")
assert cls is not None
# List all generators with metadata
for info in list_feature_generators():
print(f" {info['id']}: {info['name']}")
# Get full registry
registry = get_feature_generators_registry()
print(f"Registered generators: {list(registry.keys())}")
Testing Your Generator
Write tests in tests/test_feature_generators.py using the existing health_population_data fixture:
def test_my_generator(health_population_data):
from chap_core.feature_generators.my_generator import MyGenerator
generator = MyGenerator()
result = generator.generate(health_population_data)
assert "my_feature" in result.field_names()
Reference
Existing Implementations
| File | Description |
|---|---|
seasonality_cluster.py |
Clusters locations by seasonal disease profiles using KMeans |
API Summary
from chap_core.feature_generators import (
feature_generator, # Decorator to register generators
FeatureGenerator, # Base class (abstract)
FeatureGeneratorSpec, # Metadata dataclass
get_feature_generator, # Get generator class by ID
get_feature_generators_registry, # Get all registered generators
list_feature_generators, # List generators with metadata
parse_generated_covariates, # Split gen: prefixed from regular covariates
apply_feature_generators, # Apply generators to a dataset
)