import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from skcausal.causal_estimators import DirectNoCovariates, DirectRegressor
from skcausal.causal_estimators.benchmarking import evaluate_multiple_dataset_seeds
from skcausal.causal_estimators.benchmarking.metrics import MAE, RMSE
from skcausal.datasets import SyntheticDataset2Use this guide when you want to compare estimators on a synthetic benchmark where the true average response curve is known.
evaluate_multiple_dataset_seeds does
This helper answers a specific question: if I keep the estimator definition fixed, how does performance change when I regenerate the synthetic dataset with different random seeds?
That is useful because a single synthetic dataset draw can be misleading. Some seeds produce easier treatment assignments or cleaner outcome surfaces than others. Running across multiple dataset seeds gives you a more stable view of model quality.
Setup
We will compare two estimators on SyntheticDataset2:
| Model | What it learns |
|---|---|
DirectNoCovariates |
A naive observational baseline that fits \(\mathbb{E}[Y \mid T=t]\) and ignores confounders X |
DirectRegressor |
A direct response model that fits \(\mathbb{E}[Y \mid X, T=t]\) and averages over the fitted covariate sample |
This is a good teaching example because SyntheticDataset2 is confounded by construction. Treatment depends on the covariates, and the outcome depends on them too. The naive model therefore mixes confounding with treatment effect, while the direct model has a chance to adjust for it.
evaluate_multiple_dataset_seeds is designed for synthetic datasets that subclass BaseSyntheticDataset, because the metrics compare predicted curves against the dataset’s known ground-truth response.
Step 1: Imports
Step 3: Create the synthetic benchmark and metrics
dataset = SyntheticDataset2(
n=1500,
n_features=6,
random_state=0,
)
metrics = [
MAE(n_treatments=128, random_state=0),
RMSE(n_treatments=128, random_state=0),
]
metric_columns = [str(metric) for metric in metrics]
metric_display_names = {
metric_columns[0]: "MAE",
metric_columns[1]: "RMSE",
}
dataset_seeds = [0, 1, 2, 3, 4]Here:
datasetdefines the data-generating process, not just one observed table.metricsspecify how to compare each fitted estimator against the true average response curve.dataset_seedscontrols how many independent synthetic dataset draws we evaluate.
Step 4: Benchmark the two estimators
The helper benchmarks one estimator at a time, so we wrap it in a small function and then concatenate the outputs.
def benchmark_model(model_name, estimator):
results = evaluate_multiple_dataset_seeds(
dataset=dataset,
estimator=estimator,
metrics=metrics,
random_states=dataset_seeds,
)
results["model"] = model_name
return results
naive_results = benchmark_model(
"DirectNoCovariates",
DirectNoCovariates(outcome_regressor=make_regressor()),
)
direct_results = benchmark_model(
"DirectRegressor",
DirectRegressor(outcome_regressor=make_regressor()),
)
benchmark_results = pd.concat(
[naive_results, direct_results],
ignore_index=True,
)
benchmark_results| MAE(n_treatments=128, random_state=0) | RMSE(n_treatments=128, random_state=0) | dataset_seed | model | |
|---|---|---|---|---|
| 0 | 0.298278 | 0.377048 | 0 | DirectNoCovariates |
| 1 | 0.469495 | 0.564435 | 1 | DirectNoCovariates |
| 2 | 0.820632 | 0.957171 | 2 | DirectNoCovariates |
| 3 | 0.900588 | 1.016855 | 3 | DirectNoCovariates |
| 4 | 0.733818 | 0.845488 | 4 | DirectNoCovariates |
| 5 | 0.125409 | 0.157969 | 0 | DirectRegressor |
| 6 | 0.185232 | 0.216804 | 1 | DirectRegressor |
| 7 | 0.353658 | 0.403821 | 2 | DirectRegressor |
| 8 | 0.365656 | 0.399559 | 3 | DirectRegressor |
| 9 | 0.301836 | 0.327065 | 4 | DirectRegressor |
Each row is one dataset seed. The metric columns are named from the metric objects passed in metrics, and dataset_seed tells you which regenerated synthetic dataset produced that score.
Step 5: Summarize the comparison
summary = (
benchmark_results
.groupby("model")[metric_columns]
.agg(["mean", "std"])
.rename(columns=metric_display_names, level=0)
.round(3)
)
summary| MAE | RMSE | |||
|---|---|---|---|---|
| mean | std | mean | std | |
| model | ||||
| DirectNoCovariates | 0.645 | 0.253 | 0.752 | 0.272 |
| DirectRegressor | 0.266 | 0.106 | 0.301 | 0.110 |
The main pattern to look for is whether the direct model beats the naive baseline on average and whether that advantage is consistent across seeds.
A reusable pattern
Once this pattern works, you can swap in:
- Another synthetic dataset such as a new subclass of
BaseSyntheticDataset. - Another estimator such as
GPSorDoublyRobustPseudoOutcome. - Another metric, including your own subclass of
AverageResponseMetric. - A longer list of dataset seeds for a more stable benchmark.
Where to go next
- See Continuous Treatments for a fuller estimator walkthrough on the same dataset family.
- See Implement your own method if you want to benchmark a custom estimator or synthetic dataset.
- See Dataset Catalog for other built-in synthetic benchmarks.