Dataset Catalog

This page is generated from the dataset registry. It enumerates the classes returned by all_datasets, calls create_test_instances_and_names(), and renders each returned example with plot_marginal_curves.

Datasets expose a single sample. Any train/test or cross-fitting holdout should be created outside the dataset layer with a split object.

Use the dropdown to switch between datasets without leaving the page.

\[ 0 \]

Choose dataset

ExampleCategorical

Three-level categorical treatment dataset with observed confounding.

ExampleCategorical-0

Rows: 64. Treatment columns: treatment. Parameters: control_effect=0.0, covariate_effect=0.8, n=64, outcome_noise=0.2, placebo_effect=-0.5, random_state=7, score_x1_weight=-0.5, treated_effect=1.2, treatment_noise=0.4, treatment_threshold=0.6.

Three-level categorical treatment dataset with observed confounding.

The observed covariates satisfy

$$ X_0, X_1 \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, 1). $$

A latent treatment score is generated as

$$ S = X_0 + w_1 X_1 + \varepsilon_T, \qquad \varepsilon_T \sim \mathcal{N}(0, \sigma_T^2), $$

where $w_1 =$ score_x1_weight and $\sigma_T =$ treatment_noise. The observed treatment is then the three-level threshold rule

$$ A = \begin{cases} \mathrm{treated}, & S > \tau, \\ \mathrm{placebo}, & S < -\tau, \\ \mathrm{control}, & \mathrm{otherwise}, \end{cases} $$

with $\tau =$ treatment_threshold.

The noiseless response surface is

$$ m(X, A) = \beta X_0 + \alpha(A), $$

where $\beta =$ covariate_effect and $\alpha(A)$ is the category-specific shift given by control_effect, placebo_effect, and treated_effect. Observed outcomes satisfy

$$ Y \mid X, A \sim \mathcal{N}(m(X, A), \sigma_Y^2), $$

with $\sigma_Y =$ outcome_noise.

SemiSyntheticClassifier

Semi-synthetic categorical-treatment dataset from a classification task.

SemiSyntheticClassifier-0

Rows: 178. Treatment columns: t. Parameters: classifier=LogisticRegression(max_iter=2000), classifier__C=1.0, classifier__class_weight=None, classifier__dual=False, classifier__fit_intercept=True, classifier__intercept_scaling=1, classifier__l1_ratio=None, classifier__max_iter=2000, classifier__multi_class='deprecated', classifier__n_jobs=None, classifier__penalty='l2', classifier__random_state=None, classifier__solver='lbfgs', classifier__tol=0.0001, classifier__verbose=0, classifier__warm_start=False, load_dataset=functools.partial(<function load_wine at 0x7f9e3dd24540>, return_X_y=True, as_frame=True), outcome_noise_scale=1.0, random_state=0, treatment_effect_scale=2.0.

Semi-synthetic categorical-treatment dataset from a classification task.

The dataset starts from a supervised classification sample (X, y) returned by _load_dataset or by the optional load_dataset callable. The covariates are standardized, a clone of the supplied scikit-learn classifier is fit on the normalized features, and its fitted class predictions become the observed treatment labels. The structural response combines the fitted class probability for a requested treatment level with a random centered treatment-specific shift.

If the raw classification sample is $(X_i^{\mathrm{raw}}, y_i^{\mathrm{raw}})$ for $i = 1, \ldots, n$, the dataset first standardizes each feature column:

$$ X_{ij} = \frac{X_{ij}^{\mathrm{raw}} - \mu_j}{s_j}, $$

where zero empirical standard deviations are replaced by 1 in the code.

A cloned classifier is then fit on the normalized covariates and its fitted predictions define the realized treatment labels:

$$ \hat{c} = \operatorname{fit}(\text{classifier}, X, y), \qquad A_i = \hat{c}(X_i). $$

Let $\hat{p}_a(x)$ denote the fitted class probability assigned by predict_proba to treatment level $a$. For the distinct realized treatment levels $a_1, \ldots, a_K$, the implementation samples random offsets

$$ \beta_k \stackrel{\mathrm{iid}}{\sim} \mathcal{N}\!\left(0, \frac{1}{K}\right), \qquad k = 1, \ldots, K, $$

and centers them over the observed treatment sample:

$$ g(a) = \lambda \left( \beta(a) - \frac{1}{n} \sum_{i=1}^n \beta(A_i) \right), $$

where $\lambda =$ treatment_effect_scale.

The noiseless response surface exposed by predict_y is therefore

$$ \mu(x, a) = \hat{p}_a(x) + g(a). $$

The observed outcomes returned by load are sampled at the realized treatments with additive Gaussian noise:

$$ Y_i = \mu(X_i, A_i) + \varepsilon_i, \qquad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), $$

where $\sigma =$ outcome_noise_scale. The supplied classifier must implement both predict and predict_proba.

SemiSyntheticRegressor

Semi-synthetic continuous-treatment dataset from a regression problem.

SemiSyntheticRegressor-0

Rows: 442. Treatment columns: t. Parameters: load_dataset=functools.partial(<function load_diabetes at 0x7f9e3dd24a40>, return_X_y=True, as_frame=True), n_spline_knots=6, outcome_noise_scale=1.0, random_state=0, regressor=LinearRegression(), regressor__copy_X=True, regressor__fit_intercept=True, regressor__n_jobs=None, regressor__positive=False, regressor__tol=1e-06, spline_degree=3, treatment_effect_scale=5.

Semi-synthetic continuous-treatment dataset from a regression problem.

The dataset starts from a supervised regression sample (X, y) returned by _load_dataset or by the optional load_dataset callable. The features and target are standardized, a clone of the supplied scikit-learn regressor is fit on the normalized regression problem, and the fitted predictions become the observed treatment values. The structural response then adds a random spline effect of the treatment to the fitted regression mean.

If the raw regression sample is $(X_i^{\mathrm{raw}}, y_i^{\mathrm{raw}})$ for $i = 1, \ldots, n$, the dataset first standardizes each feature column and the target:

$$ X_{ij} = \frac{X_{ij}^{\mathrm{raw}} - \mu_j}{s_j}, \qquad y_i = \frac{y_i^{\mathrm{raw}} - \mu_y}{s_y}, $$

where zero empirical standard deviations are replaced by 1 in the code.

A cloned regressor is then fit on the normalized regression task and its fitted predictions define the treatment:

$$ \hat{m} = \operatorname{fit}(\text{regressor}, X, y), \qquad T_i = \hat{m}(X_i). $$

Let $B(t) \in \mathbb{R}^K$ denote the spline basis produced by SplineTransformer after fitting on the realized treatments $T_1, \ldots, T_n$. The random spline coefficients are sampled as

$$ \beta_k \stackrel{\mathrm{iid}}{\sim} \mathcal{N}\!\left(0, \frac{1}{K}\right), \qquad k = 1, \ldots, K, $$

and the treatment effect is centered over the observed treatment sample:

$$ g(t) = \lambda \left( B(t)^\top \beta - \frac{1}{n} \sum_{i=1}^n B(T_i)^\top \beta \right), $$

where $\lambda =$ treatment_effect_scale. If all realized treatments are identical, the implementation skips the spline fit and uses $g(t) \equiv 0$.

The noiseless response surface exposed by predict_y is therefore

$$ \mu(x, t) = \hat{m}(x) + g(t). $$

The observed outcomes returned by load are sampled at the realized treatments with additive Gaussian noise:

$$ Y_i = \mu(X_i, T_i) + \varepsilon_i, \qquad \varepsilon_i \stackrel{\mathrm{iid}}{\sim} \mathcal{N}(0, \sigma^2), $$

where $\sigma =$ outcome_noise_scale is the noise standard deviation used in the implementation.

ExampleCategorical

ExampleCategorical-0

IHDPContinuous

IHDPContinuous

KangSchaferBinary

KangSchaferBinary

KangSchaferBinaryMisspecified

KangSchaferBinaryMisspecified

KangSchaferContinuous

KangSchaferContinuous

KangSchaferContinuousMisspecified

KangSchaferContinuousMisspecified

MetaMultidimDataset

MetaMultidimDataset-0

Parameters

NurseStaffing

NurseStaffing

References

SemiSyntheticClassifier

SemiSyntheticClassifier-0

SemiSyntheticRegressor

SemiSyntheticRegressor-0

Synthetic2MultidimDataset

Synthetic2MultidimDataset-0

SyntheticDataset2

SyntheticDataset2

SyntheticDataset2Discrete

SyntheticDataset2Discrete

SyntheticVCNet

SyntheticVCNet

References