Data preprocessing#

Module for algorithms that pre-process the data in some way.

Classes:

BalancedTestSplit

Split data such that the test set is balanced and the training set is proportional.

BiasedDebiasedSubsets

Split the given data into a biased subset and a debiased subset.

BiasedSubset

Split the given data into a biased subset and a normal subset.

DataSplitter

Base class for classes that split data.

LabelBinarizer

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

ProportionalSplit

Split into train and test while preserving the proportion of s and y.

RandomSplit

Standard train test split.

ScalerType

Protocol describing a scaler class.

SequentialSplit

Take the first N samples for train set and the rest for test set; no shuffle.

Functions:

bin_cont_feats

Bin the continuous fetures.

dataset_from_cond

Return the dataframe that meets some condition.

domain_split

Split a datatuple based on a condition.

get_biased_and_debiased_subsets

Split the given data into a biased subset and a debiased subset.

get_biased_subset

Split the given data into a biased subset and a normal subset.

query_dt

Query a datatuple.

scale_continuous

Use a scaler on just the continuous features.

train_test_split

Split a data tuple into two datatuple along the rows of the DataFrames.

class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#

Bases: RandomSplit

Split data such that the test set is balanced and the training set is proportional.

Parameters:
  • balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing

  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, *, fixed_unbiased=True)#

Bases: DataSplitter

Split the given data into a biased subset and a debiased subset.

Parameters:
  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • seed (int) – random seed for the splitting

  • fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, *, data_efficient=True)#

Bases: DataSplitter

Split the given data into a biased subset and a normal subset.

Parameters:
  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • seed (int) – random seed for the splitting

  • data_efficient (bool) – if True, try to keep as many data points as possible

class DataSplitter#

Bases: ABC

Base class for classes that split data.

class LabelBinarizer#

Bases: object

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

adjust(dataset)#

Take a datatuple and make the labels [0,1].

Parameters:

dataset (DataTuple) –

Return type:

DataTuple

post(dataset)#

Inverse of adjust.

Parameters:

dataset (DataTuple) –

Return type:

DataTuple

post_only_labels(labels)#

Inverse of adjust but only for a DataFrame instead of a DataTuple.

Parameters:

labels (Series) –

Return type:

Series

class ProportionalSplit(train_percentage=0.8, start_seed=0)#

Bases: RandomSplit

Split into train and test while preserving the proportion of s and y.

Parameters:
  • train_percentage (float) –

  • start_seed (int | None) –

class RandomSplit(train_percentage=0.8, start_seed=0)#

Bases: DataSplitter

Standard train test split.

Parameters:
  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int | None) – random seed for the first split

class ScalerType(*args, **kwargs)#

Bases: Protocol

Protocol describing a scaler class.

fit(X)#

Fit parameters of the transformation to the given data.

Parameters:

X (DataFrame) –

Return type:

Self

fit_transform(X)#

Fit parameters of the transformation to the given data and then transform.

Parameters:

X (DataFrame) –

Return type:

ndarray[Any, dtype[_ScalarType_co]]

inverse_transform(X)#

Invert the transformation.

Parameters:

X (DataFrame) –

Return type:

ndarray[Any, dtype[_ScalarType_co]]

transform(X)#

Transform the given data.

Parameters:

X (DataFrame) –

Return type:

ndarray[Any, dtype[_ScalarType_co]]

class SequentialSplit(train_percentage)#

Bases: DataSplitter

Take the first N samples for train set and the rest for test set; no shuffle.

Parameters:

train_percentage (float) –

bin_cont_feats(data)#

Bin the continuous fetures.

Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.

Parameters:

data (DataTuple) – The data to bin.

Returns:

A DataTuple where the ordinal columns have been replaced.

Return type:

DataTuple

dataset_from_cond(dataset, cond)#

Return the dataframe that meets some condition.

Parameters:
  • dataset (DataFrame) –

  • cond (str) –

Return type:

DataFrame

domain_split(datatup, tr_cond, te_cond, seed=888)#

Split a datatuple based on a condition.

Parameters:
  • datatup (DataTuple) – DataTuple

  • tr_cond (str) – condition for the training set

  • te_cond (str) – condition for the test set

  • seed (int) – (Default: 888)

Returns:

Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.

Return type:

tuple[DataTuple, DataTuple]

get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, *, fixed_unbiased=True)#

Split the given data into a biased subset and a debiased subset.

In contrast to get_biased_subset(), this function makes the unbiased subset really unbiased.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y

Parameters:
  • data (DataTuple) – Data in form of a DataTuple.

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – How much of the data should be reserved for the unbiased subset?

  • seed (int) – random seed for the splitting (Default: 42)

  • fixed_unbiased (bool) – If True, then the unbiased dataset is independent from the mixing factor (Default: True)

Returns:

biased and unbiased dataset

Return type:

tuple[DataTuple, DataTuple]

get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, *, data_efficient=True)#

Split the given data into a biased subset and a normal subset.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data

  • mixing_factor=0.5: biased and unbiased are both just subsets of data

  • mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data

Parameters:
  • data (DataTuple) – data in form of a DataTuple

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting (Default: 42)

  • data_efficient (bool) – if True, try to keep as many data points as possible (Default: True)

Returns:

biased and unbiased dataset

Return type:

tuple[DataTuple, DataTuple]

query_dt(datatup, query_str)#

Query a datatuple.

Parameters:
  • datatup (DataTuple) –

  • query_str (str) –

Return type:

DataTuple

scale_continuous(dataset, datatuple, scaler, *, inverse=False, fit=True)#

Use a scaler on just the continuous features.

Example:
>>> dataset = adult()
>>> datatuple = dataset.load()
>>> train, test = train_test_split(datatuple)
>>> train, scaler = scale_continuous(dataset, train, scaler)
>>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
Parameters:
  • dataset (Dataset) – Dataset object. Used to find the continuous features.

  • datatuple (DataTuple) – DataTuple on which to sclae the continuous features.

  • scaler (ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.

  • inverse (bool) – Should the scaling be reversed? (Default: False)

  • fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform. (Default: True)

Returns:

Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).

Return type:

tuple[DataTuple, ScalerType]

train_test_split(data, train_percentage=0.8, random_seed=0, num_test_samples=None)#

Split a data tuple into two datatuple along the rows of the DataFrames.

Parameters:
  • data (DataTuple) – Data tuple to split.

  • train_percentage (float | None) – Percentage for train split. Must be None if num_test_samples given. (Default: 0.8)

  • random_seed (int) – Seed to make splitting reproducible (Default: 0)

  • num_test_samples (int | None) – Number of samples to make the test set. Must be None if train_percentage given. (Default: None)

Returns:

train split and test split

Raises:

ValueError – If either none or both of num_test_samples and train_percentage are passed in as arguments.

Return type:

tuple[DataTuple, DataTuple]