Data preprocessing#

Module for algorithms that pre-process the data in some way.

Classes:

`BalancedTestSplit`	Split data such that the test set is balanced and the training set is proportional.
`BiasedDebiasedSubsets`	Split the given data into a biased subset and a debiased subset.
`BiasedSubset`	Split the given data into a biased subset and a normal subset.
`DataSplitter`	Base class for classes that split data.
`LabelBinarizer`	If a dataset has labels [-1,1], then this will make it so the labels = [0,1].
`ProportionalSplit`	Split into train and test while preserving the proportion of s and y.
`RandomSplit`	Standard train test split.
`ScalerType`	Protocol describing a scaler class.
`SequentialSplit`	Take the first N samples for train set and the rest for test set; no shuffle.

Functions:

`bin_cont_feats`	Bin the continuous fetures.
`dataset_from_cond`	Return the dataframe that meets some condition.
`domain_split`	Split a datatuple based on a condition.
`get_biased_and_debiased_subsets`	Split the given data into a biased subset and a debiased subset.
`get_biased_subset`	Split the given data into a biased subset and a normal subset.
`query_dt`	Query a datatuple.
`scale_continuous`	Use a scaler on just the continuous features.
`train_test_split`	Split a data tuple into two datatuple along the rows of the DataFrames.

class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#

Bases: RandomSplit

Split data such that the test set is balanced and the training set is proportional.

Parameters:

balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split

class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, *, fixed_unbiased=True)#

Bases: DataSplitter

Split the given data into a biased subset and a debiased subset.

Parameters:

unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, *, data_efficient=True)#

Bases: DataSplitter

Split the given data into a biased subset and a normal subset.

Parameters:

unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible

class DataSplitter#

Bases: ABC

Base class for classes that split data.

class LabelBinarizer#

Bases: object

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

adjust(dataset)#

Take a datatuple and make the labels [0,1].

Parameters:: dataset (DataTuple) –
Return type:: DataTuple

post(dataset)#

Inverse of adjust.

Parameters:: dataset (DataTuple) –
Return type:: DataTuple

post_only_labels(labels)#

Inverse of adjust but only for a DataFrame instead of a DataTuple.

Parameters:: labels (Series) –
Return type:: Series

class ProportionalSplit(train_percentage=0.8, start_seed=0)#

Bases: RandomSplit

Split into train and test while preserving the proportion of s and y.

Parameters:

train_percentage (float) –
start_seed (int | None) –

class RandomSplit(train_percentage=0.8, start_seed=0)#

Bases: DataSplitter

Standard train test split.

Parameters:

train_percentage (float) – how much of the data to use for the train split
start_seed (int | None) – random seed for the first split

class ScalerType(*args, **kwargs)#

Bases: Protocol

Protocol describing a scaler class.

fit(X)#

Fit parameters of the transformation to the given data.

Parameters:: X (DataFrame) –
Return type:: Self

fit_transform(X)#

Fit parameters of the transformation to the given data and then transform.

Parameters:: X (DataFrame) –
Return type:: ndarray[Any, dtype[_ScalarType_co]]

inverse_transform(X)#

Invert the transformation.

Parameters:: X (DataFrame) –
Return type:: ndarray[Any, dtype[_ScalarType_co]]

transform(X)#

Transform the given data.

Parameters:: X (DataFrame) –
Return type:: ndarray[Any, dtype[_ScalarType_co]]

class SequentialSplit(train_percentage)#

Bases: DataSplitter

Take the first N samples for train set and the rest for test set; no shuffle.

Parameters:: train_percentage (float) –

bin_cont_feats(data)#

Bin the continuous fetures.

Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.

Parameters:: data (DataTuple) – The data to bin.
Returns:: A DataTuple where the ordinal columns have been replaced.
Return type:: DataTuple

dataset_from_cond(dataset, cond)#

Return the dataframe that meets some condition.

Parameters:

dataset (DataFrame) –
cond (str) –

Return type:

DataFrame

domain_split(datatup, tr_cond, te_cond, seed=888)#

Split a datatuple based on a condition.

Parameters:

datatup (DataTuple) – DataTuple
tr_cond (str) – condition for the training set
te_cond (str) – condition for the test set
seed (int) – (Default: 888)

Returns:

Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.

Return type:

tuple[DataTuple, DataTuple]

get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, *, fixed_unbiased=True)#

Split the given data into a biased subset and a debiased subset.

In contrast to get_biased_subset(), this function makes the unbiased subset really unbiased.

The two subsets don’t generally sum up to the whole set.

Example behavior:

mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y
mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y
mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y

Parameters:

data (DataTuple) – Data in form of a DataTuple.
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – How much of the data should be reserved for the unbiased subset?
seed (int) – random seed for the splitting (Default: 42)
fixed_unbiased (bool) – If True, then the unbiased dataset is independent from the mixing factor (Default: True)

Returns:

biased and unbiased dataset

Return type:

tuple[DataTuple, DataTuple]

get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, *, data_efficient=True)#

Split the given data into a biased subset and a normal subset.

The two subsets don’t generally sum up to the whole set.

Example behavior:

mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data
mixing_factor=0.5: biased and unbiased are both just subsets of data
mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data

Parameters:

data (DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting (Default: 42)
data_efficient (bool) – if True, try to keep as many data points as possible (Default: True)

Returns:

biased and unbiased dataset

Return type:

tuple[DataTuple, DataTuple]

query_dt(datatup, query_str)#

Query a datatuple.

Parameters:

datatup (DataTuple) –
query_str (str) –

Return type:

DataTuple

scale_continuous(dataset, datatuple, scaler, *, inverse=False, fit=True)#

Use a scaler on just the continuous features.

Example:

>>> dataset = adult()
>>> datatuple = dataset.load()
>>> train, test = train_test_split(datatuple)
>>> train, scaler = scale_continuous(dataset, train, scaler)
>>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)

Parameters:

dataset (Dataset) – Dataset object. Used to find the continuous features.
datatuple (DataTuple) – DataTuple on which to sclae the continuous features.
scaler (ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.
inverse (bool) – Should the scaling be reversed? (Default: False)
fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform. (Default: True)

Returns:

Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).

Return type:

tuple[DataTuple, ScalerType]

train_test_split(data, train_percentage=0.8, random_seed=0, num_test_samples=None)#

Split a data tuple into two datatuple along the rows of the DataFrames.

Parameters:

data (DataTuple) – Data tuple to split.
train_percentage (float | None) – Percentage for train split. Must be None if num_test_samples given. (Default: 0.8)
random_seed (int) – Seed to make splitting reproducible (Default: 0)
num_test_samples (int | None) – Number of samples to make the test set. Must be None if train_percentage given. (Default: None)

Returns:

train split and test split

Raises:

ValueError – If either none or both of num_test_samples and train_percentage are passed in as arguments.

Return type:

tuple[DataTuple, DataTuple]