Data preprocessing#

This moodule contains algorithms that pre-process the data in some way.

Classes:

`BalancedTestSplit`	Split data such that the test set is balanced and the training set is proportional.
`BiasedDebiasedSubsets`	Split the given data into a biased subset and a debiased subset.
`BiasedSubset`	Split the given data into a biased subset and a normal subset.
`DataSplitter`	Base class for classes that split data.
`LabelBinarizer`	If a dataset has labels [-1,1], then this will make it so the labels = [0,1].
`ProportionalSplit`	Split into train and test while preserving the proportion of s and y.
`RandomSplit`	Standard train test split.
`SequentialSplit`	Take the first N samples for train set and the rest for test set; no shuffle.

Functions:

`bin_cont_feats`	Bin the continuous fetures.
`dataset_from_cond`	Return the dataframe that meets some condition.
`domain_split`	Splits a datatuple based on a condition.
`get_biased_and_debiased_subsets`	Split the given data into a biased subset and a debiased subset.
`get_biased_subset`	Split the given data into a biased subset and a normal subset.
`query_dt`	Query a datatuple.
`scale_continuous`	Use a scaler on just the continuous features.
`train_test_split`	Split a data tuple into two datatuple along the rows of the DataFrames.

class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split data such that the test set is balanced and the training set is proportional.

The constructor takes the following arguments.

Parameters

balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split

class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, fixed_unbiased=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a debiased subset.

The constructor takes the following arguments.

Parameters

mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, data_efficient=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a normal subset.

The constructor takes the following arguments.

Parameters

mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible

class DataSplitter#

Bases: abc.ABC

Base class for classes that split data.

class LabelBinarizer#

Bases: object

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

Return type: None

adjust(dataset)#

Take a datatuple and make the labels [0,1].

Parameters: dataset (ethicml.utility.data_structures.DataTuple) –
Return type: ethicml.utility.data_structures.DataTuple

post(dataset)#

Inverse of adjust.

Parameters: dataset (ethicml.utility.data_structures.DataTuple) –
Return type: ethicml.utility.data_structures.DataTuple

post_only_labels(labels)#

Inverse of adjust but only for a DataFrame instead of a DataTuple.

Parameters: labels (pandas.Series) –
Return type: pandas.Series

class ProportionalSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split into train and test while preserving the proportion of s and y.

The constructor takes the following arguments.

Parameters

train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split

class RandomSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Standard train test split.

The constructor takes the following arguments.

Parameters

train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split

class SequentialSplit(train_percentage)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Take the first N samples for train set and the rest for test set; no shuffle.

Parameters: train_percentage (float) –

bin_cont_feats(data)#

Bin the continuous fetures.

Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.

Parameters: data (ethicml.utility.data_structures.DataTuple) –
Return type: ethicml.utility.data_structures.DataTuple

dataset_from_cond(dataset, cond)#

Return the dataframe that meets some condition.

Parameters

dataset (pandas.DataFrame) –
cond (str) –

Return type

pandas.DataFrame

domain_split(datatup, tr_cond, te_cond, seed=888)#

Splits a datatuple based on a condition.

Parameters

datatup (ethicml.utility.data_structures.DataTuple) – DataTuple
tr_cond (str) – condition for the training set
te_cond (str) – condition for the test set
seed (int) –

Returns

Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, fixed_unbiased=True)#

Split the given data into a biased subset and a debiased subset.

In contrast to get_biased_subset(), this function makes the unbiased subset really unbiased.

The two subsets don’t generally sum up to the whole set.

Example behavior:

mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y
mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y
mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y

Parameters

data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

Returns

biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, data_efficient=True)#

Split the given data into a biased subset and a normal subset.

The two subsets don’t generally sum up to the whole set.

Example behavior:

mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data
mixing_factor=0.5: biased and unbiased are both just subsets of data
mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data

Parameters

data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible

Returns

biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

query_dt(datatup, query_str)#

Query a datatuple.

Parameters

datatup (ethicml.utility.data_structures.DataTuple) –
query_str (str) –

Return type

ethicml.utility.data_structures.DataTuple

scale_continuous(dataset, datatuple, scaler, inverse=False, fit=True)#

Use a scaler on just the continuous features.

Parameters

dataset (ethicml.data.dataset.Dataset) – Dataset object. Used to find the continuous features.
datatuple (ethicml.utility.data_structures.DataTuple) – DataTuple on which to sclae the continuous features.
scaler (ethicml.preprocessing.scaling.ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.
inverse (bool) – Should the scaling be reversed?
fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform.

Returns

Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.preprocessing.scaling.ScalerType]

Examples

>>> dataset = adult()
>>> datatuple = dataset.load()
>>> train, test = train_test_split(datatuple)
>>> train, scaler = scale_continuous(dataset, train, scaler)
>>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)

train_test_split(data, train_percentage=0.8, random_seed=0)#

Split a data tuple into two datatuple along the rows of the DataFrames.

Parameters

data (ethicml.utility.data_structures.DataTuple) – data tuple to split
train_percentage (float) – percentage for train split
random_seed (int) – seed to make splitting reproducible

Returns

train split and test split

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]