Data preprocessing#
This moodule contains algorithms that preprocess the data in some way.
Classes:
Split data such that the test set is balanced and the training set is proportional. 

Split the given data into a biased subset and a debiased subset. 

Split the given data into a biased subset and a normal subset. 

Base class for classes that split data. 

If a dataset has labels [1,1], then this will make it so the labels = [0,1]. 

Split into train and test while preserving the proportion of s and y. 

Standard train test split. 

Take the first N samples for train set and the rest for test set; no shuffle. 
Functions:
Bin the continuous fetures. 

Return the dataframe that meets some condition. 

Splits a datatuple based on a condition. 

Split the given data into a biased subset and a debiased subset. 

Split the given data into a biased subset and a normal subset. 

Query a datatuple. 

Use a scaler on just the continuous features. 

Split a data tuple into two datatuple along the rows of the DataFrames. 
 class BalancedTestSplit(balance_type='P(sy)=0.5', train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.RandomSplit
Split data such that the test set is balanced and the training set is proportional.
The constructor takes the following arguments.
 Parameters
balance_type (Literal['P(sy)=0.5', 'P(ys)=0.5', 'P(s,y)=0.25']) – how to do the balancing
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
 class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, fixed_unbiased=True)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Split the given data into a biased subset and a debiased subset.
The constructor takes the following arguments.
 Parameters
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor
 class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, data_efficient=True)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Split the given data into a biased subset and a normal subset.
The constructor takes the following arguments.
 Parameters
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible
 class DataSplitter#
Bases:
abc.ABC
Base class for classes that split data.
 class LabelBinarizer#
Bases:
object
If a dataset has labels [1,1], then this will make it so the labels = [0,1].
 Return type
None
 adjust(dataset)#
Take a datatuple and make the labels [0,1].
 Parameters
dataset (ethicml.utility.data_structures.DataTuple) –
 Return type
 post(dataset)#
Inverse of adjust.
 Parameters
dataset (ethicml.utility.data_structures.DataTuple) –
 Return type
 post_only_labels(labels)#
Inverse of adjust but only for a DataFrame instead of a DataTuple.
 Parameters
labels (pandas.Series) –
 Return type
pandas.Series
 class ProportionalSplit(train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.RandomSplit
Split into train and test while preserving the proportion of s and y.
The constructor takes the following arguments.
 Parameters
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
 class RandomSplit(train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Standard train test split.
The constructor takes the following arguments.
 Parameters
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
 class SequentialSplit(train_percentage)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Take the first N samples for train set and the rest for test set; no shuffle.
 Parameters
train_percentage (float) –
 bin_cont_feats(data)#
Bin the continuous fetures.
Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.
 Parameters
 Return type
 dataset_from_cond(dataset, cond)#
Return the dataframe that meets some condition.
 Parameters
dataset (pandas.DataFrame) –
cond (str) –
 Return type
pandas.DataFrame
 domain_split(datatup, tr_cond, te_cond, seed=888)#
Splits a datatuple based on a condition.
 Parameters
datatup (ethicml.utility.data_structures.DataTuple) – DataTuple
tr_cond (str) – condition for the training set
te_cond (str) – condition for the test set
seed (int) –
 Returns
Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.
 Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
 get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, fixed_unbiased=True)#
Split the given data into a biased subset and a debiased subset.
In contrast to
get_biased_subset()
, this function makes the unbiased subset really unbiased.The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y
mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y
mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y
 Parameters
data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor
 Returns
biased and unbiased dataset
 Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
 get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, data_efficient=True)#
Split the given data into a biased subset and a normal subset.
The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data
mixing_factor=0.5: biased and unbiased are both just subsets of data
mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data
 Parameters
data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible
 Returns
biased and unbiased dataset
 Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
 query_dt(datatup, query_str)#
Query a datatuple.
 Parameters
datatup (ethicml.utility.data_structures.DataTuple) –
query_str (str) –
 Return type
 scale_continuous(dataset, datatuple, scaler, inverse=False, fit=True)#
Use a scaler on just the continuous features.
 Parameters
dataset (ethicml.data.dataset.Dataset) – Dataset object. Used to find the continuous features.
datatuple (ethicml.utility.data_structures.DataTuple) – DataTuple on which to sclae the continuous features.
scaler (ethicml.preprocessing.scaling.ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.
inverse (bool) – Should the scaling be reversed?
fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform.
 Returns
Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).
 Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.preprocessing.scaling.ScalerType]
Examples
>>> dataset = adult() >>> datatuple = dataset.load() >>> train, test = train_test_split(datatuple) >>> train, scaler = scale_continuous(dataset, train, scaler) >>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
 train_test_split(data, train_percentage=0.8, random_seed=0)#
Split a data tuple into two datatuple along the rows of the DataFrames.
 Parameters
data (ethicml.utility.data_structures.DataTuple) – data tuple to split
train_percentage (float) – percentage for train split
random_seed (int) – seed to make splitting reproducible
 Returns
train split and test split
 Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]