Data preprocessing#

This moodule contains algorithms that pre-process the data in some way.



Split data such that the test set is balanced and the training set is proportional.


Split the given data into a biased subset and a debiased subset.


Split the given data into a biased subset and a normal subset.


Base class for classes that split data.


If a dataset has labels [-1,1], then this will make it so the labels = [0,1].


Split into train and test while preserving the proportion of s and y.


Standard train test split.


Take the first N samples for train set and the rest for test set; no shuffle.



Bin the continuous fetures.


Return the dataframe that meets some condition.


Splits a datatuple based on a condition.


Split the given data into a biased subset and a debiased subset.


Split the given data into a biased subset and a normal subset.


Query a datatuple.


Use a scaler on just the continuous features.


Split a data tuple into two datatuple along the rows of the DataFrames.

class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split data such that the test set is balanced and the training set is proportional.

The constructor takes the following arguments.

  • balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing

  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, fixed_unbiased=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a debiased subset.

The constructor takes the following arguments.

  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, data_efficient=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a normal subset.

The constructor takes the following arguments.

  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • data_efficient (bool) – if True, try to keep as many data points as possible

class DataSplitter#

Bases: abc.ABC

Base class for classes that split data.

class LabelBinarizer#

Bases: object

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

Return type



Take a datatuple and make the labels [0,1].


dataset (ethicml.utility.data_structures.DataTuple) –

Return type



Inverse of adjust.


dataset (ethicml.utility.data_structures.DataTuple) –

Return type



Inverse of adjust but only for a DataFrame instead of a DataTuple.


labels (pandas.Series) –

Return type


class ProportionalSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split into train and test while preserving the proportion of s and y.

The constructor takes the following arguments.

  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class RandomSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Standard train test split.

The constructor takes the following arguments.

  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class SequentialSplit(train_percentage)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Take the first N samples for train set and the rest for test set; no shuffle.


train_percentage (float) –


Bin the continuous fetures.

Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.


data (ethicml.utility.data_structures.DataTuple) –

Return type


dataset_from_cond(dataset, cond)#

Return the dataframe that meets some condition.

  • dataset (pandas.DataFrame) –

  • cond (str) –

Return type


domain_split(datatup, tr_cond, te_cond, seed=888)#

Splits a datatuple based on a condition.


Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, fixed_unbiased=True)#

Split the given data into a biased subset and a debiased subset.

In contrast to get_biased_subset(), this function makes the unbiased subset really unbiased.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y

  • data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor


biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, data_efficient=True)#

Split the given data into a biased subset and a normal subset.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data

  • mixing_factor=0.5: biased and unbiased are both just subsets of data

  • mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data

  • data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • data_efficient (bool) – if True, try to keep as many data points as possible


biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

query_dt(datatup, query_str)#

Query a datatuple.

Return type


scale_continuous(dataset, datatuple, scaler, inverse=False, fit=True)#

Use a scaler on just the continuous features.

  • dataset ( – Dataset object. Used to find the continuous features.

  • datatuple (ethicml.utility.data_structures.DataTuple) – DataTuple on which to sclae the continuous features.

  • scaler (ethicml.preprocessing.scaling.ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.

  • inverse (bool) – Should the scaling be reversed?

  • fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform.


Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.preprocessing.scaling.ScalerType]


>>> dataset = adult()
>>> datatuple = dataset.load()
>>> train, test = train_test_split(datatuple)
>>> train, scaler = scale_continuous(dataset, train, scaler)
>>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
train_test_split(data, train_percentage=0.8, random_seed=0)#

Split a data tuple into two datatuple along the rows of the DataFrames.


train split and test split

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]