Data preprocessing#

This moodule contains algorithms that pre-process the data in some way.

Classes:

BalancedTestSplit

Split data such that the test set is balanced and the training set is proportional.

BiasedDebiasedSubsets

Split the given data into a biased subset and a debiased subset.

BiasedSubset

Split the given data into a biased subset and a normal subset.

DataSplitter

Base class for classes that split data.

LabelBinarizer

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

ProportionalSplit

Split into train and test while preserving the proportion of s and y.

RandomSplit

Standard train test split.

SequentialSplit

Take the first N samples for train set and the rest for test set; no shuffle.

Functions:

bin_cont_feats

Bin the continuous fetures.

dataset_from_cond

Return the dataframe that meets some condition.

domain_split

Splits a datatuple based on a condition.

get_biased_and_debiased_subsets

Split the given data into a biased subset and a debiased subset.

get_biased_subset

Split the given data into a biased subset and a normal subset.

query_dt

Query a datatuple.

scale_continuous

Use a scaler on just the continuous features.

train_test_split

Split a data tuple into two datatuple along the rows of the DataFrames.

class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split data such that the test set is balanced and the training set is proportional.

The constructor takes the following arguments.

Parameters
  • balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing

  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, fixed_unbiased=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a debiased subset.

The constructor takes the following arguments.

Parameters
  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, data_efficient=True)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Split the given data into a biased subset and a normal subset.

The constructor takes the following arguments.

Parameters
  • mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • data_efficient (bool) – if True, try to keep as many data points as possible

class DataSplitter#

Bases: abc.ABC

Base class for classes that split data.

class LabelBinarizer#

Bases: object

If a dataset has labels [-1,1], then this will make it so the labels = [0,1].

Return type

None

adjust(dataset)#

Take a datatuple and make the labels [0,1].

Parameters

dataset (ethicml.utility.data_structures.DataTuple) –

Return type

ethicml.utility.data_structures.DataTuple

post(dataset)#

Inverse of adjust.

Parameters

dataset (ethicml.utility.data_structures.DataTuple) –

Return type

ethicml.utility.data_structures.DataTuple

post_only_labels(labels)#

Inverse of adjust but only for a DataFrame instead of a DataTuple.

Parameters

labels (pandas.Series) –

Return type

pandas.Series

class ProportionalSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.RandomSplit

Split into train and test while preserving the proportion of s and y.

The constructor takes the following arguments.

Parameters
  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class RandomSplit(train_percentage=0.8, start_seed=0)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Standard train test split.

The constructor takes the following arguments.

Parameters
  • train_percentage (float) – how much of the data to use for the train split

  • start_seed (int) – random seed for the first split

class SequentialSplit(train_percentage)#

Bases: ethicml.preprocessing.train_test_split.DataSplitter

Take the first N samples for train set and the rest for test set; no shuffle.

Parameters

train_percentage (float) –

bin_cont_feats(data)#

Bin the continuous fetures.

Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.

Parameters

data (ethicml.utility.data_structures.DataTuple) –

Return type

ethicml.utility.data_structures.DataTuple

dataset_from_cond(dataset, cond)#

Return the dataframe that meets some condition.

Parameters
  • dataset (pandas.DataFrame) –

  • cond (str) –

Return type

pandas.DataFrame

domain_split(datatup, tr_cond, te_cond, seed=888)#

Splits a datatuple based on a condition.

Parameters
Returns

Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, fixed_unbiased=True)#

Split the given data into a biased subset and a debiased subset.

In contrast to get_biased_subset(), this function makes the unbiased subset really unbiased.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y

  • mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y

Parameters
  • data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor

Returns

biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, data_efficient=True)#

Split the given data into a biased subset and a normal subset.

The two subsets don’t generally sum up to the whole set.

Example behavior:

  • mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data

  • mixing_factor=0.5: biased and unbiased are both just subsets of data

  • mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data

Parameters
  • data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple

  • mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.

  • unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset

  • seed (int) – random seed for the splitting

  • data_efficient (bool) – if True, try to keep as many data points as possible

Returns

biased and unbiased dataset

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]

query_dt(datatup, query_str)#

Query a datatuple.

Parameters
Return type

ethicml.utility.data_structures.DataTuple

scale_continuous(dataset, datatuple, scaler, inverse=False, fit=True)#

Use a scaler on just the continuous features.

Parameters
  • dataset (ethicml.data.dataset.Dataset) – Dataset object. Used to find the continuous features.

  • datatuple (ethicml.utility.data_structures.DataTuple) – DataTuple on which to sclae the continuous features.

  • scaler (ethicml.preprocessing.scaling.ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.

  • inverse (bool) – Should the scaling be reversed?

  • fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform.

Returns

Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.preprocessing.scaling.ScalerType]

Examples

>>> dataset = adult()
>>> datatuple = dataset.load()
>>> train, test = train_test_split(datatuple)
>>> train, scaler = scale_continuous(dataset, train, scaler)
>>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
train_test_split(data, train_percentage=0.8, random_seed=0)#

Split a data tuple into two datatuple along the rows of the DataFrames.

Parameters
Returns

train split and test split

Return type

Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]