Data preprocessing#
This moodule contains algorithms that pre-process the data in some way.
Classes:
Split data such that the test set is balanced and the training set is proportional. |
|
Split the given data into a biased subset and a debiased subset. |
|
Split the given data into a biased subset and a normal subset. |
|
Base class for classes that split data. |
|
If a dataset has labels [-1,1], then this will make it so the labels = [0,1]. |
|
Split into train and test while preserving the proportion of s and y. |
|
Standard train test split. |
|
Take the first N samples for train set and the rest for test set; no shuffle. |
Functions:
Bin the continuous fetures. |
|
Return the dataframe that meets some condition. |
|
Splits a datatuple based on a condition. |
|
Split the given data into a biased subset and a debiased subset. |
|
Split the given data into a biased subset and a normal subset. |
|
Query a datatuple. |
|
Use a scaler on just the continuous features. |
|
Split a data tuple into two datatuple along the rows of the DataFrames. |
- class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.RandomSplit
Split data such that the test set is balanced and the training set is proportional.
The constructor takes the following arguments.
- Parameters
balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
- class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, fixed_unbiased=True)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Split the given data into a biased subset and a debiased subset.
The constructor takes the following arguments.
- Parameters
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor
- class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, data_efficient=True)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Split the given data into a biased subset and a normal subset.
The constructor takes the following arguments.
- Parameters
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible
- class DataSplitter#
Bases:
abc.ABC
Base class for classes that split data.
- class LabelBinarizer#
Bases:
object
If a dataset has labels [-1,1], then this will make it so the labels = [0,1].
- Return type
None
- adjust(dataset)#
Take a datatuple and make the labels [0,1].
- Parameters
dataset (ethicml.utility.data_structures.DataTuple) –
- Return type
- post(dataset)#
Inverse of adjust.
- Parameters
dataset (ethicml.utility.data_structures.DataTuple) –
- Return type
- post_only_labels(labels)#
Inverse of adjust but only for a DataFrame instead of a DataTuple.
- Parameters
labels (pandas.Series) –
- Return type
pandas.Series
- class ProportionalSplit(train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.RandomSplit
Split into train and test while preserving the proportion of s and y.
The constructor takes the following arguments.
- Parameters
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
- class RandomSplit(train_percentage=0.8, start_seed=0)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Standard train test split.
The constructor takes the following arguments.
- Parameters
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
- class SequentialSplit(train_percentage)#
Bases:
ethicml.preprocessing.train_test_split.DataSplitter
Take the first N samples for train set and the rest for test set; no shuffle.
- Parameters
train_percentage (float) –
- bin_cont_feats(data)#
Bin the continuous fetures.
Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.
- Parameters
- Return type
- dataset_from_cond(dataset, cond)#
Return the dataframe that meets some condition.
- Parameters
dataset (pandas.DataFrame) –
cond (str) –
- Return type
pandas.DataFrame
- domain_split(datatup, tr_cond, te_cond, seed=888)#
Splits a datatuple based on a condition.
- Parameters
datatup (ethicml.utility.data_structures.DataTuple) – DataTuple
tr_cond (str) – condition for the training set
te_cond (str) – condition for the test set
seed (int) –
- Returns
Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.
- Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
- get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, fixed_unbiased=True)#
Split the given data into a biased subset and a debiased subset.
In contrast to
get_biased_subset()
, this function makes the unbiased subset really unbiased.The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=y
mixing_factor=0.5: biased is just a subset of data; in debiased, 50% s=y and 50% s!=y
mixing_factor=1.0: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y
- Parameters
data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor
- Returns
biased and unbiased dataset
- Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
- get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, data_efficient=True)#
Split the given data into a biased subset and a normal subset.
The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0: in biased, s=y everywhere; unbiased is just a subset of data
mixing_factor=0.5: biased and unbiased are both just subsets of data
mixing_factor=1.0: in biased, s!=y everywhere; unbiased is just a subset of data
- Parameters
data (ethicml.utility.data_structures.DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible
- Returns
biased and unbiased dataset
- Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]
- query_dt(datatup, query_str)#
Query a datatuple.
- Parameters
datatup (ethicml.utility.data_structures.DataTuple) –
query_str (str) –
- Return type
- scale_continuous(dataset, datatuple, scaler, inverse=False, fit=True)#
Use a scaler on just the continuous features.
- Parameters
dataset (ethicml.data.dataset.Dataset) – Dataset object. Used to find the continuous features.
datatuple (ethicml.utility.data_structures.DataTuple) – DataTuple on which to sclae the continuous features.
scaler (ethicml.preprocessing.scaling.ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.
inverse (bool) – Should the scaling be reversed?
fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform.
- Returns
Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).
- Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.preprocessing.scaling.ScalerType]
Examples
>>> dataset = adult() >>> datatuple = dataset.load() >>> train, test = train_test_split(datatuple) >>> train, scaler = scale_continuous(dataset, train, scaler) >>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
- train_test_split(data, train_percentage=0.8, random_seed=0)#
Split a data tuple into two datatuple along the rows of the DataFrames.
- Parameters
data (ethicml.utility.data_structures.DataTuple) – data tuple to split
train_percentage (float) – percentage for train split
random_seed (int) – seed to make splitting reproducible
- Returns
train split and test split
- Return type
Tuple[ethicml.utility.data_structures.DataTuple, ethicml.utility.data_structures.DataTuple]