Data preprocessing#
Module for algorithms that pre-process the data in some way.
Classes:
Split data such that the test set is balanced and the training set is proportional. |
|
Split the given data into a biased subset and a debiased subset. |
|
Split the given data into a biased subset and a normal subset. |
|
Base class for classes that split data. |
|
If a dataset has labels [-1,1], then this will make it so the labels = [0,1]. |
|
Split into train and test while preserving the proportion of s and y. |
|
Standard train test split. |
|
Protocol describing a scaler class. |
|
Take the first N samples for train set and the rest for test set; no shuffle. |
Functions:
Bin the continuous fetures. |
|
Return the dataframe that meets some condition. |
|
Split a datatuple based on a condition. |
|
Split the given data into a biased subset and a debiased subset. |
|
Split the given data into a biased subset and a normal subset. |
|
Query a datatuple. |
|
Use a scaler on just the continuous features. |
|
Split a data tuple into two datatuple along the rows of the DataFrames. |
- class BalancedTestSplit(balance_type='P(s|y)=0.5', train_percentage=0.8, start_seed=0)#
Bases:
RandomSplit
Split data such that the test set is balanced and the training set is proportional.
- Parameters:
balance_type (Literal['P(s|y)=0.5', 'P(y|s)=0.5', 'P(s,y)=0.25']) – how to do the balancing
train_percentage (float) – how much of the data to use for the train split
start_seed (int) – random seed for the first split
- class BiasedDebiasedSubsets(unbiased_pcnt, mixing_factors=(0,), seed=42, *, fixed_unbiased=True)#
Bases:
DataSplitter
Split the given data into a biased subset and a debiased subset.
- Parameters:
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
seed (int) – random seed for the splitting
fixed_unbiased (bool) – if True, then the unbiased dataset is independent from the mixing factor
- class BiasedSubset(unbiased_pcnt, mixing_factors=(0,), seed=42, *, data_efficient=True)#
Bases:
DataSplitter
Split the given data into a biased subset and a normal subset.
- Parameters:
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
mixing_factors (Sequence[float]) – List of mixing factors; they are chosen based on the split ID
seed (int) – random seed for the splitting
data_efficient (bool) – if True, try to keep as many data points as possible
- class DataSplitter#
Bases:
ABC
Base class for classes that split data.
- class LabelBinarizer#
Bases:
object
If a dataset has labels [-1,1], then this will make it so the labels = [0,1].
- adjust(dataset)#
Take a datatuple and make the labels [0,1].
- post_only_labels(labels)#
Inverse of adjust but only for a DataFrame instead of a DataTuple.
- Parameters:
labels (Series) –
- Return type:
Series
- class ProportionalSplit(train_percentage=0.8, start_seed=0)#
Bases:
RandomSplit
Split into train and test while preserving the proportion of s and y.
- Parameters:
train_percentage (float) –
start_seed (int | None) –
- class RandomSplit(train_percentage=0.8, start_seed=0)#
Bases:
DataSplitter
Standard train test split.
- Parameters:
train_percentage (float) – how much of the data to use for the train split
start_seed (int | None) – random seed for the first split
- class ScalerType(*args, **kwargs)#
Bases:
Protocol
Protocol describing a scaler class.
- fit(X)#
Fit parameters of the transformation to the given data.
- Parameters:
X (DataFrame) –
- Return type:
Self
- fit_transform(X)#
Fit parameters of the transformation to the given data and then transform.
- Parameters:
X (DataFrame) –
- Return type:
ndarray[Any, dtype[_ScalarType_co]]
- inverse_transform(X)#
Invert the transformation.
- Parameters:
X (DataFrame) –
- Return type:
ndarray[Any, dtype[_ScalarType_co]]
- transform(X)#
Transform the given data.
- Parameters:
X (DataFrame) –
- Return type:
ndarray[Any, dtype[_ScalarType_co]]
- class SequentialSplit(train_percentage)#
Bases:
DataSplitter
Take the first N samples for train set and the rest for test set; no shuffle.
- Parameters:
train_percentage (float) –
- bin_cont_feats(data)#
Bin the continuous fetures.
Given a datatuple, bin the columns that have ordinal features and return as afresh new DataTuple.
- dataset_from_cond(dataset, cond)#
Return the dataframe that meets some condition.
- Parameters:
dataset (DataFrame) –
cond (str) –
- Return type:
DataFrame
- domain_split(datatup, tr_cond, te_cond, seed=888)#
Split a datatuple based on a condition.
- Parameters:
datatup (DataTuple) – DataTuple
tr_cond (str) – condition for the training set
te_cond (str) – condition for the test set
seed (int) – (Default: 888)
- Returns:
Tuple of DataTuple split into train and test. The test is all those that meet the test condition plus the same percentage again of the train set.
- Return type:
- get_biased_and_debiased_subsets(data, mixing_factor, unbiased_pcnt, seed=42, *, fixed_unbiased=True)#
Split the given data into a biased subset and a debiased subset.
In contrast to
get_biased_subset()
, this function makes the unbiased subset really unbiased.The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0
: in biased, s=y everywhere; in debiased, 50% s=y and 50% s!=ymixing_factor=0.5
: biased is just a subset of data; in debiased, 50% s=y and 50% s!=ymixing_factor=1.0
: in biased, s!=y everywhere; in debiased, 50% s=y and 50% s!=y
- Parameters:
data (DataTuple) – Data in form of a DataTuple.
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – How much of the data should be reserved for the unbiased subset?
seed (int) – random seed for the splitting (Default: 42)
fixed_unbiased (bool) – If True, then the unbiased dataset is independent from the mixing factor (Default: True)
- Returns:
biased and unbiased dataset
- Return type:
- get_biased_subset(data, mixing_factor, unbiased_pcnt, seed=42, *, data_efficient=True)#
Split the given data into a biased subset and a normal subset.
The two subsets don’t generally sum up to the whole set.
Example behavior:
mixing_factor=0.0
: in biased, s=y everywhere; unbiased is just a subset of datamixing_factor=0.5
: biased and unbiased are both just subsets of datamixing_factor=1.0
: in biased, s!=y everywhere; unbiased is just a subset of data
- Parameters:
data (DataTuple) – data in form of a DataTuple
mixing_factor (float) – How much of the debiased data should be mixed into the biased subset? If this factor is 0, the biased subset is maximally biased.
unbiased_pcnt (float) – how much of the data should be reserved for the unbiased subset
seed (int) – random seed for the splitting (Default: 42)
data_efficient (bool) – if True, try to keep as many data points as possible (Default: True)
- Returns:
biased and unbiased dataset
- Return type:
- query_dt(datatup, query_str)#
Query a datatuple.
- scale_continuous(dataset, datatuple, scaler, *, inverse=False, fit=True)#
Use a scaler on just the continuous features.
- Example:
>>> dataset = adult() >>> datatuple = dataset.load() >>> train, test = train_test_split(datatuple) >>> train, scaler = scale_continuous(dataset, train, scaler) >>> test, scaler = scale_continuous(dataset, test, scaler, fit=False)
- Parameters:
dataset (Dataset) – Dataset object. Used to find the continuous features.
datatuple (DataTuple) – DataTuple on which to sclae the continuous features.
scaler (ScalerType) – Scaler object to scale the features. Must fit the SKLearn scaler API.
inverse (bool) – Should the scaling be reversed? (Default: False)
fit (bool) – If not inverse, should the scaler be fit to the data? If True, do fit_transform operation, else just transform. (Default: True)
- Returns:
Tuple of (scaled) DataTuple, and the Scaler (which may have been fit to the data).
- Return type:
tuple[DataTuple, ScalerType]
- train_test_split(data, train_percentage=0.8, random_seed=0, num_test_samples=None)#
Split a data tuple into two datatuple along the rows of the DataFrames.
- Parameters:
data (DataTuple) – Data tuple to split.
train_percentage (float | None) – Percentage for train split. Must be
None
ifnum_test_samples
given. (Default: 0.8)random_seed (int) – Seed to make splitting reproducible (Default: 0)
num_test_samples (int | None) – Number of samples to make the test set. Must be
None
iftrain_percentage
given. (Default: None)
- Returns:
train split and test split
- Raises:
ValueError – If either none or both of
num_test_samples
andtrain_percentage
are passed in as arguments.- Return type: