ethicml.data#
Module for items related to data, such as raw csv’s and data objects.
Classes:
The ACS Employmment Dataset from EAAMO21/NeurIPS21 - Retiring Adult. |
|
The ACS Income Dataset from EAAMO21/NeurIPS21 - Retiring Adult. |
|
UFRGS Admissions dataset. |
|
Splits for the Admissions dataset. |
|
UCI Adult dataset. |
|
Available dataset splits for the Adult dataset. |
|
Dataset class that loads data from CSV files. |
|
Dataset that uses the default load function. |
|
Compas (or ProPublica) dataset. |
|
Available dataset splits for the COMPAS dataset. |
|
UCI Credit Card dataset. |
|
Splits for the Credit dataset. |
|
UCI Communities and Crime dataset. |
|
Splits for the Crime dataset. |
|
Data structure that holds all the information needed to load a given dataset. |
|
Order of features in the loaded datatuple. |
|
A dictionary of the list of columns that belong to the feature groups. |
|
German credit dataset. |
|
Splits for the German dataset. |
|
Heritage Health dataset. |
|
Splits for the Health dataset. |
|
Definition of a group of columns that should be interpreted as a single label. |
|
A pair of label specs. |
|
LSAC Law School dataset. |
|
Splits for the Law dataset. |
|
Dataset base class. |
|
Synthetic dataset from the Lipton et al. 2018. |
|
Dataset with toy data for testing. |
|
UCI Adult dataset. |
|
Available dataset splits for the Adult dataset. |
|
Stop, question and frisk dataset. |
|
Splits for the SQF dataset. |
|
Dataset whose size and file location does not depend on constructor arguments. |
|
Dataset with synthetic data. |
|
Scenarios for the synthetic dataset. |
|
Targets for the synthetic dataset. |
|
Dataset with toy data for testing. |
Functions:
List of tabular dataset names. |
|
Create a ConfigurableDataset from the given file. |
|
Filter the features by prefixes. |
|
Flatten a dictionary of lists by joining all lists to one big list. |
|
Convert one-hot encoded columns into categorical columns. |
|
Given a dataset name, get the corresponding dataset object. |
|
Get a list of the discrete features in a dataset. |
|
Group discrete features names according to the first segment of their name. |
|
Extract all the feature column names from a dictionary of label specifications. |
|
Load dataset from its CSV file. |
|
Construct a new label according to the given |
|
Drop all features in the given feature group except the ones in to_keep. |
|
Create a label spec for the case where the label is defined by a single column. |
|
Create label specs for the most common case where columns contain 0s and 1s. |
- class AcsEmployment(root, year, horizon, states, split='Sex', *, discrete_only=False, invert_s=False)#
Bases:
_AcsBase
The ACS Employmment Dataset from EAAMO21/NeurIPS21 - Retiring Adult.
- Parameters:
root (str | Path) –
year (str) –
horizon (int) –
states (list[Literal['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'PR']]) –
split (str) –
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- static cat_lookup(key)#
Look up categories.
- Parameters:
key (str) –
- Return type:
list[int]
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
List of features that are continuous.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property features_to_remove: list[str]#
Features that have to be removed from x.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- class AcsIncome(root, year, horizon, states, split='Sex', target_threshold=50000, *, discrete_only=False, invert_s=False)#
Bases:
_AcsBase
The ACS Income Dataset from EAAMO21/NeurIPS21 - Retiring Adult.
- Parameters:
root (str | Path) –
year (str) –
horizon (int) –
states (list[Literal['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'PR']]) –
split (str) –
target_threshold (int) –
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- static cat_lookup(key)#
Look up categories.
- Parameters:
key (str) –
- Return type:
list[int]
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
List of features that are continuous.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property features_to_remove: list[str]#
Features that have to be removed from x.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- class Admissions(discrete_only=False, invert_s=False, split=AdmissionsSplits.GENDER)#
Bases:
LegacyDataset
UFRGS Admissions dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (AdmissionsSplits) –
- Splits#
alias of
AdmissionsSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class AdmissionsSplits(value)#
Bases:
Enum
Splits for the Admissions dataset.
- class Adult(discrete_only=False, invert_s=False, split=AdultSplits.SEX, binarize_nationality=False, binarize_race=False)#
Bases:
StaticCSVDataset
UCI Adult dataset.
- Parameters:
discrete_only (bool) – If True, continuous features are dropped. (Default: False)
invert_s (bool) – If True, the (binary)
s
values are inverted. (Default: False)split (AdultSplits) – What to use as
s
. (Default: “Sex”)binarize_nationality (bool) – If True, nationality will be USA vs rest. (Default: False)
binarize_race (bool) – If True, race will be white vs rest. (Default: False)
- Splits#
alias of
AdultSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class AdultSplits(value)#
Bases:
Enum
Available dataset splits for the Adult dataset.
- class CSVDataset#
Bases:
Dataset
,ABC
Dataset class that loads data from CSV files.
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- abstract property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- abstract get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- abstract get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- abstract get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- abstract property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- abstract property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- abstract property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class CSVDatasetDC(discrete_only=False, invert_s=False)#
Bases:
CSVDataset
,ABC
Dataset that uses the default load function.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- abstract property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- abstract get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- abstract get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- abstract get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- abstract property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class Compas(discrete_only=False, invert_s=False, split=CompasSplits.SEX)#
Bases:
LegacyDataset
Compas (or ProPublica) dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (CompasSplits) –
- Splits#
alias of
CompasSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class CompasSplits(value)#
Bases:
Enum
Available dataset splits for the COMPAS dataset.
- class Credit(discrete_only=False, invert_s=False, split=CreditSplits.SEX)#
Bases:
LegacyDataset
UCI Credit Card dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (CreditSplits) –
- Splits#
alias of
CreditSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class CreditSplits(value)#
Bases:
Enum
Splits for the Credit dataset.
- class Crime(discrete_only=False, invert_s=False, split=CrimeSplits.RACE_BINARY)#
Bases:
LegacyDataset
UCI Communities and Crime dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (CrimeSplits) –
- Splits#
alias of
CrimeSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class CrimeSplits(value)#
Bases:
Enum
Splits for the Crime dataset.
- class Dataset#
Bases:
ABC
Data structure that holds all the information needed to load a given dataset.
- abstract __len__()#
Return number of elements in the dataset.
- Return type:
int
- abstract property continuous_features: list[str]#
Continuous features.
- abstract property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups.
- abstract property discrete_features: list[str]#
List of features that are discrete.
- abstract feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- abstract load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- abstract property name: str#
Name of the dataset.
- class FeatureOrder(value)#
Bases:
StrEnum
Order of features in the loaded datatuple.
- cont_first = 'cont_first'#
Continuous features first.
- disc_first = 'disc_first'#
Discrete features first.
- class FeatureSplit#
Bases:
TypedDict
A dictionary of the list of columns that belong to the feature groups.
- __len__()#
Return len(self).
- clear() None. Remove all items from D. #
- copy() a shallow copy of D #
- fromkeys(value=None, /)#
Create a new dictionary with keys from iterable and values set to value.
- get(key, default=None, /)#
Return the value for key if key is in the dictionary, else default.
- items() a set-like object providing a view on D's items #
- keys() a set-like object providing a view on D's keys #
- pop(k[, d]) v, remove specified key and return the corresponding value. #
If the key is not found, return the default if given; otherwise, raise a KeyError.
- popitem()#
Remove and return a (key, value) pair as a 2-tuple.
Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.
- setdefault(key, default=None, /)#
Insert key with a value of default if key is not in the dictionary.
Return the value for key if key is in the dictionary, else default.
- update([E, ]**F) None. Update D from dict/iterable E and F. #
If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]
- values() an object providing a view on D's values #
- class German(discrete_only=False, invert_s=False, split=GermanSplits.SEX)#
Bases:
LegacyDataset
German credit dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (GermanSplits) –
- Splits#
alias of
GermanSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class GermanSplits(value)#
Bases:
Enum
Splits for the German dataset.
- class Health(discrete_only=False, invert_s=False, split=HealthSplits.SEX)#
Bases:
LegacyDataset
Heritage Health dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (HealthSplits) –
- Splits#
alias of
HealthSplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class HealthSplits(value)#
Bases:
Enum
Splits for the Health dataset.
- class LabelGroup(columns, multiplier=1)#
Bases:
NamedTuple
Definition of a group of columns that should be interpreted as a single label.
- Parameters:
columns (list[str]) –
multiplier (int) –
- __len__()#
Return len(self).
- columns: list[str]#
Alias for field number 0
- count(value, /)#
Return number of occurrences of value.
- index(value, start=0, stop=9223372036854775807, /)#
Return first index of value.
Raises ValueError if the value is not present.
- multiplier: int#
Alias for field number 1
- class LabelSpecsPair(s, y, to_remove=<factory>)#
Bases:
object
A pair of label specs.
- Parameters:
s (Mapping[str, LabelGroup]) – Spec for building the
s
label.y (Mapping[str, LabelGroup]) – Spec for building the
y
label.to_remove (list[str]) – List of feature groups that need to be removed because they are label building blocks. (Default:
[]
)
- class Law(discrete_only=False, invert_s=False, split=LawSplits.SEX)#
Bases:
LegacyDataset
LSAC Law School dataset.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (LawSplits) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class LawSplits(value)#
Bases:
Enum
Splits for the Law dataset.
- class LegacyDataset(*, name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, num_samples, s_feature_groups=None, class_feature_groups=None, discrete_feature_groups=None)#
Bases:
CSVDataset
Dataset base class.
This base class is considered legacy now. Please use
CSVDatasetDC
orStaticCSVDataset
instead.- Parameters:
name (str) –
filename_or_path (str | Path) –
features (Sequence[str]) –
cont_features (Sequence[str]) –
sens_attr_spec (str | Mapping[str, LabelGroup]) –
class_label_spec (str | Mapping[str, LabelGroup]) –
num_samples (int) –
s_feature_groups (Sequence[str] | None) –
class_feature_groups (Sequence[str] | None) –
discrete_feature_groups (dict[str, list[str]] | None) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class Lipton(discrete_only=False, invert_s=False)#
Bases:
LegacyDataset
Synthetic dataset from the Lipton et al. 2018.
Described in section 4.1 of Does mitigating ML’s impact disparity require treatment disparity?
@article{lipton2018does, title={Does mitigating ML's impact disparity require treatment disparity?}, author={Lipton, Zachary and McAuley, Julian and Chouldechova, Alexandra}, journal={Advances in neural information processing systems}, volume={31}, pages={8125--8135}, year={2018} }
- Parameters:
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class NonBinaryToy(discrete_only=False, invert_s=False)#
Bases:
LegacyDataset
Dataset with toy data for testing.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class Nursery(discrete_only=False, invert_s=False, split=NurserySplits.FINANCE)#
Bases:
LegacyDataset
UCI Adult dataset.
- Parameters:
discrete_only (bool) – If True, continuous features are dropped. (Default: False)
invert_s (bool) – If True, the (binary)
s
values are inverted. (Default: False)split (NurserySplits) – What to use as
s
. (Default: “Sex”)binarize_nationality – If True, nationality will be USA vs rest. (Default: False)
binarize_race – If True, race will be white vs rest. (Default: False)
- Splits#
alias of
NurserySplits
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class NurserySplits(value)#
Bases:
Enum
Available dataset splits for the Adult dataset.
- class Sqf(discrete_only=False, invert_s=False, split=SqfSplits.SEX)#
Bases:
LegacyDataset
Stop, question and frisk dataset.
This data is from the 2016, source: http://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page
- Parameters:
discrete_only (bool) –
invert_s (bool) –
split (SqfSplits) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class SqfSplits(value)#
Bases:
Enum
Splits for the SQF dataset.
- class StaticCSVDataset(discrete_only=False, invert_s=False)#
Bases:
CSVDatasetDC
,ABC
Dataset whose size and file location does not depend on constructor arguments.
- Example:
How to subclass this:
@dataclass class Toy(StaticCSVDataset): '''Dataset with toy data for testing.''' num_samples: ClassVar[int] = 400 csv_file: ClassVar[str] = "toy.csv" @property def name(self) -> str: return "Toy" def get_label_specs(self) -> LabelSpecsPair: return LabelSpecsPair( s=single_col_spec("sens"), y=single_col_spec("class") ) @property def unfiltered_disc_feat_groups(self) -> DiscFeatureGroups: return {"disc_1": ["a_1", "a_2", "a_3"], "disc_2": ["b_1", "b_2"]} @property def continuous_features(self) -> list[str]: return ["c1", "c2"]
- Parameters:
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- abstract property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- abstract get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- abstract property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class Synthetic(discrete_only=False, invert_s=False, scenario=SyntheticScenarios.S1, target=SyntheticTargets.Y3, fair=False, num_samples=1000)#
Bases:
CSVDatasetDC
Dataset with synthetic data.
⊥ = is independent of ~ = is an ancestor of in the causal model used to generate the data
- Scenario 1 = X⊥S & Y⊥S.
This models completely fair data.
- Scenario 2 = X_2⊥S & Y_2⊥S; X_1~S, Y_1~S & Y_3~S
This models data where the inputs are biased. This is propogated through to the target.
- Scenario 3 = X⊥S, Y_1⊥S, Y_2⊥S; Y_3~S
This models data where the target is biased.
- Scenario 4 = X_2⊥S, Y_2⊥S; X_1~S, Y_1~S, Y_3~S
This models data where both the input and target are directly biased.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
scenario (SyntheticScenarios) –
target (SyntheticTargets) –
fair (bool) –
num_samples (int) –
- Scenarios#
alias of
SyntheticScenarios
- Targets#
alias of
SyntheticTargets
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- class SyntheticScenarios(value)#
Bases:
Enum
Scenarios for the synthetic dataset.
- class SyntheticTargets(value)#
Bases:
Enum
Targets for the synthetic dataset.
- class Toy(discrete_only=False, invert_s=False)#
Bases:
StaticCSVDataset
Dataset with toy data for testing.
- Parameters:
discrete_only (bool) –
invert_s (bool) –
- __len__()#
Return number of elements in the dataset.
- Return type:
int
- property class_labels: list[str]#
Get the list of class labels.
- property continuous_features: list[str]#
Continuous features.
- property disc_feature_groups: Dict[str, List[str]]#
Return Dictionary of feature groups, without s and y labels.
- discard_non_one_hot: ClassVar[bool] = False#
If some entries in s or y are not correctly one-hot encoded, discard those.
- property discrete_features: list[str]#
List of features that are discrete.
- feature_split(order=FeatureOrder.disc_first)#
Return an order features dictionary.
This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.
- Parameters:
order (FeatureOrder) –
- Return type:
- property filepath: Path#
Filepath from which to load the data.
- get_filename_or_path()#
Filename of CSV files containing the data.
- Return type:
str | Path
- get_label_specs()#
Label specs and discrete feature groups that have to be removed from
x
.- Return type:
- get_num_samples()#
Number of samples in the dataset.
- Return type:
int
- property invert_sens_attr: bool#
Whether to invert the sensitive attribute.
- load(order=FeatureOrder.disc_first, *, labels_as_features=False)#
Load dataset from its CSV file.
- Parameters:
order (FeatureOrder) – Order of the columns in the dataframes. Can be
disc_first
orcont_first
. SeeFeatureOrder
.labels_as_features (bool) – If
True
, the s and y labels are included in the x features.
- Returns:
DataTuple
with dataframes of features, labels and sensitive attributes.- Return type:
- load_aif()#
Load the dataset as an AIF360 dataset.
Experimental. Requires the aif360 library.
Ignores the type check as the return type is not yet defined.
- property load_discrete_only: bool#
Whether to only load discrete features.
- map_to_binary: ClassVar[bool] = False#
If True, convert labels from {-1, 1} to {0, 1}.
- property name: str#
Name of the dataset.
- property sens_attrs: list[str]#
Get the list of sensitive attributes.
- property unfiltered_disc_feat_groups: Dict[str, List[str]]#
Discrete feature groups, including features for the labels.
- available_tabular()#
List of tabular dataset names.
- Return type:
list[str]
- create_data_obj(filepath, s_column, y_column, additional_to_drop=None)#
Create a ConfigurableDataset from the given file.
- Parameters:
filepath (Path) – path to a CSV file
s_column (str) – column that represents sensitive attributes
y_column (str) – column that contains lables
additional_to_drop (list[str] | None) – other columns that should be dropped (Default: None)
- Returns:
Dataset object
- Return type:
ConfigurableDataset
- filter_features_by_prefixes(features, prefixes)#
Filter the features by prefixes.
- Parameters:
features (Sequence[str]) – list of features names
prefixes (Sequence[str]) – list of prefixes
- Returns:
filtered feature names
- Return type:
list[str]
- flatten_dict(dictionary)#
Flatten a dictionary of lists by joining all lists to one big list.
- Parameters:
dictionary (Mapping[str, list[str]] | None) –
- Return type:
list[str]
- from_dummies(data, categorical_cols)#
Convert one-hot encoded columns into categorical columns.
- Parameters:
data (DataFrame) –
categorical_cols (Mapping[str, Sequence[str]]) –
- Return type:
DataFrame
- get_dataset_obj_by_name(name)#
Given a dataset name, get the corresponding dataset object.
- Parameters:
name (str) – Name of the dataset.
- Returns:
A callable that can be used to construct the dataset object.
- Raises:
NotImplementedError – If the given name does not correspond to a dataset.
- Return type:
Callable[[], Dataset]
- get_discrete_features(all_feats, feats_to_remove, cont_feats)#
Get a list of the discrete features in a dataset.
- Parameters:
all_feats (list[str]) – List of all features in the dataset.
feats_to_remove (list[str]) – List of features that aren’t used.
cont_feats (list[str]) – List of continuous features in the dataset.
- Returns:
List of features not marked as continuous or to be removed.
- Return type:
list[str]
- group_disc_feat_indices(disc_feat_names, prefix_sep='_')#
Group discrete features names according to the first segment of their name.
Returns a list of their corresponding slices (assumes order is maintained).
- Parameters:
disc_feat_names (list[str]) – List of discrete feature names.
prefix_sep (str) – Separator between the prefix and the rest of the name. (Default: “_”)
- Returns:
List of slices.
- Return type:
list[slice]
- label_spec_to_feature_list(spec)#
Extract all the feature column names from a dictionary of label specifications.
- Parameters:
spec (Mapping[str, LabelGroup]) – Dictionary of label specifications.
- Returns:
A flattend list of all the columns occuring in the label specs.
- Return type:
list[str]
- load_data(dataset)#
Load dataset from its CSV file.
This function only exists for backwards compatibility. Use dataset.load() instead.
- one_hot_encode_and_combine(attributes, label_spec, *, discard_non_one_hot)#
Construct a new label according to the given
LabelSpec
.This function is at the heart of the label spec API in EthicML.
- Parameters:
attributes (DataFrame) – DataFrame containing the attributes.
label_spec (Mapping[str, LabelGroup]) – A label spec.
discard_non_one_hot (bool) – If
True
, a mask is returned which masks out all rows which are not properly one-hot (i.e., either all classes are 0 or more than one is 1).
- Returns:
A tuple of a Series with the new labels and – if
discard_non_one_hot
isTrue
– a mask for filtering out the rows that were not properly one-hot.- Return type:
tuple[Series, Series | None]
- reduce_feature_group(disc_feature_groups, feature_group, to_keep, remaining_feature_name)#
Drop all features in the given feature group except the ones in to_keep.
- Parameters:
disc_feature_groups (Mapping[str, list[str]]) – Dictionary of feature groups.
feature_group (str) – Name of the feature group that will be replaced by
to_keep
.to_keep (Sequence[str]) – List of features that will be kept in the feature group.
remaining_feature_name (str) – Name of the dummy feature that will be used to summarize the removed features.
- Returns:
Modified dictionary of feature groups.
- Return type:
Dict[str, List[str]]
- single_col_spec(col, feature_name=None)#
Create a label spec for the case where the label is defined by a single column.
- Parameters:
col (str) –
feature_name (str | None) –
- Return type:
Mapping[str, LabelGroup]
- spec_from_binary_cols(label_defs)#
Create label specs for the most common case where columns contain 0s and 1s.
- Parameters:
label_defs (Mapping[str, Sequence[str]]) – Mapping of label names to column names.
- Returns:
Label specifications.
- Return type:
Mapping[str, LabelGroup]
Aliases#
- DiscFeatureGroups#
A grouping of discrete features such that the result is one-hot encoded.
- LabelSpec#
A label specification consisting of
LabelGroup
entries with names.