ethicml.data#

Module for items related to data, such as raw csv’s and data objects.

Classes:

AcsEmployment

The ACS Employmment Dataset from EAAMO21/NeurIPS21 - Retiring Adult.

AcsIncome

The ACS Income Dataset from EAAMO21/NeurIPS21 - Retiring Adult.

Admissions

UFRGS Admissions dataset.

AdmissionsSplits

Splits for the Admissions dataset.

Adult

UCI Adult dataset.

AdultSplits

Available dataset splits for the Adult dataset.

CSVDataset

Dataset class that loads data from CSV files.

CSVDatasetDC

Dataset that uses the default load function.

Compas

Compas (or ProPublica) dataset.

CompasSplits

Available dataset splits for the COMPAS dataset.

Credit

UCI Credit Card dataset.

CreditSplits

Splits for the Credit dataset.

Crime

UCI Communities and Crime dataset.

CrimeSplits

Splits for the Crime dataset.

Dataset

Data structure that holds all the information needed to load a given dataset.

FeatureOrder

Order of features in the loaded datatuple.

FeatureSplit

A dictionary of the list of columns that belong to the feature groups.

German

German credit dataset.

GermanSplits

Splits for the German dataset.

Health

Heritage Health dataset.

HealthSplits

Splits for the Health dataset.

LabelGroup

Definition of a group of columns that should be interpreted as a single label.

LabelSpecsPair

A pair of label specs.

Law

LSAC Law School dataset.

LawSplits

Splits for the Law dataset.

LegacyDataset

Dataset base class.

Lipton

Synthetic dataset from the Lipton et al. 2018.

NonBinaryToy

Dataset with toy data for testing.

Nursery

UCI Adult dataset.

NurserySplits

Available dataset splits for the Adult dataset.

Sqf

Stop, question and frisk dataset.

SqfSplits

Splits for the SQF dataset.

StaticCSVDataset

Dataset whose size and file location does not depend on constructor arguments.

Synthetic

Dataset with synthetic data.

SyntheticScenarios

Scenarios for the synthetic dataset.

SyntheticTargets

Targets for the synthetic dataset.

Toy

Dataset with toy data for testing.

Functions:

available_tabular

List of tabular dataset names.

create_data_obj

Create a ConfigurableDataset from the given file.

filter_features_by_prefixes

Filter the features by prefixes.

flatten_dict

Flatten a dictionary of lists by joining all lists to one big list.

from_dummies

Convert one-hot encoded columns into categorical columns.

get_dataset_obj_by_name

Given a dataset name, get the corresponding dataset object.

get_discrete_features

Get a list of the discrete features in a dataset.

group_disc_feat_indices

Group discrete features names according to the first segment of their name.

label_spec_to_feature_list

Extract all the feature column names from a dictionary of label specifications.

load_data

Load dataset from its CSV file.

one_hot_encode_and_combine

Construct a new label according to the given LabelSpec.

reduce_feature_group

Drop all features in the given feature group except the ones in to_keep.

single_col_spec

Create a label spec for the case where the label is defined by a single column.

spec_from_binary_cols

Create label specs for the most common case where columns contain 0s and 1s.

class AcsEmployment(root, year, horizon, states, split='Sex', *, discrete_only=False, invert_s=False)#

Bases: _AcsBase

The ACS Employmment Dataset from EAAMO21/NeurIPS21 - Retiring Adult.

Parameters:
  • root (str | Path) –

  • year (str) –

  • horizon (int) –

  • states (list[Literal['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'PR']]) –

  • split (str) –

  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

static cat_lookup(key)#

Look up categories.

Parameters:

key (str) –

Return type:

list[int]

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

List of features that are continuous.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property features_to_remove: list[str]#

Features that have to be removed from x.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

class AcsIncome(root, year, horizon, states, split='Sex', target_threshold=50000, *, discrete_only=False, invert_s=False)#

Bases: _AcsBase

The ACS Income Dataset from EAAMO21/NeurIPS21 - Retiring Adult.

Parameters:
  • root (str | Path) –

  • year (str) –

  • horizon (int) –

  • states (list[Literal['AL', 'AK', 'AZ', 'AR', 'CA', 'CO', 'CT', 'DE', 'FL', 'GA', 'HI', 'ID', 'IL', 'IN', 'IA', 'KS', 'KY', 'LA', 'ME', 'MD', 'MA', 'MI', 'MN', 'MS', 'MO', 'MT', 'NE', 'NV', 'NH', 'NJ', 'NM', 'NY', 'NC', 'ND', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY', 'PR']]) –

  • split (str) –

  • target_threshold (int) –

  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

static cat_lookup(key)#

Look up categories.

Parameters:

key (str) –

Return type:

list[int]

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

List of features that are continuous.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property features_to_remove: list[str]#

Features that have to be removed from x.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

class Admissions(discrete_only=False, invert_s=False, split=AdmissionsSplits.GENDER)#

Bases: LegacyDataset

UFRGS Admissions dataset.

Parameters:
Splits#

alias of AdmissionsSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class AdmissionsSplits(value)#

Bases: Enum

Splits for the Admissions dataset.

class Adult(discrete_only=False, invert_s=False, split=AdultSplits.SEX, binarize_nationality=False, binarize_race=False)#

Bases: StaticCSVDataset

UCI Adult dataset.

Parameters:
  • discrete_only (bool) – If True, continuous features are dropped. (Default: False)

  • invert_s (bool) – If True, the (binary) s values are inverted. (Default: False)

  • split (AdultSplits) – What to use as s. (Default: “Sex”)

  • binarize_nationality (bool) – If True, nationality will be USA vs rest. (Default: False)

  • binarize_race (bool) – If True, race will be white vs rest. (Default: False)

Splits#

alias of AdultSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class AdultSplits(value)#

Bases: Enum

Available dataset splits for the Adult dataset.

class CSVDataset#

Bases: Dataset, ABC

Dataset class that loads data from CSV files.

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

abstract property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

abstract get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

abstract get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

abstract get_num_samples()#

Number of samples in the dataset.

Return type:

int

abstract property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

abstract property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

abstract property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class CSVDatasetDC(discrete_only=False, invert_s=False)#

Bases: CSVDataset, ABC

Dataset that uses the default load function.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

abstract property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

abstract get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

abstract get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

abstract get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

abstract property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class Compas(discrete_only=False, invert_s=False, split=CompasSplits.SEX)#

Bases: LegacyDataset

Compas (or ProPublica) dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (CompasSplits) –

Splits#

alias of CompasSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class CompasSplits(value)#

Bases: Enum

Available dataset splits for the COMPAS dataset.

class Credit(discrete_only=False, invert_s=False, split=CreditSplits.SEX)#

Bases: LegacyDataset

UCI Credit Card dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (CreditSplits) –

Splits#

alias of CreditSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class CreditSplits(value)#

Bases: Enum

Splits for the Credit dataset.

class Crime(discrete_only=False, invert_s=False, split=CrimeSplits.RACE_BINARY)#

Bases: LegacyDataset

UCI Communities and Crime dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (CrimeSplits) –

Splits#

alias of CrimeSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class CrimeSplits(value)#

Bases: Enum

Splits for the Crime dataset.

class Dataset#

Bases: ABC

Data structure that holds all the information needed to load a given dataset.

abstract __len__()#

Return number of elements in the dataset.

Return type:

int

abstract property continuous_features: list[str]#

Continuous features.

abstract property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups.

abstract property discrete_features: list[str]#

List of features that are discrete.

abstract feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

abstract load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

abstract property name: str#

Name of the dataset.

class FeatureOrder(value)#

Bases: StrEnum

Order of features in the loaded datatuple.

cont_first = 'cont_first'#

Continuous features first.

disc_first = 'disc_first'#

Discrete features first.

class FeatureSplit#

Bases: TypedDict

A dictionary of the list of columns that belong to the feature groups.

__len__()#

Return len(self).

clear() None.  Remove all items from D.#
copy() a shallow copy of D#
fromkeys(value=None, /)#

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)#

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D's items#
keys() a set-like object providing a view on D's keys#
pop(k[, d]) v, remove specified key and return the corresponding value.#

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()#

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)#

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) None.  Update D from dict/iterable E and F.#

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() an object providing a view on D's values#
class German(discrete_only=False, invert_s=False, split=GermanSplits.SEX)#

Bases: LegacyDataset

German credit dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (GermanSplits) –

Splits#

alias of GermanSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class GermanSplits(value)#

Bases: Enum

Splits for the German dataset.

class Health(discrete_only=False, invert_s=False, split=HealthSplits.SEX)#

Bases: LegacyDataset

Heritage Health dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (HealthSplits) –

Splits#

alias of HealthSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class HealthSplits(value)#

Bases: Enum

Splits for the Health dataset.

class LabelGroup(columns, multiplier=1)#

Bases: NamedTuple

Definition of a group of columns that should be interpreted as a single label.

Parameters:
  • columns (list[str]) –

  • multiplier (int) –

__len__()#

Return len(self).

columns: list[str]#

Alias for field number 0

count(value, /)#

Return number of occurrences of value.

index(value, start=0, stop=9223372036854775807, /)#

Return first index of value.

Raises ValueError if the value is not present.

multiplier: int#

Alias for field number 1

class LabelSpecsPair(s, y, to_remove=<factory>)#

Bases: object

A pair of label specs.

Parameters:
  • s (Mapping[str, LabelGroup]) – Spec for building the s label.

  • y (Mapping[str, LabelGroup]) – Spec for building the y label.

  • to_remove (list[str]) – List of feature groups that need to be removed because they are label building blocks. (Default: [])

class Law(discrete_only=False, invert_s=False, split=LawSplits.SEX)#

Bases: LegacyDataset

LSAC Law School dataset.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (LawSplits) –

Splits#

alias of LawSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class LawSplits(value)#

Bases: Enum

Splits for the Law dataset.

class LegacyDataset(*, name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, num_samples, s_feature_groups=None, class_feature_groups=None, discrete_feature_groups=None)#

Bases: CSVDataset

Dataset base class.

This base class is considered legacy now. Please use CSVDatasetDC or StaticCSVDataset instead.

Parameters:
  • name (str) –

  • filename_or_path (str | Path) –

  • features (Sequence[str]) –

  • cont_features (Sequence[str]) –

  • sens_attr_spec (str | Mapping[str, LabelGroup]) –

  • class_label_spec (str | Mapping[str, LabelGroup]) –

  • num_samples (int) –

  • s_feature_groups (Sequence[str] | None) –

  • class_feature_groups (Sequence[str] | None) –

  • discrete_feature_groups (dict[str, list[str]] | None) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class Lipton(discrete_only=False, invert_s=False)#

Bases: LegacyDataset

Synthetic dataset from the Lipton et al. 2018.

Described in section 4.1 of Does mitigating ML’s impact disparity require treatment disparity?

@article{lipton2018does,
    title={Does mitigating ML's impact disparity require treatment disparity?},
    author={Lipton, Zachary and McAuley, Julian and Chouldechova, Alexandra},
    journal={Advances in neural information processing systems},
    volume={31},
    pages={8125--8135},
    year={2018}
}
Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class NonBinaryToy(discrete_only=False, invert_s=False)#

Bases: LegacyDataset

Dataset with toy data for testing.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class Nursery(discrete_only=False, invert_s=False, split=NurserySplits.FINANCE)#

Bases: LegacyDataset

UCI Adult dataset.

Parameters:
  • discrete_only (bool) – If True, continuous features are dropped. (Default: False)

  • invert_s (bool) – If True, the (binary) s values are inverted. (Default: False)

  • split (NurserySplits) – What to use as s. (Default: “Sex”)

  • binarize_nationality – If True, nationality will be USA vs rest. (Default: False)

  • binarize_race – If True, race will be white vs rest. (Default: False)

Splits#

alias of NurserySplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class NurserySplits(value)#

Bases: Enum

Available dataset splits for the Adult dataset.

class Sqf(discrete_only=False, invert_s=False, split=SqfSplits.SEX)#

Bases: LegacyDataset

Stop, question and frisk dataset.

This data is from the 2016, source: http://www1.nyc.gov/site/nypd/stats/reports-analysis/stopfrisk.page

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

  • split (SqfSplits) –

Splits#

alias of SqfSplits

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class SqfSplits(value)#

Bases: Enum

Splits for the SQF dataset.

class StaticCSVDataset(discrete_only=False, invert_s=False)#

Bases: CSVDatasetDC, ABC

Dataset whose size and file location does not depend on constructor arguments.

Example:

How to subclass this:

@dataclass
class Toy(StaticCSVDataset):
    '''Dataset with toy data for testing.'''

    num_samples: ClassVar[int] = 400
    csv_file: ClassVar[str] = "toy.csv"

    @property
    def name(self) -> str:
        return "Toy"

    def get_label_specs(self) -> LabelSpecsPair:
        return LabelSpecsPair(
            s=single_col_spec("sens"), y=single_col_spec("class")
        )

    @property
    def unfiltered_disc_feat_groups(self) -> DiscFeatureGroups:
        return {"disc_1": ["a_1", "a_2", "a_3"], "disc_2": ["b_1", "b_2"]}

    @property
    def continuous_features(self) -> list[str]:
        return ["c1", "c2"]
Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

abstract property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

abstract get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

abstract property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

abstract property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class Synthetic(discrete_only=False, invert_s=False, scenario=SyntheticScenarios.S1, target=SyntheticTargets.Y3, fair=False, num_samples=1000)#

Bases: CSVDatasetDC

Dataset with synthetic data.

⊥ = is independent of ~ = is an ancestor of in the causal model used to generate the data

Scenario 1 = X⊥S & Y⊥S.
  • This models completely fair data.

Scenario 2 = X_2⊥S & Y_2⊥S; X_1~S, Y_1~S & Y_3~S
  • This models data where the inputs are biased. This is propogated through to the target.

Scenario 3 = X⊥S, Y_1⊥S, Y_2⊥S; Y_3~S
  • This models data where the target is biased.

Scenario 4 = X_2⊥S, Y_2⊥S; X_1~S, Y_1~S, Y_3~S
  • This models data where both the input and target are directly biased.

Parameters:
Scenarios#

alias of SyntheticScenarios

Targets#

alias of SyntheticTargets

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

class SyntheticScenarios(value)#

Bases: Enum

Scenarios for the synthetic dataset.

class SyntheticTargets(value)#

Bases: Enum

Targets for the synthetic dataset.

class Toy(discrete_only=False, invert_s=False)#

Bases: StaticCSVDataset

Dataset with toy data for testing.

Parameters:
  • discrete_only (bool) –

  • invert_s (bool) –

__len__()#

Return number of elements in the dataset.

Return type:

int

property class_labels: list[str]#

Get the list of class labels.

property continuous_features: list[str]#

Continuous features.

property disc_feature_groups: Dict[str, List[str]]#

Return Dictionary of feature groups, without s and y labels.

discard_non_one_hot: ClassVar[bool] = False#

If some entries in s or y are not correctly one-hot encoded, discard those.

property discrete_features: list[str]#

List of features that are discrete.

feature_split(order=FeatureOrder.disc_first)#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

Parameters:

order (FeatureOrder) –

Return type:

FeatureSplit

property filepath: Path#

Filepath from which to load the data.

get_filename_or_path()#

Filename of CSV files containing the data.

Return type:

str | Path

get_label_specs()#

Label specs and discrete feature groups that have to be removed from x.

Return type:

LabelSpecsPair

get_num_samples()#

Number of samples in the dataset.

Return type:

int

property invert_sens_attr: bool#

Whether to invert the sensitive attribute.

load(order=FeatureOrder.disc_first, *, labels_as_features=False)#

Load dataset from its CSV file.

Parameters:
  • order (FeatureOrder) – Order of the columns in the dataframes. Can be disc_first or cont_first. See FeatureOrder.

  • labels_as_features (bool) – If True, the s and y labels are included in the x features.

Returns:

DataTuple with dataframes of features, labels and sensitive attributes.

Return type:

DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property load_discrete_only: bool#

Whether to only load discrete features.

map_to_binary: ClassVar[bool] = False#

If True, convert labels from {-1, 1} to {0, 1}.

property name: str#

Name of the dataset.

property sens_attrs: list[str]#

Get the list of sensitive attributes.

property unfiltered_disc_feat_groups: Dict[str, List[str]]#

Discrete feature groups, including features for the labels.

available_tabular()#

List of tabular dataset names.

Return type:

list[str]

create_data_obj(filepath, s_column, y_column, additional_to_drop=None)#

Create a ConfigurableDataset from the given file.

Parameters:
  • filepath (Path) – path to a CSV file

  • s_column (str) – column that represents sensitive attributes

  • y_column (str) – column that contains lables

  • additional_to_drop (list[str] | None) – other columns that should be dropped (Default: None)

Returns:

Dataset object

Return type:

ConfigurableDataset

filter_features_by_prefixes(features, prefixes)#

Filter the features by prefixes.

Parameters:
  • features (Sequence[str]) – list of features names

  • prefixes (Sequence[str]) – list of prefixes

Returns:

filtered feature names

Return type:

list[str]

flatten_dict(dictionary)#

Flatten a dictionary of lists by joining all lists to one big list.

Parameters:

dictionary (Mapping[str, list[str]] | None) –

Return type:

list[str]

from_dummies(data, categorical_cols)#

Convert one-hot encoded columns into categorical columns.

Parameters:
  • data (DataFrame) –

  • categorical_cols (Mapping[str, Sequence[str]]) –

Return type:

DataFrame

get_dataset_obj_by_name(name)#

Given a dataset name, get the corresponding dataset object.

Parameters:

name (str) – Name of the dataset.

Returns:

A callable that can be used to construct the dataset object.

Raises:

NotImplementedError – If the given name does not correspond to a dataset.

Return type:

Callable[[], Dataset]

get_discrete_features(all_feats, feats_to_remove, cont_feats)#

Get a list of the discrete features in a dataset.

Parameters:
  • all_feats (list[str]) – List of all features in the dataset.

  • feats_to_remove (list[str]) – List of features that aren’t used.

  • cont_feats (list[str]) – List of continuous features in the dataset.

Returns:

List of features not marked as continuous or to be removed.

Return type:

list[str]

group_disc_feat_indices(disc_feat_names, prefix_sep='_')#

Group discrete features names according to the first segment of their name.

Returns a list of their corresponding slices (assumes order is maintained).

Parameters:
  • disc_feat_names (list[str]) – List of discrete feature names.

  • prefix_sep (str) – Separator between the prefix and the rest of the name. (Default: “_”)

Returns:

List of slices.

Return type:

list[slice]

label_spec_to_feature_list(spec)#

Extract all the feature column names from a dictionary of label specifications.

Parameters:

spec (Mapping[str, LabelGroup]) – Dictionary of label specifications.

Returns:

A flattend list of all the columns occuring in the label specs.

Return type:

list[str]

load_data(dataset)#

Load dataset from its CSV file.

This function only exists for backwards compatibility. Use dataset.load() instead.

Parameters:

dataset (Dataset) – dataset object

Returns:

DataTuple with dataframes of features, labels and sensitive attributes

Return type:

DataTuple

one_hot_encode_and_combine(attributes, label_spec, *, discard_non_one_hot)#

Construct a new label according to the given LabelSpec.

This function is at the heart of the label spec API in EthicML.

Parameters:
  • attributes (DataFrame) – DataFrame containing the attributes.

  • label_spec (Mapping[str, LabelGroup]) – A label spec.

  • discard_non_one_hot (bool) – If True, a mask is returned which masks out all rows which are not properly one-hot (i.e., either all classes are 0 or more than one is 1).

Returns:

A tuple of a Series with the new labels and – if discard_non_one_hot is True – a mask for filtering out the rows that were not properly one-hot.

Return type:

tuple[Series, Series | None]

reduce_feature_group(disc_feature_groups, feature_group, to_keep, remaining_feature_name)#

Drop all features in the given feature group except the ones in to_keep.

Parameters:
  • disc_feature_groups (Mapping[str, list[str]]) – Dictionary of feature groups.

  • feature_group (str) – Name of the feature group that will be replaced by to_keep.

  • to_keep (Sequence[str]) – List of features that will be kept in the feature group.

  • remaining_feature_name (str) – Name of the dummy feature that will be used to summarize the removed features.

Returns:

Modified dictionary of feature groups.

Return type:

Dict[str, List[str]]

single_col_spec(col, feature_name=None)#

Create a label spec for the case where the label is defined by a single column.

Parameters:
  • col (str) –

  • feature_name (str | None) –

Return type:

Mapping[str, LabelGroup]

spec_from_binary_cols(label_defs)#

Create label specs for the most common case where columns contain 0s and 1s.

Parameters:

label_defs (Mapping[str, Sequence[str]]) – Mapping of label names to column names.

Returns:

Label specifications.

Return type:

Mapping[str, LabelGroup]

Aliases#

DiscFeatureGroups#

A grouping of discrete features such that the result is one-hot encoded.

LabelSpec#

A label specification consisting of LabelGroup entries with names.