Datasets#

This module contains items related to data, such as raw csv’s and data objects.

Tabular datasets
- Adult
- Compas
- Credit
- Crime
- German
- Health
- Non-binary toy
- Stop, question, frisk
- Synthetic
- Toy
Utils
- Loading
- Lookup
- Other
Vision dataset
- CelebA
- Generated faces

Dataset base#

Data structure for all datasets that come with the framework.

class Dataset(name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, discrete_only, num_samples, discard_non_one_hot=False, map_to_binary=False, s_prefix=None, class_label_prefix=None, discrete_feature_groups=None)#

Bases: object

Data structure that holds all the information needed to load a given dataset.

Parameters

discard_non_one_hot (bool) – If some entries in s or y are not correctly one-hot encoded, discard those.
map_to_binary (bool) – If True, convert labels from {-1, 1} to {0, 1}.
name (str) –
filename_or_path (Union[str, pathlib.Path]) –
features (Sequence[str]) –
cont_features (Sequence[str]) –
sens_attr_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –
class_label_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –
discrete_only (bool) –
num_samples (int) –
s_prefix (Optional[Sequence[str]]) –
class_label_prefix (Optional[Sequence[str]]) –
discrete_feature_groups (Optional[Dict[str, List[str]]]) –

Return type

None

__len__()#

Number of elements in the dataset.

Return type: int

property class_labels: List[str]#: Get the list of class labels.

property continuous_features: List[str]#: List of features that are continuous.

property disc_feature_groups: Optional[Dict[str, List[str]]]#: Dictionary of feature groups.

property discrete_features: List[str]#: List of features that are discrete.

expand_labels(label, label_type)#

Expand a label in the form of an index into all the subfeatures.

Parameters

label (pandas.DataFrame) –
label_type (Literal['s', 'y']) –

Return type

pandas.DataFrame

property feature_split: ethicml.data.dataset.FeatureSplit#

Return a feature split dictionary.

This should have separate entries for the features, the labels and the sensitive attributes.

property features_to_remove: List[str]#: Features that have to be removed from x.

property filepath: pathlib.Path#: Filepath from which to load the data.

abstract load(ordered=False, labels_as_features=False)#

Load the dataset.

Parameters

ordered (bool) –
labels_as_features (bool) –

Return type

ethicml.utility.data_structures.DataTuple

property name: str#: Name of the dataset.

property ordered_features: ethicml.data.dataset.FeatureSplit#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

property sens_attrs: List[str]#: Get the list of sensitive attributes.

class FeatureSplit(_typename, _fields=None, /, **kwargs)#

Bases: dict

A dictionary of the list of columns that belong to the feature groups.

__len__()#: Return len(self).

clear() → None. Remove all items from D.#

copy() → a shallow copy of D#

fromkeys(value=None, /)#: Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)#: Return the value for key if key is in the dictionary, else default.

items() → a set-like object providing a view on D's items#

keys() → a set-like object providing a view on D's keys#

pop(k[, d]) → v, remove specified key and return the corresponding value.#: If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()#

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)#

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) → None. Update D from dict/iterable E and F.#: If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() → an object providing a view on D's values#

class LoadableDataset(name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, discrete_only, num_samples, discard_non_one_hot=False, map_to_binary=False, s_prefix=None, class_label_prefix=None, discrete_feature_groups=None)#

Bases: ethicml.data.dataset.Dataset

Dataset that uses the default load function.

Parameters

name (str) –
filename_or_path (Union[str, pathlib.Path]) –
features (Sequence[str]) –
cont_features (Sequence[str]) –
sens_attr_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –
class_label_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –
discrete_only (bool) –
num_samples (int) –
discard_non_one_hot (bool) –
map_to_binary (bool) –
s_prefix (Optional[Sequence[str]]) –
class_label_prefix (Optional[Sequence[str]]) –
discrete_feature_groups (Optional[Dict[str, List[str]]]) –

Return type

None

__len__()#

Number of elements in the dataset.

Return type: int

property class_labels: List[str]#: Get the list of class labels.

property continuous_features: List[str]#: List of features that are continuous.

property disc_feature_groups: Optional[Dict[str, List[str]]]#: Dictionary of feature groups.

property discrete_features: List[str]#: List of features that are discrete.

expand_labels(label, label_type)#

Expand a label in the form of an index into all the subfeatures.

Parameters

label (pandas.DataFrame) –
label_type (Literal['s', 'y']) –

Return type

pandas.DataFrame

property feature_split: ethicml.data.dataset.FeatureSplit#

Return a feature split dictionary.

This should have separate entries for the features, the labels and the sensitive attributes.

property features_to_remove: List[str]#: Features that have to be removed from x.

property filepath: pathlib.Path#: Filepath from which to load the data.

load(ordered=False, labels_as_features=False)#

Load dataset from its CSV file.

Parameters

ordered (bool) – if True, return features such that discrete come first, then continuous
labels_as_features (bool) – if True, the s and y labels are included in the x features

Returns

DataTuple with dataframes of features, labels and sensitive attributes

Return type

ethicml.utility.data_structures.DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property name: str#: Name of the dataset.

property ordered_features: ethicml.data.dataset.FeatureSplit#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

property sens_attrs: List[str]#: Get the list of sensitive attributes.