Datasets#

This module contains items related to data, such as raw csv’s and data objects.

Dataset base#

Data structure for all datasets that come with the framework.

class Dataset(name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, discrete_only, num_samples, discard_non_one_hot=False, map_to_binary=False, s_prefix=None, class_label_prefix=None, discrete_feature_groups=None)#

Bases: object

Data structure that holds all the information needed to load a given dataset.

Parameters
  • discard_non_one_hot (bool) – If some entries in s or y are not correctly one-hot encoded, discard those.

  • map_to_binary (bool) – If True, convert labels from {-1, 1} to {0, 1}.

  • name (str) –

  • filename_or_path (Union[str, pathlib.Path]) –

  • features (Sequence[str]) –

  • cont_features (Sequence[str]) –

  • sens_attr_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –

  • class_label_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –

  • discrete_only (bool) –

  • num_samples (int) –

  • s_prefix (Optional[Sequence[str]]) –

  • class_label_prefix (Optional[Sequence[str]]) –

  • discrete_feature_groups (Optional[Dict[str, List[str]]]) –

Return type

None

__len__()#

Number of elements in the dataset.

Return type

int

property class_labels: List[str]#

Get the list of class labels.

property continuous_features: List[str]#

List of features that are continuous.

property disc_feature_groups: Optional[Dict[str, List[str]]]#

Dictionary of feature groups.

property discrete_features: List[str]#

List of features that are discrete.

expand_labels(label, label_type)#

Expand a label in the form of an index into all the subfeatures.

Parameters
  • label (pandas.DataFrame) –

  • label_type (Literal['s', 'y']) –

Return type

pandas.DataFrame

property feature_split: ethicml.data.dataset.FeatureSplit#

Return a feature split dictionary.

This should have separate entries for the features, the labels and the sensitive attributes.

property features_to_remove: List[str]#

Features that have to be removed from x.

property filepath: pathlib.Path#

Filepath from which to load the data.

abstract load(ordered=False, labels_as_features=False)#

Load the dataset.

Parameters
  • ordered (bool) –

  • labels_as_features (bool) –

Return type

ethicml.utility.data_structures.DataTuple

property name: str#

Name of the dataset.

property ordered_features: ethicml.data.dataset.FeatureSplit#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

property sens_attrs: List[str]#

Get the list of sensitive attributes.

class FeatureSplit(_typename, _fields=None, /, **kwargs)#

Bases: dict

A dictionary of the list of columns that belong to the feature groups.

__len__()#

Return len(self).

clear() None.  Remove all items from D.#
copy() a shallow copy of D#
fromkeys(value=None, /)#

Create a new dictionary with keys from iterable and values set to value.

get(key, default=None, /)#

Return the value for key if key is in the dictionary, else default.

items() a set-like object providing a view on D's items#
keys() a set-like object providing a view on D's keys#
pop(k[, d]) v, remove specified key and return the corresponding value.#

If the key is not found, return the default if given; otherwise, raise a KeyError.

popitem()#

Remove and return a (key, value) pair as a 2-tuple.

Pairs are returned in LIFO (last-in, first-out) order. Raises KeyError if the dict is empty.

setdefault(key, default=None, /)#

Insert key with a value of default if key is not in the dictionary.

Return the value for key if key is in the dictionary, else default.

update([E, ]**F) None.  Update D from dict/iterable E and F.#

If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k]

values() an object providing a view on D's values#
class LoadableDataset(name, filename_or_path, features, cont_features, sens_attr_spec, class_label_spec, discrete_only, num_samples, discard_non_one_hot=False, map_to_binary=False, s_prefix=None, class_label_prefix=None, discrete_feature_groups=None)#

Bases: ethicml.data.dataset.Dataset

Dataset that uses the default load function.

Parameters
  • name (str) –

  • filename_or_path (Union[str, pathlib.Path]) –

  • features (Sequence[str]) –

  • cont_features (Sequence[str]) –

  • sens_attr_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –

  • class_label_spec (Union[str, Mapping[str, ethicml.data.util.LabelGroup]]) –

  • discrete_only (bool) –

  • num_samples (int) –

  • discard_non_one_hot (bool) –

  • map_to_binary (bool) –

  • s_prefix (Optional[Sequence[str]]) –

  • class_label_prefix (Optional[Sequence[str]]) –

  • discrete_feature_groups (Optional[Dict[str, List[str]]]) –

Return type

None

__len__()#

Number of elements in the dataset.

Return type

int

property class_labels: List[str]#

Get the list of class labels.

property continuous_features: List[str]#

List of features that are continuous.

property disc_feature_groups: Optional[Dict[str, List[str]]]#

Dictionary of feature groups.

property discrete_features: List[str]#

List of features that are discrete.

expand_labels(label, label_type)#

Expand a label in the form of an index into all the subfeatures.

Parameters
  • label (pandas.DataFrame) –

  • label_type (Literal['s', 'y']) –

Return type

pandas.DataFrame

property feature_split: ethicml.data.dataset.FeatureSplit#

Return a feature split dictionary.

This should have separate entries for the features, the labels and the sensitive attributes.

property features_to_remove: List[str]#

Features that have to be removed from x.

property filepath: pathlib.Path#

Filepath from which to load the data.

load(ordered=False, labels_as_features=False)#

Load dataset from its CSV file.

Parameters
  • ordered (bool) – if True, return features such that discrete come first, then continuous

  • labels_as_features (bool) – if True, the s and y labels are included in the x features

Returns

DataTuple with dataframes of features, labels and sensitive attributes

Return type

ethicml.utility.data_structures.DataTuple

load_aif()#

Load the dataset as an AIF360 dataset.

Experimental. Requires the aif360 library.

Ignores the type check as the return type is not yet defined.

property name: str#

Name of the dataset.

property ordered_features: ethicml.data.dataset.FeatureSplit#

Return an order features dictionary.

This should have separate entries for the features, the labels and the sensitive attributes, but the x features are ordered so first are the discrete features, then the continuous.

property sens_attrs: List[str]#

Get the list of sensitive attributes.