fair_forge.data_metrics¶

A collection of metrics to be used on datasets instead of classification results.

Module Attributes

LabelAndCombinationType

The type for specifying the label (class label or group label or combination).

Functions

`data_overview`(ds)
`hgr_corr`(y, groups)	Calculate the Hirschfeld-Gebelein-Rényi correlation between y labels and groups.

Classes

`DataMetric`(args, *kwargs)	Protocol for data metrics that can be calculated based on labels and groups.
`DistanceFromUniform`(label)	Kullback-Leibler distance of the distribution of a label from uniformity.
`LabelProportions`(label[, agg])	Calculate the imbalance in the class labels or the group labels.
`MissingDataProportions`(label, agg)	Determine the proportion of NaN per class or per group.

class fair_forge.data_metrics.DataMetric(*args, **kwargs)[source]¶

Bases: Protocol

Protocol for data metrics that can be calculated based on labels and groups.

__call__(y: ndarray[tuple[Any, ...], dtype[int32]], groups: ndarray[tuple[Any, ...], dtype[int32]]) → Float[source]¶: Compute the metric based on the provided labels and groups.

class fair_forge.data_metrics.DistanceFromUniform(label: LabelAndCombinationType)[source]¶

Bases: DataMetric

Kullback-Leibler distance of the distribution of a label from uniformity.

It is assumed that all possible label values are present in the data.

Example

>>> y = np.array([0, 0, 0, 1, 1, 1], dtype=np.int32)
>>> groups = np.zeros_like(y, dtype=np.int32)
>>> dist = DistanceFromUniform("y")
>>> dist(y, groups)
np.float64(0.0)
>>> y = np.array([0, 0, 0, 0, 1, 1], dtype=np.int32)
>>> round(dist(y, groups), 3)
np.float64(0.057)
>>> dist = DistanceFromUniform("combination")
>>> y = np.array([-1, -1, -1, -1, 1, 1, 1, 1], dtype=np.int32)
>>> groups = np.array([2, 3, 2, 3, 2, 3, 2, 3], dtype=np.int32)
>>> dist(y, groups)
np.float64(0.0)

__call__(y: ndarray[tuple[Any, ...], dtype[int32]], groups: ndarray[tuple[Any, ...], dtype[int32]]) → float64[source]¶: Compute the metric based on the provided labels and groups.

label: LabelAndCombinationType¶: Which label to use for the computation.

type fair_forge.data_metrics.LabelAndCombinationType = fair_forge.metrics.LabelType | Literal['combination']¶: The type for specifying the label (class label or group label or combination).

class fair_forge.data_metrics.LabelProportions(label: LabelAndCombinationType, agg: BinaryAggregation = 'ratio')[source]¶

Bases: DataMetric

Calculate the imbalance in the class labels or the group labels.

The imbalance can be expressed as either a difference or a ratio. In the case of the difference, positive values indicate an imbalance favoring the positive label, while negative values indicate an imbalance favoring the negative label. The range of the imbalance is [-1, 1], where 0 indicates a perfect balance between the two classes.

In the case of the ratio, the values range from 0 to infinity, where 1 indicates a perfect balance between the two classes. If only one label not present, the result is positive infinity. In the case of binary labels, the ratio is greater than 1 if the positive label is more common than the negative label, and less than 1 if the negative label is more common than the positive label.

If there are more than two label values, the most common and least common label values are used to calculate the imbalance. In this case, the result is always non-negative for the difference and always greater than or equal to 1 for the ratio.

Example

>>> y = np.array([0, 0, 1, 1, 1], dtype=np.int32)
>>> groups = np.zeros_like(y, dtype=np.int32)
>>> binary_class_imbalance = LabelProportions("y", agg="ratio")
>>> round(binary_class_imbalance(y, groups), 2)
np.float64(1.5)
>>> y = np.array([0, 0, 0, 0, 1], dtype=np.int32)
>>> round(binary_class_imbalance(y, groups), 2)
np.float64(0.25)
>>> # Example with only one type of label
>>> y = np.array([0, 0, 0, 0, 0], dtype=np.int32)
>>> round(binary_class_imbalance(y, groups), 2)
np.float64(inf)
>>> # Example with non-binary labels
>>> y = np.array([0, 1, 1, 2, 2, 2], dtype=np.int32)
>>> round(binary_class_imbalance(y, groups), 2)
np.float64(3.0)

__call__(y: ndarray[tuple[Any, ...], dtype[int32]], groups: ndarray[tuple[Any, ...], dtype[int32]]) → float64[source]¶: Compute the metric based on the provided labels and groups.

class fair_forge.data_metrics.MissingDataProportions(label: LabelType, agg: BinaryAggregation)[source]¶

Bases: object

Determine the proportion of NaN per class or per group.

Example

>>> y = np.array([1, 0, 1, 0, 1, 1], dtype=np.int32)
>>> groups = np.array([1, 0, 1, 0, 0, 1], dtype=np.int32)
>>> x = np.array(
...     [[0.2], [np.nan], [-0.3], [1.2], [0.8], [0.9]],
...     dtype=np.float32)
>>> ds = GroupDataset(x, y, groups, "", [], [])
>>> max_nans = MissingDataProportions("y", "max")
>>> max_nans(ds)
np.float64(0.5)

__call__(dataset: GroupDataset) → Float[source]¶: Compute the metric.

fair_forge.data_metrics.hgr_corr(y: ndarray[tuple[Any, ...], dtype[int32]], groups: ndarray[tuple[Any, ...], dtype[int32]]) → float64[source]¶

Calculate the Hirschfeld-Gebelein-Rényi correlation between y labels and groups.

The result ranges from 0 to 1, where 0 indicates no correlation and 1 indicates perfect correlation. Note that the way “correlation” is understood here, anti- correlation also counts as correlation. The metric measures the highest possible correlation that can be achieved by transforming the values in any possible way.

Example

>>> y = np.array([1, 0, 1, 0, 1, 1], dtype=np.int32)
>>> groups = np.array([1, 0, 1, 0, 0, 1], dtype=np.int32)
>>> round(hgr_corr(y, groups), 3)
np.float64(0.707)
>>> # Example with perfect anti-correlation
>>> y = np.array([1, 0, 1, 0, 1, 0], dtype=np.int32)
>>> groups = np.array([0, 1, 0, 1, 0, 1], dtype=np.int32)
>>> round(hgr_corr(y, groups), 3)
np.float64(1.0)
>>> # Example with random classes and groups
>>> gen = np.random.Generator(np.random.MT19937(42))
>>> y = gen.integers(0, 2, size=1000, dtype=np.int32)
>>> groups = gen.integers(0, 2, size=1000, dtype=np.int32)
>>> round(hgr_corr(y, groups), 3)
np.float64(0.004)