ethicml.metrics#

Module for metrics which can be applied to prediction results.

Some example code

from ethicml.metrics import Accuracy, TPR, run_metrics

run_metrics(predictions, test_data, metrics=[Accuracy(), TPR()])

Classes:

AS

Anti-spurious metric.

AbsCV

Absolute value of Calder-Verwer.

Accuracy

Classification accuracy.

AverageOddsDiff

Average Odds Difference.

BCR

Balanced Classification Rate.

BalancedAccuracy

Accuracy that is balanced with respect to the class labels.

CV

Calder-Verwer.

CfmMetric

Confusion Matrix based metric.

DependencyTarget

The variable that is compared to the predictions in order to check how similar they are.

F1

F1 score: harmonic mean of precision and recall.

FNR

False negative rate.

FPR

False positive rate.

Hsic

We add the ability to take the average of hsic score.

Metric

Base class for all metrics.

MetricStaticName

Metric base class for metrics whose name does not depend on instance variables.

NMI

Normalized Mutual Information.

NPV

Negative predictive value.

PPV

Positive predictive value.

PerSens

Aggregation methods for metrics that are computed per sensitive attributes.

ProbNeg

Probability of negative prediction.

ProbOutcome

Mean of logits.

ProbPos

Probability of positive prediction.

RenyiCorrelation

Renyi correlation.

RobustAccuracy

Minimum Classification accuracy across S-groups.

SklearnMetric

Wrapper around an sklearn metric.

TNR

True negative rate.

TPR

True positive rate.

Theil

Theil Index.

Yanovich

Yanovich Metric.

Exceptions:

LabelOutOfBoundsError

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

MetricNotApplicableError

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

Functions:

aggregate_over_sens

Aggregate metrics over sensitive attributes.

diff_per_sens

Compute the difference in the metrics per sensitive attribute.

max_per_sens

Compute the maximum value of the metrics per sensitive attribute.

metric_per_sens

Compute a metric repeatedly on subsets of the data that share a senstitive attribute.

min_per_sens

Compute the minimum value of the metrics per sensitive attribute.

per_sens_metrics_check

Check if the given metrics allow application per sensitive attribute.

ratio_per_sens

Compute the ratios in the metrics per sensitive attribute.

run_metrics

Run all the given metrics on the given predictions and return the results.

class AS#

Bases: MetricStaticName

Anti-spurious metric.

Computes \(P(\hat{y}=y|y\neq s)\).

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class AbsCV(pos_class=1, labels=None)#

Bases: CV

Absolute value of Calder-Verwer.

This metric is supposed to make it easier to compare results.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class Accuracy#

Bases: SklearnMetric

Classification accuracy.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class AverageOddsDiff(pos_class=1, labels=None)#

Bases: CfmMetric

Average Odds Difference.

\(\tfrac{1}{2}\left[(FPR_{s=0} - FPR_{s=1}) + (TPR_{s=0} - TPR_{s=1}))\right]\).

A value of 0 indicates equality of odds.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class BCR(pos_class=1, labels=None)#

Bases: CfmMetric

Balanced Classification Rate.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class BalancedAccuracy(pos_class=1, labels=None)#

Bases: CfmMetric

Accuracy that is balanced with respect to the class labels.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class CV(pos_class=1, labels=None)#

Bases: CfmMetric

Calder-Verwer.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class CfmMetric(pos_class=1, labels=None)#

Bases: MetricStaticName, ABC

Confusion Matrix based metric.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

abstract score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class DependencyTarget(value)#

Bases: StrEnum

The variable that is compared to the predictions in order to check how similar they are.

class F1#

Bases: SklearnMetric

F1 score: harmonic mean of precision and recall.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class FNR(pos_class=1, labels=None)#

Bases: CfmMetric

False negative rate.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class FPR(pos_class=1, labels=None)#

Bases: CfmMetric

False positive rate.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class Hsic(seed=888)#

Bases: MetricStaticName

We add the ability to take the average of hsic score.

As for larger datasets it will kill your machine.

Parameters:

seed (int) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

exception LabelOutOfBoundsError#

Bases: Exception

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class Metric#

Bases: ABC

Base class for all metrics.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

abstract get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

abstract score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

exception MetricNotApplicableError#

Bases: Exception

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

with_traceback()#

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class MetricStaticName#

Bases: Metric, ABC

Metric base class for metrics whose name does not depend on instance variables.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

abstract score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class NMI(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Normalized Mutual Information.

Also called V-Measure. Defined in this paper: https://www.aclweb.org/anthology/D07-1043.pdf

Parameters:

base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class NPV(pos_class=1, labels=None)#

Bases: CfmMetric

Negative predictive value.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class PPV(pos_class=1, labels=None)#

Bases: CfmMetric

Positive predictive value.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class PerSens(value)#

Bases: Enum

Aggregation methods for metrics that are computed per sensitive attributes.

ALL: ClassVar[frozenset[PerSens]] = frozenset({PerSens.DIFFS, PerSens.MAX, PerSens.MIN, PerSens.RATIOS})#

All aggregations.

DIFFS = (<function diff_per_sens>,)#

Differences of the per-group results.

DIFFS_RATIOS: ClassVar[frozenset[PerSens]] = frozenset({PerSens.DIFFS, PerSens.RATIOS})#

Equivalent to {DIFFS, RATIOS}.

MAX = (<function max_per_sens>,)#

Maximum of the per-group results.

MIN = (<function min_per_sens>,)#

Minimum of the per-group results.

MIN_MAX: ClassVar[frozenset[PerSens]] = frozenset({PerSens.MAX, PerSens.MIN})#

Equivalent to {MIN, MAX}.

RATIOS = (<function ratio_per_sens>,)#

Ratios of the per-group results.

class ProbNeg(pos_class=1, labels=None)#

Bases: CfmMetric

Probability of negative prediction.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class ProbOutcome(pos_class=1)#

Bases: MetricStaticName

Mean of logits.

Parameters:

pos_class (int) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class ProbPos(pos_class=1, labels=None)#

Bases: CfmMetric

Probability of positive prediction.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class RenyiCorrelation(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Renyi correlation. Measures how dependent two random variables are.

As defined in this paper: https://link.springer.com/content/pdf/10.1007/BF02024507.pdf , titled “On Measures of Dependence” by Alfréd Rényi.

Parameters:

base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class RobustAccuracy#

Bases: SklearnMetric

Minimum Classification accuracy across S-groups.

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class SklearnMetric#

Bases: MetricStaticName, ABC

Wrapper around an sklearn metric.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class TNR(pos_class=1, labels=None)#

Bases: CfmMetric

True negative rate.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class TPR(pos_class=1, labels=None)#

Bases: CfmMetric

True positive rate.

Parameters:
  • pos_class (int) –

  • labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

labels: List[int] | None = None#

List of possible target values. If None, then this is inferred from the data when run.

property name: str#

Name of the metric.

pos_class: int = 1#

The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class Theil#

Bases: MetricStaticName

Theil Index.

apply_per_sensitive: ClassVar[bool] = True#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

class Yanovich(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Yanovich Metric. Measures how dependent two random variables are.

As defined in this paper: https://arxiv.org/abs/1008.0492

Parameters:

base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#

Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:

str

property name: str#

Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:
Returns:

the score as a single number

Return type:

float

aggregate_over_sens(per_sens_res, aggregator, infix, prefix='', suffix='')#

Aggregate metrics over sensitive attributes.

Parameters:
  • per_sens_res (Mapping[str, float]) – Dictionary of the results.

  • aggregator (Callable[[float, float], float]) – A callable that is used to aggregate results.

  • infix (str) – A string that will be displayed between the sensitive attributes in the final metric name.

  • prefix (str) – A prefix for the final metric name.

  • suffix (str) – A suffix for the final metric name.

Returns:

Dictionary of the aggregated results.

Return type:

dict[str, float]

diff_per_sens(per_sens_res)#

Compute the difference in the metrics per sensitive attribute.

Parameters:

per_sens_res (dict[str, float]) – dictionary of the results

Returns:

dictionary of differences

Return type:

dict[str, float]

max_per_sens(per_sens_res)#

Compute the maximum value of the metrics per sensitive attribute.

Parameters:

per_sens_res (dict[str, float]) – dictionary of the results

Returns:

dictionary of max values

Return type:

dict[str, float]

metric_per_sens(prediction, actual, metric, *, use_sens_name=True)#

Compute a metric repeatedly on subsets of the data that share a senstitive attribute.

Parameters:
Return type:

dict[str, float]

min_per_sens(per_sens_res)#

Compute the minimum value of the metrics per sensitive attribute.

Parameters:

per_sens_res (dict[str, float]) – dictionary of the results

Returns:

dictionary of min values

Return type:

dict[str, float]

per_sens_metrics_check(per_sens_metrics)#

Check if the given metrics allow application per sensitive attribute.

Parameters:

per_sens_metrics (Sequence[Metric]) –

Return type:

None

ratio_per_sens(per_sens_res)#

Compute the ratios in the metrics per sensitive attribute.

Parameters:

per_sens_res (dict[str, float]) – dictionary of the results

Returns:

dictionary of ratios

Return type:

dict[str, float]

run_metrics(predictions, actual, metrics=(), per_sens_metrics=(), aggregation=frozenset({PerSens.DIFFS, PerSens.RATIOS}), *, use_sens_name=True)#

Run all the given metrics on the given predictions and return the results.

Parameters:
  • predictions (Prediction) – DataFrame with predictions

  • actual (LabelTuple | DataTuple) – EvalTuple with the labels

  • metrics (Sequence[Metric]) – list of metrics (Default: ())

  • per_sens_metrics (Sequence[Metric]) – list of metrics that are computed per sensitive attribute (Default: ())

  • aggregation (PerSens | Set[PerSens]) – Optionally specify aggregations that are performed on the per-sens metrics. (Default: DIFFS_RATIOS)

  • use_sens_name (bool) – if True, use the name of the senisitive variable in the returned results. If False, refer to the sensitive variable as “S”. (Default: True)

Returns:

A dictionary of all the metric results.

Return type:

dict[str, float]