ethicml.metrics#

Module for metrics which can be applied to prediction results.

Some example code

from ethicml.metrics import Accuracy, TPR, run_metrics

run_metrics(predictions, test_data, metrics=[Accuracy(), TPR()])

Classes:

`AS`	Anti-spurious metric.
`AbsCV`	Absolute value of Calder-Verwer.
`Accuracy`	Classification accuracy.
`AverageOddsDiff`	Average Odds Difference.
`BCR`	Balanced Classification Rate.
`BalancedAccuracy`	Accuracy that is balanced with respect to the class labels.
`CV`	Calder-Verwer.
`CfmMetric`	Confusion Matrix based metric.
`DependencyTarget`	The variable that is compared to the predictions in order to check how similar they are.
`F1`	F1 score: harmonic mean of precision and recall.
`FNR`	False negative rate.
`FPR`	False positive rate.
`Hsic`	We add the ability to take the average of hsic score.
`Metric`	Base class for all metrics.
`MetricStaticName`	Metric base class for metrics whose name does not depend on instance variables.
`NMI`	Normalized Mutual Information.
`NPV`	Negative predictive value.
`PPV`	Positive predictive value.
`PerSens`	Aggregation methods for metrics that are computed per sensitive attributes.
`ProbNeg`	Probability of negative prediction.
`ProbOutcome`	Mean of logits.
`ProbPos`	Probability of positive prediction.
`RenyiCorrelation`	Renyi correlation.
`RobustAccuracy`	Minimum Classification accuracy across S-groups.
`SklearnMetric`	Wrapper around an sklearn metric.
`TNR`	True negative rate.
`TPR`	True positive rate.
`Theil`	Theil Index.
`Yanovich`	Yanovich Metric.

Exceptions:

`LabelOutOfBoundsError`	Metric Not Applicable per sensitive attribute, apply to whole dataset instead.
`MetricNotApplicableError`	Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

Functions:

`aggregate_over_sens`	Aggregate metrics over sensitive attributes.
`diff_per_sens`	Compute the difference in the metrics per sensitive attribute.
`max_per_sens`	Compute the maximum value of the metrics per sensitive attribute.
`metric_per_sens`	Compute a metric repeatedly on subsets of the data that share a senstitive attribute.
`min_per_sens`	Compute the minimum value of the metrics per sensitive attribute.
`per_sens_metrics_check`	Check if the given metrics allow application per sensitive attribute.
`ratio_per_sens`	Compute the ratios in the metrics per sensitive attribute.
`run_metrics`	Run all the given metrics on the given predictions and return the results.

class AS#

Bases: MetricStaticName

Anti-spurious metric.

Computes \(P(\hat{y}=y|y\neq s)\).

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class AbsCV(pos_class=1, labels=None)#

Bases: CV

Absolute value of Calder-Verwer.

This metric is supposed to make it easier to compare results.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class Accuracy#

Bases: SklearnMetric

Classification accuracy.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class AverageOddsDiff(pos_class=1, labels=None)#

Bases: CfmMetric

Average Odds Difference.

\(\tfrac{1}{2}\left[(FPR_{s=0} - FPR_{s=1}) + (TPR_{s=0} - TPR_{s=1}))\right]\).

A value of 0 indicates equality of odds.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class BCR(pos_class=1, labels=None)#

Bases: CfmMetric

Balanced Classification Rate.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class BalancedAccuracy(pos_class=1, labels=None)#

Bases: CfmMetric

Accuracy that is balanced with respect to the class labels.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class CV(pos_class=1, labels=None)#

Bases: CfmMetric

Calder-Verwer.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class CfmMetric(pos_class=1, labels=None)#

Bases: MetricStaticName, ABC

Confusion Matrix based metric.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

abstract score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class DependencyTarget(value)#

Bases: StrEnum

The variable that is compared to the predictions in order to check how similar they are.

class F1#

Bases: SklearnMetric

F1 score: harmonic mean of precision and recall.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class FNR(pos_class=1, labels=None)#

Bases: CfmMetric

False negative rate.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class FPR(pos_class=1, labels=None)#

Bases: CfmMetric

False positive rate.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class Hsic(seed=888)#

Bases: MetricStaticName

We add the ability to take the average of hsic score.

As for larger datasets it will kill your machine.

Parameters:: seed (int) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

exception LabelOutOfBoundsError#

Bases: Exception

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class Metric#

Bases: ABC

Base class for all metrics.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

abstract get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

abstract score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

exception MetricNotApplicableError#

Bases: Exception

Metric Not Applicable per sensitive attribute, apply to whole dataset instead.

with_traceback()#: Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class MetricStaticName#

Bases: Metric, ABC

Metric base class for metrics whose name does not depend on instance variables.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

abstract score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class NMI(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Normalized Mutual Information.

Also called V-Measure. Defined in this paper: https://www.aclweb.org/anthology/D07-1043.pdf

Parameters:: base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class NPV(pos_class=1, labels=None)#

Bases: CfmMetric

Negative predictive value.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class PPV(pos_class=1, labels=None)#

Bases: CfmMetric

Positive predictive value.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class PerSens(value)#

Bases: Enum

Aggregation methods for metrics that are computed per sensitive attributes.

ALL: ClassVar[frozenset[PerSens]] = frozenset({PerSens.DIFFS, PerSens.MAX, PerSens.MIN, PerSens.RATIOS})#: All aggregations.

DIFFS = (<function diff_per_sens>,)#: Differences of the per-group results.

DIFFS_RATIOS: ClassVar[frozenset[PerSens]] = frozenset({PerSens.DIFFS, PerSens.RATIOS})#: Equivalent to {DIFFS, RATIOS}.

MAX = (<function max_per_sens>,)#: Maximum of the per-group results.

MIN = (<function min_per_sens>,)#: Minimum of the per-group results.

MIN_MAX: ClassVar[frozenset[PerSens]] = frozenset({PerSens.MAX, PerSens.MIN})#: Equivalent to {MIN, MAX}.

RATIOS = (<function ratio_per_sens>,)#: Ratios of the per-group results.

class ProbNeg(pos_class=1, labels=None)#

Bases: CfmMetric

Probability of negative prediction.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class ProbOutcome(pos_class=1)#

Bases: MetricStaticName

Mean of logits.

Parameters:: pos_class (int) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class ProbPos(pos_class=1, labels=None)#

Bases: CfmMetric

Probability of positive prediction.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class RenyiCorrelation(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Renyi correlation. Measures how dependent two random variables are.

As defined in this paper: https://link.springer.com/content/pdf/10.1007/BF02024507.pdf , titled “On Measures of Dependence” by Alfréd Rényi.

Parameters:: base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class RobustAccuracy#

Bases: SklearnMetric

Minimum Classification accuracy across S-groups.

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class SklearnMetric#

Bases: MetricStaticName, ABC

Wrapper around an sklearn metric.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class TNR(pos_class=1, labels=None)#

Bases: CfmMetric

True negative rate.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class TPR(pos_class=1, labels=None)#

Bases: CfmMetric

True positive rate.

Parameters:

pos_class (int) –
labels (List[int] | None) –

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

labels: List[int] | None = None#: List of possible target values. If None, then this is inferred from the data when run.

property name: str#: Name of the metric.

pos_class: int = 1#: The class to treat as being “positive”.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class Theil#

Bases: MetricStaticName

Theil Index.

apply_per_sensitive: ClassVar[bool] = True#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

class Yanovich(base=DependencyTarget.s)#

Bases: _DependenceMeasure

Yanovich Metric. Measures how dependent two random variables are.

As defined in this paper: https://arxiv.org/abs/1008.0492

Parameters:: base (DependencyTarget) –

apply_per_sensitive: ClassVar[bool] = False#: Whether the metric can be applied per sensitive attribute.

get_name()#

Name of the metric.

Return type:: str

property name: str#: Name of the metric.

score(prediction, actual)#

Compute score.

Parameters:

prediction (Prediction) – predicted labels
actual (LabelTuple | DataTuple) – EvalTuple with the actual labels and the sensitive attributes

Returns:

the score as a single number

Return type:

float

aggregate_over_sens(per_sens_res, aggregator, infix, prefix='', suffix='')#

Aggregate metrics over sensitive attributes.

Parameters:

per_sens_res (Mapping[str, float]) – Dictionary of the results.
aggregator (Callable[[float, float], float]) – A callable that is used to aggregate results.
infix (str) – A string that will be displayed between the sensitive attributes in the final metric name.
prefix (str) – A prefix for the final metric name.
suffix (str) – A suffix for the final metric name.

Returns:

Dictionary of the aggregated results.

Return type:

dict[str, float]

diff_per_sens(per_sens_res)#

Compute the difference in the metrics per sensitive attribute.

Parameters:: per_sens_res (dict[str, float]) – dictionary of the results
Returns:: dictionary of differences
Return type:: dict[str, float]

max_per_sens(per_sens_res)#

Compute the maximum value of the metrics per sensitive attribute.

Parameters:: per_sens_res (dict[str, float]) – dictionary of the results
Returns:: dictionary of max values
Return type:: dict[str, float]

metric_per_sens(prediction, actual, metric, *, use_sens_name=True)#

Compute a metric repeatedly on subsets of the data that share a senstitive attribute.

Parameters:

prediction (Prediction) –
actual (LabelTuple | DataTuple) –
metric (Metric) –
use_sens_name (bool) –

Return type:

dict[str, float]

min_per_sens(per_sens_res)#

Compute the minimum value of the metrics per sensitive attribute.

Parameters:: per_sens_res (dict[str, float]) – dictionary of the results
Returns:: dictionary of min values
Return type:: dict[str, float]

per_sens_metrics_check(per_sens_metrics)#

Check if the given metrics allow application per sensitive attribute.

Parameters:: per_sens_metrics (Sequence[Metric]) –
Return type:: None

ratio_per_sens(per_sens_res)#

Compute the ratios in the metrics per sensitive attribute.

Parameters:: per_sens_res (dict[str, float]) – dictionary of the results
Returns:: dictionary of ratios
Return type:: dict[str, float]

run_metrics(predictions, actual, metrics=(), per_sens_metrics=(), aggregation=frozenset({PerSens.DIFFS, PerSens.RATIOS}), *, use_sens_name=True)#

Run all the given metrics on the given predictions and return the results.

Parameters:

predictions (Prediction) – DataFrame with predictions
actual (LabelTuple | DataTuple) – EvalTuple with the labels
metrics (Sequence[Metric]) – list of metrics (Default: ())
per_sens_metrics (Sequence[Metric]) – list of metrics that are computed per sensitive attribute (Default: ())
aggregation (PerSens | Set[PerSens]) – Optionally specify aggregations that are performed on the per-sens metrics. (Default: DIFFS_RATIOS)
use_sens_name (bool) – if True, use the name of the senisitive variable in the returned results. If False, refer to the sensitive variable as “S”. (Default: True)

Returns:

A dictionary of all the metric results.

Return type:

dict[str, float]