Evaluation (PAN suite)¶

Forensic publications and courts expect more than raw accuracy. tamga ships the standard PAN verification-task metric menu behind one call.

One-call evaluation¶

from tamga.forensic import compute_pan_report

report = compute_pan_report(
    probs=calibrated_probs,     # from CalibratedScorer
    y=ground_truth_labels,
    log_lrs=log_lr_values,      # optional; enables cllr_bits
)
report.to_dict()
# {
#   "auc": 0.94, "c_at_1": 0.88, "f05u": 0.87,
#   "brier": 0.11, "ece": 0.042, "cllr_bits": 0.31,
#   "n_target": 80, "n_nontarget": 120,
# }

The metrics¶

Metric	Measures	Use for	Range	Reference
`auc`	Ranking quality	Choosing between systems. Higher AUC → the system ranks same-author pairs above different-author pairs more reliably.	0.5 (random) – 1.0 (perfect)	—
`c_at_1`	Accuracy with abstention credit	Operational decisions where "don't know" is safer than a wrong answer.	0 – 1	Peñas & Rodrigo 2011
`f05u`	Precision-weighted F with non-answer penalty	PAN-style evaluation. Penalises over-confident wrong answers.	0 – 1	Bevendorff et al. PAN 2022
`brier`	Posterior calibration	Probabilistic output quality. Lower = better-calibrated probabilities.	0 (perfect) – 1 (worst)	Brier 1950
`ece`	Expected calibration error	Is `predict_proba` honest? Bins predictions by confidence; compares claimed vs. actual accuracy.	0 (perfect) – 1	—
`cllr`	Log-likelihood-ratio cost	Forensic LR quality. The strict proper scoring rule for evidential output.	0 (perfect) – ∞	Brümmer & du Preez 2006
`tippett`	LR distribution plot	Sanity-checking calibration visually. Cumulative target vs. non-target LR curves should separate.	—	—

c@1¶

$$ \text{c@1} = \frac{1}{n}!\left( n_\text{correct} + n_\text{unanswered} \cdot \frac{n_\text{correct}}{n} \right) $$

where unanswered trials are those with calibrated probability inside [0.5 − margin, 0.5 + margin]. Margin = 0 (default) reduces c@1 to raw accuracy.

The PAN verification shared task has used c@1 as its primary metric since 2013 because it rewards systems that know when to abstain — directly aligned with the forensic notion of "insufficient evidence".

C_llr¶

$$ C_\text{llr} = \frac{1}{2}!\left[ \frac{1}{|T|}!\sum_{i \in T} \log_2!\left(1 + \tfrac{1}{\text{LR}i}\right) + \frac{1}{|N|}!\sum_i\right) \right] $$} \log_2!\left(1 + \text{LR

where $T$ = target trials, $N$ = non-target. Interpreted as average information loss (in bits) per trial relative to an optimally-calibrated reference system.

Prior-only system (every log-LR = 0) → C_llr = 1.0 exactly.
Perfect, confident system → C_llr ≈ 0.
Misleading system (wrong sign) → C_llr > 1.

A system with C_llr < 1 beats prior-only. Forensic publications routinely report C_llr alongside AUC because C_llr captures both discrimination and calibration in one scalar.

Tippett plots¶

tippett(log_lrs, y) returns per-class cumulative distributions you can plot directly:

import matplotlib.pyplot as plt
from tamga.forensic import tippett

data = tippett(log_lrs, y)
plt.step(data["thresholds"], data["target_cdf"], label="same-author")
plt.step(data["thresholds"], data["nontarget_cdf"], label="different-author")
plt.xlabel("log₁₀(LR) threshold")
plt.ylabel("P(log-LR ≥ threshold | class)")
plt.legend()

A well-discriminating system shows the target CDF accumulating on the right (high log-LRs are predominantly target) and the non-target CDF on the left.

Reference¶

compute_pan_report¶

Use when: you have a labelled batch of verification trials and want every standard metric in one call — AUC, c@1, F0.5u, Brier, ECE, (optionally) C_llr. Don't use when: you only need one metric; each metric function is callable directly. Expect: a PANReport dataclass with every field populated.

tamga.forensic.metrics.compute_pan_report ¶

compute_pan_report(probs: ndarray, y: ndarray, *, log_lrs: ndarray | None = None, ece_bins: int = 10, c_at_1_margin: float = 0.0) -> PANReport

Run the full PAN evaluation suite on one set of trials.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Calibrated probabilities of the target hypothesis.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required
`log_lrs`	`ndarray`	log10-LRs for each trial. If provided, `cllr_bits` is included in the report.	`None`
`ece_bins`	`int`		`10`
`c_at_1_margin`	`float`		`0.0`

tamga.forensic.metrics.PANReport `dataclass` ¶

Bundled PAN-style evaluation summary for a verification system.

All metrics are computed over binary (same-author / different-author) trials. cllr_bits requires log-LR inputs and so is optional — set when available. Fields match the metric menu reported in PAN verification-task overviews (Stamatatos et al. 2014 onward).

AUC¶

Use when: comparing two verification systems on the same benchmark — AUC is threshold-independent. Don't use when: you need an operational decision — AUC says nothing about where to set the threshold. Expect: a single number in [0.5, 1]. Does not depend on predicted probabilities being calibrated.

tamga.forensic.metrics.auc ¶

auc(scores: ndarray, y: ndarray) -> float

Area under the ROC curve — computed via the Mann-Whitney U statistic.

Invariant to monotone transformations of scores; 1.0 = perfect ranking, 0.5 = random, 0.0 = perfectly inverse. Ties contribute 0.5 each (standard convention).

Parameters:

Name	Type	Description	Default
`scores`	`np.ndarray of shape (n,)`	Higher scores should indicate the target (y=1) hypothesis.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required

c@1¶

Use when: your system can abstain ("don't know") and you want to credit that honestly — accuracy plus a partial-credit bonus for abstention. Don't use when: your system always outputs a decision; c@1 reduces to accuracy. Expect: a single number in [0, 1]. Dominates accuracy only when abstention rate > 0.

tamga.forensic.metrics.c_at_1 ¶

c_at_1(probs: ndarray, y: ndarray, *, unanswered_margin: float = 0.0) -> float

c@1 (Peñas & Rodrigo 2011): accuracy with a credit for non-answers.

Canonical formula (Peñas & Rodrigo 2011, eq. 1):

c@1 = (1 / n) * (n_correct + n_unanswered * (n_correct / n))

where n is the total number of trials, n_correct is the number of correct answered trials, and n_unanswered is the number of abstentions. The bonus for abstention scales with the overall accuracy of the system (n_correct / n), not with its accuracy on the answered subset — so a system that abstains frequently but is also inaccurate on its answers does not get rewarded.

Non-answers are defined as trials whose probability lies within [0.5 - unanswered_margin, 0.5 + unanswered_margin]. If unanswered_margin = 0, the default, there are no non-answers and c@1 reduces to raw accuracy.

Forensically principled because it rewards a system that knows when to abstain (vs. forcing a coin-flip on ambiguous evidence). The PAN verification shared task has used c@1 as the primary metric since 2013.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Calibrated probabilities of the target hypothesis, in [0, 1]. Decisions are taken at threshold 0.5.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required
`unanswered_margin`	`float`	Half-width of the non-decision band around 0.5. 0.0 = no abstention; common PAN settings use 0.0 or a small value like 0.05.	`0.0`

F0.5u¶

Use when: you're scoring a PAN-CLEF verification track — it's the official metric since PAN 2022, precision-weighted and with a non-answer penalty. Don't use when: you're reporting to a non-PAN audience; it's a specialist metric. Expect: a single number in [0, 1].

tamga.forensic.metrics.f05u ¶

f05u(probs: ndarray, y: ndarray) -> float

F0.5-unanswered (Bevendorff et al. PAN 2022) — a precision-weighted F-measure that penalises both wrong answers and (weakly) non-answers.

F0.5u uses the classical F-beta with beta=0.5 (weighting precision over recall), but counts trials falling in the [0.4, 0.6] decision band as non-answers, which are neither true positives nor false positives (they lower recall).

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Probabilities of the target hypothesis, in [0, 1].	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required

C_llr¶

Use when: you need to quantify how good your LR output is in forensic terms — this is the strict proper scoring rule the speaker-recognition community settled on. Don't use when: your scorer outputs accuracy-style probabilities; C_llr expects log-likelihood ratios. Expect: a single non-negative number; 0 is perfect; 1 is uninformative (matches a coin flip).

tamga.forensic.metrics.cllr ¶

cllr(log_lrs: ndarray, y: ndarray) -> float

Log-likelihood-ratio cost (Brümmer & du Preez 2006).

A proper scoring rule for binary forensic LR output, capturing both calibration and discrimination in a single scalar. Interpreted as the average information loss (in bits) per trial relative to an optimally-calibrated reference system.

C_llr = 0.5 * ( mean_{i: y=1} log2(1 + 1/LR_i) + mean_{i: y=0} log2(1 + LR_i) )

Zero is unattainable in practice; a prior-only (uninformative) system gives C_llr = 1. Values below 1 indicate the system is better than prior-only; values above 1 indicate the system's LRs are actively misleading.

Parameters:

Name	Type	Description	Default
`log_lrs`	`np.ndarray of shape (n,)`	log10 likelihood ratios for each trial.	required
`y`	`np.ndarray of shape (n,)`	Binary labels: 1 = target (same-author trial), 0 = non-target.	required

Returns:

Type	Description
`float`	C_llr cost, in bits.

ECE¶

Use when: you want to audit probabilistic honesty — ECE bins predictions by claimed confidence and checks whether actual accuracy matches. Don't use when: your dev set is small (<200 trials); ECE's bin estimates become noisy. Expect: a single number in [0, 1]; 0 is perfect calibration.

tamga.forensic.metrics.ece ¶

ece(probs: ndarray, y: ndarray, *, n_bins: int = 10) -> float

Expected Calibration Error with equal-width binning.

ECE = sum_{b=1..B} (|B_b|/n) * |accuracy(B_b) - confidence(B_b)|

Zero indicates perfect calibration (empirical frequency matches predicted probability in every bin). Typical forensic thresholds: ECE < 0.05 is considered well-calibrated.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Predicted probability of the positive class for each trial, in [0, 1].	required
`y`	`np.ndarray of shape (n,)`	Binary labels (0 or 1).	required
`n_bins`	`int`	Number of equal-width probability bins. Default 10.	`10`

Brier¶

Use when: you want a proper scoring rule for probabilistic classifiers (not LR outputs) — classic squared-error between predicted probability and ground truth. Don't use when: you need a forensic LR-specific metric — use C_llr. Expect: a single number in [0, 1]; 0 is perfect.

tamga.forensic.metrics.brier ¶

brier(probs: ndarray, y: ndarray) -> float

Brier score (mean squared error between predicted probability and binary label).

Zero = perfect probabilistic prediction; 0.25 = uninformed (all probs = 0.5); 1.0 = worst possible (confident-wrong on every trial).

Tippett¶

Use when: you want a visual calibration check — plot target-trial vs. non-target log-LRs as cumulative distributions. Don't use when: you need a single-number summary (use C_llr). Expect: two arrays of cumulative LRs (target and non-target) ready for a matplotlib plot.

tamga.forensic.metrics.tippett ¶

tippett(log_lrs: ndarray, y: ndarray) -> dict[str, np.ndarray]

Tippett-plot data: cumulative proportion of target and non-target trials at or above each threshold.

The classical forensic visualisation: plot both CDFs together on a log-LR x-axis. - The target CDF should accumulate at high log-LR (right of zero). - The non-target CDF should accumulate at low log-LR (left of zero). - Where they cross is an empirical equal-error threshold.

Returns:

Type	Description
`dict`	`thresholds`: sorted unique log-LR values. `target_cdf`: P(log-LR ≥ t \| target) at each threshold. `nontarget_cdf`: P(log-LR ≥ t \| non-target) at each threshold.

Evaluation (PAN suite)¶

One-call evaluation¶

The metrics¶

c@1¶

C_llr¶

Tippett plots¶

Reference¶

compute_pan_report¶

tamga.forensic.metrics.compute_pan_report ¶

tamga.forensic.metrics.PANReport dataclass ¶

AUC¶

tamga.forensic.metrics.auc ¶

c@1¶

tamga.forensic.metrics.c_at_1 ¶

F0.5u¶

tamga.forensic.metrics.f05u ¶

C_llr¶

tamga.forensic.metrics.cllr ¶

ECE¶

tamga.forensic.metrics.ece ¶

Brier¶

tamga.forensic.metrics.brier ¶

Tippett¶

tamga.forensic.metrics.tippett ¶

tamga.forensic.metrics.PANReport `dataclass` ¶