Skip to content

Evaluation (PAN suite)

Forensic publications and courts expect more than raw accuracy. tamga ships the standard PAN verification-task metric menu behind one call.

One-call evaluation

from tamga.forensic import compute_pan_report

report = compute_pan_report(
    probs=calibrated_probs,     # from CalibratedScorer
    y=ground_truth_labels,
    log_lrs=log_lr_values,      # optional; enables cllr_bits
)
report.to_dict()
# {
#   "auc": 0.94, "c_at_1": 0.88, "f05u": 0.87,
#   "brier": 0.11, "ece": 0.042, "cllr_bits": 0.31,
#   "n_target": 80, "n_nontarget": 120,
# }

The metrics

Metric Measures Use for Range Reference
auc Ranking quality Choosing between systems. Higher AUC → the system ranks same-author pairs above different-author pairs more reliably. 0.5 (random) – 1.0 (perfect)
c_at_1 Accuracy with abstention credit Operational decisions where "don't know" is safer than a wrong answer. 0 – 1 Peñas & Rodrigo 2011
f05u Precision-weighted F with non-answer penalty PAN-style evaluation. Penalises over-confident wrong answers. 0 – 1 Bevendorff et al. PAN 2022
brier Posterior calibration Probabilistic output quality. Lower = better-calibrated probabilities. 0 (perfect) – 1 (worst) Brier 1950
ece Expected calibration error Is predict_proba honest? Bins predictions by confidence; compares claimed vs. actual accuracy. 0 (perfect) – 1
cllr Log-likelihood-ratio cost Forensic LR quality. The strict proper scoring rule for evidential output. 0 (perfect) – ∞ Brümmer & du Preez 2006
tippett LR distribution plot Sanity-checking calibration visually. Cumulative target vs. non-target LR curves should separate.

c@1

$$ \text{c@1} = \frac{1}{n}!\left( n_\text{correct} + n_\text{unanswered} \cdot \frac{n_\text{correct}}{n} \right) $$

where unanswered trials are those with calibrated probability inside [0.5 − margin, 0.5 + margin]. Margin = 0 (default) reduces c@1 to raw accuracy.

The PAN verification shared task has used c@1 as its primary metric since 2013 because it rewards systems that know when to abstain — directly aligned with the forensic notion of "insufficient evidence".

C_llr

$$ C_\text{llr} = \frac{1}{2}!\left[ \frac{1}{|T|}!\sum_{i \in T} \log_2!\left(1 + \tfrac{1}{\text{LR}i}\right) + \frac{1}{|N|}!\sum_i\right) \right] $$} \log_2!\left(1 + \text{LR

where $T$ = target trials, $N$ = non-target. Interpreted as average information loss (in bits) per trial relative to an optimally-calibrated reference system.

  • Prior-only system (every log-LR = 0) → C_llr = 1.0 exactly.
  • Perfect, confident system → C_llr ≈ 0.
  • Misleading system (wrong sign) → C_llr > 1.

A system with C_llr < 1 beats prior-only. Forensic publications routinely report C_llr alongside AUC because C_llr captures both discrimination and calibration in one scalar.

Tippett plots

tippett(log_lrs, y) returns per-class cumulative distributions you can plot directly:

import matplotlib.pyplot as plt
from tamga.forensic import tippett

data = tippett(log_lrs, y)
plt.step(data["thresholds"], data["target_cdf"], label="same-author")
plt.step(data["thresholds"], data["nontarget_cdf"], label="different-author")
plt.xlabel("log₁₀(LR) threshold")
plt.ylabel("P(log-LR ≥ threshold | class)")
plt.legend()

A well-discriminating system shows the target CDF accumulating on the right (high log-LRs are predominantly target) and the non-target CDF on the left.

Reference

compute_pan_report

Use when: you have a labelled batch of verification trials and want every standard metric in one call — AUC, c@1, F0.5u, Brier, ECE, (optionally) C_llr. Don't use when: you only need one metric; each metric function is callable directly. Expect: a PANReport dataclass with every field populated.

tamga.forensic.metrics.compute_pan_report

compute_pan_report(probs: ndarray, y: ndarray, *, log_lrs: ndarray | None = None, ece_bins: int = 10, c_at_1_margin: float = 0.0) -> PANReport

Run the full PAN evaluation suite on one set of trials.

Parameters:

Name Type Description Default
probs np.ndarray of shape (n,)

Calibrated probabilities of the target hypothesis.

required
y np.ndarray of shape (n,)

Binary labels.

required
log_lrs ndarray

log10-LRs for each trial. If provided, cllr_bits is included in the report.

None
ece_bins int
10
c_at_1_margin float
0.0

tamga.forensic.metrics.PANReport dataclass

Bundled PAN-style evaluation summary for a verification system.

All metrics are computed over binary (same-author / different-author) trials. cllr_bits requires log-LR inputs and so is optional — set when available. Fields match the metric menu reported in PAN verification-task overviews (Stamatatos et al. 2014 onward).

AUC

Use when: comparing two verification systems on the same benchmark — AUC is threshold-independent. Don't use when: you need an operational decision — AUC says nothing about where to set the threshold. Expect: a single number in [0.5, 1]. Does not depend on predicted probabilities being calibrated.

tamga.forensic.metrics.auc

auc(scores: ndarray, y: ndarray) -> float

Area under the ROC curve — computed via the Mann-Whitney U statistic.

Invariant to monotone transformations of scores; 1.0 = perfect ranking, 0.5 = random, 0.0 = perfectly inverse. Ties contribute 0.5 each (standard convention).

Parameters:

Name Type Description Default
scores np.ndarray of shape (n,)

Higher scores should indicate the target (y=1) hypothesis.

required
y np.ndarray of shape (n,)

Binary labels.

required

c@1

Use when: your system can abstain ("don't know") and you want to credit that honestly — accuracy plus a partial-credit bonus for abstention. Don't use when: your system always outputs a decision; c@1 reduces to accuracy. Expect: a single number in [0, 1]. Dominates accuracy only when abstention rate > 0.

tamga.forensic.metrics.c_at_1

c_at_1(probs: ndarray, y: ndarray, *, unanswered_margin: float = 0.0) -> float

c@1 (Peñas & Rodrigo 2011): accuracy with a credit for non-answers.

Canonical formula (Peñas & Rodrigo 2011, eq. 1):

c@1 = (1 / n) * (n_correct + n_unanswered * (n_correct / n))

where n is the total number of trials, n_correct is the number of correct answered trials, and n_unanswered is the number of abstentions. The bonus for abstention scales with the overall accuracy of the system (n_correct / n), not with its accuracy on the answered subset — so a system that abstains frequently but is also inaccurate on its answers does not get rewarded.

Non-answers are defined as trials whose probability lies within [0.5 - unanswered_margin, 0.5 + unanswered_margin]. If unanswered_margin = 0, the default, there are no non-answers and c@1 reduces to raw accuracy.

Forensically principled because it rewards a system that knows when to abstain (vs. forcing a coin-flip on ambiguous evidence). The PAN verification shared task has used c@1 as the primary metric since 2013.

Parameters:

Name Type Description Default
probs np.ndarray of shape (n,)

Calibrated probabilities of the target hypothesis, in [0, 1]. Decisions are taken at threshold 0.5.

required
y np.ndarray of shape (n,)

Binary labels.

required
unanswered_margin float

Half-width of the non-decision band around 0.5. 0.0 = no abstention; common PAN settings use 0.0 or a small value like 0.05.

0.0

F0.5u

Use when: you're scoring a PAN-CLEF verification track — it's the official metric since PAN 2022, precision-weighted and with a non-answer penalty. Don't use when: you're reporting to a non-PAN audience; it's a specialist metric. Expect: a single number in [0, 1].

tamga.forensic.metrics.f05u

f05u(probs: ndarray, y: ndarray) -> float

F0.5-unanswered (Bevendorff et al. PAN 2022) — a precision-weighted F-measure that penalises both wrong answers and (weakly) non-answers.

F0.5u uses the classical F-beta with beta=0.5 (weighting precision over recall), but counts trials falling in the [0.4, 0.6] decision band as non-answers, which are neither true positives nor false positives (they lower recall).

Parameters:

Name Type Description Default
probs np.ndarray of shape (n,)

Probabilities of the target hypothesis, in [0, 1].

required
y np.ndarray of shape (n,)

Binary labels.

required

C_llr

Use when: you need to quantify how good your LR output is in forensic terms — this is the strict proper scoring rule the speaker-recognition community settled on. Don't use when: your scorer outputs accuracy-style probabilities; C_llr expects log-likelihood ratios. Expect: a single non-negative number; 0 is perfect; 1 is uninformative (matches a coin flip).

tamga.forensic.metrics.cllr

cllr(log_lrs: ndarray, y: ndarray) -> float

Log-likelihood-ratio cost (Brümmer & du Preez 2006).

A proper scoring rule for binary forensic LR output, capturing both calibration and discrimination in a single scalar. Interpreted as the average information loss (in bits) per trial relative to an optimally-calibrated reference system.

C_llr = 0.5 * ( mean_{i: y=1} log2(1 + 1/LR_i) + mean_{i: y=0} log2(1 + LR_i) )

Zero is unattainable in practice; a prior-only (uninformative) system gives C_llr = 1. Values below 1 indicate the system is better than prior-only; values above 1 indicate the system's LRs are actively misleading.

Parameters:

Name Type Description Default
log_lrs np.ndarray of shape (n,)

log10 likelihood ratios for each trial.

required
y np.ndarray of shape (n,)

Binary labels: 1 = target (same-author trial), 0 = non-target.

required

Returns:

Type Description
float

C_llr cost, in bits.

ECE

Use when: you want to audit probabilistic honesty — ECE bins predictions by claimed confidence and checks whether actual accuracy matches. Don't use when: your dev set is small (<200 trials); ECE's bin estimates become noisy. Expect: a single number in [0, 1]; 0 is perfect calibration.

tamga.forensic.metrics.ece

ece(probs: ndarray, y: ndarray, *, n_bins: int = 10) -> float

Expected Calibration Error with equal-width binning.

ECE = sum_{b=1..B} (|B_b|/n) * |accuracy(B_b) - confidence(B_b)|

Zero indicates perfect calibration (empirical frequency matches predicted probability in every bin). Typical forensic thresholds: ECE < 0.05 is considered well-calibrated.

Parameters:

Name Type Description Default
probs np.ndarray of shape (n,)

Predicted probability of the positive class for each trial, in [0, 1].

required
y np.ndarray of shape (n,)

Binary labels (0 or 1).

required
n_bins int

Number of equal-width probability bins. Default 10.

10

Brier

Use when: you want a proper scoring rule for probabilistic classifiers (not LR outputs) — classic squared-error between predicted probability and ground truth. Don't use when: you need a forensic LR-specific metric — use C_llr. Expect: a single number in [0, 1]; 0 is perfect.

tamga.forensic.metrics.brier

brier(probs: ndarray, y: ndarray) -> float

Brier score (mean squared error between predicted probability and binary label).

Zero = perfect probabilistic prediction; 0.25 = uninformed (all probs = 0.5); 1.0 = worst possible (confident-wrong on every trial).

Tippett

Use when: you want a visual calibration check — plot target-trial vs. non-target log-LRs as cumulative distributions. Don't use when: you need a single-number summary (use C_llr). Expect: two arrays of cumulative LRs (target and non-target) ready for a matplotlib plot.

tamga.forensic.metrics.tippett

tippett(log_lrs: ndarray, y: ndarray) -> dict[str, np.ndarray]

Tippett-plot data: cumulative proportion of target and non-target trials at or above each threshold.

The classical forensic visualisation: plot both CDFs together on a log-LR x-axis. - The target CDF should accumulate at high log-LR (right of zero). - The non-target CDF should accumulate at low log-LR (left of zero). - Where they cross is an empirical equal-error threshold.

Returns:

Type Description
dict

thresholds: sorted unique log-LR values. target_cdf: P(log-LR ≥ t | target) at each threshold. nontarget_cdf: P(log-LR ≥ t | non-target) at each threshold.