Evaluation (PAN suite)¶
Forensic publications and courts expect more than raw accuracy. tamga ships the standard PAN verification-task metric menu behind one call.
One-call evaluation¶
from tamga.forensic import compute_pan_report
report = compute_pan_report(
probs=calibrated_probs, # from CalibratedScorer
y=ground_truth_labels,
log_lrs=log_lr_values, # optional; enables cllr_bits
)
report.to_dict()
# {
# "auc": 0.94, "c_at_1": 0.88, "f05u": 0.87,
# "brier": 0.11, "ece": 0.042, "cllr_bits": 0.31,
# "n_target": 80, "n_nontarget": 120,
# }
The metrics¶
| Metric | Measures | Use for | Range | Reference |
|---|---|---|---|---|
auc |
Ranking quality | Choosing between systems. Higher AUC → the system ranks same-author pairs above different-author pairs more reliably. | 0.5 (random) – 1.0 (perfect) | — |
c_at_1 |
Accuracy with abstention credit | Operational decisions where "don't know" is safer than a wrong answer. | 0 – 1 | Peñas & Rodrigo 2011 |
f05u |
Precision-weighted F with non-answer penalty | PAN-style evaluation. Penalises over-confident wrong answers. | 0 – 1 | Bevendorff et al. PAN 2022 |
brier |
Posterior calibration | Probabilistic output quality. Lower = better-calibrated probabilities. | 0 (perfect) – 1 (worst) | Brier 1950 |
ece |
Expected calibration error | Is predict_proba honest? Bins predictions by confidence; compares claimed vs. actual accuracy. |
0 (perfect) – 1 | — |
cllr |
Log-likelihood-ratio cost | Forensic LR quality. The strict proper scoring rule for evidential output. | 0 (perfect) – ∞ | Brümmer & du Preez 2006 |
tippett |
LR distribution plot | Sanity-checking calibration visually. Cumulative target vs. non-target LR curves should separate. | — | — |
c@1¶
$$ \text{c@1} = \frac{1}{n}!\left( n_\text{correct} + n_\text{unanswered} \cdot \frac{n_\text{correct}}{n} \right) $$
where unanswered trials are those with calibrated probability inside
[0.5 − margin, 0.5 + margin]. Margin = 0 (default) reduces c@1 to raw accuracy.
The PAN verification shared task has used c@1 as its primary metric since 2013 because it rewards systems that know when to abstain — directly aligned with the forensic notion of "insufficient evidence".
C_llr¶
$$ C_\text{llr} = \frac{1}{2}!\left[ \frac{1}{|T|}!\sum_{i \in T} \log_2!\left(1 + \tfrac{1}{\text{LR}i}\right) + \frac{1}{|N|}!\sum_i\right) \right] $$} \log_2!\left(1 + \text{LR
where $T$ = target trials, $N$ = non-target. Interpreted as average information loss (in bits) per trial relative to an optimally-calibrated reference system.
- Prior-only system (every log-LR = 0) → C_llr = 1.0 exactly.
- Perfect, confident system → C_llr ≈ 0.
- Misleading system (wrong sign) → C_llr > 1.
A system with C_llr < 1 beats prior-only. Forensic publications routinely report C_llr alongside AUC because C_llr captures both discrimination and calibration in one scalar.
Tippett plots¶
tippett(log_lrs, y) returns per-class cumulative distributions you can plot directly:
import matplotlib.pyplot as plt
from tamga.forensic import tippett
data = tippett(log_lrs, y)
plt.step(data["thresholds"], data["target_cdf"], label="same-author")
plt.step(data["thresholds"], data["nontarget_cdf"], label="different-author")
plt.xlabel("log₁₀(LR) threshold")
plt.ylabel("P(log-LR ≥ threshold | class)")
plt.legend()
A well-discriminating system shows the target CDF accumulating on the right (high log-LRs are predominantly target) and the non-target CDF on the left.
Reference¶
compute_pan_report¶
Use when: you have a labelled batch of verification trials and want every standard
metric in one call — AUC, c@1, F0.5u, Brier, ECE, (optionally) C_llr.
Don't use when: you only need one metric; each metric function is callable directly.
Expect: a PANReport dataclass with every field populated.
tamga.forensic.metrics.compute_pan_report ¶
compute_pan_report(probs: ndarray, y: ndarray, *, log_lrs: ndarray | None = None, ece_bins: int = 10, c_at_1_margin: float = 0.0) -> PANReport
Run the full PAN evaluation suite on one set of trials.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probs
|
np.ndarray of shape (n,)
|
Calibrated probabilities of the target hypothesis. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels. |
required |
log_lrs
|
ndarray
|
log10-LRs for each trial. If provided, |
None
|
ece_bins
|
int
|
|
10
|
c_at_1_margin
|
float
|
|
0.0
|
tamga.forensic.metrics.PANReport
dataclass
¶
Bundled PAN-style evaluation summary for a verification system.
All metrics are computed over binary (same-author / different-author) trials. cllr_bits
requires log-LR inputs and so is optional — set when available. Fields match the metric
menu reported in PAN verification-task overviews (Stamatatos et al. 2014 onward).
AUC¶
Use when: comparing two verification systems on the same benchmark — AUC is
threshold-independent.
Don't use when: you need an operational decision — AUC says nothing about where to
set the threshold.
Expect: a single number in [0.5, 1]. Does not depend on predicted probabilities
being calibrated.
tamga.forensic.metrics.auc ¶
Area under the ROC curve — computed via the Mann-Whitney U statistic.
Invariant to monotone transformations of scores; 1.0 = perfect ranking, 0.5 = random,
0.0 = perfectly inverse. Ties contribute 0.5 each (standard convention).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scores
|
np.ndarray of shape (n,)
|
Higher scores should indicate the target (y=1) hypothesis. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels. |
required |
c@1¶
Use when: your system can abstain ("don't know") and you want to credit that
honestly — accuracy plus a partial-credit bonus for abstention.
Don't use when: your system always outputs a decision; c@1 reduces to accuracy.
Expect: a single number in [0, 1]. Dominates accuracy only when abstention rate > 0.
tamga.forensic.metrics.c_at_1 ¶
c@1 (Peñas & Rodrigo 2011): accuracy with a credit for non-answers.
Canonical formula (Peñas & Rodrigo 2011, eq. 1):
c@1 = (1 / n) * (n_correct + n_unanswered * (n_correct / n))
where n is the total number of trials, n_correct is the number of correct
answered trials, and n_unanswered is the number of abstentions. The bonus for
abstention scales with the overall accuracy of the system (n_correct / n), not with
its accuracy on the answered subset — so a system that abstains frequently but is
also inaccurate on its answers does not get rewarded.
Non-answers are defined as trials whose probability lies within
[0.5 - unanswered_margin, 0.5 + unanswered_margin]. If unanswered_margin = 0,
the default, there are no non-answers and c@1 reduces to raw accuracy.
Forensically principled because it rewards a system that knows when to abstain (vs. forcing a coin-flip on ambiguous evidence). The PAN verification shared task has used c@1 as the primary metric since 2013.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probs
|
np.ndarray of shape (n,)
|
Calibrated probabilities of the target hypothesis, in [0, 1]. Decisions are taken at threshold 0.5. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels. |
required |
unanswered_margin
|
float
|
Half-width of the non-decision band around 0.5. 0.0 = no abstention; common PAN settings use 0.0 or a small value like 0.05. |
0.0
|
F0.5u¶
Use when: you're scoring a PAN-CLEF verification track — it's the official metric
since PAN 2022, precision-weighted and with a non-answer penalty.
Don't use when: you're reporting to a non-PAN audience; it's a specialist metric.
Expect: a single number in [0, 1].
tamga.forensic.metrics.f05u ¶
F0.5-unanswered (Bevendorff et al. PAN 2022) — a precision-weighted F-measure that penalises both wrong answers and (weakly) non-answers.
F0.5u uses the classical F-beta with beta=0.5 (weighting precision over recall), but counts trials falling in the [0.4, 0.6] decision band as non-answers, which are neither true positives nor false positives (they lower recall).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probs
|
np.ndarray of shape (n,)
|
Probabilities of the target hypothesis, in [0, 1]. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels. |
required |
C_llr¶
Use when: you need to quantify how good your LR output is in forensic terms —
this is the strict proper scoring rule the speaker-recognition community settled on.
Don't use when: your scorer outputs accuracy-style probabilities; C_llr expects
log-likelihood ratios.
Expect: a single non-negative number; 0 is perfect; 1 is uninformative (matches a
coin flip).
tamga.forensic.metrics.cllr ¶
Log-likelihood-ratio cost (Brümmer & du Preez 2006).
A proper scoring rule for binary forensic LR output, capturing both calibration and discrimination in a single scalar. Interpreted as the average information loss (in bits) per trial relative to an optimally-calibrated reference system.
C_llr = 0.5 * ( mean_{i: y=1} log2(1 + 1/LR_i) + mean_{i: y=0} log2(1 + LR_i) )
Zero is unattainable in practice; a prior-only (uninformative) system gives C_llr = 1. Values below 1 indicate the system is better than prior-only; values above 1 indicate the system's LRs are actively misleading.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
log_lrs
|
np.ndarray of shape (n,)
|
log10 likelihood ratios for each trial. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels: 1 = target (same-author trial), 0 = non-target. |
required |
Returns:
| Type | Description |
|---|---|
float
|
C_llr cost, in bits. |
ECE¶
Use when: you want to audit probabilistic honesty — ECE bins predictions by
claimed confidence and checks whether actual accuracy matches.
Don't use when: your dev set is small (<200 trials); ECE's bin estimates become
noisy.
Expect: a single number in [0, 1]; 0 is perfect calibration.
tamga.forensic.metrics.ece ¶
Expected Calibration Error with equal-width binning.
ECE = sum_{b=1..B} (|B_b|/n) * |accuracy(B_b) - confidence(B_b)|
Zero indicates perfect calibration (empirical frequency matches predicted probability in every bin). Typical forensic thresholds: ECE < 0.05 is considered well-calibrated.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
probs
|
np.ndarray of shape (n,)
|
Predicted probability of the positive class for each trial, in [0, 1]. |
required |
y
|
np.ndarray of shape (n,)
|
Binary labels (0 or 1). |
required |
n_bins
|
int
|
Number of equal-width probability bins. Default 10. |
10
|
Brier¶
Use when: you want a proper scoring rule for probabilistic classifiers (not LR
outputs) — classic squared-error between predicted probability and ground truth.
Don't use when: you need a forensic LR-specific metric — use C_llr.
Expect: a single number in [0, 1]; 0 is perfect.
tamga.forensic.metrics.brier ¶
Brier score (mean squared error between predicted probability and binary label).
Zero = perfect probabilistic prediction; 0.25 = uninformed (all probs = 0.5); 1.0 = worst possible (confident-wrong on every trial).
Tippett¶
Use when: you want a visual calibration check — plot target-trial vs. non-target
log-LRs as cumulative distributions.
Don't use when: you need a single-number summary (use C_llr).
Expect: two arrays of cumulative LRs (target and non-target) ready for a matplotlib
plot.
tamga.forensic.metrics.tippett ¶
Tippett-plot data: cumulative proportion of target and non-target trials at or above each threshold.
The classical forensic visualisation: plot both CDFs together on a log-LR x-axis. - The target CDF should accumulate at high log-LR (right of zero). - The non-target CDF should accumulate at low log-LR (left of zero). - Where they cross is an empirical equal-error threshold.
Returns:
| Type | Description |
|---|---|
dict
|
|