Değerlendirme (PAN paketi)¶

Adli yayınlar ve mahkemeler ham doğruluktan fazlasını bekler. tamga, standart PAN doğrulama görevi metrik menüsünü tek bir çağrının arkasında sunar.

Tek çağrı değerlendirme¶

from tamga.forensic import compute_pan_report

report = compute_pan_report(
    probs=calibrated_probs,     # CalibratedScorer'dan
    y=ground_truth_labels,
    log_lrs=log_lr_values,      # isteğe bağlı; cllr_bits'i etkinleştirir
)
report.to_dict()
# {
#   "auc": 0.94, "c_at_1": 0.88, "f05u": 0.87,
#   "brier": 0.11, "ece": 0.042, "cllr_bits": 0.31,
#   "n_target": 80, "n_nontarget": 120,
# }

Metrikler¶

Ölçüt	Ölçtüğü	Ne için	Aralık	Kaynak
`auc`	Sıralama kalitesi	Sistemler arasında seçim yaparken. Daha yüksek AUC → sistem, aynı-yazar çiftlerini farklı-yazar çiftlerinin üzerinde daha güvenilir biçimde sıralar.	0.5 (rastgele) – 1.0 (mükemmel)	—
`c_at_1`	Çekimser kalma kredisiyle doğruluk	"Bilmiyorum" cevabının yanlış cevaptan daha güvenli olduğu operasyonel kararlar için.	0 – 1	Peñas & Rodrigo 2011
`f05u`	Yanıtsızlık cezalı hassasiyet ağırlıklı F	PAN-tipi değerlendirme. Aşırı güvenli yanlış cevapları cezalandırır.	0 – 1	Bevendorff et al. PAN 2022
`brier`	Posterior kalibrasyonu	Olasılıksal çıktı kalitesi. Düşük = daha iyi kalibre edilmiş olasılıklar.	0 (mükemmel) – 1 (en kötü)	Brier 1950
`ece`	Beklenen kalibrasyon hatası	`predict_proba` dürüst mü? Tahminleri güvene göre gruplar; iddia edilen ile gerçek doğruluğu karşılaştırır.	0 (mükemmel) – 1	—
`cllr`	Log-olabilirlik-oranı maliyeti	Adli LR kalitesi. Kanıtsal çıktı için katı uygun puanlama kuralı.	0 (mükemmel) – ∞	Brümmer & du Preez 2006
`tippett`	LR dağılım grafiği	Kalibrasyonu görsel olarak denetleme. Kümülatif hedef ve hedef olmayan LR eğrileri ayrışmalıdır.	—	—

c@1¶

$$ \text{c@1} = \frac{1}{n}!\left( n_\text{correct} + n_\text{unanswered} \cdot \frac{n_\text{correct}}{n} \right) $$

yanıtsız denemeler, kalibre edilmiş olasılığı [0.5 − margin, 0.5 + margin] içinde olan denemelerdir. Margin = 0 (varsayılan) c@1'i ham doğruluğa indirger.

PAN doğrulama paylaşımlı görevi, çekinmesini bilen sistemleri ödüllendirdiği için — doğrudan "yetersiz kanıt" adli kavramıyla örtüşür — 2013'ten bu yana c@1'i birincil metrik olarak kullanmaktadır.

C_llr¶

$$ C_\text{llr} = \frac{1}{2}!\left[ \frac{1}{|T|}!\sum_{i \in T} \log_2!\left(1 + \tfrac{1}{\text{LR}i}\right) + \frac{1}{|N|}!\sum_i\right) \right] $$} \log_2!\left(1 + \text{LR

$T$ = hedef denemeler (target trials), $N$ = hedef olmayan denemeler (non-target trials). En iyi kalibre edilmiş referans sisteme göre deneme başına ortalama bilgi kaybı (bit cinsinden) olarak yorumlanır.

Yalnızca önsel olasılık sistemi (her log-LR = 0) → C_llr = 1.0 tam.
Mükemmel, güvenilir sistem → C_llr ≈ 0.
Yanıltıcı sistem (yanlış işaret) → C_llr > 1.

C_llr < 1 olan bir sistem yalnızca önsel olasılık sistemini geride bırakır. Adli yayınlar, C_llr'nin ayrım ve kalibrasyonu tek bir skalerde yakalaması nedeniyle AUC'nin yanında C_llr'yi düzenli olarak raporlar.

Tippett grafikleri¶

tippett(log_lrs, y), doğrudan çizebileceğiniz sınıf başına kümülatif dağılımları döndürür:

import matplotlib.pyplot as plt
from tamga.forensic import tippett

data = tippett(log_lrs, y)
plt.step(data["thresholds"], data["target_cdf"], label="aynı-yazar")
plt.step(data["thresholds"], data["nontarget_cdf"], label="farklı-yazar")
plt.xlabel("log₁₀(LR) eşiği")
plt.ylabel("P(log-LR ≥ eşik | sınıf)")
plt.legend()

İyi ayrım yapan bir sistemde hedef CDF sağda (yüksek log-LR'ler ağırlıklı olarak hedef), hedef olmayan CDF solda birikirler.

Referans¶

compute_pan_report¶

Şu durumda kullanın: etiketlenmiş bir doğrulama denemesi grubunuz var ve tek bir çağrıda her standart metriği — AUC, c@1, F0.5u, Brier, ECE, (isteğe bağlı) C_llr — istiyorsunuz. Şu durumda kullanmayın: yalnızca tek bir metriğe ihtiyacınız var; her metrik fonksiyonu doğrudan çağrılabilir. Beklenen sonuç: her alanı doldurulmuş bir PANReport veri sınıfı.

tamga.forensic.metrics.compute_pan_report ¶

compute_pan_report(probs: ndarray, y: ndarray, *, log_lrs: ndarray | None = None, ece_bins: int = 10, c_at_1_margin: float = 0.0) -> PANReport

Run the full PAN evaluation suite on one set of trials.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Calibrated probabilities of the target hypothesis.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required
`log_lrs`	`ndarray`	log10-LRs for each trial. If provided, `cllr_bits` is included in the report.	`None`
`ece_bins`	`int`		`10`
`c_at_1_margin`	`float`		`0.0`

tamga.forensic.metrics.PANReport `dataclass` ¶

Bundled PAN-style evaluation summary for a verification system.

All metrics are computed over binary (same-author / different-author) trials. cllr_bits requires log-LR inputs and so is optional — set when available. Fields match the metric menu reported in PAN verification-task overviews (Stamatatos et al. 2014 onward).

AUC¶

Şu durumda kullanın: aynı kıyaslama üzerinde iki doğrulama sistemini karşılaştırırken — AUC eşik bağımsızdır. Şu durumda kullanmayın: operasyonel bir karar almanız gerekiyor — AUC, eşiğin nereye ayarlanacağı konusunda hiçbir şey söylemez. Beklenen sonuç: [0.5, 1] aralığında tek bir sayı. Tahmin edilen olasılıkların kalibre edilmiş olmasına bağlı değildir.

tamga.forensic.metrics.auc ¶

auc(scores: ndarray, y: ndarray) -> float

Area under the ROC curve — computed via the Mann-Whitney U statistic.

Invariant to monotone transformations of scores; 1.0 = perfect ranking, 0.5 = random, 0.0 = perfectly inverse. Ties contribute 0.5 each (standard convention).

Parameters:

Name	Type	Description	Default
`scores`	`np.ndarray of shape (n,)`	Higher scores should indicate the target (y=1) hypothesis.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required

c@1¶

Şu durumda kullanın: sisteminiz çekimser kalabiliyorsa ("bilmiyorum") ve bunu dürüstçe kredilendirmek istiyorsanız — doğruluk artı çekimser kalma için kısmi kredi bonusu. Şu durumda kullanmayın: sisteminiz her zaman bir karar üretiyorsa; c@1 doğruluğa indirger. Beklenen sonuç: [0, 1] aralığında tek bir sayı. Yalnızca çekimser kalma oranı > 0 olduğunda doğruluğu geçer.

tamga.forensic.metrics.c_at_1 ¶

c_at_1(probs: ndarray, y: ndarray, *, unanswered_margin: float = 0.0) -> float

c@1 (Peñas & Rodrigo 2011): accuracy with a credit for non-answers.

Canonical formula (Peñas & Rodrigo 2011, eq. 1):

c@1 = (1 / n) * (n_correct + n_unanswered * (n_correct / n))

where n is the total number of trials, n_correct is the number of correct answered trials, and n_unanswered is the number of abstentions. The bonus for abstention scales with the overall accuracy of the system (n_correct / n), not with its accuracy on the answered subset — so a system that abstains frequently but is also inaccurate on its answers does not get rewarded.

Non-answers are defined as trials whose probability lies within [0.5 - unanswered_margin, 0.5 + unanswered_margin]. If unanswered_margin = 0, the default, there are no non-answers and c@1 reduces to raw accuracy.

Forensically principled because it rewards a system that knows when to abstain (vs. forcing a coin-flip on ambiguous evidence). The PAN verification shared task has used c@1 as the primary metric since 2013.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Calibrated probabilities of the target hypothesis, in [0, 1]. Decisions are taken at threshold 0.5.	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required
`unanswered_margin`	`float`	Half-width of the non-decision band around 0.5. 0.0 = no abstention; common PAN settings use 0.0 or a small value like 0.05.	`0.0`

F0.5u¶

Şu durumda kullanın: bir PAN-CLEF doğrulama izini puanlıyorsanız — PAN 2022'den bu yana resmi metriktir; hassasiyet ağırlıklı ve yanıtsızlık cezalıdır. Şu durumda kullanmayın: PAN dışı bir kitleye raporluyorsanız; uzman bir metriktir. Beklenen sonuç: [0, 1] aralığında tek bir sayı.

tamga.forensic.metrics.f05u ¶

f05u(probs: ndarray, y: ndarray) -> float

F0.5-unanswered (Bevendorff et al. PAN 2022) — a precision-weighted F-measure that penalises both wrong answers and (weakly) non-answers.

F0.5u uses the classical F-beta with beta=0.5 (weighting precision over recall), but counts trials falling in the [0.4, 0.6] decision band as non-answers, which are neither true positives nor false positives (they lower recall).

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Probabilities of the target hypothesis, in [0, 1].	required
`y`	`np.ndarray of shape (n,)`	Binary labels.	required

C_llr¶

Şu durumda kullanın: adli açıdan LR çıktınızın ne kadar iyi olduğunu ölçmeniz gerekiyorsa — bu, konuşmacı tanıma topluluğunun benimsediği katı uygun puanlama kuralıdır. Şu durumda kullanmayın: puanlayıcınız doğruluk-tipi olasılıklar üretiyorsa; C_llr log-olabilirlik oranları bekler. Beklenen sonuç: negatif olmayan tek bir sayı; 0 mükemmeldir; 1 bilgisizdir (yazı-tura ile eşleşir).

tamga.forensic.metrics.cllr ¶

cllr(log_lrs: ndarray, y: ndarray) -> float

Log-likelihood-ratio cost (Brümmer & du Preez 2006).

A proper scoring rule for binary forensic LR output, capturing both calibration and discrimination in a single scalar. Interpreted as the average information loss (in bits) per trial relative to an optimally-calibrated reference system.

C_llr = 0.5 * ( mean_{i: y=1} log2(1 + 1/LR_i) + mean_{i: y=0} log2(1 + LR_i) )

Zero is unattainable in practice; a prior-only (uninformative) system gives C_llr = 1. Values below 1 indicate the system is better than prior-only; values above 1 indicate the system's LRs are actively misleading.

Parameters:

Name	Type	Description	Default
`log_lrs`	`np.ndarray of shape (n,)`	log10 likelihood ratios for each trial.	required
`y`	`np.ndarray of shape (n,)`	Binary labels: 1 = target (same-author trial), 0 = non-target.	required

Returns:

Type	Description
`float`	C_llr cost, in bits.

ECE¶

Şu durumda kullanın: olasılıksal dürüstlüğü denetlemek istiyorsanız — ECE, tahminleri iddia edilen güvene göre gruplandırır ve gerçek doğruluğun eşleşip eşleşmediğini kontrol eder. Şu durumda kullanmayın: geliştirme kümeniz küçükse (<200 deneme); ECE'nin grup tahminleri gürültülü hale gelir. Beklenen sonuç: [0, 1] aralığında tek bir sayı; 0 mükemmel kalibrasyon demektir.

tamga.forensic.metrics.ece ¶

ece(probs: ndarray, y: ndarray, *, n_bins: int = 10) -> float

Expected Calibration Error with equal-width binning.

ECE = sum_{b=1..B} (|B_b|/n) * |accuracy(B_b) - confidence(B_b)|

Zero indicates perfect calibration (empirical frequency matches predicted probability in every bin). Typical forensic thresholds: ECE < 0.05 is considered well-calibrated.

Parameters:

Name	Type	Description	Default
`probs`	`np.ndarray of shape (n,)`	Predicted probability of the positive class for each trial, in [0, 1].	required
`y`	`np.ndarray of shape (n,)`	Binary labels (0 or 1).	required
`n_bins`	`int`	Number of equal-width probability bins. Default 10.	`10`

Brier¶

Şu durumda kullanın: olasılıksal sınıflandırıcılar için uygun bir puanlama kuralı istiyorsanız (LR çıktıları değil) — tahmin edilen olasılık ile gerçek değer arasındaki klasik karesel hata. Şu durumda kullanmayın: adli LR'ye özgü bir metriğe ihtiyaç duyuyorsanız — C_llr kullanın. Beklenen sonuç: [0, 1] aralığında tek bir sayı; 0 mükemmeldir.

tamga.forensic.metrics.brier ¶

brier(probs: ndarray, y: ndarray) -> float

Brier score (mean squared error between predicted probability and binary label).

Zero = perfect probabilistic prediction; 0.25 = uninformed (all probs = 0.5); 1.0 = worst possible (confident-wrong on every trial).

Tippett¶

Şu durumda kullanın: görsel bir kalibrasyon denetimi istiyorsanız — hedef deneme ve hedef olmayan log-LR'leri kümülatif dağılımlar olarak çizin. Şu durumda kullanmayın: tek sayılı bir özete ihtiyacınız varsa (C_llr kullanın). Beklenen sonuç: matplotlib grafiği için hazır iki kümülatif LR dizisi (hedef ve hedef olmayan).

tamga.forensic.metrics.tippett ¶

tippett(log_lrs: ndarray, y: ndarray) -> dict[str, np.ndarray]

Tippett-plot data: cumulative proportion of target and non-target trials at or above each threshold.

The classical forensic visualisation: plot both CDFs together on a log-LR x-axis. - The target CDF should accumulate at high log-LR (right of zero). - The non-target CDF should accumulate at low log-LR (left of zero). - Where they cross is an empirical equal-error threshold.

Returns:

Type	Description
`dict`	`thresholds`: sorted unique log-LR values. `target_cdf`: P(log-LR ≥ t \| target) at each threshold. `nontarget_cdf`: P(log-LR ≥ t \| non-target) at each threshold.

Değerlendirme (PAN paketi)¶

Tek çağrı değerlendirme¶

Metrikler¶

c@1¶

C_llr¶

Tippett grafikleri¶

Referans¶

compute_pan_report¶

tamga.forensic.metrics.compute_pan_report ¶

tamga.forensic.metrics.PANReport dataclass ¶

AUC¶

tamga.forensic.metrics.auc ¶

c@1¶

tamga.forensic.metrics.c_at_1 ¶

F0.5u¶

tamga.forensic.metrics.f05u ¶

C_llr¶

tamga.forensic.metrics.cllr ¶

ECE¶

tamga.forensic.metrics.ece ¶

Brier¶

tamga.forensic.metrics.brier ¶

Tippett¶

tamga.forensic.metrics.tippett ¶

tamga.forensic.metrics.PANReport `dataclass` ¶