Verification¶
Authorship verification is a one-class decision: did this specific candidate produce this questioned document? Real case-work rarely offers a closed candidate set, so verification — not attribution — is the forensically canonical task.
tamga ships two complementary verifiers.
General Impostors¶
Use when: you have one questioned document, one candidate's known documents, and a
pool of ~100+ impostor documents from other authors — the forensically canonical
same-author-or-not question with a closed candidate.
Don't use when: you have no impostor pool available, or your candidate's known
writings are less than ~1000 words total (the test becomes sample-size-bound).
Expect: a score in [0, 1]; calibrate with CalibratedScorer before reporting as
an LR.
Koppel & Winter (2014). For a questioned document Q, a candidate's known documents K, and a pool of impostor documents I drawn from other authors, repeatedly:
- Sample a random feature subspace.
- Sample m impostors from the pool.
- Check whether Q is closer to K than to any sampled impostor.
The fraction of winning iterations is the verification score in [0, 1].
from tamga.features import MFWExtractor
from tamga.forensic import GeneralImpostors
# Build features over the pooled corpus so Q, K, and impostors share one vocabulary.
fm = MFWExtractor(n=200, scale="zscore", lowercase=True).fit_transform(pooled_corpus)
q_fm = slice_by_ids(fm, ["questioned"])
known_fm = slice_by_ids(fm, known_doc_ids)
impostors = slice_by_ids(fm, impostor_doc_ids)
gi = GeneralImpostors(n_iterations=100, feature_subsample_rate=0.5, seed=42)
result = gi.verify(questioned=q_fm, known=known_fm, impostors=impostors)
result.values["score"] # in [0, 1]
result.values["wins"] # raw winning-iteration count
Knobs¶
| Parameter | Default | Purpose |
|---|---|---|
n_iterations |
100 | Number of random subspace + impostor-sample iterations |
feature_subsample_rate |
0.5 | Fraction of features sampled per iteration |
impostor_sample_size |
ceil(sqrt(pool_size)) |
Impostors per iteration — scales sub-linearly so large pools don't trivialise the test |
similarity |
"cosine" |
"cosine" (real-valued) or "minmax" (non-negative features only) |
aggregate |
"centroid" |
"centroid" (mean of K) or "nearest" (most-similar known — conservative under within-author style heterogeneity) |
seed |
42 | RNG seed (feature + impostor sampling) |
Ties¶
Ties break toward the impostors (strict >). If Q is equally close to K and an
impostor, the iteration counts as a loss — the forensically conservative choice.
Unmasking¶
Use when: you have long same-author prose candidates (novel chapters, long essays, blog archives) and want a distribution-free verification — the accuracy-drop curve itself is interpretable evidence. Don't use when: your documents are short (<~1500 words per side) — Unmasking needs enough chunks to run cross-validation meaningfully. Expect: an accuracy curve across elimination rounds; same-author pairs show a steep drop, different-author pairs stay near random or above.
Koppel & Schler (2004). A distribution-free, long-text verification method. Chunk Q and K into word-windows, then iteratively:
- Train a binary classifier to distinguish Q-chunks from K-chunks.
- Measure CV accuracy.
- Remove the top-N most-Q-discriminating and top-N most-K-discriminating features (2 × N per round per Koppel & Schler).
- Repeat.
Same-author documents are stylistically similar: once a few surface differences are removed, the classifier collapses quickly (large drop). Different-author documents keep yielding discriminating features, so accuracy stays high (small drop).
from tamga.features import MFWExtractor
from tamga.forensic import Unmasking
unmasking = Unmasking(chunk_size=500, n_rounds=10, n_eliminate=3, seed=42)
result = unmasking.verify(
questioned=questioned_text, # str, Document, or Corpus
known=known_text,
extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
)
result.values["accuracy_curve"] # list[float], length n_rounds
result.values["accuracy_drop"] # scalar summary (curve[0] - curve[-1])
result.values["eliminated_per_round"] # auditable per-round feature removal
When to pick which¶
| Situation | Pick |
|---|---|
| Short CMC / threat texts (< ~2000 words total) | GeneralImpostors. Unmasking needs more text per side to run CV meaningfully. |
| Long prose (novels, essays, blog archives) | Unmasking — the accuracy-drop curve is directly interpretable. Pair with GI as a second opinion. |
| Building an evidential report | Run both, calibrate both with CalibratedScorer. Agreement between the two is itself evidential signal (Juola-style multi-method verdict). |
Reference¶
GeneralImpostors ¶
Authorship verification via the General Impostors method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_iterations
|
int
|
Number of random feature-subspace + impostor-subsample iterations (Koppel & Winter 2014 used 100; PAN baselines typically use 50-200). |
100
|
feature_subsample_rate
|
float
|
Fraction of feature columns sampled per iteration, in (0, 1]. 0.5 is the classical default; smaller values increase variance but decorrelate impostor rankings. |
0.5
|
impostor_sample_size
|
int or None
|
Number of impostors to sample per iteration. If None, uses
|
None
|
similarity
|
('cosine', 'minmax')
|
Similarity function used to compare the questioned document to the candidate and to each impostor.
|
"cosine"
|
aggregate
|
('centroid', 'nearest')
|
How to build a single comparison point from the candidate's known documents:
|
"centroid"
|
seed
|
int
|
Seed for the numpy random generator used to sample features and impostors. |
42
|
Examples:
>>> from tamga.forensic import GeneralImpostors
>>> gi = GeneralImpostors(n_iterations=100, seed=42)
>>> result = gi.verify(questioned=q_fm, known=known_fm, impostors=pool_fm)
>>> result.values["score"] # fraction of iterations where known beat all impostors
0.87
verify ¶
Run the GI algorithm for one questioned document against one candidate's known set.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
questioned
|
FeatureMatrix
|
Exactly one row (the Q document). |
required |
known
|
FeatureMatrix
|
The candidate author's known documents (>= 1 row). |
required |
impostors
|
FeatureMatrix
|
Pool of documents from other authors (>= 2 rows so each iteration can sample
distinct impostors even with the smallest default |
required |
Returns:
| Type | Description |
|---|---|
Result
|
With |
Notes
All three FeatureMatrix inputs must share the same feature space — i.e., identical
feature_names in the same order. Callers that build features independently should
use a single fit_transform on the pooled corpus and then slice by document id.
Unmasking ¶
Koppel & Schler (2004) Unmasking for authorship verification.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_size
|
int
|
Words per chunk. 500 is the common default in the literature. |
500
|
n_rounds
|
int
|
Number of iteration rounds (each round eliminates |
10
|
n_eliminate
|
int
|
Per-class top-N features to remove each round, following Koppel & Schler 2004: the N most Q-discriminating (top positive coefficients) and the N most K-discriminating (top negative coefficients), so up to 2 * N features drop per round. Default 3 (6 features per round), matching the classical literature setup. |
3
|
n_folds
|
int
|
CV folds for accuracy estimation per round. Must be >= 2 and <= min(#Q chunks, K chunks).¶ |
10
|
min_chunks_per_class
|
int
|
Minimum chunks required on each side before Unmasking is meaningful. Raises ValueError if either side falls below this threshold. |
3
|
seed
|
int
|
Seed for the CV split's random state. |
42
|
Examples:
>>> from tamga.features import MFWExtractor
>>> from tamga.forensic import Unmasking
>>> unmasking = Unmasking(chunk_size=500, n_rounds=10, seed=42)
>>> result = unmasking.verify(
... questioned=questioned_text,
... known=known_text,
... extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
... )
>>> result.values["accuracy_curve"]
[0.82, 0.79, 0.70, 0.58, 0.55, ...]
>>> result.values["accuracy_drop"]
0.27 # large drop = same author; small drop = different author
verify ¶
verify(*, questioned: Corpus | Document | str, known: Corpus | Document | str, extractor: BaseFeatureExtractor) -> Result
Run Unmasking and return the accuracy-degradation curve plus summary stats.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
questioned
|
Corpus, Document, or str
|
The questioned text. |
required |
known
|
Corpus, Document, or str
|
The candidate author's known text. |
required |
extractor
|
BaseFeatureExtractor
|
Any tamga feature extractor. Fit on the combined chunks, so the feature space is the union of terms seen in Q and K. |
required |
Returns:
| Type | Description |
|---|---|
Result
|
|