Verification¶

Authorship verification is a one-class decision: did this specific candidate produce this questioned document? Real case-work rarely offers a closed candidate set, so verification — not attribution — is the forensically canonical task.

tamga ships two complementary verifiers.

General Impostors¶

Use when: you have one questioned document, one candidate's known documents, and a pool of ~100+ impostor documents from other authors — the forensically canonical same-author-or-not question with a closed candidate. Don't use when: you have no impostor pool available, or your candidate's known writings are less than ~1000 words total (the test becomes sample-size-bound). Expect: a score in [0, 1]; calibrate with CalibratedScorer before reporting as an LR.

Koppel & Winter (2014). For a questioned document Q, a candidate's known documents K, and a pool of impostor documents I drawn from other authors, repeatedly:

Sample a random feature subspace.
Sample m impostors from the pool.
Check whether Q is closer to K than to any sampled impostor.

The fraction of winning iterations is the verification score in [0, 1].

from tamga.features import MFWExtractor
from tamga.forensic import GeneralImpostors

# Build features over the pooled corpus so Q, K, and impostors share one vocabulary.
fm = MFWExtractor(n=200, scale="zscore", lowercase=True).fit_transform(pooled_corpus)
q_fm      = slice_by_ids(fm, ["questioned"])
known_fm  = slice_by_ids(fm, known_doc_ids)
impostors = slice_by_ids(fm, impostor_doc_ids)

gi = GeneralImpostors(n_iterations=100, feature_subsample_rate=0.5, seed=42)
result = gi.verify(questioned=q_fm, known=known_fm, impostors=impostors)
result.values["score"]       # in [0, 1]
result.values["wins"]        # raw winning-iteration count

Knobs¶

Parameter	Default	Purpose
`n_iterations`	100	Number of random subspace + impostor-sample iterations
`feature_subsample_rate`	0.5	Fraction of features sampled per iteration
`impostor_sample_size`	`ceil(sqrt(pool_size))`	Impostors per iteration — scales sub-linearly so large pools don't trivialise the test
`similarity`	`"cosine"`	`"cosine"` (real-valued) or `"minmax"` (non-negative features only)
`aggregate`	`"centroid"`	`"centroid"` (mean of K) or `"nearest"` (most-similar known — conservative under within-author style heterogeneity)
`seed`	42	RNG seed (feature + impostor sampling)

Ties¶

Ties break toward the impostors (strict >). If Q is equally close to K and an impostor, the iteration counts as a loss — the forensically conservative choice.

Unmasking¶

Use when: you have long same-author prose candidates (novel chapters, long essays, blog archives) and want a distribution-free verification — the accuracy-drop curve itself is interpretable evidence. Don't use when: your documents are short (<~1500 words per side) — Unmasking needs enough chunks to run cross-validation meaningfully. Expect: an accuracy curve across elimination rounds; same-author pairs show a steep drop, different-author pairs stay near random or above.

Koppel & Schler (2004). A distribution-free, long-text verification method. Chunk Q and K into word-windows, then iteratively:

Train a binary classifier to distinguish Q-chunks from K-chunks.
Measure CV accuracy.
Remove the top-N most-Q-discriminating and top-N most-K-discriminating features (2 × N per round per Koppel & Schler).
Repeat.

Same-author documents are stylistically similar: once a few surface differences are removed, the classifier collapses quickly (large drop). Different-author documents keep yielding discriminating features, so accuracy stays high (small drop).

from tamga.features import MFWExtractor
from tamga.forensic import Unmasking

unmasking = Unmasking(chunk_size=500, n_rounds=10, n_eliminate=3, seed=42)
result = unmasking.verify(
    questioned=questioned_text,            # str, Document, or Corpus
    known=known_text,
    extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
)
result.values["accuracy_curve"]    # list[float], length n_rounds
result.values["accuracy_drop"]     # scalar summary (curve[0] - curve[-1])
result.values["eliminated_per_round"]   # auditable per-round feature removal

When to pick which¶

Situation	Pick
Short CMC / threat texts (< ~2000 words total)	`GeneralImpostors`. Unmasking needs more text per side to run CV meaningfully.
Long prose (novels, essays, blog archives)	`Unmasking` — the accuracy-drop curve is directly interpretable. Pair with GI as a second opinion.
Building an evidential report	Run both, calibrate both with `CalibratedScorer`. Agreement between the two is itself evidential signal (Juola-style multi-method verdict).

Reference¶

GeneralImpostors ¶

Authorship verification via the General Impostors method.

Parameters:

Name	Type	Description	Default
`n_iterations`	`int`	Number of random feature-subspace + impostor-subsample iterations (Koppel & Winter 2014 used 100; PAN baselines typically use 50-200).	`100`
`feature_subsample_rate`	`float`	Fraction of feature columns sampled per iteration, in (0, 1]. 0.5 is the classical default; smaller values increase variance but decorrelate impostor rankings.	`0.5`
`impostor_sample_size`	`int or None`	Number of impostors to sample per iteration. If None, uses `ceil(sqrt(n_impostors))` (a defensible default that scales sub-linearly with the pool size, so a large pool does not make every iteration trivially win).	`None`
`similarity`	`('cosine', 'minmax')`	Similarity function used to compare the questioned document to the candidate and to each impostor. `cosine`: dot(u, v) / (\|\|u\|\| * \|\|v\|\|); works for any real-valued features. `minmax`: sum(min(u, v)) / sum(max(u, v)); Koppel et al.'s MinMax similarity, which requires non-negative features (e.g., raw relative frequencies). Raises if any feature in Q/known/impostors is negative.	`"cosine"`
`aggregate`	`('centroid', 'nearest')`	How to build a single comparison point from the candidate's known documents: `centroid`: compare Q to the mean vector of the known samples (simple, standard). `nearest`: compare Q to the nearest known sample in the projected subspace (more conservative under stylistic heterogeneity within an author's corpus).	`"centroid"`
`seed`	`int`	Seed for the numpy random generator used to sample features and impostors.	`42`

Examples:

>>> from tamga.forensic import GeneralImpostors
>>> gi = GeneralImpostors(n_iterations=100, seed=42)
>>> result = gi.verify(questioned=q_fm, known=known_fm, impostors=pool_fm)
>>> result.values["score"]    # fraction of iterations where known beat all impostors
0.87

verify ¶

verify(*, questioned: FeatureMatrix, known: FeatureMatrix, impostors: FeatureMatrix) -> Result

Run the GI algorithm for one questioned document against one candidate's known set.

Parameters:

Name	Type	Description	Default
`questioned`	`FeatureMatrix`	Exactly one row (the Q document).	required
`known`	`FeatureMatrix`	The candidate author's known documents (>= 1 row).	required
`impostors`	`FeatureMatrix`	Pool of documents from other authors (>= 2 rows so each iteration can sample distinct impostors even with the smallest default `impostor_sample_size`).	required

Returns:

Type	Description
`Result`	With `values["score"]` in [0, 1] (higher = more likely same author), `values["wins"]` raw iteration-win count, and sampling counts in `params`.

Notes

All three FeatureMatrix inputs must share the same feature space — i.e., identical feature_names in the same order. Callers that build features independently should use a single fit_transform on the pooled corpus and then slice by document id.

Unmasking ¶

Koppel & Schler (2004) Unmasking for authorship verification.

Parameters:

Name	Type	Description	Default
`chunk_size`	`int`	Words per chunk. 500 is the common default in the literature.	`500`
`n_rounds`	`int`	Number of iteration rounds (each round eliminates `n_eliminate` features and retrains). 10 is the standard setting.	`10`
`n_eliminate`	`int`	Per-class top-N features to remove each round, following Koppel & Schler 2004: the N most Q-discriminating (top positive coefficients) and the N most K-discriminating (top negative coefficients), so up to 2 * N features drop per round. Default 3 (6 features per round), matching the classical literature setup.	`3`
`n_folds`	`int`	CV folds for accuracy estimation per round. Must be >= 2 and <= min(#Q chunks, K chunks).¶	`10`
`min_chunks_per_class`	`int`	Minimum chunks required on each side before Unmasking is meaningful. Raises ValueError if either side falls below this threshold.	`3`
`seed`	`int`	Seed for the CV split's random state.	`42`

Examples:

>>> from tamga.features import MFWExtractor
>>> from tamga.forensic import Unmasking
>>> unmasking = Unmasking(chunk_size=500, n_rounds=10, seed=42)
>>> result = unmasking.verify(
...     questioned=questioned_text,
...     known=known_text,
...     extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
... )
>>> result.values["accuracy_curve"]
[0.82, 0.79, 0.70, 0.58, 0.55, ...]
>>> result.values["accuracy_drop"]
0.27  # large drop = same author; small drop = different author

verify ¶

verify(*, questioned: Corpus | Document | str, known: Corpus | Document | str, extractor: BaseFeatureExtractor) -> Result

Run Unmasking and return the accuracy-degradation curve plus summary stats.

Parameters:

Name	Type	Description	Default
`questioned`	`Corpus, Document, or str`	The questioned text.	required
`known`	`Corpus, Document, or str`	The candidate author's known text.	required
`extractor`	`BaseFeatureExtractor`	Any tamga feature extractor. Fit on the combined chunks, so the feature space is the union of terms seen in Q and K.	required

Returns:

Type	Description
`Result`	`values["accuracy_curve"]` (list of CV accuracies per round, length `n_rounds`), `values["accuracy_drop"]` (accuracy at round 0 minus accuracy at final round), `values["n_q_chunks"]`, `values["n_k_chunks"]`.

Verification¶

General Impostors¶

Knobs¶

Ties¶

Unmasking¶

When to pick which¶

Reference¶

GeneralImpostors ¶

verify ¶

Unmasking ¶

K chunks).¶

verify ¶