Skip to content

Verification

Authorship verification is a one-class decision: did this specific candidate produce this questioned document? Real case-work rarely offers a closed candidate set, so verification — not attribution — is the forensically canonical task.

tamga ships two complementary verifiers.

General Impostors

Use when: you have one questioned document, one candidate's known documents, and a pool of ~100+ impostor documents from other authors — the forensically canonical same-author-or-not question with a closed candidate. Don't use when: you have no impostor pool available, or your candidate's known writings are less than ~1000 words total (the test becomes sample-size-bound). Expect: a score in [0, 1]; calibrate with CalibratedScorer before reporting as an LR.

Koppel & Winter (2014). For a questioned document Q, a candidate's known documents K, and a pool of impostor documents I drawn from other authors, repeatedly:

  1. Sample a random feature subspace.
  2. Sample m impostors from the pool.
  3. Check whether Q is closer to K than to any sampled impostor.

The fraction of winning iterations is the verification score in [0, 1].

from tamga.features import MFWExtractor
from tamga.forensic import GeneralImpostors

# Build features over the pooled corpus so Q, K, and impostors share one vocabulary.
fm = MFWExtractor(n=200, scale="zscore", lowercase=True).fit_transform(pooled_corpus)
q_fm      = slice_by_ids(fm, ["questioned"])
known_fm  = slice_by_ids(fm, known_doc_ids)
impostors = slice_by_ids(fm, impostor_doc_ids)

gi = GeneralImpostors(n_iterations=100, feature_subsample_rate=0.5, seed=42)
result = gi.verify(questioned=q_fm, known=known_fm, impostors=impostors)
result.values["score"]       # in [0, 1]
result.values["wins"]        # raw winning-iteration count

Knobs

Parameter Default Purpose
n_iterations 100 Number of random subspace + impostor-sample iterations
feature_subsample_rate 0.5 Fraction of features sampled per iteration
impostor_sample_size ceil(sqrt(pool_size)) Impostors per iteration — scales sub-linearly so large pools don't trivialise the test
similarity "cosine" "cosine" (real-valued) or "minmax" (non-negative features only)
aggregate "centroid" "centroid" (mean of K) or "nearest" (most-similar known — conservative under within-author style heterogeneity)
seed 42 RNG seed (feature + impostor sampling)

Ties

Ties break toward the impostors (strict >). If Q is equally close to K and an impostor, the iteration counts as a loss — the forensically conservative choice.

Unmasking

Use when: you have long same-author prose candidates (novel chapters, long essays, blog archives) and want a distribution-free verification — the accuracy-drop curve itself is interpretable evidence. Don't use when: your documents are short (<~1500 words per side) — Unmasking needs enough chunks to run cross-validation meaningfully. Expect: an accuracy curve across elimination rounds; same-author pairs show a steep drop, different-author pairs stay near random or above.

Koppel & Schler (2004). A distribution-free, long-text verification method. Chunk Q and K into word-windows, then iteratively:

  1. Train a binary classifier to distinguish Q-chunks from K-chunks.
  2. Measure CV accuracy.
  3. Remove the top-N most-Q-discriminating and top-N most-K-discriminating features (2 × N per round per Koppel & Schler).
  4. Repeat.

Same-author documents are stylistically similar: once a few surface differences are removed, the classifier collapses quickly (large drop). Different-author documents keep yielding discriminating features, so accuracy stays high (small drop).

from tamga.features import MFWExtractor
from tamga.forensic import Unmasking

unmasking = Unmasking(chunk_size=500, n_rounds=10, n_eliminate=3, seed=42)
result = unmasking.verify(
    questioned=questioned_text,            # str, Document, or Corpus
    known=known_text,
    extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
)
result.values["accuracy_curve"]    # list[float], length n_rounds
result.values["accuracy_drop"]     # scalar summary (curve[0] - curve[-1])
result.values["eliminated_per_round"]   # auditable per-round feature removal

When to pick which

Situation Pick
Short CMC / threat texts (< ~2000 words total) GeneralImpostors. Unmasking needs more text per side to run CV meaningfully.
Long prose (novels, essays, blog archives) Unmasking — the accuracy-drop curve is directly interpretable. Pair with GI as a second opinion.
Building an evidential report Run both, calibrate both with CalibratedScorer. Agreement between the two is itself evidential signal (Juola-style multi-method verdict).

Reference

GeneralImpostors

Authorship verification via the General Impostors method.

Parameters:

Name Type Description Default
n_iterations int

Number of random feature-subspace + impostor-subsample iterations (Koppel & Winter 2014 used 100; PAN baselines typically use 50-200).

100
feature_subsample_rate float

Fraction of feature columns sampled per iteration, in (0, 1]. 0.5 is the classical default; smaller values increase variance but decorrelate impostor rankings.

0.5
impostor_sample_size int or None

Number of impostors to sample per iteration. If None, uses ceil(sqrt(n_impostors)) (a defensible default that scales sub-linearly with the pool size, so a large pool does not make every iteration trivially win).

None
similarity ('cosine', 'minmax')

Similarity function used to compare the questioned document to the candidate and to each impostor.

  • cosine: dot(u, v) / (||u|| * ||v||); works for any real-valued features.
  • minmax: sum(min(u, v)) / sum(max(u, v)); Koppel et al.'s MinMax similarity, which requires non-negative features (e.g., raw relative frequencies). Raises if any feature in Q/known/impostors is negative.
"cosine"
aggregate ('centroid', 'nearest')

How to build a single comparison point from the candidate's known documents:

  • centroid: compare Q to the mean vector of the known samples (simple, standard).
  • nearest: compare Q to the nearest known sample in the projected subspace (more conservative under stylistic heterogeneity within an author's corpus).
"centroid"
seed int

Seed for the numpy random generator used to sample features and impostors.

42

Examples:

>>> from tamga.forensic import GeneralImpostors
>>> gi = GeneralImpostors(n_iterations=100, seed=42)
>>> result = gi.verify(questioned=q_fm, known=known_fm, impostors=pool_fm)
>>> result.values["score"]    # fraction of iterations where known beat all impostors
0.87

verify

verify(*, questioned: FeatureMatrix, known: FeatureMatrix, impostors: FeatureMatrix) -> Result

Run the GI algorithm for one questioned document against one candidate's known set.

Parameters:

Name Type Description Default
questioned FeatureMatrix

Exactly one row (the Q document).

required
known FeatureMatrix

The candidate author's known documents (>= 1 row).

required
impostors FeatureMatrix

Pool of documents from other authors (>= 2 rows so each iteration can sample distinct impostors even with the smallest default impostor_sample_size).

required

Returns:

Type Description
Result

With values["score"] in [0, 1] (higher = more likely same author), values["wins"] raw iteration-win count, and sampling counts in params.

Notes

All three FeatureMatrix inputs must share the same feature space — i.e., identical feature_names in the same order. Callers that build features independently should use a single fit_transform on the pooled corpus and then slice by document id.

Unmasking

Koppel & Schler (2004) Unmasking for authorship verification.

Parameters:

Name Type Description Default
chunk_size int

Words per chunk. 500 is the common default in the literature.

500
n_rounds int

Number of iteration rounds (each round eliminates n_eliminate features and retrains). 10 is the standard setting.

10
n_eliminate int

Per-class top-N features to remove each round, following Koppel & Schler 2004: the N most Q-discriminating (top positive coefficients) and the N most K-discriminating (top negative coefficients), so up to 2 * N features drop per round. Default 3 (6 features per round), matching the classical literature setup.

3
n_folds int

CV folds for accuracy estimation per round. Must be >= 2 and <= min(#Q chunks,

K chunks).

10
min_chunks_per_class int

Minimum chunks required on each side before Unmasking is meaningful. Raises ValueError if either side falls below this threshold.

3
seed int

Seed for the CV split's random state.

42

Examples:

>>> from tamga.features import MFWExtractor
>>> from tamga.forensic import Unmasking
>>> unmasking = Unmasking(chunk_size=500, n_rounds=10, seed=42)
>>> result = unmasking.verify(
...     questioned=questioned_text,
...     known=known_text,
...     extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
... )
>>> result.values["accuracy_curve"]
[0.82, 0.79, 0.70, 0.58, 0.55, ...]
>>> result.values["accuracy_drop"]
0.27  # large drop = same author; small drop = different author

verify

verify(*, questioned: Corpus | Document | str, known: Corpus | Document | str, extractor: BaseFeatureExtractor) -> Result

Run Unmasking and return the accuracy-degradation curve plus summary stats.

Parameters:

Name Type Description Default
questioned Corpus, Document, or str

The questioned text.

required
known Corpus, Document, or str

The candidate author's known text.

required
extractor BaseFeatureExtractor

Any tamga feature extractor. Fit on the combined chunks, so the feature space is the union of terms seen in Q and K.

required

Returns:

Type Description
Result

values["accuracy_curve"] (list of CV accuracies per round, length n_rounds), values["accuracy_drop"] (accuracy at round 0 minus accuracy at final round), values["n_q_chunks"], values["n_k_chunks"].