Skip to content

Topic-invariant features

Use when: your questioned and known documents might be on different topics — you need features that capture style without leaking topic. Don't use when: topic is part of the question (for example, a plagiarism check where the two documents should share content). Use regular features then. Expect: feature extractors that discard most content-word signal while preserving function-word, morphology, and punctuation patterns.

Two techniques live under tamga.forensic: Sapkota char-n-gram categorisation and Stamatatos distortion. Both compose with any downstream verifier.

Cross-topic is the most common failure mode of classical stylometry on real forensic data. A suspect's threat letter and personal email are typically on different topics but presumably the same author; unfiltered character-n-gram and word-n-gram features collapse into topic detection in that setting.

tamga ships two complementary tools.

Sapkota character n-gram categories

Use when: you want char-n-gram features for verification but need to strip topic-sensitive whole-word n-grams — keeping only affixes, punctuation-adjacent, and space-adjacent categories. Don't use when: your corpus is so small that further filtering collapses the feature space below ~500 dimensions. Expect: a sparse count matrix with only the chosen categories; default ("prefix","suffix","punct") is the affix-only recipe that generalises best across topics.

CategorizedCharNgramExtractor classifies each character n-gram occurrence (not just the string) by its position in the source text. Feature columns are named <ngram>|<category>, so the|whole_word and the|prefix are separate channels — explicit and auditable.

Seven categories:

Category Description
prefix word-start + char-internal (e.g., "the" in "there")
suffix char-internal + word-end (e.g., "ing" in "running")
whole_word exactly one word, boundaries at both ends
mid_word entirely internal to a single word
multi_word spans whitespace between two words
punct contains any punctuation character
space contains whitespace but not enough for multi_word

Sapkota et al. (2015) showed that selecting only affix (prefix + suffix) + punct dramatically improves cross-topic attribution — the forensic default.

from tamga.forensic import CategorizedCharNgramExtractor

extractor = CategorizedCharNgramExtractor(
    n=3,
    categories=("prefix", "suffix", "punct"),  # topic-invariant subset
    scale="zscore",
    lowercase=True,
)
fm = extractor.fit_transform(corpus)

Stamatatos distortion

Use when: you want aggressive topic removal via content-word masking — replaces content words with placeholders while preserving function words, morphology, and punctuation. Don't use when: you need any content-word signal downstream (e.g., Zeta on distinctive vocabulary). Expect: a new Corpus object you pass to any existing extractor. Modes: "dv_ma" masks all content words, "dv_sa" masks selectively by POS.

distort_corpus pre-processes documents to mask content while preserving style: function words, punctuation, digits, and whitespace remain verbatim; content-word characters are replaced.

Two modes

DV-MA (Distortion View — Multiple Asterisks): each content-word character → *. Length-preserving — morphological habits (typical word lengths) remain visible.

DV-SA (Distortion View — Single Asterisk): each content word → single *. Aggressive; only function-word and punctuation pattern survives.

from tamga.forensic import distort_corpus
from tamga import MFWExtractor

distorted = distort_corpus(corpus, mode="dv_ma")

# Downstream extractors see the distorted text — topic signal is masked out.
fm = MFWExtractor(n=200, scale="zscore").fit_transform(distorted)

Contractions

Both _TOKEN_RE and the bundled function-word list preserve common English contractions (don't, it's, we'll, they've, …) verbatim. o'clock and other apostrophised content words are masked as a single contiguous string (e.g., *******) rather than split into fragments.

Custom function-word list

distorted = distort_corpus(
    corpus,
    mode="dv_ma",
    function_words={"the", "a", "of", "to", "and"},   # minimal stoplist
)

Pass frozenset() to treat every word as content (DV-MA will produce an all-* text).

Combining the two

Sapkota categories + Stamatatos distortion compose cleanly:

distorted = distort_corpus(corpus, mode="dv_ma")
extractor = CategorizedCharNgramExtractor(
    n=3, categories=("prefix", "suffix", "punct"), lowercase=True
)
fm = extractor.fit_transform(distorted)

This produces a feature set that is doubly topic-invariant — affix-and-punctuation n-grams extracted from content-masked text — and routinely outperforms unfiltered character n-grams on cross-genre PAN tasks.

Reference

CategorizedCharNgramExtractor

Bases: BaseFeatureExtractor

Character n-gram extractor that filters n-gram occurrences by Sapkota category.

Unlike :class:CharNgramExtractor, this classifier counts each OCCURRENCE of an n-gram separately and tags it with the category of its position in the source text. The n-gram string "the" can therefore contribute to multiple category channels.

Feature columns are named "<ngram>|<category>" so the origin of each column is explicit and auditable.

Parameters:

Name Type Description Default
n int

N-gram order (fixed int, no range — ranges are an easy extension but complicate the classification logic enough to be deferred).

3
categories iterable of Category

Which categories to retain. Default: all 7. Set to ("prefix", "suffix") to get the cross-topic-robust affix-only feature set that Sapkota et al. recommend.

None
scale ('none', 'zscore', 'l1', 'l2')

Per-feature scaling applied at transform-time. Same semantics as CharNgramExtractor.

"none"
lowercase bool

Case-fold before extracting n-grams.

False

Examples:

>>> # Cross-topic-robust forensic feature set:
>>> extractor = CategorizedCharNgramExtractor(
...     n=3, categories=("prefix", "suffix", "punct"), scale="zscore", lowercase=True
... )

tamga.forensic.char_ngrams.classify_ngram

classify_ngram(ngram: str, left: str, right: str) -> Category

Classify a single n-gram occurrence.

The n-gram string itself is insufficient — its category depends on the context in which it was extracted. left and right are the characters immediately before and after the n-gram occurrence (or an empty string at document boundaries).

Priority order (following Sapkota et al. 2015 convention):

  1. punct — any punctuation character inside the n-gram wins immediately.
  2. whole_word — both the left and right neighbours are spaces (or empty) AND the n-gram contains no internal whitespace.
  3. multi_word — contains internal whitespace.
  4. prefix — left neighbour is space/empty and the last char is a word-internal letter.
  5. suffix — right neighbour is space/empty and the first char is word-internal.
  6. space — contains whitespace but not enough of a word-boundary match for the above categories (rare; gaps at the start or end).
  7. mid_word — otherwise.

tamga.forensic.distortion.distort_corpus

distort_corpus(corpus: Corpus, *, mode: DistortionMode = 'dv_ma', function_words: frozenset[str] | set[str] | list[str] | None = None, language: str | None = None) -> Corpus

Produce a new Corpus with each document's text distorted.

Document ids and metadata are preserved unchanged; metadata["distortion_mode"] is set on each new Document to record how it was produced.

Parameters:

Name Type Description Default
corpus Corpus
required
mode ('dv_ma', 'dv_sa')
"dv_ma"
function_words iterable of str

Words to preserve verbatim. If None, the bundled list for language (or corpus.language when language is not given) is used.

None
language str

Language code overriding corpus.language for function-word selection.

None

tamga.forensic.distortion.distort_text

distort_text(text: str, *, mode: DistortionMode = 'dv_ma', function_words: frozenset[str] | set[str] | list[str] | None = None, language: str = 'en') -> str

Apply Stamatatos distortion to a single string.

Parameters:

Name Type Description Default
text str

Input text.

required
mode ('dv_ma', 'dv_sa')

Distortion variant. dv_ma preserves word length; dv_sa collapses each content word to one *.

"dv_ma"
function_words iterable of str

Words to preserve verbatim. If None, the bundled list for language is used.

None
language str

Language code ("en", "tr", "de", "es", "fr") selecting the bundled function-word list when function_words is None. Ignored otherwise.

'en'

Returns:

Type Description
str

The distorted text — identical length to the input for DV-MA, shorter for DV-SA. All non-word characters (spaces, punctuation, digits) are preserved.