Topic-invariant features¶

Use when: your questioned and known documents might be on different topics — you need features that capture style without leaking topic. Don't use when: topic is part of the question (for example, a plagiarism check where the two documents should share content). Use regular features then. Expect: feature extractors that discard most content-word signal while preserving function-word, morphology, and punctuation patterns.

Two techniques live under tamga.forensic: Sapkota char-n-gram categorisation and Stamatatos distortion. Both compose with any downstream verifier.

Cross-topic is the most common failure mode of classical stylometry on real forensic data. A suspect's threat letter and personal email are typically on different topics but presumably the same author; unfiltered character-n-gram and word-n-gram features collapse into topic detection in that setting.

tamga ships two complementary tools.

Sapkota character n-gram categories¶

Use when: you want char-n-gram features for verification but need to strip topic-sensitive whole-word n-grams — keeping only affixes, punctuation-adjacent, and space-adjacent categories. Don't use when: your corpus is so small that further filtering collapses the feature space below ~500 dimensions. Expect: a sparse count matrix with only the chosen categories; default ("prefix","suffix","punct") is the affix-only recipe that generalises best across topics.

CategorizedCharNgramExtractor classifies each character n-gram occurrence (not just the string) by its position in the source text. Feature columns are named <ngram>|<category>, so the|whole_word and the|prefix are separate channels — explicit and auditable.

Seven categories:

Category	Description
`prefix`	word-start + char-internal (e.g., "the" in "there")
`suffix`	char-internal + word-end (e.g., "ing" in "running")
`whole_word`	exactly one word, boundaries at both ends
`mid_word`	entirely internal to a single word
`multi_word`	spans whitespace between two words
`punct`	contains any punctuation character
`space`	contains whitespace but not enough for multi_word

Sapkota et al. (2015) showed that selecting only affix (prefix + suffix) + punct dramatically improves cross-topic attribution — the forensic default.

from tamga.forensic import CategorizedCharNgramExtractor

extractor = CategorizedCharNgramExtractor(
    n=3,
    categories=("prefix", "suffix", "punct"),  # topic-invariant subset
    scale="zscore",
    lowercase=True,
)
fm = extractor.fit_transform(corpus)

Stamatatos distortion¶

Use when: you want aggressive topic removal via content-word masking — replaces content words with placeholders while preserving function words, morphology, and punctuation. Don't use when: you need any content-word signal downstream (e.g., Zeta on distinctive vocabulary). Expect: a new Corpus object you pass to any existing extractor. Modes: "dv_ma" masks all content words, "dv_sa" masks selectively by POS.

distort_corpus pre-processes documents to mask content while preserving style: function words, punctuation, digits, and whitespace remain verbatim; content-word characters are replaced.

Two modes¶

DV-MA (Distortion View — Multiple Asterisks): each content-word character → *. Length-preserving — morphological habits (typical word lengths) remain visible.

DV-SA (Distortion View — Single Asterisk): each content word → single *. Aggressive; only function-word and punctuation pattern survives.

from tamga.forensic import distort_corpus
from tamga import MFWExtractor

distorted = distort_corpus(corpus, mode="dv_ma")

# Downstream extractors see the distorted text — topic signal is masked out.
fm = MFWExtractor(n=200, scale="zscore").fit_transform(distorted)

Contractions¶

Both _TOKEN_RE and the bundled function-word list preserve common English contractions (don't, it's, we'll, they've, …) verbatim. o'clock and other apostrophised content words are masked as a single contiguous string (e.g., *******) rather than split into fragments.

Custom function-word list¶

distorted = distort_corpus(
    corpus,
    mode="dv_ma",
    function_words={"the", "a", "of", "to", "and"},   # minimal stoplist
)

Pass frozenset() to treat every word as content (DV-MA will produce an all-* text).

Combining the two¶

Sapkota categories + Stamatatos distortion compose cleanly:

distorted = distort_corpus(corpus, mode="dv_ma")
extractor = CategorizedCharNgramExtractor(
    n=3, categories=("prefix", "suffix", "punct"), lowercase=True
)
fm = extractor.fit_transform(distorted)

This produces a feature set that is doubly topic-invariant — affix-and-punctuation n-grams extracted from content-masked text — and routinely outperforms unfiltered character n-grams on cross-genre PAN tasks.

Reference¶

CategorizedCharNgramExtractor ¶

Bases: BaseFeatureExtractor

Character n-gram extractor that filters n-gram occurrences by Sapkota category.

Unlike :class:CharNgramExtractor, this classifier counts each OCCURRENCE of an n-gram separately and tags it with the category of its position in the source text. The n-gram string "the" can therefore contribute to multiple category channels.

Feature columns are named "<ngram>|<category>" so the origin of each column is explicit and auditable.

Parameters:

Name	Type	Description	Default
`n`	`int`	N-gram order (fixed int, no range — ranges are an easy extension but complicate the classification logic enough to be deferred).	`3`
`categories`	`iterable of Category`	Which categories to retain. Default: all 7. Set to `("prefix", "suffix")` to get the cross-topic-robust affix-only feature set that Sapkota et al. recommend.	`None`
`scale`	`('none', 'zscore', 'l1', 'l2')`	Per-feature scaling applied at transform-time. Same semantics as CharNgramExtractor.	`"none"`
`lowercase`	`bool`	Case-fold before extracting n-grams.	`False`

Examples:

>>> # Cross-topic-robust forensic feature set:
>>> extractor = CategorizedCharNgramExtractor(
...     n=3, categories=("prefix", "suffix", "punct"), scale="zscore", lowercase=True
... )

tamga.forensic.char_ngrams.classify_ngram ¶

classify_ngram(ngram: str, left: str, right: str) -> Category

Classify a single n-gram occurrence.

The n-gram string itself is insufficient — its category depends on the context in which it was extracted. left and right are the characters immediately before and after the n-gram occurrence (or an empty string at document boundaries).

Priority order (following Sapkota et al. 2015 convention):

punct — any punctuation character inside the n-gram wins immediately.
whole_word — both the left and right neighbours are spaces (or empty) AND the n-gram contains no internal whitespace.
multi_word — contains internal whitespace.
prefix — left neighbour is space/empty and the last char is a word-internal letter.
suffix — right neighbour is space/empty and the first char is word-internal.
space — contains whitespace but not enough of a word-boundary match for the above categories (rare; gaps at the start or end).
mid_word — otherwise.

tamga.forensic.distortion.distort_corpus ¶

distort_corpus(corpus: Corpus, *, mode: DistortionMode = 'dv_ma', function_words: frozenset[str] | set[str] | list[str] | None = None, language: str | None = None) -> Corpus

Produce a new Corpus with each document's text distorted.

Document ids and metadata are preserved unchanged; metadata["distortion_mode"] is set on each new Document to record how it was produced.

Parameters:

Name	Type	Description	Default
`corpus`	`Corpus`		required
`mode`	`('dv_ma', 'dv_sa')`		`"dv_ma"`
`function_words`	`iterable of str`	Words to preserve verbatim. If None, the bundled list for `language` (or `corpus.language` when `language` is not given) is used.	`None`
`language`	`str`	Language code overriding `corpus.language` for function-word selection.	`None`

tamga.forensic.distortion.distort_text ¶

distort_text(text: str, *, mode: DistortionMode = 'dv_ma', function_words: frozenset[str] | set[str] | list[str] | None = None, language: str = 'en') -> str

Apply Stamatatos distortion to a single string.

Parameters:

Name	Type	Description	Default
`text`	`str`	Input text.	required
`mode`	`('dv_ma', 'dv_sa')`	Distortion variant. `dv_ma` preserves word length; `dv_sa` collapses each content word to one `*`.	`"dv_ma"`
`function_words`	`iterable of str`	Words to preserve verbatim. If None, the bundled list for `language` is used.	`None`
`language`	`str`	Language code (`"en"`, `"tr"`, `"de"`, `"es"`, `"fr"`) selecting the bundled function-word list when `function_words` is None. Ignored otherwise.	`'en'`

Returns:

Type	Description
`str`	The distorted text — identical length to the input for DV-MA, shorter for DV-SA. All non-word characters (spaces, punctuation, digits) are preserved.