Skip to content

Features

Every feature extractor returns a FeatureMatrix — the shared numeric envelope that methods consume.

The FeatureMatrix

@dataclass
class FeatureMatrix:
    X: np.ndarray            # (n_docs, n_features)
    document_ids: list[str]
    feature_names: list[str]
    feature_type: str
    extractor_config: dict[str, Any]
    provenance_hash: str

Key properties:

  • fm.n_features, len(fm) for n_docs
  • fm.as_dataframe() — pandas DataFrame indexed by document_ids
  • fm.concat(other) — column-concatenate two matrices with identical row ids

Available extractors

Import from tamga:

Extractor Input Output
MFWExtractor(n=..., scale=..., lowercase=...) Corpus top-n word relative frequencies (z-scored, L1, L2, or raw)
CharNgramExtractor(n=..., include_boundaries=...) Corpus character n-gram counts (delegates to sklearn CountVectorizer)
WordNgramExtractor(n=..., lowercase=...) Corpus word n-gram counts
PosNgramExtractor(n=..., coarse=...) Corpus spaCy POS n-grams
DependencyBigramExtractor() Corpus (head_lemma, dep, child_lemma) triples
FunctionWordExtractor(wordlist=...) Corpus bundled English function-word frequencies
PunctuationExtractor() Corpus ASCII punctuation frequencies
ReadabilityExtractor() Corpus six readability indices (Flesch, FK-grade, Gunning Fog, Coleman-Liau, ARI, SMOG)
SentenceLengthExtractor() Corpus mean, SD, skew of per-sentence tokens
LexicalDiversityExtractor() Corpus TTR, MATTR, MTLD, HD-D, Yule's K/I, Herdan's C, Simpson's D
SentenceEmbeddingExtractor(model=...) Corpus sentence-transformers pooled embedding (extra: tamga[embeddings])
ContextualEmbeddingExtractor(model=..., pooling=...) Corpus HF transformer hidden-state vectors (extra: tamga[embeddings])

Extractor detail

Each extractor above is a callable object; fit_transform(corpus) returns a FeatureMatrix.

MFWExtractor

MFWExtractor(n=200, scale="zscore", lowercase=True)

Use when: you want the canonical stylometric feature — relative frequencies of the most-frequent words. Default choice for Delta-family attribution. Don't use when: your corpus is very small (<200 unique tokens), or the question is topic-invariant (MFW is topic-sensitive; see CategorizedCharNgramExtractor). Expect: an (n_docs, n) float matrix; rows sum to ~1 under scale="l1", zero-centred unit-variance under scale="zscore".

CharNgramExtractor

CharNgramExtractor(n=3, include_boundaries=True)

Use when: you want features that capture sub-word style (prefixes, suffixes, punctuation adjacency) and that cope with OOV words or misspellings. Don't use when: your languages mix scripts (n-grams across scripts produce noise), or you specifically need word-level semantic sensitivity. Expect: sparse count matrix delegated to sklearn's CountVectorizer.

WordNgramExtractor

WordNgramExtractor(n=1, lowercase=True)

Use when: unigrams (MFW equivalent) or short bigram phrases are what you need and you don't want z-scoring. Bigrams useful for detecting fixed expressions. Don't use when: n ≥ 3 in small corpora — sparsity dominates. Use MFWExtractor for unigrams unless you need raw counts. Expect: sparse count matrix; vocabulary grows fast with n.

PosNgramExtractor

PosNgramExtractor(n=2, coarse=False)

Use when: you want syntactic-style features (sequences of part-of-speech tags) — insensitive to content words, sensitive to register and syntactic register. Don't use when: your spaCy pipeline doesn't include a tagger (most _trf models do), or your corpus is very small per-doc. Expect: sparse count matrix over POS n-grams. coarse=True uses UD coarse tags (fewer dimensions, more robust).

DependencyBigramExtractor

DependencyBigramExtractor()

Use when: you want syntax-sensitive style features — specifically, the (head-lemma, dependency-relation, child-lemma) triples parsed by spaCy. Don't use when: your parser is a bottleneck; dependency parsing is the slowest step in the spaCy pipeline and you may be able to substitute POS n-grams. Expect: sparse count matrix over dependency triples.

FunctionWordExtractor

FunctionWordExtractor(wordlist=None)

Use when: you want the short, topic-insensitive function-word list (the classic anti-topic signal for stylometry) for the document's language. Don't use when: your corpus mixes languages without a per-doc language tag — the per-language word list won't apply. Expect: (n_docs, |wordlist|) relative-frequency matrix. Defaults come from the bundled per-language list (see Languages).

PunctuationExtractor

PunctuationExtractor()

Use when: you want pure-style features that are nearly topic-invariant — punctuation usage is remarkably author-specific and corpus-robust. Don't use when: your source text has been normalised or stripped of punctuation (e.g., OCR output without correction). Expect: (n_docs, ~20) matrix of ASCII punctuation relative frequencies.

ReadabilityExtractor

ReadabilityExtractor()

Use when: you want readability-as-style — Flesch, FK-grade, Gunning Fog, etc. — as a lightweight feature set to combine with MFW. Don't use when: readability itself is the question (for that, read the metric directly; don't bundle into a Delta). For non-English, use the per-language native-formula variant — see concepts/languages.md. Expect: (n_docs, 6) matrix of readability indices (English defaults: Flesch, FK-grade, Gunning Fog, Coleman-Liau, ARI, SMOG).

SentenceLengthExtractor

SentenceLengthExtractor()

Use when: you want the sentence-rhythm signature — mean, SD, and skew of per-sentence token counts. Small but strong stylistic signal. Don't use when: your text has aggressive sentence-boundary errors (e.g., ALL CAPS legal text breaks most sentencizers). Expect: (n_docs, 3) matrix: [mean, std, skew].

LexicalDiversityExtractor

LexicalDiversityExtractor()

Use when: you want vocabulary-richness features — TTR, MATTR, MTLD, HD-D, Yule's K/I, Herdan's C, Simpson's D. Eight indices let you compare sensitivities. Don't use when: your documents are very short (<200 tokens); most indices become unstable. Expect: (n_docs, 8) matrix; columns are the 8 indices.

SentenceEmbeddingExtractor

SentenceEmbeddingExtractor(model="paraphrase-MiniLM-L6-v2")

Use when: you want a modern neural-embedding feature set — pooled sentence-transformer output per document. Strong in classification + clustering; fast enough for moderate corpora. Don't use when: your hardware lacks GPU / MPS and your corpus is large (CPU inference is slow), or when interpretability matters (these vectors are opaque). Expect: (n_docs, embedding_dim) dense matrix. Requires tamga[embeddings].

ContextualEmbeddingExtractor

ContextualEmbeddingExtractor(model="bert-base-multilingual-cased", pooling="mean")

Use when: you want HuggingFace-model hidden states aggregated per document — language-specific embeddings (e.g., dbmdz/bert-base-turkish-cased for Turkish) with configurable pooling. Don't use when: you don't need a specific model's representation — use SentenceEmbeddingExtractor for a lighter, faster default. Expect: (n_docs, hidden_dim) dense matrix. Requires tamga[embeddings].

Composing features

Two ways to build a multi-feature matrix:

Python

from tamga import MFWExtractor, PunctuationExtractor

mfw = MFWExtractor(n=200, scale="zscore").fit_transform(corpus)
punct = PunctuationExtractor().fit_transform(corpus)
combined = mfw.concat(punct)  # (n_docs, n_mfw + n_punct)

study.yaml

features:
  - id: mfw
    type: mfw
    n: 200
    scale: zscore
  - id: punct
    type: punctuation

Methods can reference feature ids; the runner builds each matrix once and reuses it.

Forensic feature extractors

Two topic-invariant extractors live under tamga.forensic:

CategorizedCharNgramExtractor

CategorizedCharNgramExtractor(n=4, categories=("prefix","suffix","punct"))

Use when: you want topic-invariant character-level features for forensic verification — n-grams classified by position in the word so you can keep only the style-carrying categories (affixes, punctuation) and drop the topic-sensitive whole-word category. Don't use when: topic robustness isn't the goal — a plain CharNgramExtractor is faster and carries more signal per dimension. Expect: sparse count matrix restricted to the chosen n-gram categories.

Sapkota et al. 2015; categories=("prefix","suffix","punct") is the affix-only recipe that generalises best across topics.

distort_corpus

distort_corpus(corpus, mode="dv_ma")

Use when: you want Stamatatos (2013) topic masking — replaces content words with placeholders while keeping function words and punctuation. Pair with any extractor for a topic-invariant pipeline. Don't use when: your analysis needs content-word signal (e.g., Zeta looking for distinctive vocabulary). Expect: a new Corpus object you feed to any existing extractor. Modes: "dv_ma" masks all content words, "dv_sa" masks selectively.

See Topic-invariant features.

Scaling

Most extractors accept scale ∈ {"none", "zscore", "l1", "l2"}:

  • none — raw counts. Use for Bayesian Wallace–Mosteller.
  • l1 — relative frequencies (row sums to 1). Use for Zeta-like contrast methods.
  • l2 — unit-norm rows. Use for cosine-based distances.
  • zscore — per-column z-score on training means / SDs (Stylo convention). Required for Burrows Delta.

The z-score mean / SD are learned at fit time and applied at transform — so scores on unseen documents use the training distribution.

Next

  • Methods — take the FeatureMatrix and produce a Result.