tamga¶
Computational stylometry for authorship attribution, author-group comparison, and forensic-linguistic analysis. A Python replacement for R's Stylo, with a modern NLP pipeline (spaCy, transformer embeddings), a Bayesian layer (PyMC), and a full forensic-evidential toolkit on top.
Named after the tamga — the Turkic clan-mark by which individual and familial identity was recognised at a glance — the material-culture counterpart to a stylistic fingerprint.
Architecture¶
Every layer is sklearn-compatible; every Result carries full provenance (corpus hash,
feature hash, seed, spaCy version, timestamp, resolved config) so a study written as a
study.yaml is reproducible to the exact random draw years later.
Quick navigation¶
-
Getting started
Install tamga, build your first corpus, and run a Burrows Delta study from the CLI.
-
Concepts
Corpus → Features → Methods → Results. The four layers of the pipeline, explained.
-
Forensic toolkit
General Impostors verification, Unmasking, LR output + calibration, PAN evaluation.
-
Tutorials
Reproduce Mosteller & Wallace on the Federalist Papers; run PAN-style forensic verification end-to-end.
What's in the box¶
| Layer | Highlights |
|---|---|
| Corpus | .txt + TSV metadata ingestion, filter / groupby, content-addressed hashing |
| Languages | EN / TR / DE / ES / FR first-class — per-language function words, readability formulas, contextual/sentence embedding defaults. Turkish via Stanford Stanza (BOUN) through spacy-stanza |
| Features | MFW, char / word / POS n-grams, dependency bigrams, function words, punctuation, readability (EN + TR/DE/ES/FR native formulas), sentence length, lexical diversity, sentence + contextual embeddings |
| Methods | Burrows / Eder / Argamon / Cosine / Quadratic Delta; Zeta; PCA / UMAP / t-SNE / MDS; Ward / k-means / HDBSCAN; bootstrap consensus; sklearn classify + CV; Wallace–Mosteller Bayesian |
| Forensic | General Impostors, Unmasking, Stamatatos distortion, Sapkota n-gram categories, Platt / isotonic calibration, log-LR + C_llr + AUC + c@1 + F0.5u + ECE + Brier + Tippett, PANReport, chain-of-custody Provenance, LR-framed HTML report |
Status¶
Phase 5 landed — visualisation, Jinja2 reports, declarative runner (tamga run), and a
Rich-based interactive tamga shell.
Forensic phase landed — six additions (General Impostors, LR + calibration + evaluation metrics, Sapkota categories + Stamatatos distortion, Unmasking, chain-of-custody + forensic report template, PAN harness).
Multi-language phase landed — first-class support for English, Turkish, German, Spanish,
French behind a tamga.languages registry. Turkish parses through Stanford Stanza (BOUN
treebank) via spacy-stanza, returning native spaCy Doc objects so every feature extractor
works unchanged. Native readability formulas per language (Ateşman + Bezirci–Yılmaz for Turkish,
Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos for
Spanish, Kandel–Moles + LIX for French). Function-word lists generated reproducibly from
UD closed-class tokens.
Docs site landed — this MkDocs Material site with Concepts, Forensic toolkit, Federalist + PAN-CLEF + Turkish tutorials, and CLI/API reference. 417 tests passing.
Docs site is multilingual — English (default) and Turkish (/tr/) launched via
mkdocs-static-i18n; DE/ES/FR infrastructure ready, translation content deferred.
Remaining — PyPI publish.
License & citation¶
BSD-3-Clause. See LICENSE.
If you use tamga in published work, please cite it via
CITATION.cff.