tamga¶

Computational stylometry for authorship attribution, author-group comparison, and forensic-linguistic analysis. A Python replacement for R's Stylo, with a modern NLP pipeline (spaCy, transformer embeddings), a Bayesian layer (PyMC), and a full forensic-evidential toolkit on top.

Named after the tamga — the Turkic clan-mark by which individual and familial identity was recognised at a glance — the material-culture counterpart to a stylistic fingerprint.

Architecture¶

corpus → features → methods → forensic → output

Every layer is sklearn-compatible; every Result carries full provenance (corpus hash, feature hash, seed, spaCy version, timestamp, resolved config) so a study written as a study.yaml is reproducible to the exact random draw years later.

Getting started

Install tamga, build your first corpus, and run a Burrows Delta study from the CLI.

Install & quickstart
Concepts

Corpus → Features → Methods → Results. The four layers of the pipeline, explained.

Concepts
Forensic toolkit

General Impostors verification, Unmasking, LR output + calibration, PAN evaluation.

Forensic toolkit
Tutorials

Reproduce Mosteller & Wallace on the Federalist Papers; run PAN-style forensic verification end-to-end.

Tutorials

What's in the box¶

Layer	Highlights
Corpus	`.txt` + TSV metadata ingestion, filter / groupby, content-addressed hashing
Languages	EN / TR / DE / ES / FR first-class — per-language function words, readability formulas, contextual/sentence embedding defaults. Turkish via Stanford Stanza (BOUN) through `spacy-stanza`
Features	MFW, char / word / POS n-grams, dependency bigrams, function words, punctuation, readability (EN + TR/DE/ES/FR native formulas), sentence length, lexical diversity, sentence + contextual embeddings
Methods	Burrows / Eder / Argamon / Cosine / Quadratic Delta; Zeta; PCA / UMAP / t-SNE / MDS; Ward / k-means / HDBSCAN; bootstrap consensus; sklearn classify + CV; Wallace–Mosteller Bayesian
Forensic	General Impostors, Unmasking, Stamatatos distortion, Sapkota n-gram categories, Platt / isotonic calibration, log-LR + C_llr + AUC + c@1 + F0.5u + ECE + Brier + Tippett, PANReport, chain-of-custody Provenance, LR-framed HTML report

Status¶

Phase 5 landed — visualisation, Jinja2 reports, declarative runner (tamga run), and a Rich-based interactive tamga shell.

Forensic phase landed — six additions (General Impostors, LR + calibration + evaluation metrics, Sapkota categories + Stamatatos distortion, Unmasking, chain-of-custody + forensic report template, PAN harness).

Multi-language phase landed — first-class support for English, Turkish, German, Spanish, French behind a tamga.languages registry. Turkish parses through Stanford Stanza (BOUN treebank) via spacy-stanza, returning native spaCy Doc objects so every feature extractor works unchanged. Native readability formulas per language (Ateşman + Bezirci–Yılmaz for Turkish, Flesch-Amstad + Wiener Sachtextformel for German, Fernández-Huerta + Szigriszt-Pazos for Spanish, Kandel–Moles + LIX for French). Function-word lists generated reproducibly from UD closed-class tokens.

Docs site landed — this MkDocs Material site with Concepts, Forensic toolkit, Federalist + PAN-CLEF + Turkish tutorials, and CLI/API reference. 417 tests passing.

Docs site is multilingual — English (default) and Turkish (/tr/) launched via mkdocs-static-i18n; DE/ES/FR infrastructure ready, translation content deferred.

Remaining — PyPI publish.

License & citation¶

BSD-3-Clause. See LICENSE.

If you use tamga in published work, please cite it via CITATION.cff.

tamga¶

Architecture¶

Quick navigation¶

What's in the box¶

Status¶

License & citation¶