Skip to content

Concepts

What tamga is for

tamga answers three questions about who wrote a text:

  • Attribution — which of a set of candidate authors most likely wrote this document?
  • Verification — was this document written by this specific person?
  • Group comparison — how does one author's style differ from another's, or from a defined group?

On top of those core questions it ships a forensic layer: calibrated likelihood ratios, chain-of-custody metadata, and evaluation metrics tuned for courtroom use.

Not sure which question you're asking? Start with the Choosing a method guide.

The four layers

tamga is organised around four layers, each with one primary return type. Understanding these four types lets you compose anything tamga can do:

Layer Type Purpose
Corpus Corpus (wraps Documents) texts + metadata
Features FeatureMatrix numeric document-feature table
Methods Result outcome of an analysis
Reporting report file publication- or forensic-ready artifact

The pipeline

flowchart LR
  C["Corpus<br><code>.txt + metadata.tsv</code>"] --> F["FeatureMatrix<br>(n_docs × n_features)"]
  F --> M["Result<br>(method-specific)"]
  M --> R["Report<br>(HTML / Markdown)"]

  style C fill:#FBF3DE,stroke:#C9A34A
  style F fill:#FBF3DE,stroke:#C9A34A
  style M fill:#FBF3DE,stroke:#C9A34A
  style R fill:#FBF3DE,stroke:#C9A34A

Each arrow is a boundary where data serialises cleanly to disk:

  • Corpus → FeatureMatrix: cached parquet + spaCy DocBin.
  • FeatureMatrix → Result: every result directory contains result.json + optional table_*.parquet + figures.
  • Result → Report: Jinja2 templates render the result directory to a single HTML or Markdown file.

Provenance, everywhere

Every Result carries a Provenance record with:

  • tamga version, Python version, spaCy model + version
  • corpus hash (content-addressed)
  • feature hash (config + corpus hash)
  • seed used for the run
  • timestamp
  • resolved study.yaml config

Plus six optional forensic fields (questioned / known descriptions, hypothesis pair, acquisition + custody notes, source-file SHA-256s) for chain-of-custody.

Two runs of the same study.yaml against the same corpus produce byte-identical result.json under matching seeds. See Results & provenance.