Skip to content

Python API

Auto-generated from the source via mkdocstrings. Every symbol listed below is re-exported at tamga top level (unless otherwise noted).

Corpus

Corpus dataclass

An ordered collection of Documents that share a metadata schema.

Iteration yields Documents in the order provided. Equality and hashing ignore order — two Corpora with the same documents in different orders hash identically.

__getitem__

__getitem__(index: int | slice | Sequence[int] | ndarray) -> Document | Corpus

Index by int → Document; by slice or array-like → Corpus.

The array-like branch is needed for sklearn's cross-validation splitters, which slice X with an ndarray of fold indices.

filter

filter(**query: Any) -> Corpus

Return a new Corpus containing documents whose metadata matches every key in query.

Values may be scalars (exact match) or lists (membership).

groupby

groupby(field_name: str) -> dict[Any, Corpus]

Group documents by a metadata field value.

Raises KeyError if any document lacks the field.

metadata_column

metadata_column(field_name: str) -> list[Any]

Return the list of metadata values at field_name, in document order.

Missing values become None; use filter first if you want to exclude them.

hash

hash() -> str

Stable hash — sorted document hashes + sorted metadata + language.

Document dataclass

A single text with optional metadata.

hash is derived lazily from text — identical texts hash identically regardless of id or metadata, which is intentional: the cache key for parsed documents is content-addressed.

Features

FeatureMatrix dataclass

concat

concat(other: FeatureMatrix) -> FeatureMatrix

Column-concatenate two FeatureMatrix objects with identical document_ids.

MFWExtractor

Bases: BaseFeatureExtractor

Most-Frequent-Words table.

Parameters:

Name Type Description Default
n int

Retain the top n words by corpus frequency.

1000
min_df int

Drop words appearing in fewer than min_df documents.

1
max_df float

Drop words appearing in more than max_df fraction of documents (1.0 disables).

1.0
scale ('none', 'zscore', 'l1', 'l2')

Per-feature scaling applied at transform-time. Burrows Delta requires "zscore", which z-scores the relative frequencies (per-document rates) — the classical Mosteller & Wallace / Burrows formulation. "l1" normalises rows to sum to 1 (relative frequencies); "l2" normalises rows to unit length.

"none"
lowercase bool

If True, case-fold before counting.

False

Methods

Delta

BurrowsDelta

Bases: _DeltaBase

Burrows Classic Delta.

Distance = mean(|x_i - c_i|) across features -- the L1 norm divided by the feature count. Assumes both X and centroid are z-scored in the same coordinate system.

Zeta

ZetaClassic

Bases: _ZetaBase

Clustering

HierarchicalCluster

scipy-based hierarchical clustering — returns both flat labels and the full linkage matrix (the latter is what the viz layer uses to render dendrograms).

Classification

tamga.methods.classify.build_classifier

build_classifier(name: str, **kwargs: Any) -> BaseEstimator

tamga.methods.classify.cross_validate_tamga

cross_validate_tamga(estimator: BaseEstimator, fm: FeatureMatrix, y: ndarray, *, cv_kind: str = 'stratified', groups_from: ndarray | None = None, folds: int = 5, seed: int = 42) -> dict[str, Any]

Run cross-validation with a stylometry-aware CV strategy.

cv_kind: - "stratified": StratifiedKFold(folds) — uses seed for the shuffle - "loao": LeaveOneGroupOut (requires groups_from; deterministic) - "leave_one_text_out": LeaveOneOut (deterministic)

Results

Result dataclass

to_json

to_json(path: str | Path) -> None

Persist params + values + provenance to a single JSON file.

ndarray values are encoded as {"__ndarray__": list, "shape": ..., "dtype": ...} so they round-trip exactly.

save

save(directory: str | Path) -> Path

Persist everything to directory/: result.json, tables as parquet, figures deferred to viz layer.

Provenance dataclass

has_forensic_metadata property

has_forensic_metadata: bool

True if any chain-of-custody field has been populated.

Runner

tamga.runner.run_study

run_study(config_path: str | Path, *, output_dir: str | Path | None = None, run_name: str | None = None) -> Path

Execute a full study from a study.yaml file and save all results.

Returns the path to the run directory (e.g., results/2026-04-17T10-15-30/).

Reporting

tamga.report.render.build_report

build_report(result_dir: str | Path, *, output: str | Path, format: Format = 'html', title: str = 'tamga study', corpus_summary: dict[str, Any] | None = None) -> Path

Assemble result.json + figures in result_dir into a single HTML/MD file at output.

result_dir may be either a single Result directory (one method) or a parent directory containing per-method subdirectories (produced by tamga run).

tamga.report.render.build_forensic_report

build_forensic_report(result_dir: str | Path, *, output: str | Path, title: str = 'tamga forensic report', lr_summaries: dict[str, dict[str, str]] | None = None) -> Path

Render a forensic-styled report with known/questioned framing, LR output, and a chain-of-custody block.

The forensic fields (hypothesis_pair, questioned_description, known_description, acquisition_notes, custody_notes, source_hashes) are read from the provenance of the first Result in result_dir. Populate them when calling Provenance.current from your verification pipeline.

Parameters:

Name Type Description Default
result_dir path

A Result directory or a parent directory of per-method Result subdirectories.

required
output path

Output HTML path.

required
title str
'tamga forensic report'
lr_summaries dict

Per-method LR summaries keyed by method name, e.g., {"general_impostors": {"log_lr": "1.34", "lr": "21.9"}}. These render under the relevant method section. If None, no LR block is drawn per method.

None