Python API¶

Auto-generated from the source via mkdocstrings. Every symbol listed below is re-exported at tamga top level (unless otherwise noted).

Corpus¶

Corpus `dataclass` ¶

An ordered collection of Documents that share a metadata schema.

Iteration yields Documents in the order provided. Equality and hashing ignore order — two Corpora with the same documents in different orders hash identically.

getitem ¶

__getitem__(index: int | slice | Sequence[int] | ndarray) -> Document | Corpus

Index by int → Document; by slice or array-like → Corpus.

The array-like branch is needed for sklearn's cross-validation splitters, which slice X with an ndarray of fold indices.

filter ¶

filter(**query: Any) -> Corpus

Return a new Corpus containing documents whose metadata matches every key in query.

Values may be scalars (exact match) or lists (membership).

groupby ¶

groupby(field_name: str) -> dict[Any, Corpus]

Group documents by a metadata field value.

Raises KeyError if any document lacks the field.

metadata_column ¶

metadata_column(field_name: str) -> list[Any]

Return the list of metadata values at field_name, in document order.

Missing values become None; use filter first if you want to exclude them.

hash ¶

hash() -> str

Stable hash — sorted document hashes + sorted metadata + language.

Document `dataclass` ¶

A single text with optional metadata.

hash is derived lazily from text — identical texts hash identically regardless of id or metadata, which is intentional: the cache key for parsed documents is content-addressed.

Features¶

FeatureMatrix `dataclass` ¶

concat ¶

concat(other: FeatureMatrix) -> FeatureMatrix

Column-concatenate two FeatureMatrix objects with identical document_ids.

MFWExtractor ¶

Bases: BaseFeatureExtractor

Most-Frequent-Words table.

Parameters:

Name	Type	Description	Default
`n`	`int`	Retain the top `n` words by corpus frequency.	`1000`
`min_df`	`int`	Drop words appearing in fewer than `min_df` documents.	`1`
`max_df`	`float`	Drop words appearing in more than `max_df` fraction of documents (1.0 disables).	`1.0`
`scale`	`('none', 'zscore', 'l1', 'l2')`	Per-feature scaling applied at transform-time. Burrows Delta requires "zscore", which z-scores the relative frequencies (per-document rates) — the classical Mosteller & Wallace / Burrows formulation. "l1" normalises rows to sum to 1 (relative frequencies); "l2" normalises rows to unit length.	`"none"`
`lowercase`	`bool`	If True, case-fold before counting.	`False`

Methods¶

Delta¶

BurrowsDelta ¶

Bases: _DeltaBase

Burrows Classic Delta.

Distance = mean(|x_i - c_i|) across features -- the L1 norm divided by the feature count. Assumes both X and centroid are z-scored in the same coordinate system.

Zeta¶

ZetaClassic ¶

Bases: _ZetaBase

Clustering¶

HierarchicalCluster ¶

scipy-based hierarchical clustering — returns both flat labels and the full linkage matrix (the latter is what the viz layer uses to render dendrograms).

Classification¶

tamga.methods.classify.build_classifier ¶

build_classifier(name: str, **kwargs: Any) -> BaseEstimator

tamga.methods.classify.cross_validate_tamga ¶

cross_validate_tamga(estimator: BaseEstimator, fm: FeatureMatrix, y: ndarray, *, cv_kind: str = 'stratified', groups_from: ndarray | None = None, folds: int = 5, seed: int = 42) -> dict[str, Any]

Run cross-validation with a stylometry-aware CV strategy.

cv_kind: - "stratified": StratifiedKFold(folds) — uses seed for the shuffle - "loao": LeaveOneGroupOut (requires groups_from; deterministic) - "leave_one_text_out": LeaveOneOut (deterministic)

Results¶

Result `dataclass` ¶

to_json ¶

to_json(path: str | Path) -> None

Persist params + values + provenance to a single JSON file.

ndarray values are encoded as {"__ndarray__": list, "shape": ..., "dtype": ...} so they round-trip exactly.

save ¶

save(directory: str | Path) -> Path

Persist everything to directory/: result.json, tables as parquet, figures deferred to viz layer.

Provenance `dataclass` ¶

has_forensic_metadata `property` ¶

has_forensic_metadata: bool

True if any chain-of-custody field has been populated.

Runner¶

tamga.runner.run_study ¶

run_study(config_path: str | Path, *, output_dir: str | Path | None = None, run_name: str | None = None) -> Path

Execute a full study from a study.yaml file and save all results.

Returns the path to the run directory (e.g., results/2026-04-17T10-15-30/).

Reporting¶

tamga.report.render.build_report ¶

build_report(result_dir: str | Path, *, output: str | Path, format: Format = 'html', title: str = 'tamga study', corpus_summary: dict[str, Any] | None = None) -> Path

Assemble result.json + figures in result_dir into a single HTML/MD file at output.

result_dir may be either a single Result directory (one method) or a parent directory containing per-method subdirectories (produced by tamga run).

tamga.report.render.build_forensic_report ¶

build_forensic_report(result_dir: str | Path, *, output: str | Path, title: str = 'tamga forensic report', lr_summaries: dict[str, dict[str, str]] | None = None) -> Path

Render a forensic-styled report with known/questioned framing, LR output, and a chain-of-custody block.

The forensic fields (hypothesis_pair, questioned_description, known_description, acquisition_notes, custody_notes, source_hashes) are read from the provenance of the first Result in result_dir. Populate them when calling Provenance.current from your verification pipeline.