Ana içeriğe geç

Python API

Kaynaktan mkdocstrings aracılığıyla otomatik oluşturulmuştur. Aşağıda listelenen her sembol, tamga üst düzeyinde yeniden dışa aktarılır (aksi belirtilmedikçe).

Corpus

Corpus dataclass

An ordered collection of Documents that share a metadata schema.

Iteration yields Documents in the order provided. Equality and hashing ignore order — two Corpora with the same documents in different orders hash identically.

__getitem__

__getitem__(index: int | slice | Sequence[int] | ndarray) -> Document | Corpus

Index by int → Document; by slice or array-like → Corpus.

The array-like branch is needed for sklearn's cross-validation splitters, which slice X with an ndarray of fold indices.

filter

filter(**query: Any) -> Corpus

Return a new Corpus containing documents whose metadata matches every key in query.

Values may be scalars (exact match) or lists (membership).

groupby

groupby(field_name: str) -> dict[Any, Corpus]

Group documents by a metadata field value.

Raises KeyError if any document lacks the field.

metadata_column

metadata_column(field_name: str) -> list[Any]

Return the list of metadata values at field_name, in document order.

Missing values become None; use filter first if you want to exclude them.

hash

hash() -> str

Stable hash — sorted document hashes + sorted metadata + language.

Document dataclass

A single text with optional metadata.

hash is derived lazily from text — identical texts hash identically regardless of id or metadata, which is intentional: the cache key for parsed documents is content-addressed.

Features

FeatureMatrix dataclass

concat

concat(other: FeatureMatrix) -> FeatureMatrix

Column-concatenate two FeatureMatrix objects with identical document_ids.

MFWExtractor

Bases: BaseFeatureExtractor

Most-Frequent-Words table.

Parameters:

Name Type Description Default
n int

Retain the top n words by corpus frequency.

1000
min_df int

Drop words appearing in fewer than min_df documents.

1
max_df float

Drop words appearing in more than max_df fraction of documents (1.0 disables).

1.0
scale ('none', 'zscore', 'l1', 'l2')

Per-feature scaling applied at transform-time. Burrows Delta requires "zscore", which z-scores the relative frequencies (per-document rates) — the classical Mosteller & Wallace / Burrows formulation. "l1" normalises rows to sum to 1 (relative frequencies); "l2" normalises rows to unit length.

"none"
lowercase bool

If True, case-fold before counting.

False

Methods

Delta

BurrowsDelta

Bases: _DeltaBase

Burrows Classic Delta.

Distance = mean(|x_i - c_i|) across features -- the L1 norm divided by the feature count. Assumes both X and centroid are z-scored in the same coordinate system.

Zeta

ZetaClassic

Bases: _ZetaBase

Clustering

HierarchicalCluster

scipy-based hierarchical clustering — returns both flat labels and the full linkage matrix (the latter is what the viz layer uses to render dendrograms).

Classification

tamga.methods.classify.build_classifier

build_classifier(name: str, **kwargs: Any) -> BaseEstimator

tamga.methods.classify.cross_validate_tamga

cross_validate_tamga(estimator: BaseEstimator, fm: FeatureMatrix, y: ndarray, *, cv_kind: str = 'stratified', groups_from: ndarray | None = None, folds: int = 5, seed: int = 42) -> dict[str, Any]

Run cross-validation with a stylometry-aware CV strategy.

cv_kind: - "stratified": StratifiedKFold(folds) — uses seed for the shuffle - "loao": LeaveOneGroupOut (requires groups_from; deterministic) - "leave_one_text_out": LeaveOneOut (deterministic)

Results

Result dataclass

to_json

to_json(path: str | Path) -> None

Persist params + values + provenance to a single JSON file.

ndarray values are encoded as {"__ndarray__": list, "shape": ..., "dtype": ...} so they round-trip exactly.

save

save(directory: str | Path) -> Path

Persist everything to directory/: result.json, tables as parquet, figures deferred to viz layer.

Provenance dataclass

has_forensic_metadata property

has_forensic_metadata: bool

True if any chain-of-custody field has been populated.

Runner

tamga.runner.run_study

run_study(config_path: str | Path, *, output_dir: str | Path | None = None, run_name: str | None = None) -> Path

Execute a full study from a study.yaml file and save all results.

Returns the path to the run directory (e.g., results/2026-04-17T10-15-30/).

Reporting

tamga.report.render.build_report

build_report(result_dir: str | Path, *, output: str | Path, format: Format = 'html', title: str = 'tamga study', corpus_summary: dict[str, Any] | None = None) -> Path

Assemble result.json + figures in result_dir into a single HTML/MD file at output.

result_dir may be either a single Result directory (one method) or a parent directory containing per-method subdirectories (produced by tamga run).

tamga.report.render.build_forensic_report

build_forensic_report(result_dir: str | Path, *, output: str | Path, title: str = 'tamga forensic report', lr_summaries: dict[str, dict[str, str]] | None = None) -> Path

Render a forensic-styled report with known/questioned framing, LR output, and a chain-of-custody block.

The forensic fields (hypothesis_pair, questioned_description, known_description, acquisition_notes, custody_notes, source_hashes) are read from the provenance of the first Result in result_dir. Populate them when calling Provenance.current from your verification pipeline.

Parameters:

Name Type Description Default
result_dir path

A Result directory or a parent directory of per-method Result subdirectories.

required
output path

Output HTML path.

required
title str
'tamga forensic report'
lr_summaries dict

Per-method LR summaries keyed by method name, e.g., {"general_impostors": {"log_lr": "1.34", "lr": "21.9"}}. These render under the relevant method section. If None, no LR block is drawn per method.

None