Python API¶
Kaynaktan mkdocstrings aracılığıyla otomatik oluşturulmuştur. Aşağıda listelenen her sembol,
tamga üst düzeyinde yeniden dışa aktarılır (aksi belirtilmedikçe).
Corpus¶
Corpus
dataclass
¶
An ordered collection of Documents that share a metadata schema.
Iteration yields Documents in the order provided. Equality and hashing ignore order — two Corpora with the same documents in different orders hash identically.
__getitem__ ¶
Index by int → Document; by slice or array-like → Corpus.
The array-like branch is needed for sklearn's cross-validation splitters, which
slice X with an ndarray of fold indices.
filter ¶
Return a new Corpus containing documents whose metadata matches every key in query.
Values may be scalars (exact match) or lists (membership).
groupby ¶
Group documents by a metadata field value.
Raises KeyError if any document lacks the field.
metadata_column ¶
Return the list of metadata values at field_name, in document order.
Missing values become None; use filter first if you want to exclude them.
Document
dataclass
¶
A single text with optional metadata.
hash is derived lazily from text — identical texts hash identically regardless of id or
metadata, which is intentional: the cache key for parsed documents is content-addressed.
Features¶
FeatureMatrix
dataclass
¶
concat ¶
Column-concatenate two FeatureMatrix objects with identical document_ids.
MFWExtractor ¶
Bases: BaseFeatureExtractor
Most-Frequent-Words table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Retain the top |
1000
|
min_df
|
int
|
Drop words appearing in fewer than |
1
|
max_df
|
float
|
Drop words appearing in more than |
1.0
|
scale
|
('none', 'zscore', 'l1', 'l2')
|
Per-feature scaling applied at transform-time. Burrows Delta requires "zscore", which z-scores the relative frequencies (per-document rates) — the classical Mosteller & Wallace / Burrows formulation. "l1" normalises rows to sum to 1 (relative frequencies); "l2" normalises rows to unit length. |
"none"
|
lowercase
|
bool
|
If True, case-fold before counting. |
False
|
Methods¶
Delta¶
BurrowsDelta ¶
Bases: _DeltaBase
Burrows Classic Delta.
Distance = mean(|x_i - c_i|) across features -- the L1 norm divided by the feature count.
Assumes both X and centroid are z-scored in the same coordinate system.
Zeta¶
ZetaClassic ¶
Bases: _ZetaBase
Clustering¶
HierarchicalCluster ¶
scipy-based hierarchical clustering — returns both flat labels and the full linkage matrix (the latter is what the viz layer uses to render dendrograms).
Classification¶
tamga.methods.classify.build_classifier ¶
tamga.methods.classify.cross_validate_tamga ¶
cross_validate_tamga(estimator: BaseEstimator, fm: FeatureMatrix, y: ndarray, *, cv_kind: str = 'stratified', groups_from: ndarray | None = None, folds: int = 5, seed: int = 42) -> dict[str, Any]
Run cross-validation with a stylometry-aware CV strategy.
cv_kind:
- "stratified": StratifiedKFold(folds) — uses seed for the shuffle
- "loao": LeaveOneGroupOut (requires groups_from; deterministic)
- "leave_one_text_out": LeaveOneOut (deterministic)
Results¶
Result
dataclass
¶
Provenance
dataclass
¶
has_forensic_metadata
property
¶
True if any chain-of-custody field has been populated.
Runner¶
tamga.runner.run_study ¶
run_study(config_path: str | Path, *, output_dir: str | Path | None = None, run_name: str | None = None) -> Path
Execute a full study from a study.yaml file and save all results.
Returns the path to the run directory (e.g., results/2026-04-17T10-15-30/).
Reporting¶
tamga.report.render.build_report ¶
build_report(result_dir: str | Path, *, output: str | Path, format: Format = 'html', title: str = 'tamga study', corpus_summary: dict[str, Any] | None = None) -> Path
Assemble result.json + figures in result_dir into a single HTML/MD file at output.
result_dir may be either a single Result directory (one method) or a parent directory
containing per-method subdirectories (produced by tamga run).
tamga.report.render.build_forensic_report ¶
build_forensic_report(result_dir: str | Path, *, output: str | Path, title: str = 'tamga forensic report', lr_summaries: dict[str, dict[str, str]] | None = None) -> Path
Render a forensic-styled report with known/questioned framing, LR output, and a chain-of-custody block.
The forensic fields (hypothesis_pair, questioned_description, known_description,
acquisition_notes, custody_notes, source_hashes) are read from the
provenance of the first Result in result_dir. Populate them when calling
Provenance.current from your verification pipeline.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result_dir
|
path
|
A Result directory or a parent directory of per-method Result subdirectories. |
required |
output
|
path
|
Output HTML path. |
required |
title
|
str
|
|
'tamga forensic report'
|
lr_summaries
|
dict
|
Per-method LR summaries keyed by method name, e.g.,
|
None
|