study.yaml schema¶
The declarative study config consumed by tamga run. A minimal example:
name: my-study
seed: 42
corpus:
path: corpus
metadata: corpus/metadata.tsv
features:
- id: mfw200
type: mfw
n: 200
scale: zscore
lowercase: true
methods:
- id: burrows
kind: delta
method: burrows
features: mfw200
group_by: author
Top-level keys¶
| Key | Type | Required | Description |
|---|---|---|---|
name |
str | yes | Study name; shows in reports |
seed |
int | no | Default seed (42). Threaded to every stochastic method. |
corpus |
object | yes | Corpus config (below) |
features |
list | yes | One or more feature extractors |
methods |
list | yes | One or more methods to run |
output |
object | no | Output directory / timestamping |
cache |
object | no | DocBin cache directory |
preprocess |
object | no | spaCy model selection |
corpus¶
corpus:
path: corpus # directory of .txt files
metadata: corpus/metadata.tsv # optional TSV with filename + arbitrary fields
strict: true # default: raise if any file lacks metadata
filter: # optional: subset the corpus before running
role: [train]
features¶
Each feature extractor is a dict with an id (referenced by methods), a type, and
type-specific params.
Supported types¶
| type | params |
|---|---|
mfw |
n, min_df, max_df, scale ({none, zscore, l1, l2}), lowercase |
word_ngram |
n (int or [min, max]), lowercase, scale |
char_ngram |
n, include_boundaries, scale |
function_word |
wordlist (optional list or path), scale |
punctuation |
(none) |
lexical_diversity |
(none) |
readability |
(none) |
methods¶
Each method is a dict with an id, a kind, an optional features (feature id), plus
params.
Supported kinds¶
| kind | Description |
|---|---|
delta |
Nearest-centroid attribution (method: burrows by default) |
zeta |
Craig's Zeta; requires group_by and either inferred or specified params.group_a / group_b |
reduce |
Dim-reduction (default PCA); params.n_components |
cluster |
Hierarchical (default Ward); params.n_clusters, params.linkage |
consensus |
Bootstrap consensus tree; params.mfw_bands, params.replicates |
classify |
sklearn classifier; params.estimator, cv.kind, cv.folds |
output¶
cache¶
preprocess¶
A realistic multi-method example¶
name: federalist
seed: 42
output: { dir: results, timestamp: false }
corpus:
path: corpus
metadata: corpus/metadata.tsv
filter:
role: [train]
features:
- id: mfw200
type: mfw
n: 200
scale: zscore
lowercase: true
methods:
- id: burrows
kind: delta
method: burrows
features: mfw200
group_by: author
- id: pca
kind: reduce
features: mfw200
params: { n_components: 2 }
- id: ward
kind: cluster
features: mfw200
params: { n_clusters: 3, linkage: ward }
- id: zeta_h_m
kind: zeta
group_by: author
params:
top_k: 50
group_a: Hamilton
group_b: Madison