study.yaml schema¶

The declarative study config consumed by tamga run. A minimal example:

name: my-study
seed: 42

corpus:
  path: corpus
  metadata: corpus/metadata.tsv

features:
  - id: mfw200
    type: mfw
    n: 200
    scale: zscore
    lowercase: true

methods:
  - id: burrows
    kind: delta
    method: burrows
    features: mfw200
    group_by: author

Top-level keys¶

Key	Type	Required	Description
`name`	str	yes	Study name; shows in reports
`seed`	int	no	Default seed (42). Threaded to every stochastic method.
`corpus`	object	yes	Corpus config (below)
`features`	list	yes	One or more feature extractors
`methods`	list	yes	One or more methods to run
`output`	object	no	Output directory / timestamping
`cache`	object	no	DocBin cache directory
`preprocess`	object	no	spaCy model selection

corpus¶

corpus:
  path: corpus                    # directory of .txt files
  metadata: corpus/metadata.tsv   # optional TSV with filename + arbitrary fields
  strict: true                    # default: raise if any file lacks metadata
  filter:                         # optional: subset the corpus before running
    role: [train]

features¶

Each feature extractor is a dict with an id (referenced by methods), a type, and type-specific params.

Supported types¶

type	params
`mfw`	`n`, `min_df`, `max_df`, `scale` ({none, zscore, l1, l2}), `lowercase`
`word_ngram`	`n` (int or [min, max]), `lowercase`, `scale`
`char_ngram`	`n`, `include_boundaries`, `scale`
`function_word`	`wordlist` (optional list or path), `scale`
`punctuation`	(none)
`lexical_diversity`	(none)
`readability`	(none)

methods¶

Each method is a dict with an id, a kind, an optional features (feature id), plus params.

Supported kinds¶

kind	Description
`delta`	Nearest-centroid attribution (`method: burrows` by default)
`zeta`	Craig's Zeta; requires `group_by` and either inferred or specified `params.group_a` / `group_b`
`reduce`	Dim-reduction (default PCA); `params.n_components`
`cluster`	Hierarchical (default Ward); `params.n_clusters`, `params.linkage`
`consensus`	Bootstrap consensus tree; `params.mfw_bands`, `params.replicates`
`classify`	sklearn classifier; `params.estimator`, `cv.kind`, `cv.folds`

output¶

output:
  dir: results          # default
  timestamp: true       # wrap runs in timestamped subdirectories

cache¶

cache:
  dir: .tamga/cache     # spaCy DocBin cache location

preprocess¶

preprocess:
  spacy:
    model: en_core_web_trf    # default; change to sm/md for speed

A realistic multi-method example¶

name: federalist
seed: 42
output: { dir: results, timestamp: false }

corpus:
  path: corpus
  metadata: corpus/metadata.tsv
  filter:
    role: [train]

features:
  - id: mfw200
    type: mfw
    n: 200
    scale: zscore
    lowercase: true

methods:
  - id: burrows
    kind: delta
    method: burrows
    features: mfw200
    group_by: author

  - id: pca
    kind: reduce
    features: mfw200
    params: { n_components: 2 }

  - id: ward
    kind: cluster
    features: mfw200
    params: { n_clusters: 3, linkage: ward }

  - id: zeta_h_m
    kind: zeta
    group_by: author
    params:
      top_k: 50
      group_a: Hamilton
      group_b: Madison