Skip to content

study.yaml schema

The declarative study config consumed by tamga run. A minimal example:

name: my-study
seed: 42

corpus:
  path: corpus
  metadata: corpus/metadata.tsv

features:
  - id: mfw200
    type: mfw
    n: 200
    scale: zscore
    lowercase: true

methods:
  - id: burrows
    kind: delta
    method: burrows
    features: mfw200
    group_by: author

Top-level keys

Key Type Required Description
name str yes Study name; shows in reports
seed int no Default seed (42). Threaded to every stochastic method.
corpus object yes Corpus config (below)
features list yes One or more feature extractors
methods list yes One or more methods to run
output object no Output directory / timestamping
cache object no DocBin cache directory
preprocess object no spaCy model selection

corpus

corpus:
  path: corpus                    # directory of .txt files
  metadata: corpus/metadata.tsv   # optional TSV with filename + arbitrary fields
  strict: true                    # default: raise if any file lacks metadata
  filter:                         # optional: subset the corpus before running
    role: [train]

features

Each feature extractor is a dict with an id (referenced by methods), a type, and type-specific params.

Supported types

type params
mfw n, min_df, max_df, scale ({none, zscore, l1, l2}), lowercase
word_ngram n (int or [min, max]), lowercase, scale
char_ngram n, include_boundaries, scale
function_word wordlist (optional list or path), scale
punctuation (none)
lexical_diversity (none)
readability (none)

methods

Each method is a dict with an id, a kind, an optional features (feature id), plus params.

Supported kinds

kind Description
delta Nearest-centroid attribution (method: burrows by default)
zeta Craig's Zeta; requires group_by and either inferred or specified params.group_a / group_b
reduce Dim-reduction (default PCA); params.n_components
cluster Hierarchical (default Ward); params.n_clusters, params.linkage
consensus Bootstrap consensus tree; params.mfw_bands, params.replicates
classify sklearn classifier; params.estimator, cv.kind, cv.folds

output

output:
  dir: results          # default
  timestamp: true       # wrap runs in timestamped subdirectories

cache

cache:
  dir: .tamga/cache     # spaCy DocBin cache location

preprocess

preprocess:
  spacy:
    model: en_core_web_trf    # default; change to sm/md for speed

A realistic multi-method example

name: federalist
seed: 42
output: { dir: results, timestamp: false }

corpus:
  path: corpus
  metadata: corpus/metadata.tsv
  filter:
    role: [train]

features:
  - id: mfw200
    type: mfw
    n: 200
    scale: zscore
    lowercase: true

methods:
  - id: burrows
    kind: delta
    method: burrows
    features: mfw200
    group_by: author

  - id: pca
    kind: reduce
    features: mfw200
    params: { n_components: 2 }

  - id: ward
    kind: cluster
    features: mfw200
    params: { n_clusters: 3, linkage: ward }

  - id: zeta_h_m
    kind: zeta
    group_by: author
    params:
      top_k: 50
      group_a: Hamilton
      group_b: Madison