Turkish stylometry walkthrough¶
A complete runnable example: attribute a small Turkish short-story corpus using MFW + Ateşman readability + Burrows Delta. The stories are public-domain Ömer Seyfettin texts from Turkish Wikisource.
Setup¶
uv pip install 'tamga[turkish]'
python -c "import stanza; stanza.download('tr')"
tamga init seyfettin --language tr
cd seyfettin
This scaffolds a project directory with study.yaml pre-configured for Turkish. Confirm
with:
The language row shows tr.
Corpus¶
Place 3-5 Turkish short stories in corpus/ as UTF-8 .txt files. A good public-domain
source is Ömer Seyfettin on Turkish
Wikisource — dozens of early-20th-
century short stories are already transcribed there.
Add a corpus/metadata.tsv:
filename author year
bomba.txt Omer_Seyfettin 1910
kesik_biyik.txt Omer_Seyfettin 1911
forsa.txt Omer_Seyfettin 1913
pembe_incili_kaftan.txt Omer_Seyfettin 1917
A real study would include several authors. For a single-author demo, pair Seyfettin with a few stories by Refik Halit Karay (also public domain) to give Delta something to discriminate.
Running the study¶
tamga ingest corpus/ --language tr --metadata corpus/metadata.tsv
tamga run study.yaml --name first-run
tamga ingest runs Stanza through spacy-stanza. The first run parses every document and
caches the DocBins; subsequent runs hit the cache and finish in seconds.
What you get¶
A default Turkish study computes:
- MFW (top 1000 tokens, z-scored relative frequencies)
- Turkish function words — loaded from
resources/languages/tr/function_words.txt, derived from UD Turkish BOUN closed-class tokens - Ateşman and Bezirci-Yılmaz readability indices
- Burrows Delta + PCA/MDS reduction plots
The output folder results/first-run/ contains:
result.jsonwith Delta scores and provenancetable_*.parquetfeature matrices- PNG / PDF figures (distance heatmap, PCA scatter)
- a
provenance.jsonthat records the corpus hash, seed, and full resolved config
Customising¶
Edit study.yaml to swap features or methods. For example, to use contextual embeddings
instead of MFW:
features:
- id: bert_tr
type: contextual_embedding
# model auto-resolves to `dbmdz/bert-base-turkish-cased` via the language registry
pool: mean
For a heavier Turkish encoder, point model: at any HuggingFace checkpoint:
Notes on Turkish specifics¶
- Morphology. Turkish is agglutinative; a token like
evlerinizdenpacksev+ler+iniz+deninto one form. Stanza's Turkish BOUN model lemmatises and tags these correctly, which matters for POS-ngram and dependency-based features. - Syllable counting. Both Ateşman and Bezirci-Yılmaz count syllables using a
vowel-counter specialised for Turkish orthography (including
ı,ğ,ş,ç,ü,ö). - Function words. The bundled list leans on Turkish's closed-class postpositions,
conjunctions, and discourse particles (e.g.
ile,ancak,fakat,çünkü,ki,ise).
Troubleshooting¶
ModuleNotFoundError: No module named 'spacy_stanza'— runuv pip install 'tamga[turkish]'.FileNotFoundError: ... stanza_resources/tr/default.zip— runpython -c "import stanza; stanza.download('tr')". The model is about 600 MB.- Very slow first ingest on MPS. Stanza's Turkish model does not yet support Apple Silicon MPS. Expect CPU parse rates on first run; subsequent runs are cache hits.