Doğrulama¶

Yazar doğrulama (authorship verification) tek sınıflı bir karardır: belirli bir aday, bu sorgulanan belgeyi üretmiş midir? Gerçek dava çalışmaları nadiren kapalı bir aday kümesi sunar; bu nedenle doğrulama — yazar tespitinin aksine — adli açıdan standart görevdir.

tamga iki tamamlayıcı doğrulayıcı sunar.

General Impostors¶

Şu durumda kullanın: bir sorgulanan belgeniz, bir adayın bilinen belgelerinin ve diğer yazarlardan oluşan ~100+ sahte yazar havuzunun olduğu, adli açıdan standart aynı-yazar-mı-değil-mi sorusunu yanıtlamanız gereken durumlarda. Şu durumda kullanmayın: sahte yazar havuzunuz yoksa veya adayın bilinen metinleri toplamda ~1000 sözcüğün altındaysa (test örneklem boyutuna bağımlı hale gelir). Beklenen sonuç: [0, 1] aralığında bir puan; olabilirlik oranı olarak raporlamadan önce CalibratedScorer ile kalibrasyon yapın.

Koppel & Winter (2014). Sorgulanan belge Q, adayın bilinen belgeleri K ve diğer yazarlardan oluşan sahte yazar havuzu I verildiğinde, yinelemeli olarak:

Rastgele bir öznitelik alt uzayı örneklenir.
Havuzdan m sahte yazar örneklenir.
Q'nun, örneklenen herhangi bir sahte yazardan çok K'ya daha yakın olup olmadığı kontrol edilir.

Kazanan yinemelerin oranı, [0, 1] aralığındaki doğrulama puanıdır.

from tamga.features import MFWExtractor
from tamga.forensic import GeneralImpostors

# Q, K ve sahte yazarların ortak bir sözcük dağarcığı paylaşması için
# birleştirilmiş derlem üzerinde öznitelikler oluşturulur.
fm = MFWExtractor(n=200, scale="zscore", lowercase=True).fit_transform(pooled_corpus)
q_fm      = slice_by_ids(fm, ["questioned"])
known_fm  = slice_by_ids(fm, known_doc_ids)
impostors = slice_by_ids(fm, impostor_doc_ids)

gi = GeneralImpostors(n_iterations=100, feature_subsample_rate=0.5, seed=42)
result = gi.verify(questioned=q_fm, known=known_fm, impostors=impostors)
result.values["score"]       # [0, 1] aralığında
result.values["wins"]        # ham kazanma sayısı

Parametreler¶

Parametre	Varsayılan	Açıklama
`n_iterations`	100	Rastgele alt uzay + sahte yazar örnekleme yineleme sayısı
`feature_subsample_rate`	0.5	Her yinelemede örneklenen öznitelik oranı
`impostor_sample_size`	`ceil(sqrt(pool_size))`	Yineleme başına sahte yazar sayısı — büyük havuzların testi önemsizleştirmemesi için alt-doğrusal ölçeklenir
`similarity`	`"cosine"`	`"cosine"` (gerçek değerli) veya `"minmax"` (yalnızca negatif olmayan öznitelikler)
`aggregate`	`"centroid"`	`"centroid"` (K'nın ortalaması) veya `"nearest"` (en benzer bilinen belge — yazar içi stil heterojenliğinde tutucu seçenek)
`seed`	42	RNG seed değeri (öznitelik + sahte yazar örnekleme)

Berabere durumlar¶

Berabere durumlar sahte yazarlar lehine bozulur (katı >). Q, K'ya ve bir sahte yazara eşit derecede yakınsa, yineleme kayıp olarak sayılır — adli açıdan tutucu seçim.

Unmasking¶

Şu durumda kullanın: uzun düz yazı adaylarınız (roman bölümleri, uzun denemeler, blog arşivleri) varsa ve dağılım varsayımına dayanmayan bir doğrulama istiyorsanız — doğruluk düşüş eğrisi bizzat yorumlanabilir bir delil zinciri oluşturur. Şu durumda kullanmayın: belgeleriniz kısa ise (her taraf için <~1500 sözcük) — Unmasking, çapraz doğrulamayı anlamlı biçimde çalıştırmak için yeterli parçaya ihtiyaç duyar. Beklenen sonuç: eleme turları boyunca bir doğruluk eğrisi; aynı yazara ait çiftler dik düşüş gösterir, farklı yazara ait çiftler rastgele düzeyde veya üzerinde kalır.

Koppel & Schler (2004). Dağılım varsayımına dayanmayan, uzun metin doğrulama yöntemi. Q ve K, sözcük pencereleri halinde parçalanır; ardından yinelemeli olarak:

Q parçalarını K parçalarından ayırt etmek için ikili sınıflandırıcı eğitilir.
CV doğruluğu ölçülür.
Q için en ayrıştırıcı ve K için en ayrıştırıcı üst-N öznitelik kaldırılır (Koppel & Schler'e göre tur başına 2 × N).
Tekrarlanır.

Aynı yazara ait belgeler stilistik olarak benzerdir: birkaç yüzeysel fark kaldırıldığında sınıflandırıcı hızla çöker (büyük düşüş). Farklı yazara ait belgeler ayrıştırıcı öznitelik vermeyi sürdürür, bu nedenle doğruluk yüksek kalır (küçük düşüş).

from tamga.features import MFWExtractor
from tamga.forensic import Unmasking

unmasking = Unmasking(chunk_size=500, n_rounds=10, n_eliminate=3, seed=42)
result = unmasking.verify(
    questioned=questioned_text,            # str, Document veya Corpus
    known=known_text,
    extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
)
result.values["accuracy_curve"]    # list[float], uzunluk n_rounds
result.values["accuracy_drop"]     # skaler özet (curve[0] - curve[-1])
result.values["eliminated_per_round"]   # denetlenebilir tur başına öznitelik kaldırma

Hangisi seçilmeli¶

Durum	Seçim
Kısa CMC / tehdit metinleri (toplam < ~2000 sözcük)	`GeneralImpostors`. Unmasking, CV'yi anlamlı biçimde çalıştırmak için her tarafta daha fazla metne ihtiyaç duyar.
Uzun düz yazı (roman, deneme, blog arşivi)	`Unmasking` — doğruluk düşüş eğrisi doğrudan yorumlanabilir. İkinci görüş olarak GI ile birleştirin.
Kanıtsal rapor oluşturma	Her ikisini çalıştırın ve her ikisini `CalibratedScorer` ile kalibre edin. İki yöntem arasındaki uyum, başlı başına kanıtsal sinyaldir (Juola tarzı çok yöntemli karar).

Referans¶

GeneralImpostors ¶

Authorship verification via the General Impostors method.

Parameters:

Name	Type	Description	Default
`n_iterations`	`int`	Number of random feature-subspace + impostor-subsample iterations (Koppel & Winter 2014 used 100; PAN baselines typically use 50-200).	`100`
`feature_subsample_rate`	`float`	Fraction of feature columns sampled per iteration, in (0, 1]. 0.5 is the classical default; smaller values increase variance but decorrelate impostor rankings.	`0.5`
`impostor_sample_size`	`int or None`	Number of impostors to sample per iteration. If None, uses `ceil(sqrt(n_impostors))` (a defensible default that scales sub-linearly with the pool size, so a large pool does not make every iteration trivially win).	`None`
`similarity`	`('cosine', 'minmax')`	Similarity function used to compare the questioned document to the candidate and to each impostor. `cosine`: dot(u, v) / (\|\|u\|\| * \|\|v\|\|); works for any real-valued features. `minmax`: sum(min(u, v)) / sum(max(u, v)); Koppel et al.'s MinMax similarity, which requires non-negative features (e.g., raw relative frequencies). Raises if any feature in Q/known/impostors is negative.	`"cosine"`
`aggregate`	`('centroid', 'nearest')`	How to build a single comparison point from the candidate's known documents: `centroid`: compare Q to the mean vector of the known samples (simple, standard). `nearest`: compare Q to the nearest known sample in the projected subspace (more conservative under stylistic heterogeneity within an author's corpus).	`"centroid"`
`seed`	`int`	Seed for the numpy random generator used to sample features and impostors.	`42`

Examples:

>>> from tamga.forensic import GeneralImpostors
>>> gi = GeneralImpostors(n_iterations=100, seed=42)
>>> result = gi.verify(questioned=q_fm, known=known_fm, impostors=pool_fm)
>>> result.values["score"]    # fraction of iterations where known beat all impostors
0.87

verify ¶

verify(*, questioned: FeatureMatrix, known: FeatureMatrix, impostors: FeatureMatrix) -> Result

Run the GI algorithm for one questioned document against one candidate's known set.

Parameters:

Name	Type	Description	Default
`questioned`	`FeatureMatrix`	Exactly one row (the Q document).	required
`known`	`FeatureMatrix`	The candidate author's known documents (>= 1 row).	required
`impostors`	`FeatureMatrix`	Pool of documents from other authors (>= 2 rows so each iteration can sample distinct impostors even with the smallest default `impostor_sample_size`).	required

Returns:

Type	Description
`Result`	With `values["score"]` in [0, 1] (higher = more likely same author), `values["wins"]` raw iteration-win count, and sampling counts in `params`.

Notes

All three FeatureMatrix inputs must share the same feature space — i.e., identical feature_names in the same order. Callers that build features independently should use a single fit_transform on the pooled corpus and then slice by document id.

Unmasking ¶

Koppel & Schler (2004) Unmasking for authorship verification.

Parameters:

Name	Type	Description	Default
`chunk_size`	`int`	Words per chunk. 500 is the common default in the literature.	`500`
`n_rounds`	`int`	Number of iteration rounds (each round eliminates `n_eliminate` features and retrains). 10 is the standard setting.	`10`
`n_eliminate`	`int`	Per-class top-N features to remove each round, following Koppel & Schler 2004: the N most Q-discriminating (top positive coefficients) and the N most K-discriminating (top negative coefficients), so up to 2 * N features drop per round. Default 3 (6 features per round), matching the classical literature setup.	`3`
`n_folds`	`int`	CV folds for accuracy estimation per round. Must be >= 2 and <= min(#Q chunks, K chunks).¶	`10`
`min_chunks_per_class`	`int`	Minimum chunks required on each side before Unmasking is meaningful. Raises ValueError if either side falls below this threshold.	`3`
`seed`	`int`	Seed for the CV split's random state.	`42`

Examples:

>>> from tamga.features import MFWExtractor
>>> from tamga.forensic import Unmasking
>>> unmasking = Unmasking(chunk_size=500, n_rounds=10, seed=42)
>>> result = unmasking.verify(
...     questioned=questioned_text,
...     known=known_text,
...     extractor=MFWExtractor(n=200, scale="zscore", lowercase=True),
... )
>>> result.values["accuracy_curve"]
[0.82, 0.79, 0.70, 0.58, 0.55, ...]
>>> result.values["accuracy_drop"]
0.27  # large drop = same author; small drop = different author

verify ¶

verify(*, questioned: Corpus | Document | str, known: Corpus | Document | str, extractor: BaseFeatureExtractor) -> Result

Run Unmasking and return the accuracy-degradation curve plus summary stats.

Parameters:

Name	Type	Description	Default
`questioned`	`Corpus, Document, or str`	The questioned text.	required
`known`	`Corpus, Document, or str`	The candidate author's known text.	required
`extractor`	`BaseFeatureExtractor`	Any tamga feature extractor. Fit on the combined chunks, so the feature space is the union of terms seen in Q and K.	required

Returns:

Type	Description
`Result`	`values["accuracy_curve"]` (list of CV accuracies per round, length `n_rounds`), `values["accuracy_drop"]` (accuracy at round 0 minus accuracy at final round), `values["n_q_chunks"]`, `values["n_k_chunks"]`.

Doğrulama¶

General Impostors¶

Parametreler¶

Berabere durumlar¶

Unmasking¶

Hangisi seçilmeli¶

Referans¶

GeneralImpostors ¶

verify ¶

Unmasking ¶

K chunks).¶

verify ¶