Corpus¶
A Corpus is a list of Documents plus (implicitly) their metadata. Both are dataclasses
defined in tamga.corpus.
Building a Corpus¶
From the filesystem¶
metadata.tsv is a tab-separated file where the first column is filename (matched
against the basenames in corpus/) and every other column becomes Document.metadata:
filename author year genre
fed_01.txt Hamilton 1787 political essay
fed_10.txt Madison 1787 political essay
fed_50.txt Unknown 1788 political essay
In Python¶
from tamga.io import load_corpus
corpus = load_corpus("corpus/", metadata="corpus/metadata.tsv", strict=True)
strict=True(default) raises if any document is missing a metadata row.strict=Falseallows partial coverage — useful when a new unlabeled document arrives.
Programmatically¶
from tamga.corpus import Corpus, Document
corpus = Corpus(documents=[
Document(id="q", text=q_text, metadata={"role": "questioned"}),
Document(id="k1", text=k1_text, metadata={"author": "Alice"}),
Document(id="k2", text=k2_text, metadata={"author": "Alice"}),
])
Filtering and grouping¶
Corpus.filter(**query) returns a new Corpus with only documents matching every
key-value pair:
A value can be a list to match any:
Corpus.groupby(field) returns a dict of sub-corpora:
grouped = corpus.groupby("author")
# {"Hamilton": Corpus(...), "Madison": Corpus(...), "Jay": Corpus(...)}
Hashing¶
Corpus.hash() produces a stable SHA-256 derived from the sorted SHA-256 hashes of each
document's text plus sorted metadata entries. This hash ends up on every Provenance
record, so two studies with the same corpus share a hash regardless of filesystem path,
run time, or document ordering in the input directory.
Order sensitivity
Corpus.hash() is order-invariant by design (same texts + metadata = same hash). If
you need an order-sensitive hash (e.g., to detect row-reorderings that affect feature
matrix layout), hash [d.id for d in corpus.documents] separately.
Next¶
- Features — turn the corpus into a numeric matrix.