About
Architecture
How the corpus is modelled, and how cross-references work across translations.
The unified addressing model
Every text in Open Scriptorium — the Hebrew Bible, the Greek New Testament, the Septuagint, the apocrypha, Josephus, Philo, the Gospel of Thomas — lives in the same data model. There is no separate “scripture table” and “commentary table.” Instead, three layers cooperate:
- Works are concrete documents: a translation, a source-language edition, a commentary, a lexicon. Each work has a license, a language, one or more traditions, and a list of work units.
- Work units are the smallest addressable piece of any work. For scripture this is usually a verse; for Loeb commentary it might be a numbered section; for the Gospel of Thomas it’s a logion. Every work unit carries the actual text body.
- Canonical refs are the abstract concept of “Genesis 1:1.” Translations and editions of the same passage all align to the same canonical ref. This is what makes parallel reading and cross-referencing possible.
Cross-references at the canonical layer
Because canonical refs are the shared addressing scheme, cross-references live there too — not at the work level. A statement like “Jude 1:14 quotes 1 Enoch 1:9” is one row, language- independent and edition-independent. It applies to every translation of Jude and every translation of Enoch automatically.
Each cross-reference stores its source and target as Postgres
int4range values over the
sequential ord column on canonical refs. With a GIST index,
asking “what cross-references touch this verse?”
is a single indexed lookup — whether the matching reference points
at one verse, a chapter, or a whole book.
The corpus currently carries 682,741 cross-references
across three datasets: BSB inline references (1,139), OpenBible.info
(344,798), and Treasury of Scripture Knowledge (336,804). Each dataset
is tagged with a source_dataset string, and each
cross-reference row carries a note column for editorial
commentary. The reader footer exposes a cross-reference picker that
lets users opt into which datasets to display; the default is off, so
the reading experience stays clean unless you want the references.
source_range && int4range(N, N+1)
finds every cross-reference touching canonical ord N, including ones
whose range covers an entire chapter or book.
Versification — one canonical work per scheme
Different traditions disagree about verse numbering. The Masoretic Text counts Psalm superscriptions as verse 1; the Septuagint combines what the MT splits into Psalms 9 and 10; Jeremiah is shorter and reordered in the LXX; Catholic and Orthodox canons structure several books distinctly. We don’t pretend these differences don’t exist, and we don’t invent ad-hoc mappings.
Each canonical work is tagged with a versification scheme
drawn from the schemes documented by TVTMS (see below):
KJV, MT, LXX, Vulgate,
and a few specific sub-traditions where they matter. For most books
every translation lines up verse-by-verse on the same canonical work
(a BSB Genesis 1:1 work_unit and a Rahlfs Genesis 1:1 work_unit both
point at the same canonical_ref). For books where the schemes
disagree we create a separate canonical_work per scheme:
psalms— KJV scheme — used by BSB, KJV, WLC, Robinson-Pierpontpsalms-lxx— LXX scheme — used by Rahlfs, Brentonpsalms-german— German scheme — used by Lutherbibel 1912jeremiah-lxx,job-lxx— LXX scheme — reordered / shorterdaniel-theodotion,bel-and-the-dragon-theodotion,susanna-theodotion— alternative Greek recensionsjoshua-vaticanus-b,judges-vaticanus-b,tobit-sinaiticus— alternative manuscript recensions3-john-german,revelation-german— German verse-numbering divergences (Lutherbibel 1912)
German-scheme canonical works exist because the Lutherbibel tradition
has verse-numbering divergences that don’t map to any of the
standard TVTMS schemes. A local TVTMS overlay file
(db/data/tvtms_openscriptorium_overlay.tsv) supplies the
German → KJV verse mappings for these books,
processed by the same TVTMS importer that handles the upstream data.
Each of these shares a book_key with its primary
counterpart (psalms-lxx and psalms both
have book_key="psalms"), so the URL routing
(/bsb/psalms/9 vs /rahlfs-lxx/psalms/9)
resolves to whichever canonical work the requested work has the
most coverage in. The user gets clean URLs; the schema preserves
the academic distinction.
Cross-scheme verse mappings — versification_mappings
To put a Rahlfs verse next to a BSB verse on a parallel reader, we
need to know that LXX Psalm 9:22 is the same passage as
MT Psalm 10:1. That mapping data lives in the
versification_mappings table, with one row per
cross-scheme equivalence:
from_scheme: LXX from_canonical_ref: psalms-lxx 9:22 to_scheme: KJV to_canonical_ref: psalms 10:1 mapping_type: exact
The data comes from the
TVTMS
(Translators Versification Traditions with Methodology for
Standardisation) file maintained by Tyndale House Cambridge
and STEPBible.org under CC BY 4.0. TVTMS is the
most thorough public cross-walk of verse numbering across English,
Hebrew, Latin, and Greek traditions; we download it at import time
and populate versification_mappings from its
Expanded section. Attribution is recorded on every
import_run row the importer creates.
Each TVTMS row has the shape (SourceType, SourceRef, StandardRef,
Action): “in tradition X, the verse numbered Y is the
same passage as KJV verse Z, via this kind of action.” The
importer maps each Action string to a
mapping_type on our side — exact,
split, merge, or missing —
so the parallel reader can render splits, merges, and gaps without
silently fudging anything.
How the importer handles Greek conventions
TVTMS encodes several Greek traditions because real-world LXX texts disagree about whether the title of a Psalm counts as verse 1. The two most relevant labels:
-
Greek/Latin+Greek— title-as-v1 convention (LXX Ps 9 has 39 verses, with the title numbered 9:1). This is what Rahlfs’s 1935 print edition does, what the CCAT digital encoding does, and what the Eliran Wong repository we import from preserves. -
Greek2— title-merged convention (LXX Ps 9 has 38 verses; what the title would be numbered is collapsed into v1). This is what NETS and Brenton typically use.
The importer does not filter rows by SourceType label.
Instead, each TVTMS row carries a Tests column with
conditional expressions like
Psa.3:9=Last & Psa.3:TextBeforeV1=NotExist. The
importer evaluates these tests against each scheme’s actual
verse structure: a row produces mappings only for schemes whose
data satisfies its conditions. This means the Greek/Greek2
convention difference is handled implicitly — a row designed
for title-as-v1 texts will have tests that naturally fail against
any scheme with a different psalm structure.
The practical consequence: if we ever import a NETS-style LXX, we give it its own scheme tag and the same importer will produce correct mappings automatically, because the tests will evaluate differently against that scheme’s verse data.
Looking up equivalences — CanonicalRef#equivalent_in
The parallel reader doesn’t join across the mappings table
inline; it asks
canonical_ref.equivalent_in(target_scheme) and gets
back an array of equivalent canonical_refs. The helper
handles all four cases explicitly:
- identity — same scheme as the source: returns
[self]. - explicit mapping (forward or reverse) — returns the mapped ref(s); split/merge are handled by returning multiple refs or by repeating one ref across N rows.
missing— the verse does not exist in the target tradition: returns[]and the parallel column shows an em-dash.- identity fallback — no mapping row exists at all: assumes identity (most verses agree across schemes), looking up the same hierarchy in a target-scheme canonical_work that shares this
book_key, or falling through toselfif no scheme-specific work exists for the book.
Tradition taxonomy
Every work carries a traditions column — a
string array (GIN-indexed) rather than a scalar — because
many texts belong to more than one tradition. Values use a
lowercase-chi (χ) prefix for Christian sub-traditions:
χ-orthodox,χ-catholic,χ-protestant,χ-patristic— the four main Christian groupingsjewish— Jewish tradition (WLC, JPS 1917)academic— non-confessional critical editions (Rahlfs LXX, Nestle 1904, Robinson-Pierpont)
A work like the BSB is tagged [“χ-protestant”];
Brenton’s English LXX is
[“χ-orthodox”, “academic”].
The array model lets the /works page filter by tradition without
creating join tables.
Word-level alignment
Source-language texts (Hebrew, Greek) carry word-level data: inflected surface form, lemma, Strong’s number, morphological parsing. Words in a translation can be aligned to source words via alignment groups — the same model used by the Macula Hebrew and Macula Greek projects.
The BSB currently has 411,427 alignment groups linking English words to their Hebrew or Greek source words, along with per-word Strong’s numbers and lemmas. Each alignment group is a logical unit containing some source words and some target words, supporting one-to-one, one-to-many, many-to-one, many-to-many, and null alignments uniformly. This is how interlinear-style displays and quotation matching (e.g. detecting NT quotations of the LXX) will be powered.
Sub-verse references — Gen 1:1a
Critical editions, lexica, and academic commentaries cite half-verses
constantly: Rom 5:12b, 1 Cor 11:24c,
Heb 4:14ab. The canonical layer represents these by
extending the hierarchy array. A whole verse is
[chapter, verse]; a sub-verse adds a third element where
1 means “a”, 2 means “b”, and so on. Postgres
array comparison gives the right ordering for free, so a chapter
rendered in canonical order naturally interleaves whole verses and
parts.
Provenance — import_runs + PaperTrail
Reproducible research needs to know which import produced this
row, not just what changed. Every importer creates an
import_run row at the start of its execution, recording
the source URL, source revision (sha or mtime), byte count, options,
and the importer class. Every text row, word, lemma, cross-reference,
and apparatus reading the importer creates is stamped with that run’s
id.
PaperTrail captures field-level changes; import_runs
captures which run made them. Together they answer both
“what is the history of this verse?” and
“which version of which source did this come from?”
Stable citations — SBL
Every canonical_work carries an SBL Handbook of Style
abbreviation (Gen, Matt,
Philo, QG, Gos. Thom.) and a citation
format. Every canonical_ref caches an
sbl_citation string computed from those: Gen 1:1,
Gen 1:1a, Philo, QG 1.1,
Josephus, Ant. 1.1.1. These are stable across slug
changes and URL refactors, and are what every user-facing citation
string comes from.
Notes, pericopes, and the apparatus
Three more academic-grade tables that don’t fit elsewhere:
- work_unit_notes — translator and editor notes stored as rows, typed by purpose (textual / translation / cross-reference / explanation / editorial). Querying “all text-critical notes in BSB” is one indexed query, not a DOM walk.
-
pericopes — named passage groupings (BHS
pericopes, NRSV section headings, lectionary readings, parashot
and sedarim) at the canonical layer with
int4rangebounds. Multiple pericope schemes coexist viasource_dataset, so a single verse can belong to a Hebrew parashah, a Christian pericope, and a lectionary reading simultaneously. Each pericope also carries atraditionstring, and each work declares which pericope traditions it participates in viapericope_traditions. BSB section headings, for example, use traditionbsb_editorial(3,002 headings). This keeps tradition-specific headings from bleeding into works where they don’t belong. -
variant_units / readings / reading_witnesses
carry the lemma text, surrounding context, corrector hand,
folio/column/line, and lacuna/supplement flags needed to render
a real critical apparatus citation like
ℵ²ᵃ B 16r col 2.
Feature flags on /works
The /works page displays small badges per work indicating
which features its data supports. These are computed from the data, not
declared by hand:
- NTS — translator/editor footnotes (
work_unit_notes) - LEM — lemma data on words
- STR — Strong’s numbers on words
- PRS — morphological parsing on words
- ALN — word-level alignment groups
- PRC — pericope / section-heading data
A client-side JavaScript filter on the same page lets users narrow the works list by language, traditions, work type, and feature flags — no server roundtrip needed.
CanonicalWork#display_title
Many canonical works have titles with scheme parentheticals like
Psalms (LXX) or 3 John (German). The
display_title method strips these suffixes for reader-facing
display — breadcrumbs, chapter headings, and book lists show
“Psalms” rather than “Psalms (LXX).” The full
title remains in the database and in the schema reference.
Canonical ord monotonicity
The ord column on canonical_refs must be
monotonically increasing with respect to the hierarchy
array ordering within each canonical_work. This invariant
is what makes int4range overlap queries on cross-references
and pericopes correct — if ords are out of order, a range that
should cover “Genesis 1:1–1:5” might accidentally
include or exclude verses. Thirty-six canonical works were renumbered
after the initial import to fix non-monotonic ords introduced when
disputed or apocryphal verses were added after the initial numbering
pass.
BSB data features
The Berean Standard Bible import is the most feature-complete work in the corpus. Beyond the verse text, it carries:
- 4,854 translator footnotes as typed
work_unit_notesrows - 3,002 section headings as pericopes (tradition
bsb_editorial) - 411,427 alignment groups linking English words to Hebrew/Greek source words
- Per-word Strong’s numbers and lemmas
- Red-letter markup stored in
work_units.markup(JSONB), not inline HTML
Current corpus
The full list of works in the corpus, with their languages, licenses, traditions, and feature flags, is on the works index.
Things to read next
- Schema reference — every table, column, and association.
- TVTMS — the Translators Versification Traditions file (CC BY 4.0).
- Macula Hebrew
- Macula Greek