About

Architecture

How the corpus is modelled, and how cross-references work across translations.

The unified addressing model

Every text in Open Scriptorium — the Hebrew Bible, the Greek New Testament, the Septuagint, the apocrypha, Josephus, Philo, the Gospel of Thomas — lives in the same data model. There is no separate “scripture table” and “commentary table.” Instead, three layers cooperate:

Works are concrete documents: a translation, a source-language edition, a commentary, a lexicon. Each work has a license, a language, one or more traditions, and a list of work units.
Work units are the smallest addressable piece of any work. For scripture this is usually a verse; for Loeb commentary it might be a numbered section; for the Gospel of Thomas it’s a logion. Every work unit carries the actual text body.
Canonical refs are the abstract concept of “Genesis 1:1.” Translations and editions of the same passage all align to the same canonical ref. This is what makes parallel reading and cross-referencing possible.

Diagram: BSB Genesis 1:1, WLC Genesis 1:1, and the LXX all point at the same canonical ref. The canonical layer carries no text — it is the address shared between editions.

Cross-references at the canonical layer

Because canonical refs are the shared addressing scheme, cross-references live there too — not at the work level. A statement like “Jude 1:14 quotes 1 Enoch 1:9” is one row, language- independent and edition-independent. It applies to every translation of Jude and every translation of Enoch automatically.

Each cross-reference stores its source and target as Postgres int4range values over the sequential ord column on canonical refs. With a GIST index, asking “what cross-references touch this verse?” is a single indexed lookup — whether the matching reference points at one verse, a chapter, or a whole book.

The corpus currently carries 682,741 cross-references across three datasets: BSB inline references (1,139), OpenBible.info (344,798), and Treasury of Scripture Knowledge (336,804). Each dataset is tagged with a source_dataset string, and each cross-reference row carries a note column for editorial commentary. The reader footer exposes a cross-reference picker that lets users opt into which datasets to display; the default is off, so the reading experience stays clean unless you want the references.

Diagram: a single GIST overlap query against source_range && int4range(N, N+1) finds every cross-reference touching canonical ord N, including ones whose range covers an entire chapter or book.

Versification — one canonical work per scheme

Different traditions disagree about verse numbering. The Masoretic Text counts Psalm superscriptions as verse 1; the Septuagint combines what the MT splits into Psalms 9 and 10; Jeremiah is shorter and reordered in the LXX; Catholic and Orthodox canons structure several books distinctly. We don’t pretend these differences don’t exist, and we don’t invent ad-hoc mappings.

Each canonical work is tagged with a versification scheme drawn from the schemes documented by TVTMS (see below): KJV, MT, LXX, Vulgate, and a few specific sub-traditions where they matter. For most books every translation lines up verse-by-verse on the same canonical work (a BSB Genesis 1:1 work_unit and a Rahlfs Genesis 1:1 work_unit both point at the same canonical_ref). For books where the schemes disagree we create a separate canonical_work per scheme:

psalms — KJV scheme — used by BSB, KJV, WLC, Robinson-Pierpont
psalms-lxx — LXX scheme — used by Rahlfs, Brenton
psalms-german — German scheme — used by Lutherbibel 1912
jeremiah-lxx, job-lxx — LXX scheme — reordered / shorter
daniel-theodotion, bel-and-the-dragon-theodotion, susanna-theodotion — alternative Greek recensions
joshua-vaticanus-b, judges-vaticanus-b, tobit-sinaiticus — alternative manuscript recensions
3-john-german, revelation-german — German verse-numbering divergences (Lutherbibel 1912)

German-scheme canonical works exist because the Lutherbibel tradition has verse-numbering divergences that don’t map to any of the standard TVTMS schemes. A local TVTMS overlay file (db/data/tvtms_openscriptorium_overlay.tsv) supplies the German → KJV verse mappings for these books, processed by the same TVTMS importer that handles the upstream data.

Each of these shares a book_key with its primary counterpart (psalms-lxx and psalms both have book_key="psalms"), so the URL routing (/bsb/psalms/9 vs /rahlfs-lxx/psalms/9) resolves to whichever canonical work the requested work has the most coverage in. The user gets clean URLs; the schema preserves the academic distinction.

Cross-scheme verse mappings — `versification_mappings`

To put a Rahlfs verse next to a BSB verse on a parallel reader, we need to know that LXX Psalm 9:22 is the same passage as MT Psalm 10:1. That mapping data lives in the versification_mappings table, with one row per cross-scheme equivalence:

  from_scheme:        LXX
  from_canonical_ref: psalms-lxx 9:22
  to_scheme:          KJV
  to_canonical_ref:   psalms 10:1
  mapping_type:       exact

The data comes from the TVTMS (Translators Versification Traditions with Methodology for Standardisation) file maintained by Tyndale House Cambridge and STEPBible.org under CC BY 4.0. TVTMS is the most thorough public cross-walk of verse numbering across English, Hebrew, Latin, and Greek traditions; we download it at import time and populate versification_mappings from its Expanded section. Attribution is recorded on every import_run row the importer creates.

Each TVTMS row has the shape (SourceType, SourceRef, StandardRef, Action): “in tradition X, the verse numbered Y is the same passage as KJV verse Z, via this kind of action.” The importer maps each Action string to a mapping_type on our side — exact, split, merge, or missing — so the parallel reader can render splits, merges, and gaps without silently fudging anything.

How the importer handles Greek conventions

TVTMS encodes several Greek traditions because real-world LXX texts disagree about whether the title of a Psalm counts as verse 1. The two most relevant labels:

Greek / Latin+Greek — title-as-v1 convention (LXX Ps 9 has 39 verses, with the title numbered 9:1). This is what Rahlfs’s 1935 print edition does, what the CCAT digital encoding does, and what the Eliran Wong repository we import from preserves.
Greek2 — title-merged convention (LXX Ps 9 has 38 verses; what the title would be numbered is collapsed into v1). This is what NETS and Brenton typically use.

The importer does not filter rows by SourceType label. Instead, each TVTMS row carries a Tests column with conditional expressions like Psa.3:9=Last & Psa.3:TextBeforeV1=NotExist. The importer evaluates these tests against each scheme’s actual verse structure: a row produces mappings only for schemes whose data satisfies its conditions. This means the Greek/Greek2 convention difference is handled implicitly — a row designed for title-as-v1 texts will have tests that naturally fail against any scheme with a different psalm structure.

The practical consequence: if we ever import a NETS-style LXX, we give it its own scheme tag and the same importer will produce correct mappings automatically, because the tests will evaluate differently against that scheme’s verse data.

Looking up equivalences — `CanonicalRef#equivalent_in`

The parallel reader doesn’t join across the mappings table inline; it asks canonical_ref.equivalent_in(target_scheme) and gets back an array of equivalent canonical_refs. The helper handles all four cases explicitly:

identity — same scheme as the source: returns [self].
explicit mapping (forward or reverse) — returns the mapped ref(s); split/merge are handled by returning multiple refs or by repeating one ref across N rows.
missing — the verse does not exist in the target tradition: returns [] and the parallel column shows an em-dash.
identity fallback — no mapping row exists at all: assumes identity (most verses agree across schemes), looking up the same hierarchy in a target-scheme canonical_work that shares this book_key, or falling through to self if no scheme-specific work exists for the book.

Tradition taxonomy

Every work carries a traditions column — a string array (GIN-indexed) rather than a scalar — because many texts belong to more than one tradition. Values use a lowercase-chi (χ) prefix for Christian sub-traditions:

χ-orthodox, χ-catholic, χ-protestant, χ-patristic — the four main Christian groupings
jewish — Jewish tradition (WLC, JPS 1917)
academic — non-confessional critical editions (Rahlfs LXX, Nestle 1904, Robinson-Pierpont)

A work like the BSB is tagged [“χ-protestant”]; Brenton’s English LXX is [“χ-orthodox”, “academic”]. The array model lets the /works page filter by tradition without creating join tables.

Word-level alignment

Source-language texts (Hebrew, Greek) carry word-level data: inflected surface form, lemma, Strong’s number, morphological parsing. Words in a translation can be aligned to source words via alignment groups — the same model used by the Macula Hebrew and Macula Greek projects.

The BSB currently has 411,427 alignment groups linking English words to their Hebrew or Greek source words, along with per-word Strong’s numbers and lemmas. Each alignment group is a logical unit containing some source words and some target words, supporting one-to-one, one-to-many, many-to-one, many-to-many, and null alignments uniformly. This is how interlinear-style displays and quotation matching (e.g. detecting NT quotations of the LXX) will be powered.

Diagram: a many-to-many alignment group joining a Hebrew construct chain to its English rendering.

Syntax trees and discourse analysis

The Westminster Leningrad Codex and Nestle 1904 carry 450,891 syntax nodes imported from the Macula Hebrew and Macula Greek projects (CC BY 4.0). Each node sits in a tree hierarchy stored as an ltree path, with a node_class (clause, phrase, word group), a grammatical role (subject, predicate, object, adjunct, etc.), and links to the words it contains.

This powers syntactic search: a query like syntax: subject(god) verb(create) finds every verse where a word glossed “god” fills the subject role and a word glossed “create” fills the verb role. The query works with Hebrew/Greek lemmas, English glosses, or a mix of both, across both testaments simultaneously. Results include an expandable syntax tree showing the clause structure of each matching verse.

Alongside the syntax trees, the corpus carries 217,402 participant references linking pronouns and verb forms to their referents (e.g. “he” in Genesis 1:3 → God). These come from Macula’s participantref and subjref annotations and enable discourse-level questions like “who is speaking in this verse?”

Passage aliases

The corpus includes 4,174 passage aliases with 10,194 searchable names mapping common passage names to their canonical references. Searching “Parable of the Sower” or “Ten Commandments” or “Shema” returns the actual verses directly, with a “named passage” badge on the result. The alias system handles many-to-many relationships (a name like “Great Commission” maps to both Matthew 28:18–20 and Mark 16:15–16, and a single passage like the Sermon on the Mount has multiple names). Fuzzy matching via trigram similarity catches partial and approximate name queries.

Localized book names

Each canonical_work carries an alternate_titles JSONB column with book names in Greek, Latin, French, and German. When reading the Lutherbibel, book and chapter headings display both the English name and the German name (Genesis / 1. Mose); when reading Crampon, the French name appears (Genèse). The Hebrew native_title column provides the original Hebrew name for Old Testament books (בְּרֵאשִׁית).

Sub-verse references — `Gen 1:1a`

Critical editions, lexica, and academic commentaries cite half-verses constantly: Rom 5:12b, 1 Cor 11:24c, Heb 4:14ab. The canonical layer represents these by extending the hierarchy array. A whole verse is [chapter, verse]; a sub-verse adds a third element where 1 means “a”, 2 means “b”, and so on. Postgres array comparison gives the right ordering for free, so a chapter rendered in canonical order naturally interleaves whole verses and parts.

Provenance — `import_runs` + PaperTrail

Reproducible research needs to know which import produced this row, not just what changed. Every importer creates an import_run row at the start of its execution, recording the source URL, source revision (sha or mtime), byte count, options, and the importer class. Every text row, word, lemma, cross-reference, and apparatus reading the importer creates is stamped with that run’s id.

PaperTrail captures field-level changes; import_runs captures which run made them. Together they answer both “what is the history of this verse?” and “which version of which source did this come from?”

Stable citations — SBL

Every canonical_work carries an SBL Handbook of Style abbreviation (Gen, Matt, Philo, QG, Gos. Thom.) and a citation format. Every canonical_ref caches an sbl_citation string computed from those: Gen 1:1, Gen 1:1a, Philo, QG 1.1, Josephus, Ant. 1.1.1. These are stable across slug changes and URL refactors, and are what every user-facing citation string comes from.

Notes, pericopes, and the apparatus

Three more academic-grade tables that don’t fit elsewhere:

work_unit_notes — translator and editor notes stored as rows, typed by purpose (textual / translation / cross-reference / explanation / editorial). Querying “all text-critical notes in BSB” is one indexed query, not a DOM walk.
pericopes — named passage groupings (BHS pericopes, NRSV section headings, lectionary readings, parashot and sedarim) at the canonical layer with int4range bounds. Multiple pericope schemes coexist via source_dataset, so a single verse can belong to a Hebrew parashah, a Christian pericope, and a lectionary reading simultaneously. Each pericope also carries a tradition string, and each work declares which pericope traditions it participates in via pericope_traditions. BSB section headings, for example, use tradition bsb_editorial (3,002 headings). This keeps tradition-specific headings from bleeding into works where they don’t belong.
variant_units / readings / reading_witnesses carry the lemma text, surrounding context, corrector hand, folio/column/line, and lacuna/supplement flags needed to render a real critical apparatus citation like ℵ²ᵃ B 16r col 2.

Feature flags on /works

The /works page displays small badges per work indicating which features its data supports. These are computed from the data, not declared by hand:

NTS — translator/editor footnotes (work_unit_notes)
LEM — lemma data on words
STR — Strong’s numbers on words
PRS — morphological parsing on words
ALN — word-level alignment groups
PRC — pericope / section-heading data

A client-side JavaScript filter on the same page lets users narrow the works list by language, traditions, work type, and feature flags — no server roundtrip needed.

`CanonicalWork#display_title`

Many canonical works have titles with scheme parentheticals like Psalms (LXX) or 3 John (German). The display_title method strips these suffixes for reader-facing display — breadcrumbs, chapter headings, and book lists show “Psalms” rather than “Psalms (LXX).” The full title remains in the database and in the schema reference.

Canonical ord monotonicity

The ord column on canonical_refs must be monotonically increasing with respect to the hierarchy array ordering within each canonical_work. This invariant is what makes int4range overlap queries on cross-references and pericopes correct — if ords are out of order, a range that should cover “Genesis 1:1–1:5” might accidentally include or exclude verses. Thirty-six canonical works were renumbered after the initial import to fix non-monotonic ords introduced when disputed or apocryphal verses were added after the initial numbering pass.

BSB data features

The Berean Standard Bible import is the most feature-complete work in the corpus. Beyond the verse text, it carries:

4,854 translator footnotes as typed work_unit_notes rows
3,002 section headings as pericopes (tradition bsb_editorial)
411,427 alignment groups linking English words to Hebrew/Greek source words
Per-word Strong’s numbers and lemmas
Red-letter markup stored in work_units.markup (JSONB), not inline HTML

Current corpus

The full list of works in the corpus, with their languages, licenses, traditions, and feature flags, is on the works index.

Search

Search has two layers. The first is traditional full-text search: Postgres tsvector indexes on every work unit, with a custom scriptorium_english text search configuration that handles stemming and a scriptorium_simple configuration for exact token matching. Hebrew diacritics (niqqud, cantillation) are stripped at index time via a custom strip_hebrew_points() SQL function so that pointed and unpointed searches match the same verses. A table-driven synonym system (bible_synonyms) expands archaic English at query time so that “honour thy father” finds “Honor your father.”

The second layer is semantic search via sentence embeddings. A multilingual embedding model (Qwen3-Embedding-0.6B) fine-tuned on 1.2 million parallel pairs extracted from the corpus itself (parallel translations, Hebrew–English word alignments, Greek–English word alignments, and cross-reference pairs) maps every work unit to a 1024-dimensional vector stored in pgvector. At query time, the search encodes the user’s query into the same vector space and finds semantically similar verses via HNSW approximate nearest-neighbor lookup. This means a search for “What did Jesus teach about wealth?” finds the parable of the rich fool and “you cannot serve God and mammon” even though the word “wealth” never appears in those verses.

Because the embedding model was trained on the corpus’s own word-level alignments, Hebrew, Greek, and English all share the same vector space. An English query finds source-language verses by meaning, not gloss lookup. A third layer fuses in topical themes — thematic verse mappings drawn from OpenBible.info Topics (CC BY 4.0, editorially filtered) and Sefaria Topics (CC BY 4.0, Tanakh-only) — so a search for “anxiety” surfaces Phil 4:6–7 and Prov 12:25 even when those verses are buried deep in lexical results. The query classifier automatically detects citations, Strong’s numbers, Hebrew/Greek script, lemma prefixes, proximity operators, and regex patterns, dispatching each to a specialised strategy. Full details are on the search architecture page.

Things to read next

Schema reference — every table, column, and association.
TVTMS — the Translators Versification Traditions file (CC BY 4.0).
Macula Hebrew
Macula Greek