Skip to content
Open Scriptorium

About

Architecture

How the corpus is modelled, and how cross-references work across translations.

The unified addressing model

Every text in Open Scriptorium — the Hebrew Bible, the Greek New Testament, the Septuagint, the apocrypha, Josephus, Philo, the Gospel of Thomas — lives in the same data model. There is no separate “scripture table” and “commentary table.” Instead, three layers cooperate:

  1. Works are concrete documents: a translation, a source-language edition, a commentary, a lexicon. Each work has a license, a language, one or more traditions, and a list of work units.
  2. Work units are the smallest addressable piece of any work. For scripture this is usually a verse; for Loeb commentary it might be a numbered section; for the Gospel of Thomas it’s a logion. Every work unit carries the actual text body.
  3. Canonical refs are the abstract concept of “Genesis 1:1.” Translations and editions of the same passage all align to the same canonical ref. This is what makes parallel reading and cross-referencing possible.
Unified addressing model works work_units canonical_refs BSB English translation Public domain WLC Hebrew · Leningrad Public domain LXX (Rahlfs) Greek · Septuagint PD in CA / EU work_unit "In the beginning God created..." work_unit בְּרֵאשִׁית בָּרָא אֱלֹהִים... work_unit Ἐν ἀρχῇ ἐποίησεν ὁ θεὸς... canonical_ref Genesis · ord=1 · hierarchy=[1,1] display: "Gen 1:1"
Diagram: BSB Genesis 1:1, WLC Genesis 1:1, and the LXX all point at the same canonical ref. The canonical layer carries no text — it is the address shared between editions.

Cross-references at the canonical layer

Because canonical refs are the shared addressing scheme, cross-references live there too — not at the work level. A statement like “Jude 1:14 quotes 1 Enoch 1:9” is one row, language- independent and edition-independent. It applies to every translation of Jude and every translation of Enoch automatically.

Each cross-reference stores its source and target as Postgres int4range values over the sequential ord column on canonical refs. With a GIST index, asking “what cross-references touch this verse?” is a single indexed lookup — whether the matching reference points at one verse, a chapter, or a whole book.

The corpus currently carries 682,741 cross-references across three datasets: BSB inline references (1,139), OpenBible.info (344,798), and Treasury of Scripture Knowledge (336,804). Each dataset is tagged with a source_dataset string, and each cross-reference row carries a note column for editorial commentary. The reader footer exposes a cross-reference picker that lets users opt into which datasets to display; the default is off, so the reading experience stays clean unless you want the references.

Cross-references via int4range overlap canonical_refs.ord (Genesis) a sequential integer per verse 1 6 11 16 Gen 1:1 Gen 1:16 query: ord = 6 overlapping cross_references single verse → quoted by Heb 11:3 whole chapter → John 1:1-3 alludes verse range → 2 Cor 4:6 source_range && int4range(6, 7) matches all three with one indexed query.
Diagram: a single GIST overlap query against source_range && int4range(N, N+1) finds every cross-reference touching canonical ord N, including ones whose range covers an entire chapter or book.

Versification — one canonical work per scheme

Different traditions disagree about verse numbering. The Masoretic Text counts Psalm superscriptions as verse 1; the Septuagint combines what the MT splits into Psalms 9 and 10; Jeremiah is shorter and reordered in the LXX; Catholic and Orthodox canons structure several books distinctly. We don’t pretend these differences don’t exist, and we don’t invent ad-hoc mappings.

Each canonical work is tagged with a versification scheme drawn from the schemes documented by TVTMS (see below): KJV, MT, LXX, Vulgate, and a few specific sub-traditions where they matter. For most books every translation lines up verse-by-verse on the same canonical work (a BSB Genesis 1:1 work_unit and a Rahlfs Genesis 1:1 work_unit both point at the same canonical_ref). For books where the schemes disagree we create a separate canonical_work per scheme:

German-scheme canonical works exist because the Lutherbibel tradition has verse-numbering divergences that don’t map to any of the standard TVTMS schemes. A local TVTMS overlay file (db/data/tvtms_openscriptorium_overlay.tsv) supplies the German → KJV verse mappings for these books, processed by the same TVTMS importer that handles the upstream data.

Each of these shares a book_key with its primary counterpart (psalms-lxx and psalms both have book_key="psalms"), so the URL routing (/bsb/psalms/9 vs /rahlfs-lxx/psalms/9) resolves to whichever canonical work the requested work has the most coverage in. The user gets clean URLs; the schema preserves the academic distinction.

Cross-scheme verse mappings — versification_mappings

To put a Rahlfs verse next to a BSB verse on a parallel reader, we need to know that LXX Psalm 9:22 is the same passage as MT Psalm 10:1. That mapping data lives in the versification_mappings table, with one row per cross-scheme equivalence:

  from_scheme:        LXX
  from_canonical_ref: psalms-lxx 9:22
  to_scheme:          KJV
  to_canonical_ref:   psalms 10:1
  mapping_type:       exact

The data comes from the TVTMS (Translators Versification Traditions with Methodology for Standardisation) file maintained by Tyndale House Cambridge and STEPBible.org under CC BY 4.0. TVTMS is the most thorough public cross-walk of verse numbering across English, Hebrew, Latin, and Greek traditions; we download it at import time and populate versification_mappings from its Expanded section. Attribution is recorded on every import_run row the importer creates.

Each TVTMS row has the shape (SourceType, SourceRef, StandardRef, Action): “in tradition X, the verse numbered Y is the same passage as KJV verse Z, via this kind of action.” The importer maps each Action string to a mapping_type on our side — exact, split, merge, or missing — so the parallel reader can render splits, merges, and gaps without silently fudging anything.

How the importer handles Greek conventions

TVTMS encodes several Greek traditions because real-world LXX texts disagree about whether the title of a Psalm counts as verse 1. The two most relevant labels:

The importer does not filter rows by SourceType label. Instead, each TVTMS row carries a Tests column with conditional expressions like Psa.3:9=Last & Psa.3:TextBeforeV1=NotExist. The importer evaluates these tests against each scheme’s actual verse structure: a row produces mappings only for schemes whose data satisfies its conditions. This means the Greek/Greek2 convention difference is handled implicitly — a row designed for title-as-v1 texts will have tests that naturally fail against any scheme with a different psalm structure.

The practical consequence: if we ever import a NETS-style LXX, we give it its own scheme tag and the same importer will produce correct mappings automatically, because the tests will evaluate differently against that scheme’s verse data.

Looking up equivalences — CanonicalRef#equivalent_in

The parallel reader doesn’t join across the mappings table inline; it asks canonical_ref.equivalent_in(target_scheme) and gets back an array of equivalent canonical_refs. The helper handles all four cases explicitly:

Tradition taxonomy

Every work carries a traditions column — a string array (GIN-indexed) rather than a scalar — because many texts belong to more than one tradition. Values use a lowercase-chi (χ) prefix for Christian sub-traditions:

A work like the BSB is tagged [“χ-protestant”]; Brenton’s English LXX is [“χ-orthodox”, “academic”]. The array model lets the /works page filter by tradition without creating join tables.

Word-level alignment

Source-language texts (Hebrew, Greek) carry word-level data: inflected surface form, lemma, Strong’s number, morphological parsing. Words in a translation can be aligned to source words via alignment groups — the same model used by the Macula Hebrew and Macula Greek projects.

The BSB currently has 411,427 alignment groups linking English words to their Hebrew or Greek source words, along with per-word Strong’s numbers and lemmas. Each alignment group is a logical unit containing some source words and some target words, supporting one-to-one, one-to-many, many-to-one, many-to-many, and null alignments uniformly. This is how interlinear-style displays and quotation matching (e.g. detecting NT quotations of the LXX) will be powered.

Macula-style alignment groups alignment_group · type=literal a single logical alignment with many source and target words source · WLC Hebrew בְּרֵאשִׁית בָּרָא אֱלֹהִים target · BSB English In the beginning God created 1 group, 3 source words, 5 target words — m:n alignment in a single row.
Diagram: a many-to-many alignment group joining a Hebrew construct chain to its English rendering.

Sub-verse references — Gen 1:1a

Critical editions, lexica, and academic commentaries cite half-verses constantly: Rom 5:12b, 1 Cor 11:24c, Heb 4:14ab. The canonical layer represents these by extending the hierarchy array. A whole verse is [chapter, verse]; a sub-verse adds a third element where 1 means “a”, 2 means “b”, and so on. Postgres array comparison gives the right ordering for free, so a chapter rendered in canonical order naturally interleaves whole verses and parts.

Provenance — import_runs + PaperTrail

Reproducible research needs to know which import produced this row, not just what changed. Every importer creates an import_run row at the start of its execution, recording the source URL, source revision (sha or mtime), byte count, options, and the importer class. Every text row, word, lemma, cross-reference, and apparatus reading the importer creates is stamped with that run’s id.

PaperTrail captures field-level changes; import_runs captures which run made them. Together they answer both “what is the history of this verse?” and “which version of which source did this come from?”

Stable citations — SBL

Every canonical_work carries an SBL Handbook of Style abbreviation (Gen, Matt, Philo, QG, Gos. Thom.) and a citation format. Every canonical_ref caches an sbl_citation string computed from those: Gen 1:1, Gen 1:1a, Philo, QG 1.1, Josephus, Ant. 1.1.1. These are stable across slug changes and URL refactors, and are what every user-facing citation string comes from.

Notes, pericopes, and the apparatus

Three more academic-grade tables that don’t fit elsewhere:

Feature flags on /works

The /works page displays small badges per work indicating which features its data supports. These are computed from the data, not declared by hand:

A client-side JavaScript filter on the same page lets users narrow the works list by language, traditions, work type, and feature flags — no server roundtrip needed.

CanonicalWork#display_title

Many canonical works have titles with scheme parentheticals like Psalms (LXX) or 3 John (German). The display_title method strips these suffixes for reader-facing display — breadcrumbs, chapter headings, and book lists show “Psalms” rather than “Psalms (LXX).” The full title remains in the database and in the schema reference.

Canonical ord monotonicity

The ord column on canonical_refs must be monotonically increasing with respect to the hierarchy array ordering within each canonical_work. This invariant is what makes int4range overlap queries on cross-references and pericopes correct — if ords are out of order, a range that should cover “Genesis 1:1–1:5” might accidentally include or exclude verses. Thirty-six canonical works were renumbered after the initial import to fix non-monotonic ords introduced when disputed or apocryphal verses were added after the initial numbering pass.

BSB data features

The Berean Standard Bible import is the most feature-complete work in the corpus. Beyond the verse text, it carries:

Current corpus

The full list of works in the corpus, with their languages, licenses, traditions, and feature flags, is on the works index.

Things to read next