About
Search Architecture
How full-text search, semantic embeddings, and query classification work together to search across languages and traditions.
Overview
Every query passes through four stages: classification, execution, semantic enrichment (for text queries), and topical enrichment from external thematic indexes. The classifier inspects the raw query string and dispatches it to one of eleven strategies. Most queries also get a cross-language sidebar with related Hebrew and Greek lemmas.
Query classification
The classifier examines the query and assigns it one of these types:
| Type | Example | Detection |
|---|---|---|
| citation | Gen 1:1, 1 Cor 13 |
Matches book abbreviation + chapter pattern |
| strongs | H430, G3056 |
Matches /^[HG]\d{1,5}$/ |
| hebrew | בראשית |
Pure Hebrew/Aramaic Unicode range |
| greek | ἀγάπη |
Pure Greek Unicode range |
| lemma | lemma:λόγος |
lemma: prefix |
| morphological | morph:V-QAI |
morph: prefix |
| proximity | faith NEAR/3 works |
Contains NEAR operator |
| regex | regex:blessed.*Lord |
regex: prefix |
| xref traversal | xref:Rom 8:28 |
xref: prefix |
| syntax | syntax: subject(god) verb(create) |
syntax: prefix, or subj:/verb:/obj: shorthand |
| apparatus | apparatus substitution |
apparatus keyword |
| text | in the beginning |
Everything else |
Full-text search
Text queries are the common case. Every work_unit has a
pre-computed body_tsvector column populated by a trigger
that picks a text search configuration based on the work’s
language:
- English works use
scriptorium_english(Snowball stemmer + custom stop words) - Hebrew works are pre-processed by
strip_hebrew_points()which removes niqqud and cantillation marks, then indexed withscriptorium_simple - Greek works are indexed with
scriptorium_simplefor exact token matching - Latin works use
scriptorium_simple
Both configurations are searched at once: the query is parsed into a
websearch_to_tsquery against both scriptorium_english
(for stemmed matching) and scriptorium_simple (for exact
matching), then OR’d together. So “beginning”
matches English works via the stemmer and Hebrew/Greek works via
exact token overlap. The tsquery combines synonym-expanded and raw
forms:
-- Three tsqueries OR'd: synonym-expanded stemmed, raw stemmed, raw exact
(websearch_to_tsquery('scriptorium_english', expand_bible_synonyms('honour thy father'))
|| websearch_to_tsquery('scriptorium_english', 'honour thy father')
|| websearch_to_tsquery('scriptorium_simple', 'honour thy father'))
Hebrew diacritics: strip_hebrew_points()
The body_tsvector trigger calls a custom SQL function
to strip niqqud (vowel points) and cantillation marks before indexing
Hebrew text:
CREATE FUNCTION strip_hebrew_points(text) RETURNS text AS $$ SELECT regexp_replace($1, '[\u0591-\u05BD\u05BF-\u05C7]', '', 'g') $$ LANGUAGE sql IMMUTABLE;
A search for אלהים
(consonants only) matches
אֱלֹהִים
(fully pointed), since the tsvector stores only consonantal forms.
Synonym expansion
Before building the tsquery, the query is passed through
expand_bible_synonyms(), a SQL function that does a
per-word table lookup against bible_synonyms:
-- Pure SQL synonym expansion, no filesystem dictionary needed.
-- Compatible with managed Postgres (RDS, Cloud SQL, etc.).
CREATE FUNCTION expand_bible_synonyms(query text) RETURNS text AS $$
SELECT string_agg(COALESCE(bs.target, word), ' ')
FROM unnest(string_to_array(lower(query), ' ')) AS word
LEFT JOIN bible_synonyms bs ON bs.source = word
$$ LANGUAGE sql STABLE;
This maps archaic English to modern equivalents
(thou → you,
hath → has,
honour → honor) and
normalises spelling variants
(hallelujah ↔ alleluia,
immanuel ↔ emmanuel).
The original query is also searched unstemmed, so exact matches still
rank highest.
Ranking
Results are ranked by a composite score:
relevance = ts_rank(tsvector, tsquery)
× (1 + ln(1 + xref_count))
× phrase_bonus
The three factors:
-
ts_rank: Postgres’s built-in relevance score based on term frequency and document length. -
Cross-reference density: a precomputed
xref_counton eachcanonical_refcounts how many cross-references touch that verse. Iconic verses like Genesis 1:1 and John 3:16 have high counts and get a logarithmic boost. That’s why “in the beginning God created” ranks Genesis 1:1 first even though plenty of verses contain similar words. -
Phrase bonus: for multi-word queries, verses
containing the exact phrase (with punctuation stripped) get a
10× bonus. Synonym-expanded phrase matches get 8×. For
single-word queries, exact word-boundary matches (
\yregex) get 5×, which prevents stemming contamination where “Eve” would otherwise match “even” and “evening.”
The full ranking query in practice:
SELECT work_units.*,
ts_rank(body_tsvector, tsquery)
* (1.0 + LN(1 + canonical_refs.xref_count))
* CASE
-- Multi-word: exact phrase in body (punctuation stripped)
WHEN REGEXP_REPLACE(body, '[^a-zA-Z\s]', '', 'g')
ILIKE '%in the beginning God created%' THEN 10.0
-- Synonym-expanded phrase match
WHEN REGEXP_REPLACE(body, '[^a-zA-Z\s]', '', 'g')
ILIKE '%' || expand_bible_synonyms(...) || '%' THEN 8.0
ELSE 1.0
END AS relevance
FROM work_units
JOIN canonical_refs ON canonical_refs.id = work_units.canonical_ref_id
JOIN works ON works.id = work_units.work_id
WHERE body_tsvector @@ (tsquery)
ORDER BY relevance DESC,
canonical_refs.ord ASC, -- deterministic tiebreaker
works.slug ASC
OR fallback
Multi-word queries use AND logic by default. If the AND query returns zero results (common for long phrases with stopwords), the search retries with OR logic so verses matching most content words still appear. That’s how “put on the whole armor of God” finds Ephesians 6:11 even when the stemmer disagrees about “armor.”
Semantic search (embeddings)
The full-text layer matches the words the user typed. The embedding layer catches paraphrases and conceptual matches that share no vocabulary.
A multilingual sentence embedding model (Qwen3-Embedding-0.6B, 1024 dimensions) encodes every work unit in the corpus into a dense vector. At query time the query is encoded into the same space and the nearest neighbours come back from a pgvector HNSW index. That dense list is fused with the FTS list via Reciprocal Rank Fusion (RRF, k=60), so items that rank high in both lists rise to the top.
-- Find the 50 nearest verses by cosine similarity.
-- The <=> operator is pgvector's cosine distance.
-- HNSW index makes this sub-millisecond across 672,000 vectors.
-- Vectors live in a side table to keep the work_units row narrow.
SELECT wu.*,
1 - (e.embedding <=> query_embedding) AS similarity
FROM work_unit_embeddings_qwen3_06b e
JOIN work_units wu ON wu.id = e.work_unit_id
ORDER BY e.embedding <=> query_embedding
LIMIT 50
Training data
The model was fine-tuned on 1.2 million pairs extracted entirely from the corpus:
| Dataset | Pairs | What it teaches |
|---|---|---|
| English paraphrase pairs | 96,778 | Same verse in BSB, KJV, WEBBE, JPS: different wording, same meaning |
| Hebrew–English word alignments | 288,067 | Hebrew word ↔ English translation via Macula alignment data |
| Greek–English word alignments | 123,360 | Greek word ↔ English translation via Macula alignment data |
| Cross-reference pairs | 200,000 | Thematically related passages (sampled from 681,659 total) |
| Verse-level cross-language | 57,858 | Full Hebrew/Greek verses paired with their English translations |
The training uses MultipleNegativesRankingLoss: given a pair of texts that should be similar, other items in the batch act as negatives. The model learns to pull parallel and related texts together in the vector space and push unrelated texts apart.
Trilingual vector space
Because the training data includes 411,000 word-level alignments between Hebrew/Greek source words and their English translations, the model learns a single vector space where all three languages coexist. A search for “peace” is geometrically close to שָׁלוֹם (shalom) and εἰρήνη (eirene). No dictionary lookup is involved; the model learned the relationship from aligned usage in parallel texts.
Hybrid scoring (Reciprocal Rank Fusion)
The final ranking combines both layers via
Reciprocal Rank Fusion: each item gets a score of
1 / (k + rankL) from each list it appears
in, summed across lists. Items that rank well in both lists
end up on top, so keyword hits and semantic hits reinforce each
other. The constant k = 60 comes from the original RRF
paper; it dampens the spread between rank 1 and rank 50 enough that
a doc at rank 5 in both lists outranks a doc at rank 1 in only one.
-- RRF score = sum of 1/(k + rank) across both lists. -- No score normalization needed; ranks are comparable. WITH fts AS ( SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank DESC) AS rnk FROM ... -- full-text query ), semantic AS ( SELECT wu.id, ROW_NUMBER() OVER (ORDER BY e.embedding <=> query_embedding) AS rnk FROM work_unit_embeddings_qwen3_06b e JOIN work_units wu ON wu.id = e.work_unit_id ORDER BY e.embedding <=> query_embedding LIMIT 100 ) SELECT COALESCE(fts.id, semantic.id) AS id, COALESCE(1.0 / (60 + fts.rnk), 0) + COALESCE(1.0 / (60 + semantic.rnk), 0) AS score FROM fts FULL OUTER JOIN semantic ON fts.id = semantic.id ORDER BY score DESC
Measured on a 355-concept thematic eval (2026-04-26): RRF beats bi-encoder alone by +7.8 pp MRR and FTS alone by +11.9 pp MRR.
The α and β weights are tuned so exact keyword matches always rank above pure semantic matches, with semantic matches filling in where keywords fail. A search for “forgiveness” shows verses containing that word first (FTS), then verses about pardoning, mercy, and reconciliation that never use the word itself (embeddings).
Semantic search can be disabled in the advanced search controls for users who want strict keyword-only results.
Topical themes
On top of FTS and semantic search, the executor merges in a third
signal: topical themes. Two external CC BY 4.0
datasets are imported into a generic
topics + topic_verses schema:
- OpenBible.info Topics: community-voted thematic mappings, ~6,700 topics (after editorial filtering, see below) covering ~210,000 topic–verse pairs. Ranked by quality score (percentage of votes for each passage).
-
Sefaria Topics:
curator-vetted Jewish topical index, filtered to Tanakh refs only
(~2,100 topics, ~16,000 pairs). Ranked by Sefaria’s
PageRank-style
order.prscore.
At query time, the TopicRetrieval strategy extracts every
1- to 4-word contiguous substring of the query and looks them up
against an indexed topics.name column. Each match
contributes its top-ranked verses, scored as
(rank_within_topic / topic_max) × source_weight × 3.0.
Sefaria gets a 1.5× source weight relative to
OpenBible’s 1.0×: on contested topics where
both have data, Sefaria’s ranking is the stronger signal, per
the
2026-04-27 bias audit
that compared OpenBible, Nave’s, and Sefaria against modern
academic critical consensus.
Editorial filtering on OpenBible Topics
OpenBible’s topics dataset is community-voted, and the voter
base skews American evangelical, which encodes detectable bias on
socially contested questions. A small editorial blacklist
(config/openbible_blacklist.yml) excludes ~13 topic
clusters where the verse-to-topic mapping is anachronistic
(importing modern moral framing onto texts that don’t address
the question) or significantly distorting (the foundational text in
an active scholarly debate is missing from the top-N). Topics where
a conservative reading is vocal but the verse mapping itself is
honest are not excluded. The full list and reasoning lives
in the YAML file, version-controlled.
The blacklist applies only to OpenBible’s ranking; the verses themselves remain available via FTS, semantic, and (where applicable) Sefaria’s alternative topical view. Filtering doesn’t remove text from the corpus; it removes one source’s framing of which texts answer which question.
Match-stop list
Two classes of topic names never trigger a match: query
meta-vocabulary (“bible,” “scripture,”
“verse”) and biblical person names (“jesus,”
“paul,” “moses,” …). Person names
belong to PersonTopicSearch; matching them as topics
double-counts and floods results with generic person verses. Query
meta-vocabulary fires on phrases like “what does the bible
say about X,” which would otherwise activate a niche
“bible” topic about scripture itself rather than the
user’s actual topic word.
Script and word-level search
Queries in Hebrew or Greek script skip text search and go through
the word-level index. The query is normalised (Hebrew niqqud
stripped, Greek accents and breathings stripped) and matched against
the words.normalized column using trigram similarity
(GIN-indexed). So a user can type
אלהים
(without pointing) and find every occurrence of
אֱלֹהִים
(with pointing) across the WLC.
Strong’s number queries (H430, G3056)
search the words.strongs_number column directly and
return every verse containing a word tagged with that number, plus a
sidebar card with the associated lemma, gloss, and frequency. The
language prefix (H or G) filters the sidebar to the correct language
family.
Proximity search
The NEAR operator finds verses where two terms
co-occur. faith NEAR/3 works finds verses where both
terms appear in the same verse, or within 3 verses of each other in
the same work. Both terms are matched via body_tsvector
(not ILIKE), so the search is index-accelerated and avoids substring
false positives.
-- Same-verse: both terms in one work_unit (GIN-indexed) WHERE body_tsvector @@ plainto_tsquery('scriptorium_english', 'faith') AND body_tsvector @@ plainto_tsquery('scriptorium_english', 'works') -- Cross-verse: term A in one verse, term B within N ords SELECT DISTINCT a.id FROM work_units a JOIN canonical_refs cr_a ON cr_a.id = a.canonical_ref_id WHERE a.body_tsvector @@ plainto_tsquery('scriptorium_english', 'faith') AND EXISTS ( SELECT 1 FROM work_units b JOIN canonical_refs cr_b ON cr_b.id = b.canonical_ref_id WHERE b.work_id = a.work_id AND b.body_tsvector @@ plainto_tsquery('scriptorium_english', 'works') AND ABS(cr_a.ord - cr_b.ord) <= 3 )
Cross-language sidebar
For English text queries, the search runs a parallel lookup against the lemma glosses to find related Hebrew and Greek terms. The sidebar shows up to three entries per language, with the original script, transliteration, part of speech, primary gloss, and corpus frequency.
Users and lexicons don’t always pick the same word, so a concept synonym layer expands queries with known equivalences: “law” also searches for “instruction,” “statute,” and “commandment” in the glosses, which is how תּוֹרָה (torah, glossed as “instruction”) shows up in the sidebar for a search for “law.”
Transliteration fallback
When a single Latin-script word returns very few text results (fewer than 5), the search checks whether it matches a lemma transliteration. If it does, the search pivots to a lemma search and returns every verse containing that lemma. That’s how typing “agape” finds every occurrence of ἀγάπη across the Greek New Testament.
Default scoping
By default, text searches are scoped to scripture-type works in English, Hebrew, Aramaic, Greek, and Latin. That keeps commentary, Talmud, and non-English translations from diluting results for typical searches. The scope can be widened via section filters (Hebrew Bible, Greek NT, Deuterocanon, Pseudepigrapha, Apostolic Fathers, Rabbinic, Patristic), tradition filters, specific work selection, or by choosing “all works” in the search controls.
Syntactic search
Queries with the syntax: prefix search by grammatical role,
using 450,891 syntax nodes from the Macula Hebrew and
Greek projects. The query syntax: subject(god) verb(create)
finds every verse where a word glossed “god” fills the
subject role and a word glossed “create” fills the verb
role. Roles can be specified with full names or abbreviations:
| Full name | Short | Code |
|---|---|---|
| subject | subj | s |
| verb | — | v |
| object | obj | o |
| indirect object | iobj | io |
| complement | comp | c |
| adjunct | adj | a |
The lemma argument accepts Hebrew or Greek script
(subject(אלהים)), English glosses
(subject(god)), or a mix of both. English glosses are
matched against the lemma’s primary_gloss via
ILIKE, FTS stemming, and the senses JSONB array, so
“speak” finds both “speaks” and
“spoken.” When explicit role annotations are absent,
verb detection falls back to morphological codes
(morph_code LIKE 'V%').
Each result includes an expandable syntax tree rendered
client-side, showing the clause structure with colour-coded
grammatical roles. The tree is fetched from the
/api/v1/syntax/:work/:book/:chapter/:verse endpoint.
Passage aliases and named passage search
The corpus includes 4,174 passage aliases with 10,194 searchable names covering well-known passages across the entire Bible: parables, miracles, speeches, prophecies, psalms, laws, and narrative episodes. The search executor injects matching alias results alongside text search results, with a “named passage” badge.
Name matching uses exact lookup (with parenthetical variants and prefix stripping, so “parable of the sower” matches even without the “parable of the” prefix) plus trigram similarity for fuzzy matching. The alias system is many-to-many: one name can map to multiple passages (e.g. the Great Commission in Matthew and Mark), and one passage can have multiple names.
Pericope-based retrieval
For natural-language questions (“What did Jesus say about war?”), the search runs a secondary retrieval pass against the pericope name embeddings. Each of the 3,002+ pericopes has its name encoded into the same 1024-dimensional vector space as the verse embeddings. The query is encoded and matched against pericope names by cosine similarity, then the top-matching pericopes’ verses are injected into the result set in a single batched join.
For broad topics, retrieval uses multi-hop encoding: a topic like “war” is expanded into multiple facets (“war violence battle,” “peace nonviolence,” “enemies love retaliation”), each encoded separately. The union of pericope matches across all facets gives broader coverage than a single query vector.
Search suggestions
As the user types, an autocomplete dropdown shows two kinds of suggestions:
-
Lemma transliterations: typing “aga”
suggests ἀγάπη (agape,
“love”). Matched against the
transliterationcolumn on lemmas with diacritics stripped, so “agape” matches “agapē”. -
Passage aliases: typing “prodig”
suggests “Prodigal Son.” Matched via trigram
similarity against the
passage_alias_namestable.
Suggestions are fetched from /search/completions?q=...
with a 250ms debounce. Keyboard navigation (arrow keys + enter)
is supported.
Things to read next
- Architecture overview
- Schema reference
- API documentation: the search endpoint supports all query types programmatically.