Skip to content

About

Search Architecture

How full-text search, semantic embeddings, and query classification work together to search across languages and traditions.

Overview

Every query passes through four stages: classification, execution, semantic enrichment (for text queries), and topical enrichment from external thematic indexes. The classifier inspects the raw query string and dispatches it to one of eleven strategies. Most queries also get a cross-language sidebar with related Hebrew and Greek lemmas.

Query classification

The classifier examines the query and assigns it one of these types:

Type Example Detection
citation Gen 1:1, 1 Cor 13 Matches book abbreviation + chapter pattern
strongs H430, G3056 Matches /^[HG]\d{1,5}$/
hebrew בראשית Pure Hebrew/Aramaic Unicode range
greek ἀγάπη Pure Greek Unicode range
lemma lemma:λόγος lemma: prefix
morphological morph:V-QAI morph: prefix
proximity faith NEAR/3 works Contains NEAR operator
regex regex:blessed.*Lord regex: prefix
xref traversal xref:Rom 8:28 xref: prefix
syntax syntax: subject(god) verb(create) syntax: prefix, or subj:/verb:/obj: shorthand
apparatus apparatus substitution apparatus keyword
text in the beginning Everything else

Full-text search

Text queries are the common case. Every work_unit has a pre-computed body_tsvector column populated by a trigger that picks a text search configuration based on the work’s language:

Both configurations are searched at once: the query is parsed into a websearch_to_tsquery against both scriptorium_english (for stemmed matching) and scriptorium_simple (for exact matching), then OR’d together. So “beginning” matches English works via the stemmer and Hebrew/Greek works via exact token overlap. The tsquery combines synonym-expanded and raw forms:

-- Three tsqueries OR'd: synonym-expanded stemmed, raw stemmed, raw exact
(websearch_to_tsquery('scriptorium_english', expand_bible_synonyms('honour thy father'))
 || websearch_to_tsquery('scriptorium_english', 'honour thy father')
 || websearch_to_tsquery('scriptorium_simple', 'honour thy father'))

Hebrew diacritics: strip_hebrew_points()

The body_tsvector trigger calls a custom SQL function to strip niqqud (vowel points) and cantillation marks before indexing Hebrew text:

CREATE FUNCTION strip_hebrew_points(text) RETURNS text AS $$
  SELECT regexp_replace($1, '[\u0591-\u05BD\u05BF-\u05C7]', '', 'g')
$$ LANGUAGE sql IMMUTABLE;

A search for אלהים (consonants only) matches אֱלֹהִים (fully pointed), since the tsvector stores only consonantal forms.

Synonym expansion

Before building the tsquery, the query is passed through expand_bible_synonyms(), a SQL function that does a per-word table lookup against bible_synonyms:

-- Pure SQL synonym expansion, no filesystem dictionary needed.
-- Compatible with managed Postgres (RDS, Cloud SQL, etc.).
CREATE FUNCTION expand_bible_synonyms(query text) RETURNS text AS $$
  SELECT string_agg(COALESCE(bs.target, word), ' ')
  FROM unnest(string_to_array(lower(query), ' ')) AS word
  LEFT JOIN bible_synonyms bs ON bs.source = word
$$ LANGUAGE sql STABLE;

This maps archaic English to modern equivalents (thou → you, hath → has, honour → honor) and normalises spelling variants (hallelujah ↔ alleluia, immanuel ↔ emmanuel). The original query is also searched unstemmed, so exact matches still rank highest.

Ranking

Results are ranked by a composite score:

relevance = ts_rank(tsvector, tsquery)
          × (1 + ln(1 + xref_count))
          × phrase_bonus

The three factors:

  1. ts_rank: Postgres’s built-in relevance score based on term frequency and document length.
  2. Cross-reference density: a precomputed xref_count on each canonical_ref counts how many cross-references touch that verse. Iconic verses like Genesis 1:1 and John 3:16 have high counts and get a logarithmic boost. That’s why “in the beginning God created” ranks Genesis 1:1 first even though plenty of verses contain similar words.
  3. Phrase bonus: for multi-word queries, verses containing the exact phrase (with punctuation stripped) get a 10× bonus. Synonym-expanded phrase matches get 8×. For single-word queries, exact word-boundary matches (\y regex) get 5×, which prevents stemming contamination where “Eve” would otherwise match “even” and “evening.”

The full ranking query in practice:

SELECT work_units.*,
       ts_rank(body_tsvector, tsquery)
         * (1.0 + LN(1 + canonical_refs.xref_count))
         * CASE
             -- Multi-word: exact phrase in body (punctuation stripped)
             WHEN REGEXP_REPLACE(body, '[^a-zA-Z\s]', '', 'g')
                  ILIKE '%in the beginning God created%' THEN 10.0
             -- Synonym-expanded phrase match
             WHEN REGEXP_REPLACE(body, '[^a-zA-Z\s]', '', 'g')
                  ILIKE '%' || expand_bible_synonyms(...) || '%' THEN 8.0
             ELSE 1.0
           END AS relevance
FROM work_units
JOIN canonical_refs ON canonical_refs.id = work_units.canonical_ref_id
JOIN works ON works.id = work_units.work_id
WHERE body_tsvector @@ (tsquery)
ORDER BY relevance DESC,
         canonical_refs.ord ASC,  -- deterministic tiebreaker
         works.slug ASC

OR fallback

Multi-word queries use AND logic by default. If the AND query returns zero results (common for long phrases with stopwords), the search retries with OR logic so verses matching most content words still appear. That’s how “put on the whole armor of God” finds Ephesians 6:11 even when the stemmer disagrees about “armor.”

Semantic search (embeddings)

The full-text layer matches the words the user typed. The embedding layer catches paraphrases and conceptual matches that share no vocabulary.

A multilingual sentence embedding model (Qwen3-Embedding-0.6B, 1024 dimensions) encodes every work unit in the corpus into a dense vector. At query time the query is encoded into the same space and the nearest neighbours come back from a pgvector HNSW index. That dense list is fused with the FTS list via Reciprocal Rank Fusion (RRF, k=60), so items that rank high in both lists rise to the top.

-- Find the 50 nearest verses by cosine similarity.
-- The <=> operator is pgvector's cosine distance.
-- HNSW index makes this sub-millisecond across 672,000 vectors.
-- Vectors live in a side table to keep the work_units row narrow.
SELECT wu.*,
       1 - (e.embedding <=> query_embedding) AS similarity
FROM work_unit_embeddings_qwen3_06b e
JOIN work_units wu ON wu.id = e.work_unit_id
ORDER BY e.embedding <=> query_embedding
LIMIT 50

Training data

The model was fine-tuned on 1.2 million pairs extracted entirely from the corpus:

Dataset Pairs What it teaches
English paraphrase pairs 96,778 Same verse in BSB, KJV, WEBBE, JPS: different wording, same meaning
Hebrew–English word alignments 288,067 Hebrew word ↔ English translation via Macula alignment data
Greek–English word alignments 123,360 Greek word ↔ English translation via Macula alignment data
Cross-reference pairs 200,000 Thematically related passages (sampled from 681,659 total)
Verse-level cross-language 57,858 Full Hebrew/Greek verses paired with their English translations

The training uses MultipleNegativesRankingLoss: given a pair of texts that should be similar, other items in the batch act as negatives. The model learns to pull parallel and related texts together in the vector space and push unrelated texts apart.

Trilingual vector space

Because the training data includes 411,000 word-level alignments between Hebrew/Greek source words and their English translations, the model learns a single vector space where all three languages coexist. A search for “peace” is geometrically close to שָׁלוֹם (shalom) and εἰρήνη (eirene). No dictionary lookup is involved; the model learned the relationship from aligned usage in parallel texts.

Hybrid scoring (Reciprocal Rank Fusion)

The final ranking combines both layers via Reciprocal Rank Fusion: each item gets a score of 1 / (k + rankL) from each list it appears in, summed across lists. Items that rank well in both lists end up on top, so keyword hits and semantic hits reinforce each other. The constant k = 60 comes from the original RRF paper; it dampens the spread between rank 1 and rank 50 enough that a doc at rank 5 in both lists outranks a doc at rank 1 in only one.

-- RRF score = sum of 1/(k + rank) across both lists.
-- No score normalization needed; ranks are comparable.
WITH fts AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank DESC) AS rnk
  FROM ... -- full-text query
),
semantic AS (
  SELECT wu.id,
         ROW_NUMBER() OVER (ORDER BY e.embedding <=> query_embedding) AS rnk
  FROM work_unit_embeddings_qwen3_06b e
  JOIN work_units wu ON wu.id = e.work_unit_id
  ORDER BY e.embedding <=> query_embedding
  LIMIT 100
)
SELECT COALESCE(fts.id, semantic.id) AS id,
       COALESCE(1.0 / (60 + fts.rnk), 0) +
       COALESCE(1.0 / (60 + semantic.rnk), 0) AS score
FROM fts
FULL OUTER JOIN semantic ON fts.id = semantic.id
ORDER BY score DESC

Measured on a 355-concept thematic eval (2026-04-26): RRF beats bi-encoder alone by +7.8 pp MRR and FTS alone by +11.9 pp MRR.

The α and β weights are tuned so exact keyword matches always rank above pure semantic matches, with semantic matches filling in where keywords fail. A search for “forgiveness” shows verses containing that word first (FTS), then verses about pardoning, mercy, and reconciliation that never use the word itself (embeddings).

Semantic search can be disabled in the advanced search controls for users who want strict keyword-only results.

Topical themes

On top of FTS and semantic search, the executor merges in a third signal: topical themes. Two external CC BY 4.0 datasets are imported into a generic topics + topic_verses schema:

At query time, the TopicRetrieval strategy extracts every 1- to 4-word contiguous substring of the query and looks them up against an indexed topics.name column. Each match contributes its top-ranked verses, scored as (rank_within_topic / topic_max) × source_weight × 3.0. Sefaria gets a 1.5× source weight relative to OpenBible’s 1.0×: on contested topics where both have data, Sefaria’s ranking is the stronger signal, per the 2026-04-27 bias audit that compared OpenBible, Nave’s, and Sefaria against modern academic critical consensus.

Editorial filtering on OpenBible Topics

OpenBible’s topics dataset is community-voted, and the voter base skews American evangelical, which encodes detectable bias on socially contested questions. A small editorial blacklist (config/openbible_blacklist.yml) excludes ~13 topic clusters where the verse-to-topic mapping is anachronistic (importing modern moral framing onto texts that don’t address the question) or significantly distorting (the foundational text in an active scholarly debate is missing from the top-N). Topics where a conservative reading is vocal but the verse mapping itself is honest are not excluded. The full list and reasoning lives in the YAML file, version-controlled.

The blacklist applies only to OpenBible’s ranking; the verses themselves remain available via FTS, semantic, and (where applicable) Sefaria’s alternative topical view. Filtering doesn’t remove text from the corpus; it removes one source’s framing of which texts answer which question.

Match-stop list

Two classes of topic names never trigger a match: query meta-vocabulary (“bible,” “scripture,” “verse”) and biblical person names (“jesus,” “paul,” “moses,” …). Person names belong to PersonTopicSearch; matching them as topics double-counts and floods results with generic person verses. Query meta-vocabulary fires on phrases like “what does the bible say about X,” which would otherwise activate a niche “bible” topic about scripture itself rather than the user’s actual topic word.

Script and word-level search

Queries in Hebrew or Greek script skip text search and go through the word-level index. The query is normalised (Hebrew niqqud stripped, Greek accents and breathings stripped) and matched against the words.normalized column using trigram similarity (GIN-indexed). So a user can type אלהים (without pointing) and find every occurrence of אֱלֹהִים (with pointing) across the WLC.

Strong’s number queries (H430, G3056) search the words.strongs_number column directly and return every verse containing a word tagged with that number, plus a sidebar card with the associated lemma, gloss, and frequency. The language prefix (H or G) filters the sidebar to the correct language family.

Proximity search

The NEAR operator finds verses where two terms co-occur. faith NEAR/3 works finds verses where both terms appear in the same verse, or within 3 verses of each other in the same work. Both terms are matched via body_tsvector (not ILIKE), so the search is index-accelerated and avoids substring false positives.

-- Same-verse: both terms in one work_unit (GIN-indexed)
WHERE body_tsvector @@ plainto_tsquery('scriptorium_english', 'faith')
  AND body_tsvector @@ plainto_tsquery('scriptorium_english', 'works')

-- Cross-verse: term A in one verse, term B within N ords
SELECT DISTINCT a.id FROM work_units a
JOIN canonical_refs cr_a ON cr_a.id = a.canonical_ref_id
WHERE a.body_tsvector @@ plainto_tsquery('scriptorium_english', 'faith')
  AND EXISTS (
    SELECT 1 FROM work_units b
    JOIN canonical_refs cr_b ON cr_b.id = b.canonical_ref_id
    WHERE b.work_id = a.work_id
      AND b.body_tsvector @@ plainto_tsquery('scriptorium_english', 'works')
      AND ABS(cr_a.ord - cr_b.ord) <= 3
  )

Cross-language sidebar

For English text queries, the search runs a parallel lookup against the lemma glosses to find related Hebrew and Greek terms. The sidebar shows up to three entries per language, with the original script, transliteration, part of speech, primary gloss, and corpus frequency.

Users and lexicons don’t always pick the same word, so a concept synonym layer expands queries with known equivalences: “law” also searches for “instruction,” “statute,” and “commandment” in the glosses, which is how תּוֹרָה (torah, glossed as “instruction”) shows up in the sidebar for a search for “law.”

Transliteration fallback

When a single Latin-script word returns very few text results (fewer than 5), the search checks whether it matches a lemma transliteration. If it does, the search pivots to a lemma search and returns every verse containing that lemma. That’s how typing “agape” finds every occurrence of ἀγάπη across the Greek New Testament.

Default scoping

By default, text searches are scoped to scripture-type works in English, Hebrew, Aramaic, Greek, and Latin. That keeps commentary, Talmud, and non-English translations from diluting results for typical searches. The scope can be widened via section filters (Hebrew Bible, Greek NT, Deuterocanon, Pseudepigrapha, Apostolic Fathers, Rabbinic, Patristic), tradition filters, specific work selection, or by choosing “all works” in the search controls.

Syntactic search

Queries with the syntax: prefix search by grammatical role, using 450,891 syntax nodes from the Macula Hebrew and Greek projects. The query syntax: subject(god) verb(create) finds every verse where a word glossed “god” fills the subject role and a word glossed “create” fills the verb role. Roles can be specified with full names or abbreviations:

Full nameShortCode
subjectsubjs
verbv
objectobjo
indirect objectiobjio
complementcompc
adjunctadja

The lemma argument accepts Hebrew or Greek script (subject(אלהים)), English glosses (subject(god)), or a mix of both. English glosses are matched against the lemma’s primary_gloss via ILIKE, FTS stemming, and the senses JSONB array, so “speak” finds both “speaks” and “spoken.” When explicit role annotations are absent, verb detection falls back to morphological codes (morph_code LIKE 'V%').

Each result includes an expandable syntax tree rendered client-side, showing the clause structure with colour-coded grammatical roles. The tree is fetched from the /api/v1/syntax/:work/:book/:chapter/:verse endpoint.

Passage aliases and named passage search

The corpus includes 4,174 passage aliases with 10,194 searchable names covering well-known passages across the entire Bible: parables, miracles, speeches, prophecies, psalms, laws, and narrative episodes. The search executor injects matching alias results alongside text search results, with a “named passage” badge.

Name matching uses exact lookup (with parenthetical variants and prefix stripping, so “parable of the sower” matches even without the “parable of the” prefix) plus trigram similarity for fuzzy matching. The alias system is many-to-many: one name can map to multiple passages (e.g. the Great Commission in Matthew and Mark), and one passage can have multiple names.

Pericope-based retrieval

For natural-language questions (“What did Jesus say about war?”), the search runs a secondary retrieval pass against the pericope name embeddings. Each of the 3,002+ pericopes has its name encoded into the same 1024-dimensional vector space as the verse embeddings. The query is encoded and matched against pericope names by cosine similarity, then the top-matching pericopes’ verses are injected into the result set in a single batched join.

For broad topics, retrieval uses multi-hop encoding: a topic like “war” is expanded into multiple facets (“war violence battle,” “peace nonviolence,” “enemies love retaliation”), each encoded separately. The union of pericope matches across all facets gives broader coverage than a single query vector.

Search suggestions

As the user types, an autocomplete dropdown shows two kinds of suggestions:

Suggestions are fetched from /search/completions?q=... with a 250ms debounce. Keyboard navigation (arrow keys + enter) is supported.

Things to read next