When you search for a term in our search engine, the results panel shows two numbers: how many sequences were found and how many were expected by pure chance. The first without the second says nothing — and that is probably the single most important idea in the entire codes phenomenon. This article explains precisely where the second number comes from, what it assumes, what it does not capture, and how to use it to run searches that withstand scrutiny.

The Torah as a statistical object

The Torah corpus in our search engine (Koren edition — here we document why) has exactly N = 304,805 letters. Before talking about significance, you need to know the raw material: how often each letter occurs. This is the actual distribution, computed over the corpus (final forms counted with their base letter, the engine's convention):

LetterOccurrencesFrequencyLetterOccurrencesFrequency
י yod31,53110.34%כ kaf11,9683.93%
ו vav30,51310.01%ע ayin11,2503.69%
ה he28,0569.20%ח het7,1892.36%
א alef27,0598.88%ד dalet7,0322.31%
מ mem25,0908.23%פ pe4,8051.58%
ל lamed21,5707.08%ק qof4,6951.54%
ר resh18,1255.95%צ tsadi3,9621.30%
ת tav17,9505.89%ז zayin2,1980.72%
ב bet16,3455.36%ג gimel2,1090.69%
ש shin15,5955.12%ס samekh1,8330.60%
נ nun14,1264.63%ט tet1,8040.59%

The five final forms (ך ם ן ף ץ) are counted with their base letter — the same equivalence the search engine uses in its classic mode.

Two observations with direct consequences for ELS. First: the distribution is highly uneven — yod is 17 times more frequent than tet. Second: just five letters (י ו ה א מ) account for 46.7% of the text. A term made of common letters (like תורה or יהוה) will appear as an ELS tens of thousands of times out of sheer arithmetic; one with rare letters (ט, ג, ז) will be naturally scarce. Comparing raw counts across different terms is meaningless without this context.

The null model: how many ELS does a message-free text produce?

In statistics, a null hypothesis is the boring scenario everything else is compared against: here, a text that contains no code whatsoever — a sequence of 304,805 letters drawn independently with the frequencies in the table. If the actual count of your search is indistinguishable from what that message-free text produces, there is nothing to explain.

Under that model, the probability that one specific starting position n and skip d spell out your k-letter term is the product of its letters' frequencies: p = p(c₁)·p(c₂)·…·p(c_k). The expected number of matches sums that probability over all valid starting positions and all skips in the range:

E = directions × Σd=d_min..d_max max(0, N − (k−1)·d) × ∏ p(cᵢ)

A worked example with משיח (mem-shin-yod-het): p = 0.0823 × 0.0512 × 0.1034 × 0.0236 ≈ 1.03 × 10⁻⁵. That looks minuscule — but the 2–1000 skip range in both directions offers some 600 million (position, skip) pairs. Multiplying: E ≈ 6,227 expected occurrences. Six thousand appearances of "Mashiach" in a random text. That is the power of combinatorics, and it is the reason why finding a word is never news.

From expected count to p-value: the Poisson approximation

The expected count E gives the average; what is still missing is how much variation is normal. When millions of individually improbable events are added together, the total count approximately follows a Poisson distribution with parameter λ = E. That makes it possible to compute the p-value: the probability of observing at least as many matches as you saw, if the text were noise — P(X ≥ observed). A p-value of 0.5 means "utterly ordinary"; one of 0.000001 means "this almost never happens by chance".

Four real searches, read through the model

Everything below is computed on our actual corpus — you can reproduce every row in the search engine:

TermSkipsFoundExpectedReading
תורה2–1000, both dir.19,334≈ 19,554Nothing: chance predicts even slightly more.
ישראל2–1000, both dir.1,190≈ 1,196Nothing: near-exact agreement with noise.
אהבה2–100, both dir.2,433≈ 2,433Agreement to within 0.003% — the model is calibrated.
משיח2–1000, both dir.6,398≈ 6,2272.7% excess (p ≈ 0.016). See below.

The third row deserves a pause: the actual count of אהבה matches the theoretical prediction at 2,433 against 2,432.9. This matters for both sides of the debate. For the enthusiast: it confirms the model is not rigged — it predicts the real text with astonishing precision. For the skeptic: it confirms that, at the level of raw counts, the Torah behaves exactly like a text with its letter frequencies. If there is anything extraordinary in it, it is not in how many times a word appears.

Why a small p-value is not enough either

The fourth row (משיח, p ≈ 0.016) looks interesting. Is it? Here enters the most common error in the entire codes literature: the multiple-search problem (the look-elsewhere effect). A p of 0.016 means chance produces an excess like that about one time in ~60. But if you explored 60 terms — or one term across 60 configurations of book, range and direction — you expect to find one like it even when nothing is there. And every user of a search engine explores dozens of combinations without realizing that each one is an "attempt".

The perfect example is the most famous finding of all: תורה at a skip of exactly 50 in Genesis. Our engine reports 19 occurrences where the model expects 9.8 — p ≈ 0.006. Impressive? Only if skip 50 had been fixed before looking. Historically it was the other way around: skip 50 is famous because something was found there. Testing after the fact the configuration you already knew was a winner invalidates the p-value — it is betting on the horse after the race. (To the only experiment that tried to resolve this with a protocol fixed in advance — WRR 1994 — and its refutation, we devoted a full article.)

What the model does not capture (and we say so ourselves)

Our null model is deliberately simple, and its limits are worth declaring:

  • Real letters are not independent. Hebrew has morphology: prefixes (ו, ה, ב, ל), suffixes, root patterns. Two consecutive letters are not independent draws. For large skips the effect washes out, but for very small skips (2–5) the model is only approximate.
  • Matches overlap. Two matches of the same term can share letters, which correlates the events; Poisson ignores this. In practice the effect is minor, as the calibration in the table shows.
  • It does not model crossings or proximity. The number in the panel applies to the count of a single term. The significance of two nearby terms (the WRR question) demands permutation methods that lie outside this calculation — which is why we show no expected count in the crossings tab.

How to search with rigor: a five-rule protocol

  1. Fix everything before searching. Term, exact spelling (malei or chaser?), book, skip range, directions. Every decision made after seeing results turns your search into exploration — legitimate, but without evidential value.
  2. Always read the found/expected pair. 19,334 occurrences of תורה impress until you see the ≈ 19,554 next to them. The observed/expected ratio is your first filter; the count alone, never.
  3. Discount your attempts. If you tried 20 variants, mentally multiply your p-value by 20 (Bonferroni correction). A p of 0.016 after 60 attempts is exactly nothing.
  4. Use controls. Repeat your search in another book of the Tanakh — the search engine puts it one click away. A pattern that shows up just as well in any text of the same size is arithmetic, not message. That is the lesson of the Moby Dick experiment.
  5. Distinguish exploring from confirming. Exploring is valid and fascinating — that is how hypotheses are born. But a hypothesis born from exploration is only confirmed by a new test, fixed in advance, ideally on data you did not use to generate it.

What the search engine actually does

Every time you run a search, the engine computes the actual frequencies of the loaded corpus (no precomputed tables: if you load only Psalms, it uses the frequencies of Psalms), evaluates the formula for E with your term, your range and your directions, and shows the rounded result next to the count. The calculation runs in your browser, on the frozen corpus verified by checksums — the same numbers any programmer can reproduce with the formula above.

The "expected by chance" number is not there to discourage anyone. It is there because a code search engine without it is a generator of false miracles — and because the interesting question was never whether the words appear, but whether they appear more often than arithmetic forces them to. Now you have the tool to answer it.