Tuesday 28 July 2020

The maleh text and the Leningrad Codex

I have for the last 4 months been comparing the pointed (WLC) and unpointed (maleh) texts of the Hebrew Bible. The maleh text (the word is variously spelled) roughly means 'full spelling'. I think of it as the Bible according to the spelling standards of modern Hebrew as you might find it in a newspaper.

Let's see an example.

Qohelet 4
שְׁמֹ֣ר רַגְלְךָ֗ כַּאֲשֶׁ֤ר תֵּלֵךְ֙ אֶל־בֵּ֣ית הָאֱלֹהִ֔ים וְקָר֣וֹב לִשְׁמֹ֔עַ מִתֵּ֥ת הַכְּסִילִ֖ים זָ֑בַח
כִּֽי־אֵינָ֥ם יוֹדְעִ֖ים לַעֲשׂ֥וֹת רָֽע
iz wmor rglç cawr tlç al-bit halohim vqrob lwmoy mtt hcsilim zbk
ci-ainm iodyim lywot ry
יז שמור רגלך כאשר תלך אל־בית האלוהים וקרוב לשמוע מתת הכסילים זבח
כי־אינם יודעים לעשות רע
17 Keep your footing as you are walking to the house of God, and approach more to hear than to give an offering among the dullards,
for they haven't a clue that what they do is evil.

Comparing the two is a long way from just dropping the dots and saying you are done. The relationship between columns 2 and 3 is (almost) reversible. Column 1 cannot be built from either column 2 or column 3. Information has been lost and transformed in several ways. The algorithm I have written takes column 1 and transforms it into column 2, a left to right simulation of column 3. So in effect, I am comparing the Leningrad codex (tanach.us) with the maleh text such as is found on the Mechon-Mamre site (mechon-mamre.org). You can transform any Hebrew to SimHebrew here. (So why are you doing this exercise, Bob?) Read on.

So the first word above שְׁמֹ֣ר is in the full spelling שמור or in SimHebrew wmor. The rule of allowing the holam to become o is not followed for all words but is easily seen in this example. The second and third words do appear to just drop the dots. רַגְלְךָ֗ becomes רגלך rglç and כַּאֲשֶׁ֤ר becomes כאשר or cawr. This word is so mirror image. The w is not, of course, the w familiar to an English speaker, but an /s/ or, in this case, /sh/ sound.

One verse will not illustrate all the rules. Of the 38,543 words in my test data, 19,568 simply drop the dots. About 50%. Good grief what do the rest of them do? Dropping the dots would be a trivial string manipulation problem. One would do that in an hour or so. It would not require 6 months of experimental programming.

Of the remaining, there are 9,217 with my column 2 (Sim) showing o, u, and v where column 3 (maleh) will have only v. Rule: o and u are realized in SimHebrew but not in the maleh text. These vowels may also come from holam or qubuts so that is why columns 2 and 3 are not fully reversible. When my program converts WLC to SimHebrew, it sees the qubuts, otherwise invisible in maleh text.

That still leaves us with another 10,000 or so to account for. These are the places where hireq is or is not rendered, or tsere, segol become i or double i, and the myriad of possibilities for the various shades of the vowels for /a/, hatef and otherwise. These become double i, or i, or o, or v. Even the humble schwa plays vital roles in the process.

In the test data, there are 10 verses that show a conflict where the identical word in WLC has been converted into conflicting forms in the maleh text. Only 10 words in 10 verses out of 38,543 words, a little over 1/8th of the Bible.

So roughly 50% are simple string manipulation, 25% are v-related, and 25% are i-related and 0.03% are in conflict. I should note that the maqaf plays no part in the transformation (and there are variations in the placement of maqaf between the two text sources.) You can see in the result that SimHebrew contains some i's, o's, u's, and a's (only some of which sound like a) but no e's. /a/ like /y/ is a guttural and can carry any vowel.

Now I will search for a way to explain to you how the program decides what to do with a word. (See the next post.) And you can tell me how to simplify it further. Then we will begin to 'make visible' a little of the psychology behind the use of vowels in the Hebrew text.

No comments:

Post a Comment