Tuesday 22 September 2020

Will the real rules please stand up?

 One of my pandemic projects is to map the differences between a pointed and an unpointed text of the Hebrew Bible. I am doing this using the very memorable and capable left to right version of the square text called SimHebrew. 

I had 'my own version' of a left-to right abbreviation  in capital letters and with a few punctuation marks for the gutturals aleph and ayin. But I converted to the lower case version developed by Jonathan Orr-Stav. (Such a Latin-letter code is an abbreviation because it takes one byte as opposed to 7 bytes in rendered Unicode, and one byte as opposed to 2 internally. Using a Unicode database is really awkward for me and the technology was new 10 years ago so it was a non-starter.) 

My method of data capture is to use the unpointed Mechon-Mamre text that can be downloaded from their site (one book at a time) and to run it through the SimHebrew converter here. Then to manipulate the text in notepad until I have legitimate insert statements for my database. This is somewhat prone to error but eventually I get a clean script. (I use a combination of Word and Notepad for the necessary global changes.)

After this I create a temporary table that matches: book, chapter, a conversion of the alef-betic verse number to a real verse number, and a word by word Hebrew word in SimHebrew, full pointed text word from the Leningrad codex, the stem code, raw word form, and semantic domain from my database, the word number relative to the start of the verse, and the word id (assigned by an Oracle sequence and connecting to my word table).

I recently added 2 Kings to my data. At first I assigned the wrong chapter numbers - a little oversight. That resulted in over 5000 differences in my calculations. When I fixed the issue, the differences went down to about 230. I then discovered that M-M is treating דִּבְיוֹנִ֖ים (dbivn) which they render as dung of a dove as two words (db ivnh - apparently a poor substitute for salt). Fixing those 4 instances dropped my mismatches by 100. Then word by word I work through the remainder until the differences are accounted for.

My program that calculates the words also suggests a code snippet that will fix any discrepancy (as long as I put it in the right place in the code.) The program as it stands gets 99% of the words right on first pass now. So my predicted SimHebrew Bible is about 99% right. Sorry - for the Bible, that's not good enough.

About this time in the history of the project, I ask, Is there a better way? I know I can finish the way I have begun and my brute force spelling changes are next to 0. I have over a third of the words in my test data now and over 75% of the stems represented. But the code is specific to prefix and suffix and sometimes to particular vowel combinations in the WLC. What are the real rules? 

Native speakers who 'just know' the pronunciation - what are they really doing? Certainly they have retention in memory by word form, by context, and by stem. But can I get the program to discover the shortcuts that people use? And in some ways see what they are doing. (And thereby discover the nature of the evolution of language usage.)

Here's an example: the 31 uses so far of עצר (yxr)

For yxr, the following rules are noted. 

  • First you see that the holem is rendered as /o/ in lines 1 and 2 and others. This is default, but some words with some exceptions do not render the o. 
  • Next (line 3) you will note that the qamats is not rendered as /o/ (or vav for rtl Hebrew) but it is in line 4, but the qamats under the second letter of the stem is rendered with prefix /i/ and /t/ in each case without suffix. 
  • Tsere is rendered as /i/ in line 6 and 12 - but only the tsere under the last letter of the prefix. (I would like to color code these but the new editor absolutely ruins rtl display with embedded color coding.) 
  • Notice that the final h is dropped in line 19. 
  • Lines 26 and 27 would have rendered the qamats in the closed syllable as /o/ if the suffix had not been /u/. That's a general rule - but there are exceptions here too and exceptions to the exceptions.

Ref (book: chap: vs: (word within vs)

Stem

Word Form

Morph

Sim Source

Sim Calc.

WLC Word

Domain

Rendering

 

2 Chronicles 14:10(30)

yxr

iyxr

i/yxr

-iyxvr

-iyxor

־יַעְצֹר

BOUND

יעצר will coerce

1 Chronicles 29:14(7)

yxr

nyxr

n/yxr

-nyxvr

-nyxor

־נַעְצֹר

BOUND

נעצר contained of

2 Chronicles 13:20(2)

yxr

yxr

yxr

-yxr

-yxr

־עָצַר

BOUND

עצר did coerce

2 Kings 4:24(9)

yxr

tyxr

t/yxr

-tyxvr

-tyxor

־תַּעֲצָר

BOUND

תעצר do detain

2 Chronicles 7:13(2)

yxr (5)

ayxr

a/yxr

ayxvr

ayxor

אֶעֱצֹר

BOUND

אעצר I contain

1 Kings 8:35(1)

yxr

bhyxr

bh/yxr

bhiyxr

bhiyxr

בְּהֵעָצֵר

BOUND

בהעצר when is contained

2 Chronicles 6:26(1)

yxr

bhyxr

bh/yxr

bhiyxr

bhiyxr

בְּהֵעָצֵר

BOUND

בהעצר when are contained

Amos 5:21(6)

yxr

byxrticm

b/yxr\ticm

byxrvticm

byxroticm

בְּעַצְּרֹתֵיכֶם

COVENANT

בעצרתיכם in your conclaves

2 Kings 9:8(9)

yxr

vyxvr

v/yxvr

vyxvr

vyxur

וְעָצוּר

BOUND

ועצור the contained

1 Kings 21:21(11)

yxr (10)

vyxvr

v/yxvr

vyxvr

vyxur

וְעָצוּר

BOUND

ועצור the contained

Proverbs 30:16(2)

yxr

vyxr

v/yxr

vyvxr

vyoxr

וְעֹצֶר

BOUND

ועצר and contained

1 Chronicles 21:22(17)

yxr

vtyxr

vt/yxr

vtiyxr

vtiyxr

וְתֵעָצַר

BOUND

ותעצר that may be contained

2 Kings 17:4(20)

yxr

viyxrhv

vi/yxr\hv

viyxrhv

viyxrhu

וַיַּעַצְרֵהוּ

BOUND

ויעצרהו so detained him

Job 4:2(5)

yxr

vyxr

v/yxr

vyxvr

vyxor

וַעְצֹר

BOUND

ועצר but contain

Psalms 106:30(4)

yxr (15)

vtyxr

vt/yxr

vtiyxr

vtiyxr

וַתֵּעָצַר

BOUND

ותעצר and was contained

Job 12:15(2)

yxr

iyxr

i/yxr

iyxvr

iyxor

יַעְצֹר

BOUND

יעצר he contains

2 Chronicles 2:5(2)

yxr

iyxr

i/yxr

iyxvr

iyxor

יַעֲצָר

BOUND

יעצר contains

1 Kings 18:44(19)

yxr

iyxrch

i/yxr\ch

iyxvrç

iyxorc

יַעַצָרְכָה

BOUND

יעצרכה detain you

2 Chronicles 22:9(27)

yxr

lyxr

l/yxr

lyxvr

lyxor

לַעְצֹר

BOUND

לעצר to coerce

Psalms 107:39(3)

yxr (20)

myxr

m/yxr

myvxr

myoxr

מֵעֹצֶר

BOUND

מעצר through coercion of

Proverbs 25:28(8)

yxr

myxr

m/yxr

myxr

myxr

מַעְצָר

BOUND

מעצר containment

2 Chronicles 7:9(4)

yxr

yxrt

yxr\t

yxrt

yxrt

עֲצָרֶת

COVENANT

עצרת a conclave

Joel 1:14(4)

yxr

yxrh

yxr\h

yxrh

yxrh

עֲצָרָה

COVENANT

עצרה a conclave

Joel 2:15(7)

yxr

yxrh

yxr\h

yxrh

yxrh

עֲצָרָה

COVENANT

עצרה a conclave

2 Kings 10:20(4)

yxr (25)

yxrh

yxr\h

yxrh

yxrh

עֲצָרָה

COVENANT

עצרה a conclave

Job 29:9(2)

yxr

yxrv

yxr\v

yxrv

yxru

עָצְרוּ

BOUND

עצרו contained

2 Chronicles 20:37(19)

yxr

yxrv

yxr\v

yxrv

yxru

עָצְרוּ

BOUND

עצרו could be coerced

2 Kings 14:26(10)

yxr

yxvr

yxvr

yxvr

yxur

עָצוּר

BOUND

עצור coercion

Jeremiah 36:5(7)

yxr

yxvr

yxvr

yxur

yxur

עָצוּר

BOUND

עצור am detained

1 Chronicles 12:1(7)

yxr (30)

yxvr

yxvr

yxvr

yxur

עָצוּר

BOUND

עצור he contained himself

1 Kings 14:10(12)

yxr

yxvr

yxvr

yxvr

yxur

עָצוּר

BOUND

עצור those who are contained


All that is pretty straightforward. But what are the real rules? 

Some of my rules are miles long conditions; strings of stems with prefix and suffix combinations and the occasional appeal to an odd Unicode value. 
The 'rules' are long for when to render hireq as /i/.
  • 200 distinct stems which do not contain a yod are allowed to render hireq as yod without exception.
  • Another 281 allow hireq as yod on an exception basis.
  • Only 7 stems containing a yod disallow hireq as yod.
  • Another 46 disallow it on an exception basis.
  • Hireq is usually ignored in a closed syllable - but there are exceptions and exceptions by word form to the exceptions (only 8 stems).
  • Hireq is rendered as /ii/ for several reasons. When the prefix is i, and for tsere, patah, and qamats occasionally. What! these also are rendered as o or v sometimes. Who can know?
Any ideas?

I have been debating whether to design a data table for the rule combinations. I don't want to do it unless I cannot simplify the code. Last time I asked I discovered several simplifications. Looking for more...

Here's the rule for qamats becoming vav (RTL) or /o/ (SIM): I have converted the code to 'English'
Unconditionally for stems ahl, anih, arc, avn, bvw, ctl, grn, iq+n, iqwn, irqym, nvh, rnn, yziali, zvh, 
(+ is my single letter internal abbreviation for tet ט.)
or the stem is +rk and prefix suffix combination in (none, cm) 
or the stem is +hr and prefix suffix combination in (none, t, c, t)
or the stem is abd and prefix suffix combination in (b, n)
or the stem is acl and prefix suffix combination in (l, h, b, nu) 
or the stem is adm and v_prefix = m and the first part of the word is m with a schwa
or the stem is amn and the word form is amn and there is no tsere under the second stem letter
or prefix suffix combination = b, h
or the stem is ark and v_suffix in (ti, vt)
or the stem is bit and v_suffix in (icm)
or the stem is bzz and prefix suffix combination in (i, vm)
or the stem is cvl and prefix suffix combination in (v, clu)
or the stem is csh and prefix suffix combination in (none, u)
or the stem is dbr and prefix suffix combination in (b, c) 
or the stem is gal and prefix suffix combination in (l, c)
or the stem is gbh and prefix suffix combination = l, h
or the stem is gml and prefix suffix combination in (none, h,none, c) 
or the stem is hrh and prefix suffix combination in (vh, tihm)
or the stem is ivn and prefix suffix combination in (none, none)
or the stem is iwr and prefix suffix combination in (none, o, b, o)
or the stem is imn and prefix suffix combination in (m, none)
or the stem is isd and prefix suffix combination in (b, i)
or the stem is knn and not (suffix in (u) and there's a qamats under the second letter) and the domain is not in (PERSON, LOCATION)
or the stem is kq and v_suffix not in (u)
or the stem is kgg and prefix suffix combination in (none, i) 
or the stem is krb and prefix suffix combination in (none, h, l, h) 
or the stem is krm and prefix suffix combination in (none, h, vb, h) 
or the stem is kwc and prefix suffix combination in (none, i) 
or the stem is kzq and prefix suffix combination in (b, h, b, nu) 
or the stem is lcd and prefix suffix combination in (l, h) 
or the stem is mlc and prefix suffix combination in (l, o, b, o, c, o) 
or the stem is mvt and prefix suffix combination in (h, h) 
or the stem is pyl and prefix suffix combination in (c, h) 
or the stem is q+n and prefix suffix combination in (none, i)
or the stem is qdw and v_suffix in (im, iv)
or the stem is qra and prefix suffix combination in (none, nu, none, ic, none, im) 
or the stem is qrk and prefix suffix combination is none, h
or the stem is rb  and prefix suffix combination in (b, none) 
or the stem is sll and prefix suffix combination in (none, vh) 
or the stem is rbb and prefix suffix combination in (v, none)
or the stem is rbb and prefix suffix combination in (hb, none)
or the stem is rkb and prefix suffix combination not in (none, none)
or the stem is rkx and prefix suffix combination in (l, h)
or the stem is tmm and prefix suffix combination in (b, none) and the first two letters of the word are not m
or the stem is wby and prefix suffix combination in (m, h, l, h) 
or the stem is wcb and prefix suffix combination in (b, o) 
or the stem is wcr and prefix suffix combination is l, h
or the stem is wdd and prefix suffix combination in (i, m)
or the stem is wmr and prefix suffix combination in (a, none,vl, h,none, h)
or the stem is wp+ and prefix suffix combination in (none, h) 
or the stem is wr and suffix in (c, rc)
or the stem is wrw and prefix suffix combination in (none, ih) 
or the stem is xrp and prefix suffix combination in (none, h) 
or the stem is ybd and prefix suffix combination in (l, h)
or the stem is yni and prefix suffix combination in (none, nu) 
or the stem is yzz and prefix suffix combination in (none, i) 
or the stem is zcr and prefix suffix combination in (b, nu) 
or the stem is yzr and prefix suffix combination in (l, ni, l, u,none, ni,none, nu)
then
the qamats under the first letter of the stem becomes /o/.
And by the way if the stem is qdw and v_suffix in (im) then the schwa under the first letter becomes /o/!
(or maybe only v)


No comments:

Post a Comment