Dust: Will the real rules please stand up?

Tuesday, 22 September 2020

Will the real rules please stand up?

One of my pandemic projects is to map the differences between a pointed and an unpointed text of the Hebrew Bible. I am doing this using the very memorable and capable left to right version of the square text called SimHebrew.

I had 'my own version' of a left-to right abbreviation in capital letters and with a few punctuation marks for the gutturals aleph and ayin. But I converted to the lower case version developed by Jonathan Orr-Stav. (Such a Latin-letter code is an abbreviation because it takes one byte as opposed to 7 bytes in rendered Unicode, and one byte as opposed to 2 internally. Using a Unicode database is really awkward for me and the technology was new 10 years ago so it was a non-starter.)

My method of data capture is to use the unpointed Mechon-Mamre text that can be downloaded from their site (one book at a time) and to run it through the SimHebrew converter here. Then to manipulate the text in notepad until I have legitimate insert statements for my database. This is somewhat prone to error but eventually I get a clean script. (I use a combination of Word and Notepad for the necessary global changes.)

After this I create a temporary table that matches: book, chapter, a conversion of the alef-betic verse number to a real verse number, and a word by word Hebrew word in SimHebrew, full pointed text word from the Leningrad codex, the stem code, raw word form, and semantic domain from my database, the word number relative to the start of the verse, and the word id (assigned by an Oracle sequence and connecting to my word table).

I recently added 2 Kings to my data. At first I assigned the wrong chapter numbers - a little oversight. That resulted in over 5000 differences in my calculations. When I fixed the issue, the differences went down to about 230. I then discovered that M-M is treating דִּבְיוֹנִ֖ים (dbivn) which they render as dung of a dove as two words (db ivnh - apparently a poor substitute for salt). Fixing those 4 instances dropped my mismatches by 100. Then word by word I work through the remainder until the differences are accounted for.

My program that calculates the words also suggests a code snippet that will fix any discrepancy (as long as I put it in the right place in the code.) The program as it stands gets 99% of the words right on first pass now. So my predicted SimHebrew Bible is about 99% right. Sorry - for the Bible, that's not good enough.

About this time in the history of the project, I ask, Is there a better way? I know I can finish the way I have begun and my brute force spelling changes are next to 0. I have over a third of the words in my test data now and over 75% of the stems represented. But the code is specific to prefix and suffix and sometimes to particular vowel combinations in the WLC. What are the real rules?

Native speakers who 'just know' the pronunciation - what are they really doing? Certainly they have retention in memory by word form, by context, and by stem. But can I get the program to discover the shortcuts that people use? And in some ways see what they are doing. (And thereby discover the nature of the evolution of language usage.)

Here's an example: the 31 uses so far of עצר (yxr)

For yxr, the following rules are noted.

First you see that the holem is rendered as /o/ in lines 1 and 2 and others. This is default, but some words with some exceptions do not render the o.
Next (line 3) you will note that the qamats is not rendered as /o/ (or vav for rtl Hebrew) but it is in line 4, but the qamats under the second letter of the stem is rendered with prefix /i/ and /t/ in each case without suffix.
Tsere is rendered as /i/ in line 6 and 12 - but only the tsere under the last letter of the prefix. (I would like to color code these but the new editor absolutely ruins rtl display with embedded color coding.)
Notice that the final h is dropped in line 19.
Lines 26 and 27 would have rendered the qamats in the closed syllable as /o/ if the suffix had not been /u/. That's a general rule - but there are exceptions here too and exceptions to the exceptions.

Ref (book: chap: vs: (word within vs)	Stem	Word Form	Morph	Sim Source	Sim Calc.	WLC Word	Domain	Rendering
2 Chronicles 14:10(30)	yxr	iyxr	i/yxr	-iyxvr	-iyxor	־יַעְצֹר	BOUND	יעצר will coerce
1 Chronicles 29:14(7)	yxr	nyxr	n/yxr	-nyxvr	-nyxor	־נַעְצֹר	BOUND	נעצר contained of
2 Chronicles 13:20(2)	yxr	yxr	yxr	-yxr	-yxr	־עָצַר	BOUND	עצר did coerce
2 Kings 4:24(9)	yxr	tyxr	t/yxr	-tyxvr	-tyxor	־תַּעֲצָר	BOUND	תעצר do detain
2 Chronicles 7:13(2)	yxr (5)	ayxr	a/yxr	ayxvr	ayxor	אֶעֱצֹר	BOUND	אעצר I contain
1 Kings 8:35(1)	yxr	bhyxr	bh/yxr	bhiyxr	bhiyxr	בְּהֵעָצֵר	BOUND	בהעצר when is contained
2 Chronicles 6:26(1)	yxr	bhyxr	bh/yxr	bhiyxr	bhiyxr	בְּהֵעָצֵר	BOUND	בהעצר when are contained
Amos 5:21(6)	yxr	byxrticm	b/yxr\ticm	byxrvticm	byxroticm	בְּעַצְּרֹתֵיכֶם	COVENANT	בעצרתיכם in your conclaves
2 Kings 9:8(9)	yxr	vyxvr	v/yxvr	vyxvr	vyxur	וְעָצוּר	BOUND	ועצור the contained
1 Kings 21:21(11)	yxr (10)	vyxvr	v/yxvr	vyxvr	vyxur	וְעָצוּר	BOUND	ועצור the contained
Proverbs 30:16(2)	yxr	vyxr	v/yxr	vyvxr	vyoxr	וְעֹצֶר	BOUND	ועצר and contained
1 Chronicles 21:22(17)	yxr	vtyxr	vt/yxr	vtiyxr	vtiyxr	וְתֵעָצַר	BOUND	ותעצר that may be contained
2 Kings 17:4(20)	yxr	viyxrhv	vi/yxr\hv	viyxrhv	viyxrhu	וַיַּעַצְרֵהוּ	BOUND	ויעצרהו so detained him
Job 4:2(5)	yxr	vyxr	v/yxr	vyxvr	vyxor	וַעְצֹר	BOUND	ועצר but contain
Psalms 106:30(4)	yxr (15)	vtyxr	vt/yxr	vtiyxr	vtiyxr	וַתֵּעָצַר	BOUND	ותעצר and was contained
Job 12:15(2)	yxr	iyxr	i/yxr	iyxvr	iyxor	יַעְצֹר	BOUND	יעצר he contains
2 Chronicles 2:5(2)	yxr	iyxr	i/yxr	iyxvr	iyxor	יַעֲצָר	BOUND	יעצר contains
1 Kings 18:44(19)	yxr	iyxrch	i/yxr\ch	iyxvrç	iyxorc	יַעַצָרְכָה	BOUND	יעצרכה detain you
2 Chronicles 22:9(27)	yxr	lyxr	l/yxr	lyxvr	lyxor	לַעְצֹר	BOUND	לעצר to coerce
Psalms 107:39(3)	yxr (20)	myxr	m/yxr	myvxr	myoxr	מֵעֹצֶר	BOUND	מעצר through coercion of
Proverbs 25:28(8)	yxr	myxr	m/yxr	myxr	myxr	מַעְצָר	BOUND	מעצר containment
2 Chronicles 7:9(4)	yxr	yxrt	yxr\t	yxrt	yxrt	עֲצָרֶת	COVENANT	עצרת a conclave
Joel 1:14(4)	yxr	yxrh	yxr\h	yxrh	yxrh	עֲצָרָה	COVENANT	עצרה a conclave
Joel 2:15(7)	yxr	yxrh	yxr\h	yxrh	yxrh	עֲצָרָה	COVENANT	עצרה a conclave
2 Kings 10:20(4)	yxr (25)	yxrh	yxr\h	yxrh	yxrh	עֲצָרָה	COVENANT	עצרה a conclave
Job 29:9(2)	yxr	yxrv	yxr\v	yxrv	yxru	עָצְרוּ	BOUND	עצרו contained
2 Chronicles 20:37(19)	yxr	yxrv	yxr\v	yxrv	yxru	עָצְרוּ	BOUND	עצרו could be coerced
2 Kings 14:26(10)	yxr	yxvr	yxvr	yxvr	yxur	עָצוּר	BOUND	עצור coercion
Jeremiah 36:5(7)	yxr	yxvr	yxvr	yxur	yxur	עָצוּר	BOUND	עצור am detained
1 Chronicles 12:1(7)	yxr (30)	yxvr	yxvr	yxvr	yxur	עָצוּר	BOUND	עצור he contained himself
1 Kings 14:10(12)	yxr	yxvr	yxvr	yxvr	yxur	עָצוּר	BOUND	עצור those who are contained

All that is pretty straightforward. But what are the real rules?

Some of my rules are miles long conditions; strings of stems with prefix and suffix combinations and the occasional appeal to an odd Unicode value.

The 'rules' are long for when to render hireq as /i/.

200 distinct stems which do not contain a yod are allowed to render hireq as yod without exception.
Another 281 allow hireq as yod on an exception basis.
Only 7 stems containing a yod disallow hireq as yod.
Another 46 disallow it on an exception basis.
Hireq is usually ignored in a closed syllable - but there are exceptions and exceptions by word form to the exceptions (only 8 stems).
Hireq is rendered as /ii/ for several reasons. When the prefix is i, and for tsere, patah, and qamats occasionally. What! these also are rendered as o or v sometimes. Who can know?

Any ideas?

I have been debating whether to design a data table for the rule combinations. I don't want to do it unless I cannot simplify the code. Last time I asked I discovered several simplifications. Looking for more...

Here's the rule for qamats becoming vav (RTL) or /o/ (SIM): I have converted the code to 'English'

Unconditionally for stems ahl, anih, arc, avn, bvw, ctl, grn, iq+n, iqwn, irqym, nvh, rnn, yziali, zvh,

(+ is my single letter internal abbreviation for tet ט.)

or the stem is +rk and prefix suffix combination in (none, cm)

or the stem is +hr and prefix suffix combination in (none, t, c, t)

or the stem is abd and prefix suffix combination in (b, n)

or the stem is acl and prefix suffix combination in (l, h, b, nu)

or the stem is adm and v_prefix = m and the first part of the word is m with a schwa

or the stem is amn and the word form is amn and there is no tsere under the second stem letter

or prefix suffix combination = b, h

or the stem is ark and v_suffix in (ti, vt)

or the stem is bit and v_suffix in (icm)

or the stem is bzz and prefix suffix combination in (i, vm)

or the stem is cvl and prefix suffix combination in (v, clu)

or the stem is csh and prefix suffix combination in (none, u)

or the stem is dbr and prefix suffix combination in (b, c)

or the stem is gal and prefix suffix combination in (l, c)

or the stem is gbh and prefix suffix combination = l, h

or the stem is gml and prefix suffix combination in (none, h,none, c)

or the stem is hrh and prefix suffix combination in (vh, tihm)

or the stem is ivn and prefix suffix combination in (none, none)

or the stem is iwr and prefix suffix combination in (none, o, b, o)

or the stem is imn and prefix suffix combination in (m, none)

or the stem is isd and prefix suffix combination in (b, i)

or the stem is knn and not (suffix in (u) and there's a qamats under the second letter) and the domain is not in (PERSON, LOCATION)

or the stem is kq and v_suffix not in (u)

or the stem is kgg and prefix suffix combination in (none, i)

or the stem is krb and prefix suffix combination in (none, h, l, h)

or the stem is krm and prefix suffix combination in (none, h, vb, h)

or the stem is kwc and prefix suffix combination in (none, i)

or the stem is kzq and prefix suffix combination in (b, h, b, nu)

or the stem is lcd and prefix suffix combination in (l, h)

or the stem is mlc and prefix suffix combination in (l, o, b, o, c, o)

or the stem is mvt and prefix suffix combination in (h, h)

or the stem is pyl and prefix suffix combination in (c, h)

or the stem is q+n and prefix suffix combination in (none, i)

or the stem is qdw and v_suffix in (im, iv)

or the stem is qra and prefix suffix combination in (none, nu, none, ic, none, im)

or the stem is qrk and prefix suffix combination is none, h

or the stem is rb and prefix suffix combination in (b, none)

or the stem is sll and prefix suffix combination in (none, vh)

or the stem is rbb and prefix suffix combination in (v, none)

or the stem is rbb and prefix suffix combination in (hb, none)

or the stem is rkb and prefix suffix combination not in (none, none)

or the stem is rkx and prefix suffix combination in (l, h)

or the stem is tmm and prefix suffix combination in (b, none) and the first two letters of the word are not m

or the stem is wby and prefix suffix combination in (m, h, l, h)

or the stem is wcb and prefix suffix combination in (b, o)

or the stem is wcr and prefix suffix combination is l, h

or the stem is wdd and prefix suffix combination in (i, m)

or the stem is wmr and prefix suffix combination in (a, none,vl, h,none, h)

or the stem is wp+ and prefix suffix combination in (none, h)

or the stem is wr and suffix in (c, rc)

or the stem is wrw and prefix suffix combination in (none, ih)

or the stem is xrp and prefix suffix combination in (none, h)

or the stem is ybd and prefix suffix combination in (l, h)

or the stem is yni and prefix suffix combination in (none, nu)

or the stem is yzz and prefix suffix combination in (none, i)

or the stem is zcr and prefix suffix combination in (b, nu)

or the stem is yzr and prefix suffix combination in (l, ni, l, u,none, ni,none, nu)

then

the qamats under the first letter of the stem becomes /o/.

And by the way if the stem is qdw and v_suffix in (im) then the schwa under the first letter becomes /o/!

(or maybe only v)

Dust

Pages

Tuesday, 22 September 2020

Will the real rules please stand up?

No comments:

Post a Comment