Thursday 30 April 2020

Comparing SimHebrew with the WLC

All my youth as a Hebrew student, I have used the Aleppo and Westminster Leningrad traditions. I am now into my teen-aged years as a student of this tongue, and I have recently come across the undotted version of the text that is common in modern usage. My colleague and coach, Jonathan Orr-Stav has invented a new and simple method of encoding undotted Hebrew or Aramaic (of any sort) in English (Latin) characters called simulated Hebrew, or SimHebrew for short. (You can find an introduction here.) Jonathan explains:
ḥaser in the context of ktiv (spelling) means 'lacking, deficient'—as opposed to malé, which means 'full'. 
The former refers to the austere use of yods and vavs to indicate /i/ and /o/ or /u/ sounds — limiting them to where they are actually part of the word stem, and relying on niqqud to dispel misunderstandings. The latter refers to the generous use of yods and vavs to indicate /i/, /o/ or /u/. [and as will be noted below, sometimes qamats and patah (a)].
Undotted Hebrew (both today and in the Second Temple era) tends to use malé, esp. for secular purposes. 
I was wondering if it was possible to write a program that would analyse the WLC (a ḥaser text) and produce an undotted version (malé). I think it is.

And it is very clear to me that information is lost in this conversion process, notably vowels and accents, but this does simulate the ancient versions that had no vowels (though I would not rush to say that they had no accents). And with that simulation plus a few extra mater lectiones, (vavs and yods that aid the reader), we do have a text that could be subject to text mining without the need to manage Unicode.

I am holding my breath about putting the SimHebrew text into the music. At present that will have to wait. But the exercise of programming 9 chapters of test data has proven very instructive so far. One thing it has taught me is how illogical I am at times, thinking one thing and coding another. This is common in programming, especially I imagine in old programmers, but that is anecdotal.

I am now going to try and 'explain' my rules in English.

There is an easy mapping from Unicode to Latin characters. For the sake of understanding I mapped the members of the non-grammatical team first. A straightforward replace. The Unicode values translate unambiguously:
1490 - g, 1491 - d, 1494 - z, 1495 - k, 1496 - +, 1505 - s,
1506 - y, 1508 - p, 1507 - f, 1510 - x, 1509 - x, 1511 - q, 1512 - r
Note I use + for tet (internally). It does have the odd use grammatically, but not for hitting runs, just keeping score.
Then the grammatical letters:
1488 - a, 1489 - b, 1492 - h, 1493 - v, 1497 - i, 1499 - c, 1498 - ç,
1500 - l, 1502 - m, 1501 - m, 1504 - n, 1503 - n, 1513 - w, 1514 - t
These all have significant impacts on the placement and usage of i's and o's in the SimHebrew representation of the malé square text. Yod and vav have the most complex problems. There is an initial quick conversion for vav, vav+1460 is vi, vav+other vowels are vv, vav+1466 is vo. There's some nuance here since these are not final decisions. They depend on other variables. The holam 1465 can follow many letters and has a number of rules. 1466 is used only with vav and is generally fixed.

I also allow myself the generality of converting some common suffixes. It's a bit surprising, but it saves a lot of hunting later.
PatternResult
t-1461 c-1462;mticm
l-1461 c-1462;mlicm
l-1461 h-1462;mlihm
n-1464 i-1460;mni*im
t-1463 i-1460;mti*im
b-1468-1463;i-1460;mbi*im
i-1468-1464;hi*ih
(The * is to prevent this as being seen as a double i which might be prevented per later rules. The * like the + is removed as a last step in constructing the full version of the verse.)

Some of the above may need restricting, e.g. there are 291 rows with the last combo and they might not all behave the same way in the rest of the WLC.

This is the beginning. And I won't continue this level of detail. I need to explain that each of the remaining rules by stem are processed in sequence. They allow one to see if a vowel in the text will cause a conversion to a mater lectionis. All the jots and tittles gradually disappear.

This is the matrix: (It will extend - and who knows, may become simpler if the rules appear to have patterns, particularly with respect to some diphthongs.)

It is similar to doing a program to deal with English lemmas. So many exceptions. I began my career as a programmer 54 years ago. I got the job because I could remember a host of three-character nonsense syllables. This program seems to be my bookend.

This table breaks down to three sections: Getting to the o's, Getting to the i's, A. vowels that undergo strange transformations, and B. Finally getting to the real i.

Rule abbreviatedApplies to stem (+ = ט)Comment
tsere vavnvh exceptions to vav+vowel becomes vv
qamats qatanycrn nsy mlc krm ahl
awih azn +rk krb kpry lcd pyl sll
render vowel a (Unicode 1464) as o
qamats qatan afxpqd acl render a as o except for some affixes
qamats qatan bwmy render a as o for prefix b
allow o lacl allow holam with prefix l
allow o bywh allow holam with prefix b
allow o vywh allow holam with prefix v
allow o eywh allow holam with segol
allow o sfywh allow holam with some suffixes
prevent oywh acl aph azn ch la mwh
pry raw xan zat
prevent holam
prevent o prefamr prevent holam with some common prefixes

tsere hireqrcc yrc rgn zrh wrq wrp kth kmr kln bar aph yvr ird pl+ wpl +vb render tsere (1461) as i (some conditions)
tsere hireqmman allow for some stems beginning with m but prevent i for single prefix mem with tsere
tsere hireq trpa ywh allow hireq from tsere for t
patah hireqrcc wnh mdi avli ybd pnh kq dbr yin al ph becomes ii but not for prefixed vai
patah hireqiild ird יַ ip (1497-1463) becomes ii
qamats upqd הָת ht becomes hut -- specialized prefix
qamats hireqvwlv ikd 'ָv' qamats-v (1464-1493) becomes iv
qamats hireqhih pnh yl qamats i becomes ii except for suffix ' th nh '
qamats hireq pfitr ivm qamats i becomes ii - except for prefix 'vָi'
prevent final ikih suppress rendering of final ai as ii
allow init pilvn wby allow initial patah-i as ii
allow init pi exciwb lqk allow initial patah-i as ii except for trailing u

allow iww yzz rpa qxx nsy ywr nqb nkl lvn ktt abd acr am amn awh aw at azn bxr clm csh cpr cys dbr dmm gbr gll hlc hll hnm kmw kx lb lbb lqk mxa mla ml+ nba ntc npl ntn npx pla psl q+r qnh rgy rnn wqx wbr wck wvb +vb tmm xvh xih yl yll ycb yxb yxmallow hireq (1460) to be realized -- too generous?
Note that hireq is rendered as i when a step contains an i anywhere.
allow i tqrb nwa ntn mxa ngp lcd lkm allow hireq to be realized for prefix t
allow i hwlv wby rby lvi kll nqm nsc lkm allow hireq to be realized for prefix h
allow i itmm wmm lkm allow hireq to be realized for prefix i
allow i vzvd wlk wlm tpw rmh nwa nqm nkm itr allow hireq to be realized for prefix v
allow i lpnh allow hireq to be realized for prefix v
allow i cxpkt allow hireq to be realized for prefix c
allow i mzmm allow hireq to be realized for prefix m
prevent iymindb ymihvd ynqi xvriwdi wlmial pgyial wyir tnin sin sini sir lvi ymiwdi gdyni cnyni brik bin di riq kih acl abidn cid rib irmih hia yir ci mi kli nbia csdi ihvdi cwdi ict itr bli ikm ial id idy anci bnimn iwb cli bit iwral ivm hih ani prevent hireq -- too restrictive?
And what will happen with names?
prevent i nwbr rpa lkm ird mxa qnh csh suppress hireq prefix n
prevent i hixg ixt hlc rgy +tb q+r suppress hireq prefix h
prevent i mdmm suppress hireq prefix m
prevent i cdbr suppress hireq prefix c
prevent i aww ird suppress tsere or hireq prefix aleph or h - may need refining
prevent i ioird suppress hireq prefix yod vav
prevent i vqrb lqk dmm suppress hireq prefix vav --
prevent i tsuppress hireq for prefix t
prevent i ilqk qnh suppress hireq for preterite/imperfect yod
prevent i l-xxx suppress hireq for prefix l
prevent i umla mxa suppress hireq for suffix u
allow dbl ircc wmm lqk ixt iin mdi bxr nplti yl yvr dmm igy wbr qnh kq ntn ml+ pla irw yll ixm gvi npl allow initial double hireq
prevent dbl iwlk wlm xpkt wby rpa rby yzz ywr ww nqm nkl mni tmm mla mim kx lvn ill ild nptli ixg nwa cpr kmw npx gmlial gll amn dbr azn mah +vb aliab aliwmy alixvr ink amr at ail ain akiry akiyzr  adryi iyd iwb aim psl ptiw abir xvh xih xivn wit bvw wvb aiw awh gbr anci lb cys brit ira bli idy nsc q+r pl+prevent double hireq
except forlkm ensures leading ii for stem, no ii anywhere else. There is supposedly a rule that with a prefix, the i is not doubled. Sometimes...



No comments:

Post a Comment