Dust: Comparing SimHebrew with the WLC

Thursday, 30 April 2020

Comparing SimHebrew with the WLC

All my youth as a Hebrew student, I have used the Aleppo and Westminster Leningrad traditions. I am now into my teen-aged years as a student of this tongue, and I have recently come across the undotted version of the text that is common in modern usage. My colleague and coach, Jonathan Orr-Stav has invented a new and simple method of encoding undotted Hebrew or Aramaic (of any sort) in English (Latin) characters called simulated Hebrew, or SimHebrew for short. (You can find an introduction here.) Jonathan explains:

ḥaser in the context of ktiv (spelling) means 'lacking, deficient'—as opposed to malé, which means 'full'.

The former refers to the austere use of yods and vavs to indicate /i/ and /o/ or /u/ sounds — limiting them to where they are actually part of the word stem, and relying on niqqud to dispel misunderstandings. The latter refers to the generous use of yods and vavs to indicate /i/, /o/ or /u/. [and as will be noted below, sometimes qamats and patah (a)].

Undotted Hebrew (both today and in the Second Temple era) tends to use malé, esp. for secular purposes.

I was wondering if it was possible to write a program that would analyse the WLC (a ḥaser text) and produce an undotted version (malé). I think it is.

And it is very clear to me that information is lost in this conversion process, notably vowels and accents, but this does simulate the ancient versions that had no vowels (though I would not rush to say that they had no accents). And with that simulation plus a few extra mater lectiones, (vavs and yods that aid the reader), we do have a text that could be subject to text mining without the need to manage Unicode.

I am holding my breath about putting the SimHebrew text into the music. At present that will have to wait. But the exercise of programming 9 chapters of test data has proven very instructive so far. One thing it has taught me is how illogical I am at times, thinking one thing and coding another. This is common in programming, especially I imagine in old programmers, but that is anecdotal.

I am now going to try and 'explain' my rules in English.

There is an easy mapping from Unicode to Latin characters. For the sake of understanding I mapped the members of the non-grammatical team first. A straightforward replace. The Unicode values translate unambiguously:
1490 - g, 1491 - d, 1494 - z, 1495 - k, 1496 - +, 1505 - s,
1506 - y, 1508 - p, 1507 - f, 1510 - x, 1509 - x, 1511 - q, 1512 - r
Note I use + for tet (internally). It does have the odd use grammatically, but not for hitting runs, just keeping score.
Then the grammatical letters:
1488 - a, 1489 - b, 1492 - h, 1493 - v, 1497 - i, 1499 - c, 1498 - ç,
1500 - l, 1502 - m, 1501 - m, 1504 - n, 1503 - n, 1513 - w, 1514 - t
These all have significant impacts on the placement and usage of i's and o's in the SimHebrew representation of the malé square text. Yod and vav have the most complex problems. There is an initial quick conversion for vav, vav+1460 is vi, vav+other vowels are vv, vav+1466 is vo. There's some nuance here since these are not final decisions. They depend on other variables. The holam 1465 can follow many letters and has a number of rules. 1466 is used only with vav and is generally fixed.

I also allow myself the generality of converting some common suffixes. It's a bit surprising, but it saves a lot of hunting later.

Pattern	Result
t-1461 c-1462;m	ticm
l-1461 c-1462;m	licm
l-1461 h-1462;m	lihm
n-1464 i-1460;m	ni*im
t-1463 i-1460;m	ti*im
b-1468-1463;i-1460;m	bi*im
i-1468-1464;h	i*ih

(The * is to prevent this as being seen as a double i which might be prevented per later rules. The * like the + is removed as a last step in constructing the full version of the verse.)

Some of the above may need restricting, e.g. there are 291 rows with the last combo and they might not all behave the same way in the rest of the WLC.

This is the beginning. And I won't continue this level of detail. I need to explain that each of the remaining rules by stem are processed in sequence. They allow one to see if a vowel in the text will cause a conversion to a mater lectionis. All the jots and tittles gradually disappear.

This is the matrix: (It will extend - and who knows, may become simpler if the rules appear to have patterns, particularly with respect to some diphthongs.)

It is similar to doing a program to deal with English lemmas. So many exceptions. I began my career as a programmer 54 years ago. I got the job because I could remember a host of three-character nonsense syllables. This program seems to be my bookend.

This table breaks down to three sections: Getting to the o's, Getting to the i's, A. vowels that undergo strange transformations, and B. Finally getting to the real i.

Rule abbreviated	Applies to stem (+ = ט)	Comment
tsere vav	nvh	exceptions to vav+vowel becomes vv
qamats qatan	ycrn nsy mlc krm ahl awih azn +rk krb kpry lcd pyl sll	render vowel a (Unicode 1464) as o
qamats qatan afx	pqd acl	render a as o except for some affixes
qamats qatan b	wmy	render a as o for prefix b
allow o l	acl	allow holam with prefix l
allow o b	ywh	allow holam with prefix b
allow o v	ywh	allow holam with prefix v
allow o e	ywh	allow holam with segol
allow o sf	ywh	allow holam with some suffixes
prevent o	ywh acl aph azn ch la mwh pry raw xan zat	prevent holam
prevent o pref	amr	prevent holam with some common prefixes

tsere hireq	rcc yrc rgn zrh wrq wrp kth kmr kln bar aph yvr ird pl+ wpl +vb	render tsere (1461) as i (some conditions)
tsere hireqm	man	allow for some stems beginning with m but prevent i for single prefix mem with tsere
tsere hireq t	rpa ywh	allow hireq from tsere for t
patah hireq	rcc wnh mdi avli ybd pnh kq dbr yin al	ph becomes ii but not for prefixed vai
patah hireqi	ild ird	יַ ip (1497-1463) becomes ii
qamats u	pqd	הָת ht becomes hut -- specialized prefix
qamats hireqv	wlv ikd	'ָv' qamats-v (1464-1493) becomes iv
qamats hireq	hih pnh yl	qamats i becomes ii except for suffix ' th nh '
qamats hireq pf	itr ivm	qamats i becomes ii - except for prefix 'vָi'
prevent final i	kih	suppress rendering of final ai as ii
allow init pi	lvn wby	allow initial patah-i as ii
allow init pi exc	iwb lqk	allow initial patah-i as ii except for trailing u

allow i	ww yzz rpa qxx nsy ywr nqb nkl lvn ktt abd acr am amn awh aw at azn bxr clm csh cpr cys dbr dmm gbr gll hlc hll hnm kmw kx lb lbb lqk mxa mla ml+ nba ntc npl ntn npx pla psl q+r qnh rgy rnn wqx wbr wck wvb +vb tmm xvh xih yl yll ycb yxb yxm	allow hireq (1460) to be realized -- too generous? Note that hireq is rendered as i when a step contains an i anywhere.
allow i t	qrb nwa ntn mxa ngp lcd lkm	allow hireq to be realized for prefix t
allow i h	wlv wby rby lvi kll nqm nsc lkm	allow hireq to be realized for prefix h
allow i i	tmm wmm lkm	allow hireq to be realized for prefix i
allow i v	zvd wlk wlm tpw rmh nwa nqm nkm itr	allow hireq to be realized for prefix v
allow i l	pnh	allow hireq to be realized for prefix v
allow i c	xpkt	allow hireq to be realized for prefix c
allow i m	zmm	allow hireq to be realized for prefix m
prevent i	ymindb ymihvd ynqi xvriwdi wlmial pgyial wyir tnin sin sini sir lvi ymiwdi gdyni cnyni brik bin di riq kih acl abidn cid rib irmih hia yir ci mi kli nbia csdi ihvdi cwdi ict itr bli ikm ial id idy anci bnimn iwb cli bit iwral ivm hih ani	prevent hireq -- too restrictive? And what will happen with names?
prevent i n	wbr rpa lkm ird mxa qnh csh	suppress hireq prefix n
prevent i h	ixg ixt hlc rgy +tb q+r	suppress hireq prefix h
prevent i m	dmm	suppress hireq prefix m
prevent i c	dbr	suppress hireq prefix c
prevent i a	ww ird	suppress tsere or hireq prefix aleph or h - may need refining
prevent i io	ird	suppress hireq prefix yod vav
prevent i v	qrb lqk dmm	suppress hireq prefix vav --
prevent i t		suppress hireq for prefix t
prevent i i	lqk qnh	suppress hireq for preterite/imperfect yod
prevent i l	-xxx	suppress hireq for prefix l
prevent i u	mla mxa	suppress hireq for suffix u
allow dbl i	rcc wmm lqk ixt iin mdi bxr nplti yl yvr dmm igy wbr qnh kq ntn ml+ pla irw yll ixm gvi npl	allow initial double hireq
prevent dbl i	wlk wlm xpkt wby rpa rby yzz ywr ww nqm nkl mni tmm mla mim kx lvn ill ild nptli ixg nwa cpr kmw npx gmlial gll amn dbr azn mah +vb aliab aliwmy alixvr ink amr at ail ain akiry akiyzr adryi iyd iwb aim psl ptiw abir xvh xih xivn wit bvw wvb aiw awh gbr anci lb cys brit ira bli idy nsc q+r pl+	prevent double hireq
except for	lkm	ensures leading ii for stem, no ii anywhere else. There is supposedly a rule that with a prefix, the i is not doubled. Sometimes...

Dust

Pages

Thursday, 30 April 2020

Comparing SimHebrew with the WLC

No comments:

Post a Comment