Sunday, 19 April 2020

SimHebrew and learning its uses

As many will have seen, I am learning and experimenting with SimHebrew, a way of looking at the square right to left text with a left to right mapping of Latin letters.

I rolled my own version of this some 10 years ago when I was learning how to manage the Biblical text in a database. My mapping is different of course, but it is possible to map my mapping to SimHebrew. There are only a half-dozen differences. SimHebrew also differs from the mapping for the Hebrew keyboard I have (Tyndale).

Aleppo codex, Jeremiah 32:13
But the problem I have been working on is mapping the voweled text of the Leningrad codex to what is called a malé (full) consonantal text of Hebrew. (You can get a copy from mechon-mamre here http://mechon-mamre.org/dlk.htm). Don't click this if you don't want it. It is an immediate download  (but safe).

Information is lost in this mapping alone, but some ease is gained. The mapping is suitable for readable URL's for instance. And I recently used it to search the Aleppo Codex for confirmation of a missing maqaf in Jeremiah 32:13. It is hard for me to read a manuscript. So what I did was read some of the ms, transcribe to SimHebrew in my head, and then look up the text in a text file of the chapter to see where I was on the ms page. That way, I could decide where the text I wanted to find was relative to what I was reading. So I could see where on the page to look for it. (There are no verse numbers in the ms.)

But the most value will be in studying the consonantal text and observing differences in word usage. I am already able to do this with just a few chapters of test data. It is a learning experience.

We lose most of the vowels. There are plenty of i's and o's, but /a/ is aleph, not what we think of as 'a'. The real info that is lost is the prosody. Hebrew speakers take this for granted. But the music and stress in the accents is not in a malé text. I could revise my algorithms for converting the text to music, and incorporate the malé text in the left to right libretto. Maybe someday I will do this. That music XML conversion routine exposes the transcription of the Hebrew syllable by syllable. It is reasonably robust. My initial algorithm for the mapping of Leningrad to malé is based on pattern recognition. This may prove more difficult than syllable by syllable.

The guts of my fragile first algorithm for converting Leningrad to a text without niqud or accent has some techniques that will evolve as the routine sees more data:
  • a list of stems that allow or prevent a double yod in the result.
  • a list of stems that use a qamats-qatan 'o' for a qamats. These are typically if not always in the first syllable of a strong stem.
  • a list of stems that convert a tsere, or patah, or even a qamats to hireq. My rules here may need some more conditions on their application.
  • exceptions for various prefixes like yod or nun, i.e, binyan dependencies.
That's a touch of rules that might seem complex. I don't have a fully decomposed analysis of the prefixes and suffixes for every word, but I do have the stem (aka root). I remember asking if anyone had a definitive table of Hebrew roots 10 years ago. Someone gave me a spreadsheet as a bootstrap. May he be blessed. Eventually I put it all together by hand over the past 10 years. It proves useful in this exercise. It remains essential for the translation project.

One thing this old programmer gets from this work is an escape from news and the current pandemic. One only hopes that human intelligence can hold. My code may be fragile, but so are we all.

No comments:

Post a Comment