Sunday, November 19, 2017

Deriving a lemma for an English gloss

One tends to take one's own tongue for granted. The last three days I have been running some tests improving my very ad hoc algorithms to try to discover where I am breaking my rules. Essentially this is a move from gloss to lemma, the form you would find in a dictionary.

I thought at first I would need an English dictionary app, but it turns out a few simple rules will take a gloss and uncover its lemma form. I have uncovered about 200 stems that I need to investigate one at a time to see why I have an overlap in my translations that I had not already known and explicitly allowed.

My explicit allowance rules are:
I can reuse a gloss with a different Hebrew stem for these reasons.
  • literary effect, like use in an acrostic, or English idiom that I want to reflect in the reading.
  • stems that are so general, come, go, bring, set, put, that distinguishing them is well nigh impossible. In each of these cases, there is generally a most frequently used gloss.
  • stems that are in both Aramaic and Hebrew.
  • Homonyms of course.
  • places where there seemed no other choice. There are quite a few of these.
My strategy has been to run an algorithm that counts the 'main' word in an English phrase and the number of Hebrew stems that are used for that English string. It has been pretty accurate. To find the main word, I remove some helping verbs, most prepositions, some modifiers and then count the number of Hebrew stems used with the result.

This strategy misses some overlaps: e.g. As would be expected, I use brass with נחשׁ. That was fine and I had not used brass with any other Hebrew stem, but I missed that I had used brasses in Zechariah 14:20 for צלל, a word with the tinkling sound of the cymbals. This is an example I do not want to leave in place. But what is the technical term for the place on the horse where the word would be written? I am keeping my eye out for this, but in the meanwhile I added the adjective tinkling to indicate the noise factor implied in the Hebrew.

Now I can derive the singular from the plural, restore silent e's when removing parsed components like ed and ing. And allow exceptions. The English analysis has been much easier than the Hebrew root algorithm I wrote as a bootstrap to my Hebrew dictionary (and now like other bootstraps, no longer in use.). 

For the first 200,000 words of the Bible, I now have a full Hebrew stem, English lemma mapping. I know how many lemmas occur for a stem and how many stems occur for a lemma so I can check out the ones I have missed.

This is a summary of the steps in the algorithm. 
  1. as before, remove various components from the English phrase for the Hebrew word, general terms, articles, interjections, vocative, negatives, demonstrative pronouns, common modifiers, prepositions, helping verbs,  possessive, relative and reflexive pronouns, conjunctions.
  2. I leave in a few pronouns that are used to distinguish some significant verbs and nouns. There is some subjectivity in this decision. E.g. ascend is also go up, descend is also go down. These two verbs, however, are also a part of the generic issue noted above where I have largely abandoned my rules.
  3. I remove some additional helping verbs. I have these in two groups since some helping verbs are also used stand alone, like do or make.
  4. Then I apply a series of rules - remarkably simple given the irregularity of English.
There are five lists of exceptions to the rules: words that must end in 'ly' like comply, words that must end in 's' like debris, words that must end in 'ing', like sling, words that must end in 'at', words ending in 'ite' that must have a silent e. 

There is one list of the last two letters of any word that demand a silent e on the lemma. This list is nc:dg:bl:at:as:iz:ut:su:ls:rg:rs:ev:ir:ur:av:dy:ud:us: (and a few more) but with the exceptions noted above.

Given these exceptions, the algorithm removes adverbial endings, ly, ily, ingly, ably, participle endings,  ied, ed, ing, and plurals, ches, shes, sses, ies, s, except for words that end in s like debris. It also removes doubling of the final letter before one of these endings and restores the silent e where needed. There are exceptions to the silent e rule that have to be noted.

Re comparatives, more words than not in the English that I have used ending in er and est are not comparatives. Some of the functional 'er' words should be noted. And I see some compromised words ending in est: e,g, highest (a name for God, but not related to high, רום). So I will be adding these to the algorithm. One can't remove est from words that require it, like guest, arrest or harvest. The 'er' words are frequent and appear to be of several different classes of word: comparative (better), functional (farmer), and maybe others that are neither. Good thing I didn't start with 'er'. I might have got discouraged. But I did get to it. See the next post.

I think the algorithm is over 90% successful (99% after dealing with er but I still don't strip ion or ess endings). I am still experimenting. I am also investigating one by one the rule violations I have discovered with this refinement. There are about 8500 English lemma forms in the 200,000 or so words done or guessed so far. These are used with about 3000 Hebrew lemma forms (stem). 200 new exceptions will go by quickly [wrong - see next post!] since some of them are prepositions that I ignore anyway for reasons I have explained before.