Saturday, May 23, 2015

An example of domain analysis

This process of deciding semantic domains is long and difficult. It will continue to occupy some time in this project as long as the project continues. But it is a bit of a rest for the mind - searching for the right place(s) for a word stem. I can add domains and subdomains just by naming them. I can move individual words into multiple domains depending on what role they play in the sentence.

There are lots of words. I am working with about 60,000 of about 300,000 words in the Hebrew Bible. Eventually, if I complete the project, I will have all of them in various slots. How many slots, you might wonder. Notice that in the first 60,000 words, there are only currently about 2250 distinct stems. Some stems are content, some are grammar. Many have multiple meanings - i.e. they are homonyms in Hebrew - really they are different words, but they sound similar or at least have the same stem. Some are 100% in one domain always, though I may chose to translate them with multiple synonyms in English. Best to look at an example.

The picture shows that I have 8 domains at the highest level. 21% of the words so far are grammar only. Form rather than content. Creation is selected in this example. I have at present about 20 subdomains for this domain. One word that appears under it is highlighted in the list in the middle, תחת or TXT in my transcription (I manipulate Hebrew in the database by transcription to Latin characters. Far easier than messing with Unicode.) תחת operates in three domains, even though it has no homonyms that I know of yet. Its basic meaning is as a preposition under, or instead of. Prepositions are notoriously hard to translate consistently. They seem to be used in many ways and often a verb implies the preposition or requires it to render a word into English. Another subdomain that this word fits into is Pronoun. It is still a preposition, but because it is combined with a pronoun, I let the pronoun subdomain take precedence. It is easy enough to see (both on the page and in the computer) that this is more than a pronoun. I have this sort of logic coded into my grammatical form recognition software, another tool that is evolving in this project. A third subdomain I have for תחת is related to Distance. It can have the sense of 'beyond'. I wonder if I should combine Distance with Measure - not sure at this point.

In the middle left, I can adjust the domain and the stem with a few well placed clicks. Not all my words have the right stem yet, and about 400 or so are unassigned. It is not always easy to derive the stem from the word. As I add new chapters, the software examines each word and makes a best guess. I could spend more time on the stem recognition logic, but the more data I have, the more accurate it becomes anyway. I wrote that logic several years ago and I know I would not take the same approach today.

You might ask, Should you have a greater hierarchical depth to your analysis? Possibly, but look how much is reduced by the concept of stem - from 60,000 words to just under 2250 stems. Psalms with about 20,000 words uses about 1350 of these stems. 3 times the number and we have only added 900 new stems. How many more will there be? Perhaps as many as 4000 stems. Can one handle 4000 stems with a two-level hierarchy? If there were 20 domains, and 20 subdomains, then there would be only 10 stems in each subdomain. So if there were only 9 domains (stretching the human mind) and 25 subdomains, but wait... 21% of the stems are Grammar and this has 9 subdomains. So drop 21% of 4000 from consideration. Then 8 domains with 20 stems each would have an average stem count of (4000-840) = (8*20*x), x = 19.75. A two level hierarchy governs quite handily.

This is one of a half-dozen screens that I use to control my reading project.

Semantic Domain page constructed using GX-LEAF, Live Enterprise Accountability Framework.