Writing a morphological analyser (parser) for any one language nearly always starts with making global compromises between universal and specific approaches, deciding which platform and programming tools to prefer, stating the goal and audience. Before describing the analyser in more detail I will briefly mention some of these considerations.
It was nerly ten years ago that the first attempts were made to build a morphological analyser (Hein 1990) on a VAX clone using Pascal as the programming language. Although the program nearly worked, it seemed unsuited for any practical work being just too slow due to constant disk access. Also the old mainframe made room for desktop PCs. We started to rewrite the programs last year. My primary (programming) language has always been Pascal and there was no reason to change the habit. DOS is the most widespread operating system in these parts and the only one I feel familiar with, so we made no attempts to develop the parser for UNIX or Macintosh environments. Hopefully our approach is generic enough to allow a shift to a different platform in the future.
As a proof of this statement we decided to make the jump to Windows and it took just a couple of months to finish the new version of the parser although our previous DOS-based product was quite acceptable. Most of the time was spent implementing the new user interface. Why Windows? First, it made life easier. I no longer had to worry about memory management and some nagging file I/O limitations. All the program and some of the data can be kept in computer memory and disk caching programs improve the speed without intervention. The more memory is available, the faster the analyser will work and this is the direction computers evolve. Windows also provides a handy user interface with no special efforts. And last, if ever this parser should turn out to be a usable tool Windows word processors' users will be the most grateful.
In the following text, list and dictionary are treated as synonyms since both contain the same kind of data to guide the analysis. Lexicon might be used as a common denominator for both. A dictionary is usually something voluminous, fixed and containing various data fields. Due to their size, lists in this text usually refer to formatives and dictionaries are used for stem listings but this need not always be the case.
Basically, our morphological parser implements the most basic approach to any morphological problem - it cuts off part of the word form and tries to find both parts in dictionaries. If the searches are successful, then one possible analysis is presented to the user. This parser adds few features to the main level of the morphological analysis - we also implement search among exceptions and most frequent word forms. Both are listed in dictionaries with full morphological information making further analysis redundant. Indeclinable word forms are analysed next but they need further analysis to find possible parallel variants.
The only Estonian-specific addition to this scheme is GI/KI analysis. GI (KI) is a clitic particle that can be added to any word form except conjunctions and interjections. The choice between GI and KI is done on purely phonological grounds - if the word form ends with a vowel or voiced consonant then GI is used, otherwise use particle KI. The particle has no lexical meaning but gives the inflected form the semantic flavor of 'even', 'also', 'too' to stress that this particular item is opposed to a list of other possible items or actions.
Paat osteti saarelt. \The boat was bought from the island.\
Paatki osteti saarelt. \Even the boat was bought from the island.\
Paat ostetigi saarelt. \The boat WAS (as opposed to 'was not') bought from the island.\
Paat osteti saareltki. \A boat was bought from the island as well.\
Therefore, when a word form has undergone the morphological analysis (possibly without results) and it ends with the sequence GI or KI, then the possible particle is cut off and the remaining part gets analysed a second time.
All this indicates that our algorithm is not particularly suitable for highly agglutinative languages like Finnish where it would be more productive to separate and identify one suffix at a time. We are using finite lists of all possible formatives regardless of whether they are compound (TELE = TE "pl" + LE "ill") or not.
I have used modular approach in the implementation wherever possible. The analyzer maintains two queues - one for intermediate results and the other for final output. The items in the queues consist of all information gathered so far during the analysis. The modules get access to the partially analysed queue and after examining every item in the queue and possibly adding new information to the item there are three options: if the item is acceptable, then it will be placed in the output queue; the item can be updated but the module cannot yet decide if it is a legal analysis and the item is pushed back into the working queue; the item is found unacceptable and no further action is taken. The last module in the chain must leave the working area empty so that every possible analysis is either accepted or rejected. All information in the fully analysed queue is then presented to the user. The output may also be redirected to file instead of screen and sorted according to statistical data. By sorting we mean rearranging the items in the output queue so that the most probable analyses are picked among other candidates and put out first. This process - although not implemented yet - can be purely statistical (stem or formative frequencies) or certain syntactic considerations may be taken into account (e.g. certain part of speech or case is expected to appear in the text).
MODULE 1. Indeclinable word forms and exceptions
Gets a full word form and tries to find it in corresponding dictionary. Upon success the word form is sent to the output queue with all information found in the dictionary. Some examples:
JA J "conjunction" (JA) \and\ ROOTSI G "attribute in genitive" (ROOTSI) \Swedish\ SÜDANT S 04 "sg part" noun (SÜDA) \heart\, the regular form would be 'SÜDAT'
MODULE 2. Most frequent word forms
Looks up the list of most frequent word forms. The list is not closed and can be very flexible depending on the task. The dictionary is updated constantly and therefore suits itself to user-specific text types. When there is enough computer memory, any analysed word form from current text session can also be added to this list, thus speeding up the parser.
MODULE 3. Clitic particles (GI or KI)
This module causes all analysis cycle to be repeated twice if the word form under analysis ends with GI/KI in the right phonetic context. Full analysis is necessary because of possible morphological homonyms, cf.
SAAGI S 22 "sg part" (SAAG) \saw\ SAAGI S 22 "sg gen" (SAAK) \harvest\ SAAGI V 37 "imp pr sg2" + GI (SAAMA) \to get\ V 37 "ind pr ps (neg)" + GI (SAAMA) \to get\
Any increase in the number of such particles would inevitably cause exponential growth of the morphological dictionaries used, requiring a different approach. Even now, the list of about 170 formatives that we use in the analysis would grow twofold which will be both unpractical and unaesthetic solution.
MODULE 4. Formatives
This module is the first to fill the working queue. All possible formatives are separated from the stem and compared against a closed list. For the formative to be accepted, it is necessary that not only the formative is found but also the phonetic environment provided with the formative must match the actual environment. The environment - left context - is described as a vowel (V), a consonant (C) or an arbitrary string. Thus the formative 'I' can be used as a plural partitive (------1p) with stem classes 13at (inflection type 13, lemma-stem a, strong grade) and 14at (inflection type 14, lemma-stem a, strong grade) if the stem ends with a consonant but 'I' might always be used as past tense 3rd person indicative (-02031--) with stem classes 36bn (inflection type 36, inflection stem b, weak grade) and 38ct (inflection type 38, inflection stem c, strong grade) because the context field is left unspecified.
MODULE 5. Stems
The morphological classification originates from A Concise Morphological Dictionary of Estonian (Viks 1992). For the present, it is sufficient to say that members of one stem class share the same paradigm of stem changes (lemmatic or inflection stems, strong or weak grades). It is often possible to deduce one stem variant from another by the means of some kind of rule system. These rules would greatly decrease the size of the dictionaries we use now - instead of separating all stem variants into separate dictionaries one could get one variant from a dictionary and generate the other or others. If the generated product matches the remaining stem, the analysis has succeeded. To implement such rules will be the next step in building the analyser, so far we still use separate lists for all possible stem variants for all inflection types.
Let us assume that we are analysing LAPSI, supposing there are no more possible formatives except the two 'I' lines given above. After examining every possible stem+formative pair from LA+PSI to LAPSI+0 there is only LAPS+I left in the working queue after the previous module 4 has finished. But the formative list suggests that LAPS can be found in any one or even in all of the four stem dictionaries 13at, 14at, 36bn, 38ct. Module 5 is responsible for searching all those dictionaries to leave only such entries in the queue that can be found. In our case LAPS \child\ can only be found in 14at, leaving only one entry in the queue. This entry already has all the information needed to send it to output - we know that this must be LAPS+I and no other analysis is possible, that 'I' stands for plural partitive case and LAPS is an existing stem. Module 5 also finds the corresponding lemma with morphosyntactic information: LAPS S 14.
MODULE 6. Exceptions
Although it seems that the analyser has by now reached the conclusion about a word form, something still has to be checked yet. Suppose the input contains a word form that is produced by rules corresponding to morphological type description in question but is nontheless incorrect as the case was with
SÜDANT S 04 "sg part" noun (SÜDA) \heart\
The expected regular form would be SÜDAT which passes through all the previous modules undetected because the rules predict all words of this type (04) to have T added to the lemma-stem (04a0) in singular partitive (SIDE+T, ASE+T) and SÜDANT is the only exception to this rule. Now we have to run through the list of exceptions once more, this time trying to find if there exists an entry for the same word with the same morphological analysis. E.g. we will find that the proposed analysis for SÜDA + "sg part" is incorrect because the corresponding entry for SÜDA gives a different word form.
S 04 * SÜDA ------0P- SÜDANT P 02 * MÕLEMA ------0N- N 02 * TUHAT ------0P- &TUHAT
The following two lines tell us that MÕLEMA \both\ cannot be used in singular nominative (for semantic reasons) and that TUHAT \thousand\ has parallel forms in singular partitive: TUHANDET as the type description suggests and also TUHAT that is an irregular form.
There is also another kind of exceptions - paradigmatic, where a whole branch of a paradigm (the so called analogy group) can be considered as irregular.
These lines in the formative list show that the formative TAVAT behaves exactly like TUD (the stem classes permitted are the same) and that TUD (and therefore TAVAT) has exceptions in types 27, 28 and 32.
27* instructs to look up the list of exceptions and the analyser finds the following matching line
V 27 * AJAMA 411--0--- `AE[TUD
that gives the analyser all data about the nature of this exception. After encountering e.g. the word form AETAVAT, the analyser is able to find both the morphological description (-101-0-- describes TAVAT as quotative present impersonal) and the irregular stem variant of the verb AJAMA - AE which is usable in all word forms belonging to the analogy group of TUD-form (past participle impersonal: 411--0---).
All items in the output queue are now sorted and returned to the user and each presents one way to analyse the given word form. To decide, which one is correct, requires some information from outside 'morphology proper'. At the present time, the analysis recognises about 75% of all word forms in running newspaper text at a speed of about 1 second per word form. The remaining 25% consist of compound words (about 20%), proper names, neologisms, typos, e.t.c. The speed will considerably improve when things settle down and the analyser will be able to operate with binary dictionaries instead of text files. It is too early to predict if we achieve a near-hundred-per-cent correctness in the near future when separate modules for compound words, word formation and stem changes plus type recognition are added, but getting only one (and correct) analysis for any word form seems to require large-scale statistical data and will not be implemented soon.