A MORPHOLOGICAL ANALYZER FOR THE ESTONIAN LANGUAGE:
the possibilities and impossibilities of automatic analysis

Ülle Viks

INTRODUCTION

At the Institute of the Estonian Language a project is under way to develop a morphological analyzer of the Estonian language. The analyzer is designed as a system of computer programs meant to find the initial form (lemma) and the grammatical meaning (i.e. morphosyntactical description) for any input word form.

This means that if we enter the word form vett the analyzer should be able to define it as the partitive singular of the word vesi: v`ett ® VESI "sg part". If the input is a text, the output may represent the same text morphologically tagged: Ema <EMA \mother\ "sg ngp"> `aitas <`AITA[MA \to help\ "ind ipf sg 3"> lapsel <L`APS \child\ "sg ad"> jalule <J`ALG \foot\ "pl all"> t`õusta <T`ÕUS[MA \to get up\ "inf">.

If a word form has more than one morphological interpretation, the system will produce all possible analyses as the right choice among morphological homonyms requires additional information (syntactic or other):

jaluta ® JALUTA[MA \to walk\ "imp pr sg 2" (Jaluta siiapoole!)

® JALUTA[MA \to walk\ "ind pr neg" (Miks te ei jaluta minuga?)

® J`ALG \foot\ "pl ab" (Käteta vehib, jaluta jookseb.)

In practical applications it often suffices to lemmatize, i.e. to find the initial form of the word. Lemma serves as a key to information contained in dictionaries. The recognition of the grammatical meaning is first of all needed to solve certain linguistic tasks.

Morphological analyzer as such has long ceased to be a novelty: to have one is a must for any language to be processed by computer. For many languages they do exist. Yet for us it is difficult to use any ready-made systems as in most cases their object languages have a relatively simple morphology, whereas the Estonian language is morphologically rather complicated and rich in inflectional forms. On the average every Estonian word has 33 simple forms (parallel forms included), apart from the analytic verb forms.

The best example to follow might be the Finnish language for which several analyzers have been developed. The better-known of the implemented systems are: KIMMO (Koskenniemi 1983) and MORFO (Jäppinen 1983, Jäppinen & Ylilammi 1986). Yet none of them fits Estonian very well as the problems in the way of computer analysis are different for either language. Vowel harmony, for example, so typical of the Finnish language, is absent from standard Estonian. Another feature specific to Finnish is its extensive use of possessive suffixes and clitic particles which increases the number of word forms to one stem for dozens of times as compared to that in Estonian. At the same time those numerous forms are easier to handle by computer as the Finnish language is much more agglutinative than Estonian, i.e. the Finnish word forms are rather transparent as the morphemes are easily segmented and associated with appropriate grammatical meanings.

The Estonian word forms are considerably less transparent. Very often it is impossible to decide unambiguously whether we have a morphological formative or a stem element, cf. j`algu ® J`ALG \foot\ "pl p" (j`alga + u) and h`algu ® H`ALG \piece of firewood\ "sg p" (h`algu + 0).

Historical phonic changes have rendered the Estonian system of stem alternations quite complicated. Stems of one and the same phonological structure may be governed by quite different rules, cf. pidu : p`eo, sadu : saju, sodi : sodi, padi : padja : p`atja, lage : lageda, lagi : l`ae : lage : l`akke, puge[ma : p`oe[b, p`ood : p`oe, etc.

Several problems are due to the extensive morphological homonymy characteristic of the Estonian word forms (Viks 1984), e.g. p`oega ® P`OEG \son\ "sg part" & P`OOD \shop\ "sg kom" or

v`eeta ® VESI \water\ "sg ab" & V`EETMA \to spend\ "inf" & VEDAMA \to draw\ "ind pr ips (neg)".

The few programs that are available for the morphological analysis of Estonian, developed in Tartu University (Kaasik & Korjus 1960, Litvak & Roosmaa & Saluveer & Õim 1980, Litvak & Roosmaa & Saluveer 1983) and the Institute of Language and Literature (Hein 1990), have not been adopted into general use. Their common fault is insufficient consideration for linguistic regularities as compared to formal computer requirements. This inevitably limits down their possibilities to those of closed systems bound to a more or less limited vocabulary, i.e. they can handle only those words the lemmatic forms and stem variants of which are fixed in their lexicons. Texts, however, tend to contain words absent from any dictionary.

The ever increasing popularity of computers breeds a growing necessity for a morphological analyzer that would be capable of handling a possibly larger number and variety of texts. Morphological analysis lies at the base of all programs of automatic text processing of the Estonian language. Next we should like to point out a few spheres in which a morphological analyzer is indispensable.

User systems

The most widely used systems are text editors. These require a spelling checker to decide whether a text word is a normal word of the language and whether it looks acceptable from the orthographical and morphological points of view. Newer editors have spelling checkers of English built in (see Rummo 1993). For the Estonian language the first spelling checker has been generated lately (Kaalep 1993a, 1993b).

Although hyphenation routines are usually based on phonological information they may sometimes have to resort to the morphological analyzer as well, esp. with compounds, e.g. pool-aeg, not poo-laeg or esi-klaas, not esik-laas (but: esik-laps and ürg-laas).

More advanced checkers reach even beyond orthography and morphology, checking sentences against some syntactic rules and certain stylistic parameters (too frequent occurrence of a word, the appropriateness of a word in the particular type of text, etc.) and suggesting replacements. A good example is the MORFO-based VIRKKU system for the Finnish language (H. Jäppinen/Arnola).

A morphological analyzer is indispensable in various information systems. In information retrieval, for example, requests are usually presented in the lemmatic form (vesi) while texts contain several other forms of the word (vett, veed, vetes) and the analyzer has to work out the correct associations. Similar problems arise with automatic annotation and keyword recognition.

Computational linguistics

The requirements of user systems are usually limited to lemmatizing. Other applications, however, demand complete morphological analysis, i.e. recognition of the grammatical meaning of the word form as well. Such morphological analyzers are usually part of larger systems their output serving as input for syntactic analysis. The systems of analysis, in turn, are part of still bigger systems permitting, for example, to translate texts from one language into another, or to hold a dialogue between computer and man (Koit 1987).

Linguistics

The most important domain of automatic morphology is nevertheless linguistics.

On the one hand there are certain linguistic problems the study of which can be based only on vast amounts of morphologically analyzed and tagged texts. Over the recent years corpus linguistics (problems of text choice, tagging principles, etc.) has developed into a separate branch of computational linguistics. Of bigger languages various special-purpose corpora have been produced. An Estonian text corpus is being developed at the Chair of General Linguistics, Tartu University (Õim 1991, Hennoste & Muischnek & Potter & Roosmaa 1993). A collection of texts has also been started at the Institute of the Estonian Language.

A text corpus would offer a better opportunity for studying the actual language usage, a field still rather poorly cultivated in Estonia. New prospects are opened up in the studying of a) grammar: the usage of word forms, phrases, and collocations; b) lexicography: frequency dictionaries, dictionaries of individual styles, authors, dialects, etc.; concordances instead of card files (see Langemets 1993) as well as c) textology and stylistics: grammatical and lexical peculiarities of different text types.

On the other hand a morphological analyzer is necessary to improve the description of the morphology itself. We still lack reliable data on the realization of grammatical categories and on the actual usage of concrete forms.

The morphology parts of every Estonian grammar are, without exeption, synthesis-oriented. They provide rules for the formation of inlectional forms, but say nearly nothing about the usage of those forms. In the actual usage one word form usually dominates over its parallel forms. Of some words only the singular, or plural, or just a couple of concrete forms are used. Some forms occur only in certain fixed word combinations, etc. This kind of information can be obtained only through the analysis of large text corpora. In turn, the availability of such information may have positive repercussions on the quality of automatic analysis.

STRATEGIES OF ANALYSIS

The different strategies underlying morphological analysis are based on the following properties of morphological units and their relationships:

-- integrity of word forms,

-- segmental structure of word forms,

-- variability of units,

-- regularity/irregularity of relations.

The following discussion purports to evaluate some strategies from the point of view of their suitability for automatic morphological analysis of the Estonian language. At that no strategy should be considered exclusive of the others, vice versa, the preceding strategies may always be incorporated into the following ones. The concrete choice of a strategy depends, of course, on the practical purpose of the analysis, yet even more it depends on the peculiarities of the language to be analyzed.

1. Searching

The strategy of searching requires that the word form as a text unit be treated as a whole. The computer stores a huge dictionary (or lexicon) providing all word forms with the necessary information (lemma, part of speech, grammatical meanings, etc.). To analyze a word form means just to find out the word form and the information appended in the dictionary. Such analysis might fit languages "without morphology", i.e. morphologically simple languages in which grammatical meaning is mostly expressed by syntactic means. As Estonian does not belong to such languages the strategy of searching may fulfil but an auxiliary function.

2. Segmentation

2.0. Segmentation is the basic strategy of automatic analysis. This assumes that a word form consists of certain smaller linguistic units with a certain lexical or grammatical meaning and a certain phonological shape. To analyze a word form means to segment it into units and to find those units in lexicons. The lexicons provide every unit with appropriate information from which the output analysis of the word form in question can be generated.

2.1. The depth of segmentation may differ according to what kind of segments are aimed at.

The minimal morphological segmentation divides a word form into two parts: word form ® stem + formative. The stem carries the lexical meaning of the word form, whereas the formative carries the whole complex of grammatical meanings:

hammas[tega ® HAMMAS \tooth\ + "pl kom",

h`amba[id ® HAMMAS + "pl p",

h`amba[ga ® HAMMAS + "sg kom".

A more detailed morphological segmentation divides the word form into morphemes: word form ® stem morpheme + grammatical morpheme(s). The stem remains as it is, but the formative is segmented into morphemes according to distinct grammatical meanings:

hammas[te[ga ® HAMMAS + "pl" + "kom",

h`amba[ga ® HAMMAS + "kom".

Derivational segmentation is applied to the stem: stem ® root + derivational affix(es). The root carries the basic lexical meaning of the stem, whereas the meaning of the derivative affixes (in the Estonian language mostly suffixes) modify the lexical meaning of the root and may also affect the category (part of speech) of the word, cf.

kala \fish\ "noun" -- kala|ke "diminutive noun",

palu[ma \to ask\ "verb" -- palu|mine \asking\ "noun", palu|ja \one who asks\ "noun".

Compound segmentation divides a compound word form as follows: compound word ® attributive word(s) + base word. The resulting components (each carrying a lexical meaning) are one base word and one or more attributive words (k`aitse+vägi, õhu+k`aitse+s`uur+tüki+vägi). The component words may further be segmented both derivationally and morphologically (p`ea[ta+ole|ku[st, `uuri|mis+`andme|st`ik). The attributive component of a compound word, as a rule, remains uninflected in different inflectional forms of the word (k`aitse+vägi : k`aitse+v`äe[s : k`aitse+väge[dega, p`ea[ta+ole|k : p`ea[ta+ole|ku[id), with a few exceptions (v`aene+l`aps : v`aese[st+lap-se[st).

The different depth levels of segmentation correspond to the traditional division of grammar into morphology, derivation, and compounding.

2.2. The segmentation of a word form is followed by a searching of lexicons for the segments. Usually there is one dictionary for the lexical units (lexemes) and several dictionaries (or lists) for the grammatical units (grammemes). The number and contents of the lexicons used depend, above all, on the depth of segmentation.

Morphological segmentation requires a big dictionary of stems, including roots, derived word stems, and compound words (kala, kala|ke, palu([ma), palu|mine, abi+palu|ja, abi[ks+ole|k). Lists of grammemes contain morphological formatives or morphemes (for detailed segmentation).

Derivational segmentation enables one to decrease the volume of the dictionary of stems, increasing at the same time its capacity. If a morphological analyzer can recognize such universal Estonian suffixes as the verbal-noun-producing mine (palu([ma) ® palu|mine) or the diminutive ke (kala ® kala|ke) the need to keep those derivatives in the stem dictionary is eliminated. As a result the stem dictionary will rather approach a root dictionary (palu([ma), kala) while the derivational suffixes together with the attached information make up a separate list of grammemes (mine "noun ...", ke "noun dim ...").

Compound segmentation makes for an even greater economy in dictionary volume as the proportion of compound words in Estonian is relatively large. About two thirds of the Orthological Dictionary (ÕS 1976) is filled with compound words, in newspaper texts nearly every fifth word is a compound. If the morphological analyzer can recognize the boundaries of the components, the stem dictionary is relieved of those compounds the components of which are included in the dictionary as separate words. Every such component word can be subjected to further morphological analysis like any ordinary word form (abi, palu|ja, abi[ks, ole|k). The stem dictionary will then contain either roots or word stems (depending on the availability of derivational segmentation).

Analysis can be started either from the beginning or from the end of the word form. With suffixed languages the right-to-left analysis is more common as the lists of grammemes, being smaller than stem dictionaries, permit quicker retrieval. The part not discovered in any of the grammeme lists is finally looked for in the stem dictionary. The left-to-right analysis starts from search in the stem dictionary (if there are no prefixes, of course), as stem is the only obligatory unit present in a word form.

2.3. The segmentation of a word form often produces many candidates for units and, as a result, many different analyses, erroneous ones included. The right choice requires that the possibilities be checked against rules of morphotactics governing the combinability of units.

2.3.1. Rules of morphotactics define what units in what order and on what conditions may be combined in one and the same word form, i.e. they describe the inner morphological structure of a word form. In the Estonian language, for example, the typical morphological structure of a simple noun is STEM + NUMBER + CASE. As all units except stem may also occur without phonological expression there are practically four morphotactical combinations: STEM + NUMBER + CASE (`aasta[te[ga), STEM + CASE (`aasta[ga), STEM + NUMBER (`aasta[te), STEM (`aasta). The number of combinations will double if we also consider the clitic particle gi/ki that may be added to almost any word form. The morphotactics of the verb is a little more complex as it involves more of grammatical categories.

In the case of derivational segmentation morphotactical rules do not describe only the purely morphological structure of a word form, but also the derivational structure of the stem by prescribing what suffixes may follow a root and in what order. This, however, does not work very well for automatic analysis, as stem formation is not too regular. What affix is used tends to depend rather on lexical meanings or some indefinable circumstances (see Vare 1993) not easily handled by computer. The only promising sphere is the so-called paradigmatic derivation (Viks 1992a: 58--60) in which the suffix does not affect the lexical meaning of the stem but mainly its grammatical characteristics (e.g. part of speech). Paradigmatic derivation involves a score of suffixes (-mine, -ja, -v, -m "komp" a.o.) that are most productive and morphologically rather regular. Many derived stems, however, remain unsegmented in automatic analysis.

The morphotactics of the Estonian compound words is a rather intricate domain of which relatively little is known as yet. Most compounds consist of two or three components, but five-component compounds are possible as well, e.g. õhu+k`aitse+s`uur+tüki+vägi, `all+m`aa+r`aud+t`ee+j`aam, põlev+kivi+t`uhk+side+aine. Most components are also known as separate words, e.g. kl`aas+k`uul, but a compound may also consist of bound components only, e.g. lühi+nägel`ik. As a rule, only the base word is inflected, but sometimes this happens also to the attributive part, e.g. k`uus+sada : kuue+saja[le, `emb+k`umb : emma[st+kumma[st, s`ee+sama : selle[le+sama[le. Usually the attributive component does not take any formatives (r`aud+t`ee, raua+m`aak) or appears in a shortened form (lühi+rom`aan, v`alg+ala), but in principle neither morphological formatives nor derivative suffixes are excluded (ve[te[l+pääste, l`aul|mis+t`und, vali|ja+m`ees).

The formation and morphotactics of the compounds are not so easy to formalize as the combinatorial properties of components depend more on their lexical meanings than on their forms. If rules of compounding are applied without semantic restrictions they allow too much freedom and many spurious analyses remain undiscovered. Neither is there a clear dividing line between derivation and compounding, cf. perenaisel`ik (pere+naine | l`ik) and ebanaisel`ik (eba + naise|l`ik). As the Estonian compound formation involves morphological, derivative, syntactic and semantic factors, the mechanisms underlying the morphotactics of compound words may differ a great deal (v. also Kerge 1990).

2.3.2. There are several ways to formalize the rules of morphotactics. In the case of right-to-left analysis the means of formal grammars are usually applied. The rules of the formal grammar must be able to generate all acceptable morpheme combinations (structures). The analysis of a concrete word form is considered a success if the assumed structure of the word form belongs to those allowed by the grammar used (Kaalep 1993a).

In the case of left-to-right analysis it is more common that the morphotactic rules are dealt with in the dictionary. This approach has been realized in the two-level morphology of the Finnish language (Koskenniemi 1983: 43--69) and also in the morphological analyzer of Estonian (Hein 1990). All units of the morphological system are divided among minilexicons according to their meaning and morphotactic features. Separate minilexicons are formed, e.g. for noun stems, derivative suffixes, plural endings, etc. Beside other information every unit is also provided with data on its continuation classes, i.e. those lexicons the units of which may follow the given unit. A noun stem (`aasta), for example, has such continuation classes as plural (`aasta[te), case (`aasta[ga), or end of word form (`aasta). The continuation classes for plural are either case (`aasta[te[ga) or end of word form (`aasta[te, `aasta[d). The analysis of a concrete word form is considered a success if a sequence of permissible continuation classes can be ascertained.

Continuation classes are good to use with linear structures (like a simple inflectional form), but compound words are not so convenient to handle as there a case ending can be followed by still another stem (halva[ks+panu). Here a little more flexibility is found in formal grammars.

2.4. The strategy of segmentation is effective in the case of agglutinative languages with their rich inflection. The separation of stems and grammemes into different lexicons makes for a considerable economy in dictionary volume (as compared to a dictionary of word forms). As for the Estonian language segmentation suits it quite well in some parts, whereas in some other parts it may lead to solutions that are linguistically poorly motivated.

2.4.1. Sometimes problems may arise with segmentation as well. Unit boundaries need not be unambiguously recognizable, no matter what the level, e.g.:

täisarv: t`äis+`arv \whole number\ or t`äi+s`arv \horn of louse\

nimekaim: nime+k`aim \namesake\ or nime|ka|im \the most well-known\

elataks: ela|ta[ks ® ELA|TA[MA \to maintain\ "knd pr sg 3" or

ela[ta[ks ® ELA[MA \to live\ "knd pr ips"

jooksin: j`ooks[in ® J`OOKS[MA \to run\ "ind ipf sg 1" or

j`oo[ksi[n ® J`OO[MA \to drink\ "knd pr sg 1"

And even if segmentation is unambiguos, the exact place of the boundary may be hard to ascertain, either because of a fusion of the stem ending and formative beginning (jala + ul ® jal[ul or jalu[l), or for orthographic reasons (e.g. kä[tega, instead of kät[tega with morpheme boundary in the middle of the stop). This leads to linguistically artificial structures that, even if they have no adverse effect on the results of analysis, may set their limitations on the further development and use of the system.

2.4.2. More serious problems are due to variation of morphological units. Segmentational analysis requires that in the dictionary of stems each variant of a stem be represented as a separate unit of the lexicon. Correct association is guaranteed by providing every variant with a reference to the lemma: hammas >HAMMAS, h`amba >HAMMAS. Likewise, every variant of a grammatical unit (formative, allomorph) yields a separate unit of the lexicon: de "pl g", te "pl g".

As variation is characteristic of all morphological units, stems as well as grammemes, the checking of the results of analysis requires rules reaching beyond unit level. Otherwise, although the recognized units may be joined in a morphotactically correct structure, the variants of the units need not fit together at all. E.g. the structure STEM + "pl g" is normal in every respect. The word LIIGE \member\ has two stem variants: liige and l`iikme, while the genitive plural may be formed either by te or de. Yet of the four possible combinations only one -- l`iikme[te is correct for this word, not *l`iikme[de (cf. k`iike[de), *liige[te (cf. hõige[te), or *liige[de (cf. kolge[de).

Consequently the right choice prerequires a checking of the mutual compatibility of the variants. The rules describing the conditions of the choice and concurrence of unit variants (allomorphs) in a word form could be called rules of allotactics, i.e. rules for allomorph combinations (cf. morphotactics -- rules for morpheme combinations).

It is not easy to formalize the rules of allotactics in the framework of segmentational strategy. Usually they are not considered separately at all but together with morphotactic ones. For left-to-right analysis every stem variant in the lexicon is provided with a list of allomorphs or formative variants that can be taken by this particular stem variant:

liige >LIIGE: 0 "sg n", t "sg p"

l`iikme >LIIGE: 0 "sg g", sse "sg ill",...; te "pl g", tesse "pl ill",...

For the sake of economy the list is usually replaced by a continuation class code referring to the minilexicon in which the necessary set of formative variants can be found (Kaalep 1993a, Hein 1990).

For right-to-left analysis in which the first step is to search a grammatical unit, allotactic information should, in principle, be attached to allomorphs or formative variants:

te "pl g": l`iikme >LIIGE, hõige >HÕIGE, ...

de "pl g": k`iike >K`IIK, kolge >KOLGE, ...

This means that the stem variants must previously be divided among minilexicons according to their combinability with formative variants.

The joining of allotactic rules with morphotactic ones is not recommended from the linguistic point of view as this leads to a disintegration of morphotactic minilexicons based on grammatical meanings into numerous subminilexicons, that have lost their connection with grammatical categories. As a result, case endings will find themselves divided between different minilexicons, while the division is different for different words, cf. the above example of LIIGE and the following one of J`ALG \foot\:

j`alg >J`ALG: 0 "sg n"

jala >J`ALG: 0 "sg g", sse "sg ill", ... d "pl n"

j`alga >J`ALG: 0 "sg p", 0 "sg adt", de "pl g", desse "pl ill", ...

It would be even less acceptable to divide the variants of one stem between several minilexicons.

3. Transformation

3.0. The strategy of transformation can complement segmentational analysis wherever we have variation of units. This is applicable if morphological units appear in word forms as different variants that have one and the same lexical or grammatical meaning, but different phonological shapes. In order to analyze a word form it is first necessary to segment it into units which will then have to be transformed into the shape of their initial forms to be, in turn, searched for in dictionaries. The result of the analysis is generated from information attached to the initial forms.

3.1. In morphological analysis it is not possible to consider unit variation on the level of an individual unit, instead, the word form must be treated as a member of a paradigm. The paradigmatic approach serves as basis for the model of classificatory morphology, presented in the most compact way in Viks 1992a and Viks 1994.

A paradigm represents an ordered set of the inflectional forms of a word, in which every word form has a fixed position determined by its grammatical meanings. The stem is the stable part of the paradigm, retainig its lexical meaning throughout the whole paradigm. The variable elements are morphological formatives, each of which has its own grammatical meaning corresponding to its position in the paradigm.

The variability of the stem appears within the paradigm, i.e. it is revealed if we compare different inflectional forms of the same word. The variability of the formatives, in the contrary, is revealed interparadigmatically, i.e. if we compare one and the same inflectional form across different words.

Stem variation could be considered from two aspects. One is the phonological relationship between the variants (e.g. s`alga[ma : sala[ta \to deny\), i.e. stem changes describable by rules of transformation (g ® 0). There is no direct link between a stem change and a concrete inflection. The rules are similar for noun and verb morphology, as well as for derivation, cf. s`alga[ma \to deny\ : sala[ta, j`alg \foot\ : jala : j`alga; j`alg \foot\ -- jala|ts \footwear\ -- jalu|ta[ma \to walk\.

The other aspect concerns the allotactic properties of the stem, i.e. the association of stem variants with certain paradigmatic positions (what stem variants are used with what inflections), cf.

"sg n : sg g : sg p" j`alg : jala : j`alga \foot\

pale : p`alge : pale[t \cheek\

"sup : inf : ind pr sg 3" s`alga[ma : sala[ta : s`alga[b \to deny\ s`ulge[ma : s`ulge[da : sule[b \to close\

It can be described by alternation patterns.

3.2. The inclusion of transformational rules in the system does not change the principal strategy of analysis, only completes it. The lexicons used are the same: separate dictionaries for lexical and grammatical units. The difference lies in that the stem dictionary has one stem for every word, even if different stem variants occur in a paradigm. The rules of transformation provide a linkage between the variants permitting to transform one stem variant into another.

3.2.1. As the Estonian system of stem variation is rather sophisticated its adequate description requires different kinds of transformational rules. The principal types of changes are the following:

a) stem-grade changes in which the stem is either in a strong or a weak grade; the grades are differentiated first of all by phonetic quantity (2nd or 3rd degree of quantity) that may be accompanied by various sound changes such as gemination of stops, assimilation, loss of sound, etc.: t`elli[ma - telli[b (3-2), p`aika[ma - paiga[ta (3-2, k-g), kr`unti[ma - krundi[b (3-2, t-d), vanne - v`ande (2-3, nn-nd), puue - p`uude (2-3, 0-d).

b) stem-end changes in which the stem appears either as a lemmatic stem or an inflection stem, being subjected to such sound changes as apocope or epenthesis, sound exchange, etc.: v`eok - v`eoki (0-i), m`under - m`undri (er-ri), soolane - soolase (ne-se), sipelgas - sipelga (s-0), paigas - paigase (0-e).

Although in the above examples stem-grade and stem-end changes occur separately, it is more usual for them to occur simultaneously in one and the same word: p`aik - paiga - p`aika (3-2-3, k-g-k, 0-a), kaigas - k`aika (2-3, g-k, s-0), pundar - p`untra (2-3, d-t, ar-ra), kannel - k`andle (2-3, nn-nd, el-le), suue - s`uudme (2-3, 0-d, e-me).

c) In addition to stem-grade and stem-end changes, on the boundary of stem and formative changes may be conditioned by the morphonological properties of concrete units. Such stem and formative changes are described by rules of morphonological distribution. Examples:

-- a long vowel of stem is shortened if preceding another vowel:

id`ee + id ® id`e[id

-- a stem vowel fuses with the plural vowel: j`alga + V ® j`alg[u

-- an epenthetic vowel is inserted between consonants:

n`aer + v ® n`aer[ev

3.2.2. The use of transformation rules brings forward the problem of what the dictionary form of the word should be.

In systems following the spirit of generative grammar (as Koskenniemi 1983: 69--82) the stem included in the lexicon is a deep structure (underlying form, lexical representation) that need not coincide with any stem shape ever met in texts. The deep structure may contain archiphonemes or morphophonemes that are conveniently transformed into any sequence of a real stem variant (surface form, phonological representation), e.g. the Finnish kaTo ® kato and kado[n, katTo ® katto and kato[n. The stem variants are related to each other through the deep structure. As to the grammatical units they can also be reduced to invariant underlying forms, e.g. the Finnish llA "sg all" ® lla (kato[lla) and llä (tytö[llä).

Another option means that the dictionary includes one of the actual stem variants, adding rules to transform the entry stem into other possible variants, e.g. the Estonian `oota[ma ® ooda[ta. In this case the stem variants are related to each other directly, without an intermediary form. If the stem variant presented in the dictionary of morphological analyzer is lemmatic, access to other dictionaries is direct. The lexicon variants of grammatical units, however, remain separated in most cases.

3.2.3. Rules of transformation provide for effective analysis of languages in which the variation of morphological units is regular and can be associated with certain morphonological conditions. The rules permit to avoid repetitions and cross-references between variants, and to present units in the lexicon proceeding from their meanings, which means considerable economy in dictionary volume.

With certain reservations it may be stated that the strategy of transformation suits the Estonian language quite well. Problems tends to arise wherever variation is not phonologically conditioned. On the one hand a stem change may be irregular, like in `aeda - aia pro aja (cf. `aega - aja) or k`üt[ma - k`öe[takse. On the other hand the phonological structure of the stem needs not always suffice to trigger a particular rule. Cf. for example, s`eedi[ma - s`eedi[b, but kr`aadi[ma - kraadi[b, pr`aadi[ma - pr`ae[b; sodi - sodi - sodi, but lodi - lodja - l`otja.

3.3.1. The allotactic properties of stems, that determine the use of stem variants in different inflectional forms can be described by alternation patterns. According to stem changes there are two kinds of alternation patterns:

a) the stem-grade alternation pattern defines the paradigmatic positions of the strong (T) and the weak (N) grade, cf.

sup inf ind pr sg 3 pts pt ips

h`inda[ma hinna[ta h`inda[b hinna[tud (TNTN)

s`undi[ma s`undi[da sunni[b sunni[tud (TTNN)

sg n sg g sg p pl n pl p

huige h`uike huige[t h`uike[d h`uike[id (NTNTT)

l`uik luige l`uike luige[d l`uike[sid (TNTNT) (ABBBB)

b) the stem-end alternation pattern defines the paradigmatic positions of the lemmatic stem (A) and the inflection stems (B, C), cf.

sg n sg g sg p pl g

r`audne r`audse r`audse[t r`audse[te (ABBB)

ase aseme ase[t aseme[te (ABAB)

soolane soolase soolas[t soolas[te (ABCC)

hammas h`amba hammas[t hammas[te (ABAA) (NTNN)

l`uik luige l`uike l`uike[de (ABBB) (TNTT)

3.3.2. In terms of language history the Estonian stem alternation patterns have largely been conditioned by morphonological phenomena: stem-grade alternation has developed under the influence of the syllable structure of word forms. Notably, in two-syllable stems the closed second syllable caused the development of the weak grade, whereas an open second syllable supported the formation of the strong grade. Stem-end alternation was more closely connected with certain concrete inflectional forms. Notably, two-stem words had a vowel-ending stem in some forms, whereas in some others the stem ended on a consonant. The current version of the language is the result of numerous historical sound changes (cases of apocope, syncope, etc.) that may have masked the underlying factors of stem alternations, but not the patterns. This is why stem alternation patterns have to a great extent become idiosyncratic to a word.

Likewise, the formative variants were once morphonologically conditioned, depending on the position of the formative in the syllable structure of the word, the stress status of the syllable and the preceding sounds. Language historical developments, however, have masked this dependence as well, thus leaving the choice of formative variants up to the concrete word to a large extent, cf. norse[te, morse[de.

3.4. As the allotactic properties of both the Estonian stems and formatives often depend on concrete words it is only natural to find classification used in morphological descriptions as well as in traditional dictionaries. In dictionaries an entry word is followed by a type number referring to a type description that provides usage information on both the stem and the formative variants for words of the given type.

3.4.1. Morphological classification is the most economical way to present allotactic rules. The classification may be more or less detailed according to the number of distinctive features lying at its base. The dictionary by E. Muuk (1937), for example, differentiates between 845 types, the Orthological Dictionary (ÕS 1976) has 115 types and A Concise Morphological Dictionary of Estonian (MDE) has 38 (Viks 1992). The MDE classification presents only allotactic combinations, classifying the words by the following three distinctive features:

-- stem-grade alternation pattern,

-- stem-end alternation pattern,

-- set of formative variants in the paradigm.

Stem changes are not taken into account as MDE provides each entry word with all possible stem variants. The inclusion of grade and end changes of stems would certainly increase the number of types, while the extent of the growth would depend on the level of abstraction the stem changes are treated on, i.e. whether the changes are viewed on the level of a class of rules or on the level of concrete sounds.

The great number of types distinguished in the Orthological Dictionary, and particularly in Muuk's dictionary, is the result of their classifications considering stem changes as well.

3.4.2. The description of allotactic rules through morphological classification provides for a most economical presentation of dictionary information. In the stem dictionary every word has a type number referring to a type description, in which the basic forms of the paradigm are presented so that all variant combinations characteristic of the type are fixed.

In an MDE type description every basic form is represented by a combined marker in which each element stands for a variable phenomenon:

-- stem-end variant: lemmatic stem (A), inflection stems (B, C..)

-- stem-grade variant: strong stem (T), weak stem (N)

-- variant of formative

E.g.

sg n: AN liige AN hõige AT k`iik

sg g: BT l`iikme AT h`õike BN kiige

sg p: ANt liige[t ANt hõige[t BT k`iike

pl g: BTte l`iikme[te ANte hõige[te BTde k`iike[de

pl p: BTid l`iikme[id ATid h`õike[id BTsid k`iike[sid

3.4.3. The remaining members of the paradigm fall into groups of analogy that belong to the basic forms. Within an analogy group all inflectional forms are characterized by one and the same allotactic configuration: the same stem variant as the basic form has and formative variants all deducible by a rule of analogy. E.g. the analogy group of genitive plural includes all the other forms of plural beginning with the illative case: pl g ® pl ill, pl in, pl el, pl all, pl ad, pl abl, pl tr, pl ter, pl es, pl ab, pl kom

l`iikme[te ® l`iikme[tesse, l`iikme[tes, ...

hõige[te ® hõige[tesse, hõige[tes, ...

k`iike[de ® k`iike[desse, k`iike[des, ...

In a sense, a type number resembles the code of a continuation class, but while the minilexicon referred to by a continuation class code contains only the formative variants associated with one stem variant, the type description referred to by a type number determines the allotactic configurations for all basic forms of the given word (and by means of rules of analogy for the whole paradigm), thus including all stem and formative variants belonging to the paradigm of the particular word.

3.4.4. The inclusion of classification in the analyzer is effectual if there is extensive variation of morphological units in the language. As a concentration of all allotactic rules, classification permits to present dictionary information in a very economical way.

Classification permits to distinguish every conceivable peculiarity of word inflection if every differently inflected word is assigned its own type number. This is quite a practical solution, yet not so good from the linguistic point of view as paradigmatic similarities become troublesome to observe. To prevent the number of types from growing unreasonable MDE employs the notion of exception. If a type inflection differs from another type by very few forms of very few words, those few may be labelled as exceptions and the types joined into one type. This way the word olema becomes an exception of the tulema-type as it has an irregular form in two inflections: `on "Ind Pr Sg 3" and "Ind Pr Pl 3" (cf. tule[b and tule[vad), while all the other forms are analogous to the corresponding forms of tulema: ole[n - tule[n, oll[akse - tull[akse, etc.

True, type exception is a relative notion as the number and nomenclature of exceptions depend on classification principles, but it still helps to keep apart the typical and the atypical, including the former in type descriptions and the latter among exceptions. In general, exceptions form a mixed collection which is caused by a variety of reasons. An exception may be either morphotactical (if the paradigm is defective: mõlema) or allotactical (if the combination of variants is atypical: m`aa[de & maa[de - weak grade) or morphonological (if the stem alternation is irregular: k`üt[ma : k`öe[takse).

4. Recognition

4.0. The strategy of recognition may complement the segmentational and transformational analysis in the regular part of morphology. This assumes that the morphological system of the language falls, as Toomas Help has suggested, into two parts: active and passive morphologies (Help 1985, 1990, v. also Viks 1991).

In active morphology rules apply to words automatically, triggered by information contained in the phonological shape of the word. In passive morphology there are certain rules that apply to certain words, only in case the word bears an additional marker to trigger the rule. If the marker is a type number, it serves as a reference to a whole set of rules contained in the type description.

In passive morphology a word form is analyzed following the strategy of segmentation: the necessary marker is found in the dictionary. In active morphology search in the lexicon is replaced by the analysis of the phonological structure of the unit in question.

4.1. In the Estonian language there are a large number of words the morphology of which is totally determined by the phonological structure of the stem (Viks 1990). E.g. the stems of at least 3 syllables and ending on lik take an additional u by stem-end change and have stem-grade change in final syllables (õnnel`ik - õnneliku - õnnel`ikku); stems ending on a long vowel take a formative variant d in partitive singular (m`aa - m`aa[d, rokok`oo - rokok`oo[d). Certain phonological properties of the stem serve to trigger this or that morphological rule (Help 1990).

In the Estonian morphology the relevant phonological features of stems include the number of syllables, degree of quantity, final sounds, medial sounds, sometimes also the first syllable vowel. A combination of these features forms the phonological pattern of the stem, which often enables one to predict the morphological behaviour of the word, i.e. what formative variants occur in the paradigm and what alternation patterns apply to stem variants in the paradigm. Consequently, the phonological pattern of the stem permits to recognize the inflection type to which the word pertains in the morphological classification.

The rigidity of relationships between the phonological patterns and the options of inflection can differ. E.g. verbs with the phonological pattern '3 syllables, 1st degree of quantity, ending on ele' are all conjugated like kõnele[ma (kõnele[ma : kõnel[da & kõnele[da : kõnele[n). At the same time there is the pattern '2 syllables, 3rd degree of quantity, ending on u' most of the verbs of which are conjugated like m`uutu[ma, i.e. without stem-grade change (j`ahtu[ma : j`ahtu[b), whereas a smaller part of verbs with the same phonological pattern are subject to stem-grade alternation and are conjugated like `õppi[ma (m`ahtu[ma : mahu[b). In the former case the rule of recognition works with absolute certainty as the phonological pattern determines the inflection type of the word unambiguously, whereas in the latter case the rule allows for a number of exceptions.

Those words in which the dependence between the phonological pattern and the inflection type is valid make up the regular part of the morphological system of the language. The rest are irregular words which, in order to be recognized need an additional individual marker. The same principle has also been followed by some systems designed for the automatic morphological synthesis of Finnish, e.g. Holman 1988, Kettunen 1991.

4.2. The inclusion of rules of recognition into the system of analysis permits to limit the lexicon down to irregular stems only. So the lexicon of the morphological analyzer can dismiss 1) all words of the kõnelema-type as well as 2) the u-final words of the m`uutuma-type the inflection of which is determined by their phonological pattern. At the same time the lexicon must present the u-final words of the `õppima-type which otherwise, i.e. following the rule of recognition would find themselves classified wrong. The appropriate type marker can be found in the dictionary.

As a result considerable economy is gained in dictionary volume. For example, an absolute majority of hundreds of Estonian iin-final words are declined just like t`oon, i.e. the stem-end vowel i is added, and stem-grade change takes place in the last syllables. If the analyzer applies the recognition rule relating the iin-ending with the t`oon-type, the lexicon can dismiss such words as m`iin, bens`iin, levomütset`iin, etc., keeping only those that are inflected differently, like v`iin - viina, p`iin - piina, t`iin - tiinu (the stem-end vowel is a or u, not i). This means that instead of hundreds only three words (according to the Orthological Dictionary) need be included in the lexicon.

4.3. All paradigmatic forms of regular stems are analyzed by means of rules and grammeme dictionaries, without applying to the big dictionary of stems. E.g. the word form lapatakse can be segmented into lapa [ takse. The sequence takse can be found in the list of formatives where its grammatical meaning is defined as "ind pr ips". If the preceding sequence lapa is not in the dictionary of irregular stems, the next step to make is to ascertain its phonological pattern. A rule of recognition states that the stem structure '2 syllables, ending on a' should be related to the h`akkama-type. This type description, in turn, prescribes that the formative variant in "ind pr ips" is takse and the formative is preceded by a lemmatic stem (A) in the weak (N) grade: ANtakse. In order to arrive at the lemma form (supine) the stem must be changed into the strong grade lapa ® l`appa, adding the supine formative ma (l`appa[ma). So the analysis results in: lapa[takse ® L`APPA[MA "ind pr ips".

4.4. Such a system of morphological analysis as described above is feasible due to the general law stating that most words are inflected simply and regularly, on active principles. New words usually conform to active patterns of inflection (Kross 1984). The old words, however, having passed through various language historical developments have retained traces of several of them, thus presenting a morphologically rather obscure picture. At the same time the number of such words with passive morphology is relatively small and shows no considerable tendency towards increase. So they make up a closed set that can be fixed in full.

The use of rules of recognition renders the system of analysis an open one: the absence of a word in the dictionary does not mean end of work for the analyzer, instead a set of active rules are triggered. The type number given in the dictionary directs the system to use the rules of passive morphology. The role of the lexicon decreases and that of the grammar increases. This in turn means that the efficiency of the analysis does not depend so much on the volume of the dictionary as on the adequacy of the morphological description.

CONCLUSION

If the aim is to create an open system of morphological analysis it is inevitable that the effort should be based on linguistic regularities. Every regular and productive feature can be handled by the rules of active morphology.

The rules of recognition relate the stems to certain inflection types. A type description defines the stem-end and stem-grade alternation patterns as well as the set of formative variants for the basic inflectional forms. The rules of analogy produce the allotactic configurations for the whole paradigm. Different variants of one and the same stem are linked by rules of transformation. The options for the inner morphological structure of a word form are defined by rules of morphotactics.

All those words that behave in an irregular and unproductive way morphologically are handled by passive morphology and they need some additional information contained in the dictionary. As a word may be exceptional in several different senses the additional infomation needed can vary as well.

Exceptions to the rules of recognition require a type marker referring to another type description. Exceptions to types (to allotactic rules) are pointed out as irregular word forms. Exceptions to transformational rules need either to be referred to another rule or to be presented as ready-made stem variants. Exceptions to rules of morphotactics require information about the missing forms.

Even though the boundary between the active and the passive morphologies is not always equally clear, depending on the way the rules are formulated, the distinction it makes is essential indeed, as this is what makes the system open.

The aim of our project is just to create an open system for the morphological analysis of the Estonian language, a system capable of analyzing even those words that cannot be found in any dictionary, although they are correct Estonian words in every respect. Closed systems of analysis require an auxiliary system for dictionary update by means of which the system lexicon could be complemented with new words, or additional user lexicons could be created for specific domains.

A natural language is subject to constant change and development. New words are coming in all the time: they are produced by derivation and compounding, borrowed from other languages or dialects, they are even coined artificially if a new notion is felt to require an entirely new way of expression. Hence the need to be able to imitate the openness of the system underlaying a natural language, a system not limited to the computer-stored lexicon with its final selection of stems.

Beside being able to handle unfamiliar but regular words such an analyzer stands out for economy as the smaller volume of the dictionary accordingly reduces searching time.

A MORPHOLOGICAL ANALYZER FOR THE ESTONIAN LANGUAGE: the possibilities and impossibilities of automatic analysis

INTRODUCTION

A MORPHOLOGICAL ANALYZER FOR THE ESTONIAN LANGUAGE:
the possibilities and impossibilities of automatic analysis