Inflection types alias morphological classes serve to characterise the different possibilities of word inflection. Owing to the fact that the inflection of a concrete word depends to a considerable extent on the phonological structure of the word it is possible to recognise inflection types automatically.
Here the phonological structure (pattern) is regarded as a combination of several phonological features that can be ascertained from the letter sequence called word.
The selection of the necessary features for the pattern depends on the transcription used. If an Estonian word is provided with special marks for the third (overlong) degree of phonetic quantity and the main stress (as is the use in A Concise Morphological Dictionary = CMD = Viks 1992) it is possible to apply a pattern extended up to six features (see Lind, Viks 1994):
The present approach proceeds from the spelling of the word, with a view of the future applicability of the rules to as extensive material as possible. The same aim defined the initial word form for the automatic recognition algorithm - this is the so-called lemma, the traditional form for headword entered in the Estonian dictionaries. For nouns the lemmatic form is nominative singular and for verbs the supine (ma-infinitive).
As we proceed from orthography we cannot use those phonological features that are not reflected in the Estonian spelling, i.e. degrees of phonetic quantity, main stress and the number of syllables counted from the main stress. The use of medial sounds is possible, if (as usual) the main stress occurs on the first syllable.
The rules of recognition fall into two classes: first, the rules for word class recognition, and second, the rules for inflection type recognition. The rules for word class recognition enable one to differentiate between nouns and verbs. Owing to the supine being the initial form it is actually necessary to apply only one rule there:
MA ® verb
If the initial form ends in ma, it is a verb, otherwise it is a noun (uninflected words are not discussed here). The other (and the basic) group of the rules serve to recognise inflection types. According to the word classes they are divided into rules for noun (declension) type recognition and rules for verb (conjugation) type recognition.
As there is no one-to-one correspondence between
the phonological structure of an Estonian word and its morphological
behaviour the rules cannot do without exceptions. Exceptions are
words that behave differently from what would as a rule be expected
from a word of that particular phonological structure. E.g. most
of the two-syllable verbs ending in le belong to type 30
in CMD (vaatle[ma - vaadel[da, nautle[ma - naudel[da),
but three such verbs (taotle[ma, nõutle[ma, loetle[ma)
belong to type 27 (loetle[ma - loetle[da). All such words
are listed as exceptions.
The following rules and the statistics presented are exclusively based on the classification applied in CMD (Vol. 1: 37-51) and the words contained in CMD (Vol. 2). There are 33,000 inflected words, of which more than 26,000 are nouns and ab. 7000 are verbs.
The CMD transcription system has been reduced to the orthographic form by eliminating the quantity and stress marks. The rules are based on three features:
The rules to recognise syllable boundaries on the base of CMD transcription are described in an earlier article (Lind, Viks 1994: 59-60). Spelling-based syllabification may present problems if the division into syllables depends directly on the position of the main stress, e.g. one and the same sequence of two vowels may form a syllable in the stressed part of a word, but be divided between two syllables in an unstressed position, cf. kor'ea (2s) - h'evea (3s), linol'eum (3s) - petr'ooleum (4s).
Spelling allows but partly reliance on medial sounds as their position depends on the morphological stress of the word (that need not even coincide with its phonological main stress), cf. p'õleng (õl) - poj'eng (eng), skol'astik (ast) - t'alastik (ik).
Therefore medial sounds can be used mainly in the case of shorter words of 1-2 syllables. The medial sounds of 3-syllable words are worth consideration in the case of words with suffixes line/lane/mine/kene. In many cases, however, medial sounds are irrelevant for our purposes as the number of syllables and the final sounds are entirely sufficient to recognise the inflection type of the word in question.
Final sounds are always relevant and available for analysis, but
it is not so easy to decide how many of final sounds need to be
considered. Sometimes one would suffice, whereas in another case
(esp. in derivatives, in which the essential letters precede the
suffix) as many as seven need to be taken into account - mostly
as compensation for overlooking the medial sounds, cf. t'aanlane
- hisp'aanlane - retorom'aanlane (in all of them the medial
sounds are aanl).
The rules for type recognition are presented in Table 1:
1: number of syllables
'+' written after the number N means 'N syllables or more'
2: final sounds
presented either in terms of sound classes (lower case letters) or as concrete letters (capitals).
'ww' means two similar vowels, 'Ww' stands for two different vowels
3: medial sounds
presented either as their number, following '|' or as sound classes:
'0' means that medial sounds are not considered
4: type number in CMD classification
(for parallel types two numbers are presented)
5: coverage of the rule
i.e. how many words match the rule
6: number of exceptions
i.e. words of the similar pattern belonging to other types of inflection
7: examples - regular word
8: examples - irregular words with their type numbers
The set of rules is ordered: as soon as the first matching rule
is found it is implemented regardless of the following ones. The
lines with a semicolon contain comments disregarded by the program.
The coverage of the rules differs greatly. There are six rules covering more than 1,000 words, one for verbs and 5 rules for nouns:
3+ v 0 ® 27 (3630)
1 c 0 ® 22 (2612)
3 cUS 0 ® 11 (2036)
3+ vNE 0 ® 12 (1928)
2 cvc 0 ® 02 (1652)
3+ wwn 0 ®
The above six rules (relatively simple ones) cover
more than 2/3 (38.4%) of all inflected words contained in CMD,
accounting for the main big types. Eleven rules cover by 500-1,000
words, which accounts for 23.1% of all CMD words and eleven more
cover by 200-500 words (11.2%). Those 28 rules cover 73% of CMD
words. The remaining 89 rules cover 27% of the words and are relatively
complicated. Almost all rules (except 3) referring to the medial
sounds belong to this group.
The set of rules has been compiled with a view to the "best" result, i.e. a minimum of exceptions combined with a maximum coverage. This is why the final set of rules is rather numerous (117), while part of the rules may seem too detailed and narrow. The present set of rules covers 93% of CMD nouns and 96% of CMD verbs.
In the formation of rules I also tried to keep an
eye on the productivity of phonological patterns that could render
some rules eligible for inclusion despite their relatively narrow
coverage in CMD. E.g. the pattern '3+ IKKUS' covers only 23 CMD
words, but the ik-suffix is a productive adjectival suffix
and of every ik-adjective a corresponding us-noun
can be derived (ik + us ®
ikkus), so the pattern was set forth as
Table 2 presents the summary statistics type-by-type,
adding a few of the more typical phonological patterns. At that
part of the patterns have been combined into one, e.g. the pattern
'2+ cvc' actually covers two patterns from the set of rules: '2
cvc' and '3+ cvc'.
Table 2: Statistics of type recognition (by types)
2: number of regular words
3: number of exceptions
5: major patterns
1 2 3 % 5 6
00 18 see kes keegi mina
01 1730 149 92 '3+ IA/JA' mütoloogia avaja
'3+ IKA' matemaatika
'3+ cU/cI' vallatu uimasti
02 5469 278 95 '2+ cvc' õpik rahustav
'2+ cNE' raudne monoliitne
'2 cg/kS' vahend madrats
03 19 vaher armas
04 29 ase
05 115 liige
06 342 54 86,4 '2 cE |>2' hinne
07 127 38 77 '2 cAS' hammas
08 41 tütar sammal
09 1128 197 85,1 '2 cIS' tennis
10 774 109 87,6 '3 vNE |=2/31 vesine karvane
11 3167 98 97 '3 cUS' harjutus
'3+ cIS' seadeldis
12 2413 91 96,4 '3+ KE' tilluke
'4+ vNE' oluline
11~09 '2 cUS' raskus
12~(10) '3 LANE/LINE/ eestlane näiline
13 30 suur
14 25 uus
15 9 käsi
16 569 148 79,4 '2 cv |=3/vp' kaabu hapu
'3+ cA/cO/cE' seljanka embargo andante
17 491 6 98,9 '2 cv vl' saba
18 50 sõda
19 331 39 89,5 '3+ cvn' molekul
19~02 '3+ Wwn' pension
20 16 nimi
21 8 jõgi
22 6587 139 98 '1 c' sepp viil
'2+ cc' hotell kujutelm
'3+ wwn' versioon kolesterool
'2+ vvc' mentool eksponaat
23 18 hein
24 74 padi sõber
25 1255 36 97 '2 cnIK' ristmik
'3+ STIK' kuristik
'3+ nIK' õnnelik
26 289 100 '1 v' '2+
24672 1834 93
27 4483 76 98,3 '2 U/E' juurduma riknema
'2 A/I vl' elama väsima
'3+ A/E/U' kirjutama ragisema
28 1345 154 89,7 '2+ I' leppima aplodeerima
29 492 16 96,8 '2 A' hüppama
30 176 1 99,4 '2 cLE' riidlema
31 42 100 '3 ELE' rabelema
32 9 2 81,8 '1 S' seisma
33 7 100 '1 n' naerma
34 68 100 '1 k |>2' söötma
35 11 100 '1 k |=2' nutma
36 65 8 89 '4+ ELE' mõtiskelema
37 8 võima
38 9 sööma
6698 274 96
There are few types without exceptions - just one noun type (26) and four verb types (31, 33, 34, 35). More regular types (98% and more of regular words) are 17, 22 for nouns and 27, 30 for verbs. Exceptions are more (over 10%) in types 06, 07, 09, 10, 16, 19 for nouns and 28, 32, 36 for verbs. The list of exceptions includes words the phonological pattern of which would according to the set of rules attribute them to some other type. E.g. the predominant pattern in type 07 is '2 cAS' (excl. '2 JAS/KAS'), but there are also a number of words that should regularly belong to type 09: '2 cIS/cES ® 09', e.g.
helves - helbe (07), cf. ilves - ilvese (09) or
kallis - kalli (07~05), cf. tellis - tellise (09).
On the other hand there are some '2 cAS'- words in other types beside 07, e.g. kolmas, pagas (02), armas, ergas (03~05), pargas, linnas (09) etc.
There are many types that cannot at all be recognised by the set of rules proposed here. E.g. type 13 (keel - keele - keelt) includes only one pattern '1 c', but the same pattern is much more numerously represented in type 22 (viil -viili - viili).
Of the 26 noun types the system fails to recognise 12 (plus all words of 00-type). As each of those types, however, contains but few words (only one type containing more than 100) they cover but 1.7% of all nouns.
Verb types are much easier to recognise than noun types. The rules are few (22) and the percentage of recognition is higher. Of the 12 verb types only two small ones remained unrecognised. Both contained one-syllable vowel-stem words, 17 words all told (0.24% of CMD verbs).
According to the two kinds of rules the program for type recognition uses two kinds of exceptions:
(1) As for verbs the initial form is supine, the supine formative ma is quite efficient in distinguishing the class of verbs from that of nouns. There are, however, 53 nouns ending in ma, 15 of which have a verb-like phonological pattern (according to our set of rules): astma, muroma, mõlema, panama, pidþaama, firma, gamma, kisma, lemma, mamma, plasma, prisma, sperma, summa, trilma.
If a rule for 1-syllable verbs with a long vowel or a diphthong (1 vv ® 38) were added to the set of type recognition rules, modelling the rule after the most numerous type (containing 9 words), the list of word class exceptions would have to be supplemented by 11 ma-final nouns that violate this rule: draama, duuma, kliima, kooma, laama, puuma, reuma, struuma, teema, trauma, treema.
(2) The list of type exceptions contains those words
that according to their phonological pattern would be attributed
to the wrong type of inflection. The list has two parts - one
for nouns, the other for verbs. All exceptional words are provided
with the actual number of the inflection type. The following sample
comes from the noun exceptions (the line for comments carries
the rule violated by the following exceptions).
Sample 1: Exceptions
to the type recognition rules
; 2 cv vl ® 17
MINA 00 ,
SINA 00 ,
IGA 18 ,
PADA 18 ,
PIDU 18 ,
KALI 24 ,
KARI 24 ,
MARI 24 ,
; 2 cv vp ® 16
SETU 05 ,
; 3+ Wwn 0 ® 19~02
In Estonian there are also quite many words that, despite having one and the same lemmatic form belong to the different types of inflection. The cases are of two kinds: parallel types and morphological homonyms.
In parallel types one and the same word has some regular parallel forms (different forms with the same grammatical meaning) in its inflectional paradigm. In CMD such words have got two type numbers, e.g. raskus (types 11~09) that has regular parallel forms in the plural:
raskusi, raskusisse etc. (like harjutus: type 11) and
raskuseid, raskuseisse etc. (like katus: type 09).
A similar strategy has been followed in the type recognition system: two type numbers are used in the corresponding rules as well as in the exceptions, e.g.
rule: 2 cUS 0 ® 11~09 raskusi (11) ~ raskuseid (09)
exception: laius 11~09 laiusi (11) ~ laiuseid
Morphological homonyms are the words the lemmatic forms of which look similar, but are inflected according to different paradigms. In CMD those words are all entered as two or three separate headwords, e.g. ehe (eheda) A 02, ehe (ehte) S 06, ehe (ehtme) S 05.
In the list of exceptions to the type recognition rules a comma is used to indicate that in addition to the irregular type(s) a word also belongs to a regular type. E.g. the list of exceptions includes the words:
ehe 05 ,
ehe 06 ,
whereas the type of the third homonym (ehe : eheda) can be found by means of the rule:
2 cE vl ® 02
(like kibe : kibeda and others of the kind).
The number of CMD homonyms that belong to a regular as well as
to an irregular type is 124 (117 nouns and 7 verbs).
The program of type recognition was compiled by Peeter Lind. The
algorithm is presented in Figure 1.
The above way is just one of the possible realisations of automatic type recognition. There are certainly other ways to compile the set of rules, some more general, some less. One way would be to discard the medial sounds criterion and work just with two features. Good results could also be expected of a pattern based on a combination of word syllable structure and final sounds.
A different set of rules would mean different lists of exceptions.
The application of more general rules would increase the number
of exceptions, whereas an increase in the detail and complexity
of the rules would mean fewer exceptions. As the program is independent
of the set of rules and lists of exceptions it can be applied
to a totally different classification as well. Such an experiment
has been carried out on the verb types of the Orthological Dictionary
of Estonian (ÕS 1976).
As to the spheres of application of the type recognition system one of the most essential is automatic morphological analysis and synthesis in the context of an open model of morphology. Such a model proceeds from the rules operating in a natural language and its aim is to find a rule-based computer presentation of everything what is regular in the language, referring to a dictionary only in those cases that are not covered by the rules applied, i.e. exceptions (Viks 1994). This means that the inflection type of all those words not included in the lists of exceptions must be recognisable automatically.
But even if the computer system is based not on rules
but on a big dictionary in which every entry word is supplied
with all necessary information, rules of type recognition would
be necessary for complementing the dictionary with new words (as
no dictionary can, in principle, include all the words used in
a natural language). Automatic type recognition would help one
to provide new words with the grammatical information, which in
the case of a language as complex morphologically as Estonian
may not be so trivial at all.
Another important implementation is practical lexicography. There is hardly a dictionary, either mono- or bilingual, with Estonian as a source language, that could do without providing the headword with morphological information: at least the type number and a couple of inflected forms to indicate the relevant stem changes and formative variants. A lexicographer's mind, however, is usually occupied mainly with quite different problems such as explanation, equivalents, examples etc. Type recognition combined with morphological synthesis enables the morphological component of the entry to be generated automatically. An experiment of this kind has been carried out (v. Kuusik, Lind, Viks 1995) on the Estonian-Russian dictionary, the first volume of which is going to be published in the nearest future.
A third application could be envisaged in computer
systems of language learning, e.g. in morphology tests (type identification).
ÕS 1976 = Õigekeelsussõnaraamat. Tallinn.
Viks 1992 = Ü. Viks, Väike vormisõnastik.
I: Sissejuhatus & grammatika. II: Sõnastik & lisad.
A Concise Morphological Dictionary of Estonian. I: Introduction
& Grammar. II: The Dictionary & Appendices. Tallinn.
Viks 1994 = Ü. Viks, A morphological
analyzer for the Estonian language: the possibilities and impossibilities
of automatic analysis. In: Automatic Morphology of Estonian 1:
Lind, Viks 1994 = P. Lind, Ü. Viks, MALL
- the tool of a linguist. In: Automatic Morphology of Estonian
Kuusik, Lind, Viks 1995 = Evelin Kuusik, Peeter
Lind, Ülle Viks, An Estonian Morpho-Generator for Dictionaries.
Preprint FU 1995. Tallinn.