RULES FOR RECOGNITION OF INFLECTION TYPES

Ülle Viks

Inflection types alias morphological classes serve to characterise the different possibilities of word inflection. Owing to the fact that the inflection of a concrete word depends to a considerable extent on the phonological structure of the word it is possible to recognise inflection types automatically.

Here the phonological structure (pattern) is regarded as a combination of several phonological features that can be ascertained from the letter sequence called word.

The selection of the necessary features for the pattern depends on the transcription used. If an Estonian word is provided with special marks for the third (overlong) degree of phonetic quantity and the main stress (as is the use in A Concise Morphological Dictionary = CMD = Viks 1992) it is possible to apply a pattern extended up to six features (see Lind, Viks 1994):

  1. number of syllables counted from word beginning,
  2. number of syllables counted from the main-stress syllable,
  3. degree of phonetic quantity,
  4. final sounds,
  5. medial sounds (from the first vowel of the main-stress syllable up to the first vowel of the next syllable, excl.),
  6. syllable structure (sequence of syllable types).

The present approach proceeds from the spelling of the word, with a view of the future applicability of the rules to as extensive material as possible. The same aim defined the initial word form for the automatic recognition algorithm - this is the so-called lemma, the traditional form for headword entered in the Estonian dictionaries. For nouns the lemmatic form is nominative singular and for verbs the supine (ma-infinitive).

As we proceed from orthography we cannot use those phonological features that are not reflected in the Estonian spelling, i.e. degrees of phonetic quantity, main stress and the number of syllables counted from the main stress. The use of medial sounds is possible, if (as usual) the main stress occurs on the first syllable.

The rules of recognition fall into two classes: first, the rules for word class recognition, and second, the rules for inflection type recognition. The rules for word class recognition enable one to differentiate between nouns and verbs. Owing to the supine being the initial form it is actually necessary to apply only one rule there:

MA ® verb

If the initial form ends in ma, it is a verb, otherwise it is a noun (uninflected words are not discussed here). The other (and the basic) group of the rules serve to recognise inflection types. According to the word classes they are divided into rules for noun (declension) type recognition and rules for verb (conjugation) type recognition.

As there is no one-to-one correspondence between the phonological structure of an Estonian word and its morphological behaviour the rules cannot do without exceptions. Exceptions are words that behave differently from what would as a rule be expected from a word of that particular phonological structure. E.g. most of the two-syllable verbs ending in le belong to type 30 in CMD (vaatle[ma - vaadel[da, nautle[ma - naudel[da), but three such verbs (taotle[ma, nõutle[ma, loetle[ma) belong to type 27 (loetle[ma - loetle[da). All such words are listed as exceptions.

Rules

The following rules and the statistics presented are exclusively based on the classification applied in CMD (Vol. 1: 37-51) and the words contained in CMD (Vol. 2). There are 33,000 inflected words, of which more than 26,000 are nouns and ab. 7000 are verbs.

The CMD transcription system has been reduced to the orthographic form by eliminating the quantity and stress marks. The rules are based on three features:

  1. number of syllables from word beginning,
  2. final sounds,
  3. medial sounds (in some cases).

The rules to recognise syllable boundaries on the base of CMD transcription are described in an earlier article (Lind, Viks 1994: 59-60). Spelling-based syllabification may present problems if the division into syllables depends directly on the position of the main stress, e.g. one and the same sequence of two vowels may form a syllable in the stressed part of a word, but be divided between two syllables in an unstressed position, cf. kor'ea (2s) - h'evea (3s), linol'eum (3s) - petr'ooleum (4s).

Spelling allows but partly reliance on medial sounds as their position depends on the morphological stress of the word (that need not even coincide with its phonological main stress), cf. p'õleng (õl) - poj'eng (eng), skol'astik (ast) - t'alastik (ik).

Therefore medial sounds can be used mainly in the case of shorter words of 1-2 syllables. The medial sounds of 3-syllable words are worth consideration in the case of words with suffixes line/lane/mine/kene. In many cases, however, medial sounds are irrelevant for our purposes as the number of syllables and the final sounds are entirely sufficient to recognise the inflection type of the word in question.

Final sounds are always relevant and available for analysis, but it is not so easy to decide how many of final sounds need to be considered. Sometimes one would suffice, whereas in another case (esp. in derivatives, in which the essential letters precede the suffix) as many as seven need to be taken into account - mostly as compensation for overlooking the medial sounds, cf. t'aanlane - hisp'aanlane - retorom'aanlane (in all of them the medial sounds are aanl).

The rules for type recognition are presented in Table 1:

Columns:

1: number of syllables

'+' written after the number N means 'N syllables or more'

2: final sounds

presented either in terms of sound classes (lower case letters) or as concrete letters (capitals).

Sound classes:

v=AEIOUÕÄÖÜ

c=BDFGHJKLMNPRSÐZÞTV

g=GBD

k=KPT

s=SH

n=LMNR

j=JV

'ww' means two similar vowels, 'Ww' stands for two different vowels

3: medial sounds

presented either as their number, following '|' or as sound classes:

v= AEIOUÕÄÖÜ

p=KPTFÐ

l=BDGHJLMNRSZÞV

.=any letter

'0' means that medial sounds are not considered

4: type number in CMD classification

(for parallel types two numbers are presented)

5: coverage of the rule

i.e. how many words match the rule

6: number of exceptions

i.e. words of the similar pattern belonging to other types of inflection

7: examples - regular word

8: examples - irregular words with their type numbers

The set of rules is ordered: as soon as the first matching rule is found it is implemented regardless of the following ones. The lines with a semicolon contain comments disregarded by the program.

Comments:

The coverage of the rules differs greatly. There are six rules covering more than 1,000 words, one for verbs and 5 rules for nouns:

3+ v 0 ® 27 (3630)

1 c 0 ® 22 (2612)

3 cUS 0 ® 11 (2036)

3+ vNE 0 ® 12 (1928)

2 cvc 0 ® 02 (1652)

3+ wwn 0 ® 22 (1007)

The above six rules (relatively simple ones) cover more than 2/3 (38.4%) of all inflected words contained in CMD, accounting for the main big types. Eleven rules cover by 500-1,000 words, which accounts for 23.1% of all CMD words and eleven more cover by 200-500 words (11.2%). Those 28 rules cover 73% of CMD words. The remaining 89 rules cover 27% of the words and are relatively complicated. Almost all rules (except 3) referring to the medial sounds belong to this group.

The set of rules has been compiled with a view to the "best" result, i.e. a minimum of exceptions combined with a maximum coverage. This is why the final set of rules is rather numerous (117), while part of the rules may seem too detailed and narrow. The present set of rules covers 93% of CMD nouns and 96% of CMD verbs.

In the formation of rules I also tried to keep an eye on the productivity of phonological patterns that could render some rules eligible for inclusion despite their relatively narrow coverage in CMD. E.g. the pattern '3+ IKKUS' covers only 23 CMD words, but the ik-suffix is a productive adjectival suffix and of every ik-adjective a corresponding us-noun can be derived (ik + us ® ikkus), so the pattern was set forth as a rule.

Table 2 presents the summary statistics type-by-type, adding a few of the more typical phonological patterns. At that part of the patterns have been combined into one, e.g. the pattern '2+ cvc' actually covers two patterns from the set of rules: '2 cvc' and '3+ cvc'.

Table 2: Statistics of type recognition (by types)

Columns:

1: type

2: number of regular words

3: number of exceptions

4: %

5: major patterns

6: examples


1 2 3 % 5 6


Noun

00 18 see kes keegi mina

01 1730 149 92 '3+ IA/JA' mütoloogia avaja

'3+ IKA' matemaatika

'3+ cU/cI' vallatu uimasti

02 5469 278 95 '2+ cvc' õpik rahustav

'2+ cNE' raudne monoliitne

'2 cg/kS' vahend madrats

03 19 vaher armas

04 29 ase

05 115 liige

06 342 54 86,4 '2 cE |>2' hinne

07 127 38 77 '2 cAS' hammas

08 41 tütar sammal

09 1128 197 85,1 '2 cIS' tennis

10 774 109 87,6 '3 vNE |=2/31 vesine karvane

11 3167 98 97 '3 cUS' harjutus

'3+ cIS' seadeldis

12 2413 91 96,4 '3+ KE' tilluke

'4+ vNE' oluline

11~09 '2 cUS' raskus

12~(10) '3 LANE/LINE/ eestlane näiline

MINE' pealmine

13 30 suur

14 25 uus

15 9 käsi

16 569 148 79,4 '2 cv |=3/vp' kaabu hapu

'3+ cA/cO/cE' seljanka embargo andante

17 491 6 98,9 '2 cv vl' saba

18 50 sõda

19 331 39 89,5 '3+ cvn' molekul

19~02 '3+ Wwn' pension

20 16 nimi

21 8 jõgi

22 6587 139 98 '1 c' sepp viil

'2+ cc' hotell kujutelm

'3+ wwn' versioon kolesterool

'2+ vvc' mentool eksponaat

23 18 hein

24 74 padi sõber

25 1255 36 97 '2 cnIK' ristmik

'3+ STIK' kuristik

'3+ nIK' õnnelik

26 289 100 '1 v' '2+ cvv' boa idee

24672 1834 93


Verb

27 4483 76 98,3 '2 U/E' juurduma riknema

'2 A/I vl' elama väsima

'3+ A/E/U' kirjutama ragisema

valmistuma

28 1345 154 89,7 '2+ I' leppima aplodeerima

29 492 16 96,8 '2 A' hüppama

30 176 1 99,4 '2 cLE' riidlema

31 42 100 '3 ELE' rabelema

32 9 2 81,8 '1 S' seisma

33 7 100 '1 n' naerma

34 68 100 '1 k |>2' söötma

35 11 100 '1 k |=2' nutma

36 65 8 89 '4+ ELE' mõtiskelema

37 8 võima

38 9 sööma


6698 274 96


Comments:

There are few types without exceptions - just one noun type (26) and four verb types (31, 33, 34, 35). More regular types (98% and more of regular words) are 17, 22 for nouns and 27, 30 for verbs. Exceptions are more (over 10%) in types 06, 07, 09, 10, 16, 19 for nouns and 28, 32, 36 for verbs. The list of exceptions includes words the phonological pattern of which would according to the set of rules attribute them to some other type. E.g. the predominant pattern in type 07 is '2 cAS' (excl. '2 JAS/KAS'), but there are also a number of words that should regularly belong to type 09: '2 cIS/cES ® 09', e.g.

helves - helbe (07), cf. ilves - ilvese (09) or

kallis - kalli (07~05), cf. tellis - tellise (09).

On the other hand there are some '2 cAS'- words in other types beside 07, e.g. kolmas, pagas (02), armas, ergas (03~05), pargas, linnas (09) etc.

There are many types that cannot at all be recognised by the set of rules proposed here. E.g. type 13 (keel - keele - keelt) includes only one pattern '1 c', but the same pattern is much more numerously represented in type 22 (viil -viili - viili).

Of the 26 noun types the system fails to recognise 12 (plus all words of 00-type). As each of those types, however, contains but few words (only one type containing more than 100) they cover but 1.7% of all nouns.

Verb types are much easier to recognise than noun types. The rules are few (22) and the percentage of recognition is higher. Of the 12 verb types only two small ones remained unrecognised. Both contained one-syllable vowel-stem words, 17 words all told (0.24% of CMD verbs).

Exceptions

According to the two kinds of rules the program for type recognition uses two kinds of exceptions:

  1. exceptions to word classes (noun - verb),
  2. exceptions to inflection types.

(1) As for verbs the initial form is supine, the supine formative ma is quite efficient in distinguishing the class of verbs from that of nouns. There are, however, 53 nouns ending in ma, 15 of which have a verb-like phonological pattern (according to our set of rules): astma, muroma, mõlema, panama, pidþaama, firma, gamma, kisma, lemma, mamma, plasma, prisma, sperma, summa, trilma.

If a rule for 1-syllable verbs with a long vowel or a diphthong (1 vv ® 38) were added to the set of type recognition rules, modelling the rule after the most numerous type (containing 9 words), the list of word class exceptions would have to be supplemented by 11 ma-final nouns that violate this rule: draama, duuma, kliima, kooma, laama, puuma, reuma, struuma, teema, trauma, treema.

(2) The list of type exceptions contains those words that according to their phonological pattern would be attributed to the wrong type of inflection. The list has two parts - one for nouns, the other for verbs. All exceptional words are provided with the actual number of the inflection type. The following sample comes from the noun exceptions (the line for comments carries the rule violated by the following exceptions).

Sample 1: Exceptions to the type recognition rules

; 2 cv vl ® 17

;

MINA 00 ,

SINA 00 ,

TEMA 00

SÜDA 04

KÄSI 15

SUSI 15

TÕSI 15

VESI 15

KSERO 16

PANI 16

IGA 18 ,

KODA 18

LUBA 18

LUGU 18

NÄGU 18

PADA 18 ,

PIDU 18 ,

RIDA 18

SADA 18

TUBA 18

VIGA 18

LUMI 20

MERI 20

TULI 20

JÕGI 21

LAGI 21

TÕBI 21

AHI 24

ASI 24

HARI 24

KABI 24

KALI 24 ,

KARI 24 ,

KIRI 24

MARI 24 ,

PADI 24

ROHI 24

VILI 24

ÕLU 24

;

; 2 cv vp ® 16

;

LUKA 05

MITU 05

SETU 05 ,

VÕTI 05

;

; 3+ Wwn 0 ® 19~02

;

KONVEIER 02

LIINEAL 19

LINOLEUM 22

MAUSOLEUM 22

BASSEIN 22

DETAIL 22

In Estonian there are also quite many words that, despite having one and the same lemmatic form belong to the different types of inflection. The cases are of two kinds: parallel types and morphological homonyms.

In parallel types one and the same word has some regular parallel forms (different forms with the same grammatical meaning) in its inflectional paradigm. In CMD such words have got two type numbers, e.g. raskus (types 11~09) that has regular parallel forms in the plural:

raskusi, raskusisse etc. (like harjutus: type 11) and

raskuseid, raskuseisse etc. (like katus: type 09).

A similar strategy has been followed in the type recognition system: two type numbers are used in the corresponding rules as well as in the exceptions, e.g.

rule: 2 cUS 0 ® 11~09 raskusi (11) ~ raskuseid (09)

exception: laius 11~09 laiusi (11) ~ laiuseid (09)

Morphological homonyms are the words the lemmatic forms of which look similar, but are inflected according to different paradigms. In CMD those words are all entered as two or three separate headwords, e.g. ehe (eheda) A 02, ehe (ehte) S 06, ehe (ehtme) S 05.

In the list of exceptions to the type recognition rules a comma is used to indicate that in addition to the irregular type(s) a word also belongs to a regular type. E.g. the list of exceptions includes the words:

ehe 05 ,

ehe 06 ,

whereas the type of the third homonym (ehe : eheda) can be found by means of the rule:

2 cE vl ® 02

(like kibe : kibeda and others of the kind). The number of CMD homonyms that belong to a regular as well as to an irregular type is 124 (117 nouns and 7 verbs).

Program

The program of type recognition was compiled by Peeter Lind. The algorithm is presented in Figure 1.

Conclusion

The above way is just one of the possible realisations of automatic type recognition. There are certainly other ways to compile the set of rules, some more general, some less. One way would be to discard the medial sounds criterion and work just with two features. Good results could also be expected of a pattern based on a combination of word syllable structure and final sounds.

A different set of rules would mean different lists of exceptions. The application of more general rules would increase the number of exceptions, whereas an increase in the detail and complexity of the rules would mean fewer exceptions. As the program is independent of the set of rules and lists of exceptions it can be applied to a totally different classification as well. Such an experiment has been carried out on the verb types of the Orthological Dictionary of Estonian (ÕS 1976).

As to the spheres of application of the type recognition system one of the most essential is automatic morphological analysis and synthesis in the context of an open model of morphology. Such a model proceeds from the rules operating in a natural language and its aim is to find a rule-based computer presentation of everything what is regular in the language, referring to a dictionary only in those cases that are not covered by the rules applied, i.e. exceptions (Viks 1994). This means that the inflection type of all those words not included in the lists of exceptions must be recognisable automatically.

But even if the computer system is based not on rules but on a big dictionary in which every entry word is supplied with all necessary information, rules of type recognition would be necessary for complementing the dictionary with new words (as no dictionary can, in principle, include all the words used in a natural language). Automatic type recognition would help one to provide new words with the grammatical information, which in the case of a language as complex morphologically as Estonian may not be so trivial at all.

Another important implementation is practical lexicography. There is hardly a dictionary, either mono- or bilingual, with Estonian as a source language, that could do without providing the headword with morphological information: at least the type number and a couple of inflected forms to indicate the relevant stem changes and formative variants. A lexicographer's mind, however, is usually occupied mainly with quite different problems such as explanation, equivalents, examples etc. Type recognition combined with morphological synthesis enables the morphological component of the entry to be generated automatically. An experiment of this kind has been carried out (v. Kuusik, Lind, Viks 1995) on the Estonian-Russian dictionary, the first volume of which is going to be published in the nearest future.

A third application could be envisaged in computer systems of language learning, e.g. in morphology tests (type identification).

References

ÕS 1976 = Õigekeelsussõnaraamat. Tallinn.

Viks 1992 = Ü. Viks, Väike vormisõnastik. I: Sissejuhatus & grammatika. II: Sõnastik & lisad. A Concise Morphological Dictionary of Estonian. I: Introduction & Grammar. II: The Dictionary & Appendices. Tallinn.

Viks 1994 = Ü. Viks, A morphological analyzer for the Estonian language: the possibilities and impossibilities of automatic analysis. In: Automatic Morphology of Estonian 1: 7-28.

Lind, Viks 1994 = P. Lind, Ü. Viks, MALL - the tool of a linguist. In: Automatic Morphology of Estonian 1: 49-61.

Kuusik, Lind, Viks 1995 = Evelin Kuusik, Peeter Lind, Ülle Viks, An Estonian Morpho-Generator for Dictionaries. Preprint FU 1995. Tallinn.