AUTOMATIC MORPHOLOGY OF ESTONIAN

MALL - THE TOOL OF A LINGUIST

Peeter Lind, Ülle Viks

1. WHAT IS MALL?

MALL is a computer program that enables a linguist to analyze the phonological structure of words and to detect correlations between their phonological structure and some other properties.

The first major task MALL has been applied to is type recognition: the phonological properties of the initial form (lemma) of a word are used as a clue to the place of the word in a morphological classification. The classification is taken from Viks 1992: A Concise Morphological Dictionary of Estonian (MDE).

As the relationship between the phonological and morphological properties is quite a close one the phonological shape of an Estonian word usually suffices to identify the inflection type of the word (v. Viks 1990). If, for example, the lemma stem of a verb consists of two syllables and ends in a sequence 'consonant + LE', there is a great possibility that it is conjugated like r`iidle[ma (Type 30 in MDE), e.g. v`aatle[ma : vaadel[da, h`üple[ma : hüpel[da, n`õudle[ma : nõuel[da. MDE contains 175 such words, but there are three words of the same structure that are conjugated like ela[ma (Type 27): t`aotle[ma : t`aotle[da, l`oetle[ma : l`oetle[da, n`õutle[ma : n`õutle[da. So the MALL-program helps the linguist to find out which phonological properties correlate with the inflection type of the word, and to what extent.

More generally, MALL enables one to ascertain correlations between two groups of characteristic features: the phonological properties and some other properties of a word that are taken as distinctive features for a classification. Although the program has been written with an eye to the requirements of morphology, it is quite possible to replace morphological classification with some other one, thus enabling the researcher to study the correlation between the phonological properties and, e.g. the derivational type of a word, or between different phonological characteristics.

2. GENERAL OUTLINE

The application of MALL prerequires an input lexicon stored by type lists. Although this particular program has been designed on the basis of MDE it may, in principle, be applied to other word lists as well.

The input lexicon is required to be stored in one catalogue, while the number of files within an inflection type corresponds to the number of different stem variants in the type. The file name is tp??*.txt in which characters 3-4 contain the number of the inflection type and the following 2 or 3 characters represent the code of the stem variant (A-/B-stem, strong/weak stem, etc.).

The program consists of a working module and an initiating file. The ini-file (see Suppl. 1) describes the alphabet and defines a few possible sound classes which can be accommodated to match a concrete task. In essence MALL is a database system with certain linguistic functions to it.

3. PHONOLOGICAL PATTERN

3.1. Pattern generation

The MALL-program enables the user to specify, for every word in the input lexicon, six phonological features to form the phonological pattern of the word. The features are as follows:

a) number of syllables from the beginning of the word (S1),

b) number of syllables from the main stress syllable (S2),

c) degree of phonetic quantity (Vl),

d) final sounds (Lh),

e) medial sounds (Sh),

f) syllable structure of the word (Sss).

The medial and the final sounds can be considered either in the original form or in various sound classes specified by the user in the ini-file.

The selection and alternation of the features proceeds in dialog-mode, for the relevance of this or that feature for the type recognition may differ from type to type. In some cases it suffices to fix just a couple of general features, whereas some other cases require the using of several features in greater detail.

For the generation of a pattern, the user is offered a menu provided with the default values of the features (see Fig. 1).

STRUKTUURIMALLI KOOSTAMINE 1 Silpe sõna algusest (J/E) : J 2 Silpe rõhust alates (J/E) : J 3 Välde (J/E) : J 4 Lõpuhäälikute arv (0..5) : 1 ei asenda 5 Sisehäälikud (J/E) : E 6 Tüübid (Nt.00,02-07,12-34) : 00-38 7 Tüvekoodid [ABCD?][TN0G?][RV?] : TP*.TXT 8 Lemmatüved (J/E) : E 9 Sõna silbistruktuur (J/E) : E

T TEGUTSE A Asenduste ridade vaatamine 0 või Esc Tagasi algmenüüsse

Figure 1. Dialog screen for the generation of a pattern

Lines 1-5 and 9 specify the phonological features to be included in the pattern. Lines 6-8 serve to limit down the number of words to be analyzed. Selection can be based on the type number (6) and on the stem code (7). In order to simplify selection of initial forms line 8 presents the lemma stems. After the final and the medial sounds have been selected there follows a question on sound class replacements. Now that the initial requirements have been fixed in the form of a pattern the program can proceed to analysis.

3.2. Word analysis

As the input lexicon is MDE the MALL-program follows the transcription used in MDE. That differs from the orthographic spelling in that the third (overlong) quantity degree is always marked, and the stress is marked if its position cannot be ascertained automatically. At that the sign ' ` ' designates the third quantity degree and ' ' ' stands for stress, while both can be found before the nuclear vowel of the syllable (Viks 1992a: 8--9).

One of the basic modules of the program deals with syllabification. There are several features that cannot be identified unless the word has already been syllabized and the syllable carrying the main stress identified. For the rules underlying the syllabification algorithm see Suppl. 2.

The number of syllables is counted in two ways: from the beginning of the word (S1) and from the syllable carrying the main stress (S2). For genuine Estonian words, S1 and S2 usually coincide as in Estonian it is the first syllable that carries the main stress, as a rule. The difference appears in the case of foreign words, e.g. šokol`aad (S1=3, S2=1), gig`ant (S1=2, S2=1) and of such genuine Estonian words that contain a suffix with grade alternation, e.g. k`oolk`ond (S1=2, S2=1), sõbral`ik (S1=3, S2=1). Although from the phonetic point of view the main stress is often moved from the non-initial syllable to the first syllable (š'okol`aad -- not šokol'`aad, s'õbral`ik -- not sõbral'`ik), MALL overrides pronunciation in favour of the, so to say, morphological main stress.

Final sounds (Lh) are those denoted by the final letters of the word. If a word happens to be shorter than the number of positions asked for the final sounds, blanks remain to the left of them.

Medial sounds (Sh) begin from the first vowel of the main-stress syllable and end either with the sound preceding the first vowel of the next syllable, or with the end of the word (if the main stress falls on the last syllable).

The quantity degree (Vl) of a word is ascertained as follows:

the word is of the third (overlong) quantity if its medial sounds are preceded by ' ` ';
the word is of the first quantity if it has just two medial sounds the first of which is a short vowel and the second is a short consonant;
all the rest of the words are of the second quantity.

The syllable structure (Sss) of a word is a representation of the word as a sequence of syllable types. The underlying classification is, in principle, the one suggested by P. Päll in 1986, but with slight modifications. Relevant are the sounds from the syllable nucleus (consonants preceding the first vowel are not considered) up to the syllable boundary (that may coincide with the end of the word). The syllables are classified as follows ('v' = vowel, 'c' = consonant, 's' = s word-final, '-' = syllable boundary):

   open syllable (ends in a vowel):
     -- short (contains one vowel)    v-   = Y
     -- long (contains two vowels)    vv-  = E

   closed syllable (ends in a consonant):
     -- contains one vowel:
        -- ends in a word-final s     vs-  = S
        -- ends in one consonant      vc-  = G
        -- ends in several consonants vcc  = K
     -- contains two vowels:
        -- ends in one consonant      vvc- = D
        -- ends in several consonants vvcc = T

In order to identify the syllable structure of a word the system does the following:

1) doubles the fortis stops (k, p, t) and the foreign fortis consonants (f, š) in a voiced environment, the end of a word included:

latern ® lattern, laatsaret ® laatsarett

2) syllabizes the word according to the rules of syllabification:

lat-tern, laat-sa-rett

3) identifies the type of each syllable according to the type descriptions of syllables:

lat-tern ® GK, laat-sa-rett ® DYK

In the following example of analysis all properties belonging to the pattern have been found for three words: SENJOORA, LATERN and TSISTERN:

      SEN-JOO-RA  LA-TERN  TSIS-TERN
  S1     3          2         2
  S2     2          2         1
  Vl     2          2         3
  Lh     vcv        vcc       vcc
  Sh     OOR        AT        `ERN
  Sss    GEY        GK        GK

(There are three final sounds, divided into vowels and consonants; the medial sounds are presented in their original form.)

4. RESULTS

The results of the analysis can be made available for the user in two basic formats: a file and a screen format.

First, all analyzed words are recorded in a text file in which each word is supplemented by a sequence of phonological features with separators in the following order: /S1 (S2 )Sh !Vl =Lh ?Sss

This is what the above examples look like in the text file:

   SENJOORA   /  3(  2)  OOR!2=vcv?GEY
   LATERN     /  2(  2)   AT!2=vcc?GK
   TSIST`ERN  /  2(  1) `ERN!3=vcc?GK

The text file can be addressed by additional specific inquiries by means of 'grep' or 'agrep'. Also, the text file can be handled by some other programs for sorting, restructuring, etc.

Second, the screen format represents a table of patterns displaying the correlations between the patterns and the inflection types.

Figure 2 displays the results of an analysis where the inquiry concerned the lemma stems of Types 28-30 of MDE and the pattern was to include: the number of syllables from the beginning of the word (S1), the number of syllables from the stressed syllable (S2), and the quantity degree (Vl) of the word.

 Nr Kordi   M/T     T/M    Tüüp  Mall: S1  S2 Vl Lh Sh Sss

  1   20    1.33  100.00   28AT        2   2  1
  2  831   55.44   54.85   28AT        2   2  3
  3  508  100.00   33.53   29AT        2   2  3
  4  176   99.44   11.62   30AT        2   2  3
  5  175   11.67  100.00   28AT        3   2  3
  6  295   19.68   99.66   28AT        4   2  3
  7    1    0.56    0.34   30AT        4   2  3
  8  130    8.67  100.00   28AT        5   2  3
  9   39    2.60  100.00   28AT        6   2  3
 10    8    0.53  100.00   28AT        7   2  3
 11    1    0.07  100.00   28AT        8   2  3

Lõpetab ESC, ENTER näitab sõnu, P Print, PgUp, PgDn,
F1 Abi, F2, F3

Figure 2. Table of patterns

Every line represents a combination of a type and a pattern. Column 1 contains the line number, Column 2 shows how many words of the given type correspond to the given pattern. Columns 3 and 4 display the pattern/type and type/pattern ratios. There follows the number of the inflection type together with the stem code, and the phonological pattern in the composition asked for.

The user has the following options:

a) sort the table by columns (number of words, ratios, types, patterns);

b) print the whole table;

c) see what words correspond to a line;

d) store the words of a line in a separate file;

e) divide the input files into two parts depending on correlations (full or partial correlation).

5. APPLICATION: TYPE RECOGNITION

Automatic recognition of inflection types is necessary for making the words that are missing from the input lexicon, nevertheless, accessible to the rules of automatic morphology. In order to implement the system for other dictionaries (word lists) as well, the patterns are generated without consideration for such phonological features that are specific to MDE, i.e. the marking of the third quantity degree and stress. So, the number of syllables from the beginning of the word (S1), the word-final sounds (Lh) and in the case of some two-syllable words also the medial sounds (Sh) are taken into account. Nevertheless, those three features suffice to recognize the inflection type of a verb in about 96% of cases and of a noun in about 92% of all cases. The words which could be classified into inflection types by the phonological pattern need not be included in the lexicon with type information at all. This lexicon needs contain only exceptions to the recognition rules.

Let us take, for example, declinable words of three and more syllables, ending in a consonant (S1=3... & Lh=c). According to MDE they divide among the following six types of inflection: 02 (õpik), 09 (katus), 11 (harjutus), 19 (seminar), 22 (s`epp), 25 (õnnel`ik). Three final sounds divide the words into three groups (vvc, cvc and cc) within which the patterns can be further approximated. For a sample of the recognition rules v. Suppl. 3. The following is a closer study of a subgroup:

  Lh=vvc
      v1v2n    ®  19~02 (141) * 02 (2), 19 (2), 22 (3)
      v1v1n    ®   22 (1007)  * 19 (2)
      vv+^n    ®   22 (605)   * 02 (4), 09 (1), 11 (15)

As a distinctive feature in the 'vvc'-group serves the sequence v1v2n (two different vowels + a voiced consonant) which indicates that the words with such final sounds belong to two types (19 and 02) in parallel. E.g. st`aadion can be declined either like seminar (19): st`aadion : st`aadioni : st`aadioni : st`aadioni[de or like õpik (02): st`aadion : st`aadioni : st`aadioni[t : st`aadioni[te. The number of such words in MDE is 141, apart from 7 exceptions: konv`eier, biidermeier (02); karaul, liineal (19); linol`eum, mausol`eum, karbolin`eum (22).

The rest of the 'vvc'-words: a) 'v1v1n' (two similar vowels + a voiced consonant) and b) 'vv+^n' (two vowels + a nonvoiced consonant) belong to Type 22 (s`epp), e.g. illusi`oon, kartot`eek, sinus`oid. In MDE there are 1612 such words, exceptions are 22: kont`iinuum, v`aakuum (19); küren'aik, m`essias, paran'oik, tobias (02); skarab`eus (09); avaus, bakal`aureus, f`aatsies, g`eenius, `iileus, `ishias, k`aaries, n`oonius, n`untsius, ordin`aarius, paleus, p`ankreas, r`aadius, stradiv`aarius, teenuis (11).

SUPPLEMENTS:

Supplement 1. Mall.ini -- the initiating file of the MALL-program

ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜ lcclkccllkllllklckllkllllll AEIOUÕÄÖÜ BDGHJLMNRSZŽV

The following are comments to be overlooked by the program.

Line 1 : letters accepted
Line 2 : voiced (l) and voiceless (k/c) sounds
Line 3 : vowels
Line 4 : short consonants

Line 2 is replaced by an alphabet according to the user-selected code. The maximum number of possible choices is 9.

The code (name) of the user's alphabet is on a separate line. The code can be 8 characters long at most and it stands in brackets '['...']'. The next line contains an user's alphabet in terms of sound classes.

Vowels/consonants

[V-C-]
vccvcccvcccccvcccccccvcvvvv

Long/short consonants

[V-Cpl]
vllvpllvlplllvpllpllpvlvvvv

Voiced/voiceless consonants

[V-Cht]
vttvtttvhthhhvthtthhtvhvvvv

Full classification of consonants

[V-C+]
vggvfgsvnknnnvknsfnnkvnvvvv

Classification of vowels

[V+C-]
xccqcccycccccqcccccccycqxqy

Stem-final sounds

[Lh]
AggEfgsIjknnnOknsfnnkUjvvvv

Medial sounds

[Sh]
vggvkgsvnknnnvknsknnkvnvvvv

Anything

[oma]
vllvpllvlplMlvpllpllpvlvvvv

Sound classes:

 v = vowels:                AEIOUÕÄÖÜ

   y = high:                IUÜ
   q = medium-high:         EOÕÖ
   x = low:                 AÄ

 c = consonants:            BDFGHJKLMNPRSŠZŽTV

   p = long:                KPTFŠ
   l = short:               BDGHJLMNRSZŽV

   h = voiced:              JLMNRZŽV
   t = voiceless:           BDFGHKPSŠT

   g = lenis:               GBD
   k = fortis:              KPT
   f = foreign fortis:      FŠ
   s = sibilants, spirants: SH
   n = voiced:              JLMNRZŽV

Supplement 2. Rules of syllabification

A syllable is the smallest integral unit of pronunciation within a word. A syllable consists of one or two vowels (syllable nucleus) that can be preceded and followed by consonants. The number of syllables in a word is equal to the number of syllable nuclei.
In a word it is important to differentiate between:
1. the first syllable and the non-initial syllables,
2. the syllable carrying the main stress and the rest of syllables (those carrying a secondary stress or the unstressed ones).
As a rule, the main stress falls on the first syllable in Estonian.
A non-initial syllable carries the morphological main stress if it:
1. is overlong (linol`eum) (marked as a third-quantity degree in MDE),
2. contains a long vowel (armaada),
3. carries a stress mark in MDE (fil'ipika).
If there are two or more syllables fulfilling the above conditions for a main-stress syllable, the stress is regarded as falling on the last of them (k`onst`ant).
The syllable nucleus consists of vowels:
- (3.1) a single vowel = syllable nucleus (e-la-sin)
- (3.2) two similar vowels = syllable nucleus (long vowel)* (kaa-lu, po-k`aal)
- (3.3) two different vowels:
  - (a) in the first syllable or in a non-initial syllable carrying the main stress = 1 syllable nucleus (diphthong) (s`ea-tud, geo-l`oogia, asa-l`ea, tera-p`eut)
  - (b) in the part of the word not carrying the main stress
    - if the second vowel is 'i' = 1 syllable nucleus (diphthong) ** (osa-vaid, ela-mui-le)
    - in other sequences = 2 syllable nuclei (2 single vowels) (ste-re-od, he-ve-a-le, l`üt-se-um)
- (3.4) three vowels yield two syllable nuclei:
  - (a) if v2=v3, then 1+2 (spi-`oon, du-`aal)
  - (b) the more usual division is 2+1 (lau-al, kuu-es)
A syllable boundary lies between two syllable nuclei:
1. if there are no consonants between two nuclei, the boundary lies between two vowels (rii-ul, ego-`ist);
2. if there are consonants between two syllable nuclei, the syllable boundary immediately precedes the last (or the only) consonant (k`art-sid, v`intsk-les, e-la-taks). ***

Notes:

* (3.2) exceptions: there are two foreign words with two similar vowels in a non-main-stress syllable, producing two syllable nuclei (v`aaku-um, kont`iinu-um), variants with one vowel are in parallel use (v`aakum, kont`iinum).

** (3.3.b) a sequence ending in 'i' is also created by the suffixes istika, ist and ism if they get linked to a vowel. But as those suffixes carry the main stress they create a syllable boundary in front of them (kasu-'istika, ate-`ist, ego-`ism).

*** (4.b) an ambiguous situation may arise:

1) on the word boundary in compound words if the second member begins with a vowel (t`äis-`arv) or a consonant cluster (`öö-klubi);

2) before a final foreign component of a compound-like word if the component begins with a consonant cluster (tele-gr`amm);

3) in foreign names (Neu-stadt, Dobro-ljubov, Gorba-tšov).

Supplement 3. Sample of rules for type recognition

Symbols: v = vowel, c = consonant, n = voiced consonant, k = fortis stop, s = 'S H', ^ = 'not'. An arrow points to the type number, parentheses contain the number of words. An asterisk is followed by exceptions (type and number of words).

Lh=vvc

v1v2n ® 19~02 (141) * 02 (2), 19 (2), 22 (3)

v1v1n ® 22 (1007) * 19 (2)

vv+^n ® 22 (605) * 02 (4), 09 (1), 11 (15)

Lh=cvc:

Lh=cvn

c+(EL/ER/OR) ® 02 (277) * 19 (27)

c+^(EL/ER/OR) ® 19 (190) * 02 (37)

Lh=cvk

(n/D/ST)+IK ® 25 (1064) * 02 (50)

^(n/D/ST)+IK ® 02 (47) * 25 (3)

c+^(IK) ® 22 (15)

Lh=cvs

c+IS ® 11 (148) * 09 (12)

c+US & S1=3 ® 11 (2041) * 11~09 (46)

& S1=4... ® 11~09 (313) * 11 (46)

c+^(IS/US) ® 02 (696) * 09 (1), 11~09 (1)

Lh=cv+^(n/k/s) ® 02 (208) * 19 (1)

Lh=cc

NG ® 02 (48) * 22 (8)

^(NG) ® 22 (949) * 02 (4)