RULES FOR FINDING BOUNDARIES IN COMPOUND WORDS

Indrek Hein

The aim of the paper is to find out whether-and if, then to what extent-component boundaries in the Estonian compound words can be found by rules that take into account only feasible and unfeasible letter sequences. A set of such simple rules could evidently speed up the morphological analysis of such word forms in which the rather time-consuming search for a boundary in a possible compound (consisting in the check-up on the acceptability of a possible component to be followed by a search in the dictionary of stems) could be replaced by a finite state automaton to mark certain letter sequences as indicating the presence of a boundary without exceptions or with a great probability. This would also be useful for such applications which do not refer to morphological or semantic analysis-database queries, hyphenation algorithms etc. E.g. in the word form 'päästeüksusi' the boundary can be fixed at once as in Estonian the sequence 'eü' can occur only at compound boundaries. The sequence 'dt' may either denote a boundary ('raud+tee', 'med+töötaja', 'üld+tunnustatud') or a proper name of Germanic origin ('Schmidt', 'Markvardt', 'Rembrandt' etc.). As can be seen from the examples the notion of a compound has been interpreted rather freely.

The combinatorial analysis of the letter sequences is based on the text corpus of the Institute of the Estonian Language, more emphasis has been laid on newspaper texts («Eesti Ekspress», «Hommikuleht» and «Eesti Sõnumid») as they display more linguistic variation. All word forms occurring in the texts are alphabetically sorted, the number after the slash gives the frequency. All compound boundaries were then marked by '+'. The resulting dictionary:

  • eksamineeritute/1
  • eksami+nõuete/1
  • eksami+sessiooni/1
  • eksamit/4
  • eksamita/1
  • eksamite/2
  • eksamitega/1
  • eksamitel/1
  • eksamitele/1
  • eksamitelt/1
  • eksamiteta/1
  • eksami+ukse/1
  • eksami+ülesanne/1
  • eksamplar/1 (a typo)
  • eks+ase+linna+pea/1
  • eks+direktor/1
  • eks+direktoriga/1
  • eks+direktorile/1
  • ekseemid/1
  • The size of the corpus used in the analysis was 14Mbytes consisting of 1,733,000 word forms. As the ordered dictionary has approx. 200,000 entries, the average frequency of a word form in the selection is 8.5. The respective numbers for compound words were 204,000 in the texts and 78,000 in the dictionary. Consequently the percentage of compound words in the Estonian (newspaper)texts is 12%. One tenth of the compounds consist of more than two components, the respective figures were

    5 components 8

    4 components 290

    3 components 7426

    2 components 70117

    The comparison between the average frequency of compounds in the text (2.6) and simple word forms (12.5) indicates that compound words are mostly used to denote more specified notions.

    It also appears that compounds spontaneously created in conversation are rare, most of the compounds used are established (i.e. not unique) in the language. For automatic analysis this means that a relatively large part of compound words can be included in the dictionary. This would leave the texts with only 2% of such word forms that shall need the lengthy procedure of boundary-finding. By way of illustration the following list of the compound words the first component of which 'öö' night is relatively neutral, is provided. The non-initial components are lemmatised. Words that can be found in the Orthological Dictionary (OD) are underlined, the dotted underline marks derivatives of OD entries and such newer compounds that seem to be established enough to be included in the dictionary as well.

    öö+aja+kiri/8, öö+baar/9, öö+ekspress/1, öö+elu/4, öö+hakuks/1, öö+jook/1, öö+kapp/2, öö+klubi/18, öö+klubi+hääl/1, öö+klubi+omanik/1, öö+kreem/1, öö+kull/6, öö+kulli+topis/1, öö+kuninganna/2, öö+külm/2, öö+lokaal/7, öö+lõhn/1, öö+maja/9, öö+must/1, öö+muusik/l, öö+pikaks/1, öö+pikk+silm/6, öö+pimedus/6, öö+pood/1, öö+pott/1, öö+printsess/1, öö+päev/69, öö+päeva+ringne/14, öö+rahu/2, öö+show/1, öö+särk/7, öö+söök/1, öö+tund/3, öö+töö/2, öö+vaikus/1, öö+valve/1, öö+valvur/2, öö+video/7, öö+visiit/1, öö+õde/1.

    Summarising the frequencies of the two groups we can see that the text ratio of established compounds and occasional ones is 175 to 30.

    To find the letter sequences occurring exclusively on the boundaries of compound words a program was used that for every possible letter sequence found its frequency of occurrence on a compound boundary (positive result) versus its frequency elsewhere (negative result).

    One-letter sequences yield but forbidding rules: as 'j', 'h' and 'õ' never occur at the end of a word form they never mark the end of a compound word component either.

    A two-letter sequence requires three hypotheses to be checked: the boundary can be found before the sequence (+xx), after the sequence (xx+) or between the letters (x+x). As the number of possibilities grows exponentially (for two letters ~3000, for three letters ~130,000, the number of different four-letter sequences being ~5,000,000), sequences of more than four letters were not analysed. For two-letter sequences the results are as follows:

    +iü 0 1 i+ü 130 1 iü+ 0 1

    +ja 234 11854 j+a 0 8288 ja+ 3840 13172

    +jb 0 0 j+b 0 0 jb+ 0 0

    The first number after the hypothesis denotes positive results, whereas the second stands for negative results. E.g. all of the 130 'iü' sequences occurred on the compound boundary, while the boundary ran between the two letters (the only exception was a spelling error). The 'ja' sequence was found 234 times at the beginning of a component, 3840 times at the end of a component, not once was the boundary between these two letters, but predominantly was 'ja' found elsewhere in the word form. The sequence 'jb' was not detected.

    There in no formal criterion to select the best set of rules. One possible approach is to take into account the ratio of positive and negative results, but this method has several drawbacks.

    1. Most of the exceptions (negative results) can be accounted for by the occurrence of the corresponding sequence in simple words. If those are eliminated before boundary search starts, the problem is off.
    2. There is hardly a rule without any exceptions. Even if negative results did not show up in the corpus, they are bound to surface sooner or later.
    3. For a rule it is much more important just how many boundaries it can detect than how many exceptions it has. One can always take all the marked word forms in the dictionary and claim this to be an exceptionally good set of rules but this lacks both practical and theoretical relevance.

    A closer look at the letter sequences that qualify as rules enables us to differentiate between several groups. The first group of the rules is phonotactically based. Those rules have practically no exceptions and should be included in the final set of rules even if the letter sequence is very rare. Ordered to the same group are such rules that that have exceptions among loanwords and foreign words, or in paradigms of unproductive declensions but otherwise seem to be phonotactical. The distinction between these two sets is somewhat arbitrary. E.g. although at first sight the sequence 'tg' seems quite impossible to accept within a word, the word 'röntgen' x-rays is much more frequent than 't+g' on a compound boundary. At the same time anyone can come up with the exception to rule 't+p'-'tpruu' whoa!. Still the rule 't+p' has no exceptions while 't+g' can hardly be considered as useful.

    The rules are presented in tables. The third column contains the number of boundaries covered by the rule (the corpus number of the boundaries was 85,871). The numbers of exceptions, if any, are separated by a slash. I have also tried to arrange the rules in the order of 'goodness', but there are no formal criteria.
    RuleExceptions Frequency
    {A,E,I,O,U}+{Õ,Ä,Ö,Ü}

    (any vowel of the first set combined with any of the second set)

    puänt3007/4
    I+J{O,U,Õ,Ä,Ü}
    any vowel except A
    550
    T+P533
    K+P495
    G+P341
    D+TBrandt, Landtag, … 334
    V+P233
    G+H196
    ÖÖ+A191
    M+T154
    D+P127
    M+V126
    K+F99
    V+M41
    G+T38
    Ö+Ü29
    M+H28
    Ö+K28
    G+Õ16
    Ö+Õ7
    Ä+A4
    Ö+F4
    US+Jsoomusjad534/1
    S+Hshow, ekshibitsionism, isheemia 603/358
    P+Kknopka, papka,
    the -kond-suffix: piiskopkond, …
    the ki-enclitic: kappki, …
    171/12
    G+Jõlgjad74/1
    L+Rtaalri, maalri, … 141/16
    M+Jliimjas, piimjas, … 58/5
    K+Hkhaan, khmeerid81/10
    G+Kthe ki-enclitic: lepingki, poegki, …
    the -kond-suffix: ringkond, aegkond, …
    230/228
    T+Gröntgen6/23

    The second group contains rules that are the result of a combination of several circumstances such as the typical structure of a word stem in Estonian, the occurrence of certain letters in case and person suffixes, restrictions on the occurrence of 'Õ', 'Ä', 'Ö' and 'Ü' in non-initial syllables, frequency of word forms etc. E.g. the rule 'EE+J' covers only two extremely frequent word forms 'seejuures' (202) and 'seejärel' (317).
    RuleExceptions Frequency
    +KÕkaksikõde 1044/5
    +PÕtippõiguskaitsja 638/1
    +TÕimportõlu, importõunad 1129/5
    +VÕadministratiivõigus, sugestiivõpe, korvõieline 5386/5
    +PÄtrumpäss, tippärimees 5366/8
    +VÄreväär, skväär 3026/8
    +MÄparemäärmus,
    proper names ending in -mäe
    2926/4
    +NÄ{D, G, H, I, L, O}
    (all except suffix
    -när/-näär - aktsionär, etc.)
    ~1000
    +JÕ (partial overlap with rule I+JÕ) 876
    EE+J571
    +HÄ496
    OO+A355
    EA+O145
    I+UU142
    ÄE+O109
    +HÕ67
    +VÖ50
    +NÕvaseliinõli 916/1
    +KÄvasakäärmus 997/5
    +PÜkupüür, tippüritus 345/14
    +PÖepopöa, pompöösne 126/17
    +HÜformaldehüüd 86/3

    The third group is mainly the result of an analysis of the word form frequencies. Many of the words carry political or journalistic connotations ('OND+ER' - koonderakond, 'ISA+MAA', both are names of political parties; 'NE+MAA' - Venemaa Russia, 'AJA+KIR' - ajakirjandus/ajakirjanik journal, journalist, journalistic). Some of the rules added here could, in principle, belong to one of the previous groups, but their context has been extended to reduce the number of exceptions ('+SÜS' and '+SÜN').
    RuleExceptions Frequency
    +SÜ{S, N}584
    I+RÄ439
    I+TÖ427
    {A, E, I, O}+TÜatatürk, trotüül 317/9
    LE+Aasalea314/2
    L+POkolposkoopia305/5
    O+AJ286
    GE+PTaagepera266
    E+AAST204
    TS+P182
    A+EE127/1
    ÄE+{A, P, R}147
    IU+P146
    UP+M133
    A+UU31/2

    The following rules of the third group are presented in a list:

    {A,E,I,U}+PEA

    +AMETI

    +BÜRO

    +FILM

    +FIR

    +GRUP

    +KIRJ

    +MINISTE

    +POOL

    +PROG

    +RÄÄ

    +RÄH

    +SUGU

    +TÖÖ

    A+LEHE

    A+LEHT

    AA+IL

    AJA+KIR

    AJA+LOO

    AJA+LU

    AKTSIA+SE

    ARU+SA

    AS+AEG

    ASJA+O

    AU+HI

    BI+EL

    BI+KAA

    DU+MAA

    EEL+ARV

    EES+KUJ

    EESTI+M

    GI+KO

    HE+KO

    I+MEES

    ISA+MAA

    ISE+EN

    JA+MAA

    JA+PI

    KÄES+O

    KOOS+S

    KSA+MAA

    LE+OL

    LGE+O

    LIS+MAA

    LU+KO

    MAA+VA

    MITTE+

    NE+MAA

    OMA+E

    OMA+KA

    OND+ER

    ÖÖ+KO

    ÕU+KO

    PEA+A

    PEA+DI

    S+TÖÖ

    SE+LO

    SE+PR

    SI+ALG

    ST+KI

    ST+KU

    TE+VAH

    TE+VAA

    TI+MAA

    TO+JU

    US+MAA

    US+VA

    The following rules could be used if the rules are ordered or when the logical operator 'not' can be used:

    S+ÕIGU > +SÕ

    NIM+Õ > +MÕ

    AAL+Õ > +LÕ

    EEL+Õ > +LÕ

    OOL+Õ > +LÕ

    M+ÜH > +MÜ

    A similar analysis has been carried out using different material (programs and realisation by Indrek Kiissel). The analysis was applied to the compound words contained in the Orthological Dictionary. There are two main reasons for differences in the resulting rules: first, OD is poor in proper names, neologisms, foreign words and terminology, and, second, OD does not inform the user of word frequences.

    According to OD, there are 55 two-letter combinations (x+x) that without exceptions mark the compound boundary:

    I+Õ 234

    D+T 228

    A+Õ 183

    E+Õ 162

    D+P 150

    U+Õ 110

    E+Ü 96

    I+Ü 96

    A+Ü 83

    V+P 81

    V+V 52

    U+Ü 50

    A+Ä 33

    E+Ä 32

    A+Ö 28

    D+B 28

    I+Ä 22

    P+H 20

    T+D 17

    K+B 16

    E+Ö 13

    O+Õ 10

    Ä+A 10

    Ö+Õ 9

    D+D 7

    P+F 6

    K+G 5

    V+G 5

    V+F 5

    F+K 4

    H+P 4

    U+Ö 4

    Ö+Ü 4

    S+Þ 3

    D+Þ 2

    G+F 2

    O+Ö 2

    P+B 2

    P+G 2

    Ð+T 2

    V+B 2

    G+G 1

    H+S 1

    K+Ð 1

    M+Z 1

    O+Ä 1

    P+Ð 1

    S+Ð 1

    Ð+K 1

    Ð+P 1

    Ð+R 1

    Þ+F 1

    Þ+J 1

    Þ+S 1

    Ä+Þ 1

    52 combinations had more of positive results than of negative ones. For every rule, the number of positive results, negative results and their ratio is provided.

    K+P 229 2 0.009

    T+P 274 3 0.011

    G+P 115 2 0.017

    V+M 55 1 0.018

    S+D 50 1 0.02

    S+H 406 10 0.025

    M+H 40 1 0.025

    T+B 22 1 0.045

    N+P 88 4 0.045

    S+R 542 25 0.046

    T+H 85 4 0.047

    G+Õ 20 1 0.05

    M+T 83 5 0.06

    G+T 81 5 0.062

    M+V 92 6 0.065

    L+R 123 8 0.065

    D+K 250 18 0.072

    K+F 12 1 0.083

    P+K 83 7 0.084

    G+H 42 4 0.095

    D+H 61 6 0.098

    K+H 96 11 0.115

    U+Ä 17 2 0.118

    M+K 176 23 0.131

    N+M 52 7 0.135

    V+K 128 18 0.141

    D+F 7 1 0.143

    G+K 144 21 0.146

    M+R 31 5 0.161

    O+Ü 16 3 0.188

    N+B 5 1 0.200

    B+P 5 1 0.200

    T+F 18 4 0.222

    D+Õ 17 4 0.235

    S+G 41 10 0.244

    P+D 4 1 0.25

    U+O 131 33 0.252

    B+V 7 2 0.286

    V+T 47 14 0.298

    B+K 19 6 0.316

    O+Z 3 1 0.333

    D+G 3 1 0.333

    T+V 185 71 0.384

    S+V 1097 478 0.436

    S+B 57 25 0.439

    I+P 1705 785 0.460

    V+H 36 17 0.472

    L+H 109 52 0.477

    W+P 4 2 0.5

    M+G 2 1 0.5

    K+D 10 5 0.5

    G+Ä 2 1 0.5

    The worst rules were:

    R+I 21 11035 525.4

    V+E 7 3689 527.0

    P+Õ 3 1588 529.3

    N+D 12 6925 577.0

    M+E 10 6028 602.8

    M+Ä 2 1233 616.5

    V+U 2 1253 626.5

    P+Ä 2 1328 664.0

    G+U 4 3037 759.2

    N+I 8 6083 760.3

    B+I 2 1804 902.0

    P+I 5 4528 905.6

    K+U 7 6524 932.0

    P+O 3 2800 933.3

    Ü+H 1 1082 1082.0

    H+K 1 1247 1247.0

    T+U 9 11907 1323.0

    N+E 14 20132 1438.0

    M+I 8 12024 1503.0

    P+U 2 3392 1696.0

    B+E 1 1963 1963.0

    Ü+L 1 1986 1986.0

    Ä+Ä 1 1999 1999.0

    N+U 1 2054 2054.0

    V+Ä 1 2123 2123.0

    V+I 2 5333 2666.5

    M+U 1 3104 3104.0

    H+A 1 4187 4187.0

    For the purpose of comparison we will also present two lists of four-letter sequences conforming to the pattern xx+xx. The first of them presents such letter combinations that occurred exclusively on compound boundaries 50 times or more:

    IS+VÄ 152

    US+MA 133

    IS+VI 101

    TE+VA 91

    LE+KA 88

    US+VI 82

    US+VÄ 82

    SE+VA 75

    US+KI 73

    US+TÖ 71

    UU+VI 68

    NA+KO 66

    IS+PU 60

    ME+KA 59

    LA+KO 56

    IS+AE 55

    SE+KI 55

    US+PI 55

    US+RA 54

    US+RI 53

    US+PA 52

    JA+VA 51

    UD+TE 51

    US+AS 51

    US+PU 51

    US+SÄ 51

    LI+VA 50

    SE+KO 50

    The second list contains sequences with exceptions, but no more than 1 exception to 10 recognised boundaries:

    US+VA 170 1 0.006

    SE+KA 100 1 0.010

    IS+VA 93 1 0.011

    JA+KO 58 1 0.017

    TE+KA 56 1 0.018

    SI+VA 54 1 0.019

    SI+PU 54 1 0.019

    DE+VA 54 1 0.019

    VA+KA 43 1 0.023

    US+ME 43 1 0.023

    JA+TE 43 1 0.023

    IS+PI 42 1 0.024

    KU+VA 40 1 0.025

    SI+TE 38 1 0.026

    NI+TE 39 1 0.026

    SA+VA 37 1 0.027

    RI+KI 37 1 0.027

    US+KE 36 1 0.028

    LI+PA 36 1 0.028

    AL+MA 34 1 0.029

    US+TO 35 1 0.029

    NI+LA 35 1 0.029

    ER+KA 35 1 0.029

    TI+KO 31 1 0.032

    DI+MA 31 1 0.032

    RI+LE 30 1 0.033

    US+SA 29 1 0.034

    SE+VI 29 1 0.034

    SE+RA 28 1 0.036

    SI+PI 27 1 0.037

    NA+SA 27 1 0.037

    MA+KI 27 1 0.037

    LA+PA 27 1 0.037

    AS+PU 27 1 0.037

    IS+KI 76 3 0.039

    TE+SA 25 1 0.040

    KU+KA 25 1 0.040

    LE+HA 24 1 0.042

    HA+VA 24 1 0.042

    NI+PU 48 2 0.042

    HA+KU 23 1 0.043

    US+PE 22 1 0.045

    TE+TE 22 1 0.045

    SI+ME 21 1 0.048

    SE+TO 21 1 0.048

    HE+LA 21 1 0.048

    BA+MA 21 1 0.048

    NI+SI 20 1 0.050

    ME+KE 20 1 0.050

    ER+KO 20 1 0.050

    EL+AR 20 1 0.050

    BI+VA 20 1 0.050

    US+IN 19 1 0.053

    NA+LE 19 1 0.053

    NA+KE 18 1 0.056

    AS+LI 18 1 0.056

    SI+PA 36 2 0.056

    US+SE 90 5 0.056

    DI+VA 35 2 0.057

    LI+RI 17 1 0.059

    KI+KA 17 1 0.059

    GU+PI 17 1 0.059

    DI+TU 17 1 0.059

    RA+MA 33 2 0.061

    SI+SI 16 1 0.063

    LU+TE 16 1 0.063

    KA+ME 16 1 0.063

    DE+PA 16 1 0.063

    DU+KA 32 2 0.063

    SI+KL 15 1 0.067

    ON+KA 15 1 0.067

    NI+TO 15 1 0.067

    NA+VE 15 1 0.067

    KU+TU 15 1 0.067

    DU+VA 15 1 0.067

    LU+PI 14 1 0.071

    KE+LA 14 1 0.071

    IN+HA 14 1 0.071

    HI+LA 14 1 0.071

    AE+KA 14 1 0.071

    AL+KA 28 2 0.071

    LU+MA 56 4 0.071

    TU+KU 13 1 0.077

    TU+VA 13 1 0.077

    RI+AR 13 1 0.077

    OR+SO 13 1 0.077

    JA+AR 13 1 0.077

    GI+ME 13 1 0.077

    ER+PI 13 1 0.077

    ES+PI 13 1 0.077

    IS+PA 39 3 0.077

    LI+PI 38 3 0.079

    IS+MA 114 9 0.079

    US+KA 139 11 0.079

    SE+AS 12 1 0.083

    RE+VY 12 1 0.083

    GA+TA 12 1 0.083

    GA+VI 12 1 0.083

    EL+SE 12 1 0.083

    BE+KA 12 1 0.083

    BI+KA 12 1 0.083

    DA+VA 23 2 0.087

    AS+PI 23 2 0.087

    TI+PE 11 1 0.091

    TI+PI 11 1 0.091

    OO+TA 11 1 0.091

    MA+VO 11 1 0.091

    KS+PA 11 1 0.091

    IK+VA 11 1 0.091

    IS+SO 11 1 0.091

    HU+LA 11 1 0.091

    EO+TE 11 1 0.091

    NU+KA 22 2 0.091

    RA+PU 22 2 0.091

    NA+PA 22 2 0.091

    NA+PO 22 2 0.091

    NA+MA 43 4 0.093

    VA+TE 32 3 0.094

    US+LU 21 2 0.095

    GI+LA 21 2 0.095

    RI+PA 42 4 0.095

    RI+LA 52 5 0.096

    RI+PU 41 4 0.098

    Conclusion

    1. The selection of the rules, their number and realisation all depend on the target application requirements.
    2. The most reasonable place of such a rule system in the framework of automatic morphological analysis would be immediately before compound word analysis. This would exempt already lemmatised (recognised) words from the necessity to pass this stage of analysis.
    3. The rules obtained can be divided into two groups. The first could include exceptionless rules with sufficient coverage to be applied at once. In the case of a positive result the word form in question could be directly submitted to the compound analysis. The expedience of this solution depends on the way of realisation of the analysis algorithm, and mostly on the working speed of different components of the analysis.
    4. As to the number of rules to be applied one must reach a reasonable compromise. If the rules are too many, the analysis will slow down as the same set of rules is applied to all words, mostly without avail. If the rules are too many or too complex, dictionary search could turn out to be the quicker way. On the other hand, every new stage added to the automatic analysis will make it slower, therefore the rules must find enough compound boundaries to pay off.
    5. The implementation of the rules as a finite state automaton makes it almost impossible to order the rules. E.g. '+MÕ' would be a fine rule if it could be triggered only after the other rule 'NIM+Õ', thus eliminating most of the exceptions represented by the various forms of the word 'inimõigused' human rights. When ordered rules give better results, the speed will suffer.
    6. The rules could be more complex. The generality of the rules would increase considerably if:
    7. the rules could overlap;
    8. several boundaries could be covered by one rule;
    9. the beginning and end of a word would be marked and used in the rules;
      • inhibiting rules could be introduced.

    The rules proposed here cover approx. 45% of the compound boundaries while the addition of some rules without noticeable slowdown of the process would probably raise the efficiency to 60% of all compound boundaries.