RULES FOR FINDING BOUNDARIES IN COMPOUND WORDS

Indrek Hein

The aim of the paper is to find out whether-and if, then to what extent-component boundaries in the Estonian compound words can be found by rules that take into account only feasible and unfeasible letter sequences. A set of such simple rules could evidently speed up the morphological analysis of such word forms in which the rather time-consuming search for a boundary in a possible compound (consisting in the check-up on the acceptability of a possible component to be followed by a search in the dictionary of stems) could be replaced by a finite state automaton to mark certain letter sequences as indicating the presence of a boundary without exceptions or with a great probability. This would also be useful for such applications which do not refer to morphological or semantic analysis-database queries, hyphenation algorithms etc. E.g. in the word form 'päästeüksusi' the boundary can be fixed at once as in Estonian the sequence 'eü' can occur only at compound boundaries. The sequence 'dt' may either denote a boundary ('raud+tee', 'med+töötaja', 'üld+tunnustatud') or a proper name of Germanic origin ('Schmidt', 'Markvardt', 'Rembrandt' etc.). As can be seen from the examples the notion of a compound has been interpreted rather freely.

The combinatorial analysis of the letter sequences is based on the text corpus of the Institute of the Estonian Language, more emphasis has been laid on newspaper texts («Eesti Ekspress», «Hommikuleht» and «Eesti Sõnumid») as they display more linguistic variation. All word forms occurring in the texts are alphabetically sorted, the number after the slash gives the frequency. All compound boundaries were then marked by '+'. The resulting dictionary:

eksamineeritute/1

eksami+nõuete/1

eksami+sessiooni/1

eksamit/4

eksamita/1

eksamite/2

eksamitega/1

eksamitel/1

eksamitele/1

eksamitelt/1

eksamiteta/1

eksami+ukse/1

eksami+ülesanne/1

eksamplar/1 (a typo)

eks+ase+linna+pea/1

eks+direktor/1

eks+direktoriga/1

eks+direktorile/1

ekseemid/1

The size of the corpus used in the analysis was 14Mbytes consisting of 1,733,000 word forms. As the ordered dictionary has approx. 200,000 entries, the average frequency of a word form in the selection is 8.5. The respective numbers for compound words were 204,000 in the texts and 78,000 in the dictionary. Consequently the percentage of compound words in the Estonian (newspaper)texts is 12%. One tenth of the compounds consist of more than two components, the respective figures were

5 components 8

4 components 290

3 components 7426

2 components 70117

The comparison between the average frequency of compounds in the text (2.6) and simple word forms (12.5) indicates that compound words are mostly used to denote more specified notions.

It also appears that compounds spontaneously created in conversation are rare, most of the compounds used are established (i.e. not unique) in the language. For automatic analysis this means that a relatively large part of compound words can be included in the dictionary. This would leave the texts with only 2% of such word forms that shall need the lengthy procedure of boundary-finding. By way of illustration the following list of the compound words the first component of which 'öö' night is relatively neutral, is provided. The non-initial components are lemmatised. Words that can be found in the Orthological Dictionary (OD) are underlined, the dotted underline marks derivatives of OD entries and such newer compounds that seem to be established enough to be included in the dictionary as well.

öö+aja+kiri/8, öö+baar/9, öö+ekspress/1, öö+elu/4, öö+hakuks/1, öö+jook/1, öö+kapp/2, öö+klubi/18, öö+klubi+hääl/1, öö+klubi+omanik/1, öö+kreem/1, öö+kull/6, öö+kulli+topis/1, öö+kuninganna/2, öö+külm/2, öö+lokaal/7, öö+lõhn/1, öö+maja/9, öö+must/1, öö+muusik/l, öö+pikaks/1, öö+pikk+silm/6, öö+pimedus/6, öö+pood/1, öö+pott/1, öö+printsess/1, öö+päev/69, öö+päeva+ringne/14, öö+rahu/2, öö+show/1, öö+särk/7, öö+söök/1, öö+tund/3, öö+töö/2, öö+vaikus/1, öö+valve/1, öö+valvur/2, öö+video/7, öö+visiit/1, öö+õde/1.

Summarising the frequencies of the two groups we can see that the text ratio of established compounds and occasional ones is 175 to 30.

To find the letter sequences occurring exclusively on the boundaries of compound words a program was used that for every possible letter sequence found its frequency of occurrence on a compound boundary (positive result) versus its frequency elsewhere (negative result).

One-letter sequences yield but forbidding rules: as 'j', 'h' and 'õ' never occur at the end of a word form they never mark the end of a compound word component either.

A two-letter sequence requires three hypotheses to be checked: the boundary can be found before the sequence (+xx), after the sequence (xx+) or between the letters (x+x). As the number of possibilities grows exponentially (for two letters ~3000, for three letters ~130,000, the number of different four-letter sequences being ~5,000,000), sequences of more than four letters were not analysed. For two-letter sequences the results are as follows:

+iü 0 1 i+ü 130 1 iü+ 0 1

+ja 234 11854 j+a 0 8288 ja+ 3840 13172

+jb 0 0 j+b 0 0 jb+ 0 0

The first number after the hypothesis denotes positive results, whereas the second stands for negative results. E.g. all of the 130 'iü' sequences occurred on the compound boundary, while the boundary ran between the two letters (the only exception was a spelling error). The 'ja' sequence was found 234 times at the beginning of a component, 3840 times at the end of a component, not once was the boundary between these two letters, but predominantly was 'ja' found elsewhere in the word form. The sequence 'jb' was not detected.

There in no formal criterion to select the best set of rules. One possible approach is to take into account the ratio of positive and negative results, but this method has several drawbacks.

Most of the exceptions (negative results) can be accounted for by the occurrence of the corresponding sequence in simple words. If those are eliminated before boundary search starts, the problem is off.
There is hardly a rule without any exceptions. Even if negative results did not show up in the corpus, they are bound to surface sooner or later.
For a rule it is much more important just how many boundaries it can detect than how many exceptions it has. One can always take all the marked word forms in the dictionary and claim this to be an exceptionally good set of rules but this lacks both practical and theoretical relevance.

A closer look at the letter sequences that qualify as rules enables us to differentiate between several groups. The first group of the rules is phonotactically based. Those rules have practically no exceptions and should be included in the final set of rules even if the letter sequence is very rare. Ordered to the same group are such rules that that have exceptions among loanwords and foreign words, or in paradigms of unproductive declensions but otherwise seem to be phonotactical. The distinction between these two sets is somewhat arbitrary. E.g. although at first sight the sequence 'tg' seems quite impossible to accept within a word, the word 'röntgen' x-rays is much more frequent than 't+g' on a compound boundary. At the same time anyone can come up with the exception to rule 't+p'-'tpruu' whoa!. Still the rule 't+p' has no exceptions while 't+g' can hardly be considered as useful.

The rules are presented in tables. The third column contains the number of boundaries covered by the rule (the corpus number of the boundaries was 85,871). The numbers of exceptions, if any, are separated by a slash. I have also tried to arrange the rules in the order of 'goodness', but there are no formal criteria.

Rule Exceptions Frequency

{A,E,I,O,U}+{Õ,Ä,Ö,Ü}
(any vowel of the first set combined with any of the second set)
puänt 3007/4

I+J{O,U,Õ,Ä,Ü}
any vowel except A 550

T+P 533

K+P 495

G+P 341

D+T Brandt, Landtag, … 334

V+P 233

G+H 196

ÖÖ+A 191

M+T 154

D+P 127

M+V 126

K+F 99

V+M 41

G+T 38

Ö+Ü 29

M+H 28

Ö+K 28

G+Õ 16

Ö+Õ 7

Ä+A 4

Ö+F 4

US+J soomusjad 534/1

S+H show, ekshibitsionism, isheemia 603/358

P+K knopka, papka,
the -kond-suffix: piiskopkond, …
the ki-enclitic: kappki, … 171/12

G+J õlgjad 74/1

L+R taalri, maalri, … 141/16

M+J liimjas, piimjas, … 58/5

K+H khaan, khmeerid 81/10

G+K the ki-enclitic: lepingki, poegki, …
the -kond-suffix: ringkond, aegkond, … 230/228

T+G röntgen 6/23

The second group contains rules that are the result of a combination of several circumstances such as the typical structure of a word stem in Estonian, the occurrence of certain letters in case and person suffixes, restrictions on the occurrence of 'Õ', 'Ä', 'Ö' and 'Ü' in non-initial syllables, frequency of word forms etc. E.g. the rule 'EE+J' covers only two extremely frequent word forms 'seejuures' (202) and 'seejärel' (317).

Rule Exceptions Frequency

+KÕ kaksikõde 1044/5

+PÕ tippõiguskaitsja 638/1

+TÕ importõlu, importõunad 1129/5

+VÕ administratiivõigus, sugestiivõpe, korvõieline 5386/5

+PÄ trumpäss, tippärimees 5366/8

+VÄ reväär, skväär 3026/8

+MÄ paremäärmus,
proper names ending in -mäe 2926/4

+NÄ{D, G, H, I, L, O}
(all except suffix
-när/-näär - aktsionär, etc.) ~1000

+JÕ (partial overlap with rule I+JÕ) 876

EE+J 571

+HÄ 496

OO+A 355

EA+O 145

I+UU 142

ÄE+O 109

+HÕ 67

+VÖ 50

+NÕ vaseliinõli 916/1

+KÄ vasakäärmus 997/5

+PÜ kupüür, tippüritus 345/14

+PÖ epopöa, pompöösne 126/17

+HÜ formaldehüüd 86/3

The third group is mainly the result of an analysis of the word form frequencies. Many of the words carry political or journalistic connotations ('OND+ER' - koonderakond, 'ISA+MAA', both are names of political parties; 'NE+MAA' - Venemaa Russia, 'AJA+KIR' - ajakirjandus/ajakirjanik journal, journalist, journalistic). Some of the rules added here could, in principle, belong to one of the previous groups, but their context has been extended to reduce the number of exceptions ('+SÜS' and '+SÜN').

Rule Exceptions Frequency

+SÜ{S, N} 584

I+RÄ 439

I+TÖ 427

{A, E, I, O}+TÜ atatürk, trotüül 317/9

LE+A asalea 314/2

L+PO kolposkoopia 305/5

O+AJ 286

GE+P Taagepera 266

E+AAST 204

TS+P 182

A+EE 127/1

ÄE+{A, P, R} 147

IU+P 146

UP+M 133

A+UU 31/2

The following rules of the third group are presented in a list:

{A,E,I,U}+PEA

+AMETI

+BÜRO

+FILM

+FIR

+GRUP

+KIRJ

+MINISTE

+POOL

+PROG

+RÄÄ

+RÄH

+SUGU

+TÖÖ

A+LEHE

A+LEHT

AA+IL

AJA+KIR

AJA+LOO

AJA+LU

AKTSIA+SE

ARU+SA

AS+AEG

ASJA+O

AU+HI

BI+EL

BI+KAA

DU+MAA

EEL+ARV

EES+KUJ

EESTI+M

GI+KO

HE+KO

I+MEES

ISA+MAA

ISE+EN

JA+MAA

JA+PI

KÄES+O

KOOS+S

KSA+MAA

LE+OL

LGE+O

LIS+MAA

LU+KO

MAA+VA

MITTE+

NE+MAA

OMA+E

OMA+KA

OND+ER

ÖÖ+KO

ÕU+KO

PEA+A

PEA+DI

S+TÖÖ

SE+LO

SE+PR

SI+ALG

ST+KI

ST+KU

TE+VAH

TE+VAA

TI+MAA

TO+JU

US+MAA

US+VA

The following rules could be used if the rules are ordered or when the logical operator 'not' can be used:

S+ÕIGU > +SÕ

NIM+Õ > +MÕ

AAL+Õ > +LÕ

EEL+Õ > +LÕ

OOL+Õ > +LÕ

M+ÜH > +MÜ

A similar analysis has been carried out using different material (programs and realisation by Indrek Kiissel). The analysis was applied to the compound words contained in the Orthological Dictionary. There are two main reasons for differences in the resulting rules: first, OD is poor in proper names, neologisms, foreign words and terminology, and, second, OD does not inform the user of word frequences.

According to OD, there are 55 two-letter combinations (x+x) that without exceptions mark the compound boundary:

I+Õ 234

D+T 228

A+Õ 183

E+Õ 162

D+P 150

U+Õ 110

E+Ü 96

I+Ü 96

A+Ü 83

V+P 81

V+V 52

U+Ü 50

A+Ä 33

E+Ä 32

A+Ö 28

D+B 28

I+Ä 22

P+H 20

T+D 17

K+B 16

E+Ö 13

O+Õ 10

Ä+A 10

Ö+Õ 9

D+D 7

P+F 6

K+G 5

V+G 5

V+F 5

F+K 4

H+P 4

U+Ö 4

Ö+Ü 4

S+Þ 3

D+Þ 2

G+F 2

O+Ö 2

P+B 2

P+G 2

Ð+T 2

V+B 2

G+G 1

H+S 1

K+Ð 1

M+Z 1

O+Ä 1

P+Ð 1

S+Ð 1

Ð+K 1

Ð+P 1

Ð+R 1

Þ+F 1

Þ+J 1

Þ+S 1

Ä+Þ 1

52 combinations had more of positive results than of negative ones. For every rule, the number of positive results, negative results and their ratio is provided.

K+P 229 2 0.009

T+P 274 3 0.011

G+P 115 2 0.017

V+M 55 1 0.018

S+D 50 1 0.02

S+H 406 10 0.025

M+H 40 1 0.025

T+B 22 1 0.045

N+P 88 4 0.045

S+R 542 25 0.046

T+H 85 4 0.047

G+Õ 20 1 0.05

M+T 83 5 0.06

G+T 81 5 0.062

M+V 92 6 0.065

L+R 123 8 0.065

D+K 250 18 0.072

K+F 12 1 0.083

P+K 83 7 0.084

G+H 42 4 0.095

D+H 61 6 0.098

K+H 96 11 0.115

U+Ä 17 2 0.118

M+K 176 23 0.131

N+M 52 7 0.135

V+K 128 18 0.141

D+F 7 1 0.143

G+K 144 21 0.146

M+R 31 5 0.161

O+Ü 16 3 0.188

N+B 5 1 0.200

B+P 5 1 0.200

T+F 18 4 0.222

D+Õ 17 4 0.235

S+G 41 10 0.244

P+D 4 1 0.25

U+O 131 33 0.252

B+V 7 2 0.286

V+T 47 14 0.298

B+K 19 6 0.316

O+Z 3 1 0.333

D+G 3 1 0.333

T+V 185 71 0.384

S+V 1097 478 0.436

S+B 57 25 0.439

I+P 1705 785 0.460

V+H 36 17 0.472

L+H 109 52 0.477

W+P 4 2 0.5

M+G 2 1 0.5

K+D 10 5 0.5

G+Ä 2 1 0.5

The worst rules were:

R+I 21 11035 525.4

V+E 7 3689 527.0

P+Õ 3 1588 529.3

N+D 12 6925 577.0

M+E 10 6028 602.8

M+Ä 2 1233 616.5

V+U 2 1253 626.5

P+Ä 2 1328 664.0

G+U 4 3037 759.2

N+I 8 6083 760.3

B+I 2 1804 902.0

P+I 5 4528 905.6

K+U 7 6524 932.0

P+O 3 2800 933.3

Ü+H 1 1082 1082.0

H+K 1 1247 1247.0

T+U 9 11907 1323.0

N+E 14 20132 1438.0

M+I 8 12024 1503.0

P+U 2 3392 1696.0

B+E 1 1963 1963.0

Ü+L 1 1986 1986.0

Ä+Ä 1 1999 1999.0

N+U 1 2054 2054.0

V+Ä 1 2123 2123.0

V+I 2 5333 2666.5

M+U 1 3104 3104.0

H+A 1 4187 4187.0

For the purpose of comparison we will also present two lists of four-letter sequences conforming to the pattern xx+xx. The first of them presents such letter combinations that occurred exclusively on compound boundaries 50 times or more:

IS+VÄ 152

US+MA 133

IS+VI 101

TE+VA 91

LE+KA 88

US+VI 82

US+VÄ 82

SE+VA 75

US+KI 73

US+TÖ 71

UU+VI 68

NA+KO 66

IS+PU 60

ME+KA 59

LA+KO 56

IS+AE 55

SE+KI 55

US+PI 55

US+RA 54

US+RI 53

US+PA 52

JA+VA 51

UD+TE 51

US+AS 51

US+PU 51

US+SÄ 51

LI+VA 50

SE+KO 50

The second list contains sequences with exceptions, but no more than 1 exception to 10 recognised boundaries:

US+VA 170 1 0.006

SE+KA 100 1 0.010

IS+VA 93 1 0.011

JA+KO 58 1 0.017

TE+KA 56 1 0.018

SI+VA 54 1 0.019

SI+PU 54 1 0.019

DE+VA 54 1 0.019

VA+KA 43 1 0.023

US+ME 43 1 0.023

JA+TE 43 1 0.023

IS+PI 42 1 0.024

KU+VA 40 1 0.025

SI+TE 38 1 0.026

NI+TE 39 1 0.026

SA+VA 37 1 0.027

RI+KI 37 1 0.027

US+KE 36 1 0.028

LI+PA 36 1 0.028

AL+MA 34 1 0.029

US+TO 35 1 0.029

NI+LA 35 1 0.029

ER+KA 35 1 0.029

TI+KO 31 1 0.032

DI+MA 31 1 0.032

RI+LE 30 1 0.033

US+SA 29 1 0.034

SE+VI 29 1 0.034

SE+RA 28 1 0.036

SI+PI 27 1 0.037

NA+SA 27 1 0.037

MA+KI 27 1 0.037

LA+PA 27 1 0.037

AS+PU 27 1 0.037

IS+KI 76 3 0.039

TE+SA 25 1 0.040

KU+KA 25 1 0.040

LE+HA 24 1 0.042

HA+VA 24 1 0.042

NI+PU 48 2 0.042

HA+KU 23 1 0.043

US+PE 22 1 0.045

TE+TE 22 1 0.045

SI+ME 21 1 0.048

SE+TO 21 1 0.048

HE+LA 21 1 0.048

BA+MA 21 1 0.048

NI+SI 20 1 0.050

ME+KE 20 1 0.050

ER+KO 20 1 0.050

EL+AR 20 1 0.050

BI+VA 20 1 0.050

US+IN 19 1 0.053

NA+LE 19 1 0.053

NA+KE 18 1 0.056

AS+LI 18 1 0.056

SI+PA 36 2 0.056

US+SE 90 5 0.056

DI+VA 35 2 0.057

LI+RI 17 1 0.059

KI+KA 17 1 0.059

GU+PI 17 1 0.059

DI+TU 17 1 0.059

RA+MA 33 2 0.061

SI+SI 16 1 0.063

LU+TE 16 1 0.063

KA+ME 16 1 0.063

DE+PA 16 1 0.063

DU+KA 32 2 0.063

SI+KL 15 1 0.067

ON+KA 15 1 0.067

NI+TO 15 1 0.067

NA+VE 15 1 0.067

KU+TU 15 1 0.067

DU+VA 15 1 0.067

LU+PI 14 1 0.071

KE+LA 14 1 0.071

IN+HA 14 1 0.071

HI+LA 14 1 0.071

AE+KA 14 1 0.071

AL+KA 28 2 0.071

LU+MA 56 4 0.071

TU+KU 13 1 0.077

TU+VA 13 1 0.077

RI+AR 13 1 0.077

OR+SO 13 1 0.077

JA+AR 13 1 0.077

GI+ME 13 1 0.077

ER+PI 13 1 0.077

ES+PI 13 1 0.077

IS+PA 39 3 0.077

LI+PI 38 3 0.079

IS+MA 114 9 0.079

US+KA 139 11 0.079

SE+AS 12 1 0.083

RE+VY 12 1 0.083

GA+TA 12 1 0.083

GA+VI 12 1 0.083

EL+SE 12 1 0.083

BE+KA 12 1 0.083

BI+KA 12 1 0.083

DA+VA 23 2 0.087

AS+PI 23 2 0.087

TI+PE 11 1 0.091

TI+PI 11 1 0.091

OO+TA 11 1 0.091

MA+VO 11 1 0.091

KS+PA 11 1 0.091

IK+VA 11 1 0.091

IS+SO 11 1 0.091

HU+LA 11 1 0.091

EO+TE 11 1 0.091

NU+KA 22 2 0.091

RA+PU 22 2 0.091

NA+PA 22 2 0.091

NA+PO 22 2 0.091

NA+MA 43 4 0.093

VA+TE 32 3 0.094

US+LU 21 2 0.095

GI+LA 21 2 0.095

RI+PA 42 4 0.095

RI+LA 52 5 0.096

RI+PU 41 4 0.098

Conclusion

The selection of the rules, their number and realisation all depend on the target application requirements.
The most reasonable place of such a rule system in the framework of automatic morphological analysis would be immediately before compound word analysis. This would exempt already lemmatised (recognised) words from the necessity to pass this stage of analysis.
The rules obtained can be divided into two groups. The first could include exceptionless rules with sufficient coverage to be applied at once. In the case of a positive result the word form in question could be directly submitted to the compound analysis. The expedience of this solution depends on the way of realisation of the analysis algorithm, and mostly on the working speed of different components of the analysis.
As to the number of rules to be applied one must reach a reasonable compromise. If the rules are too many, the analysis will slow down as the same set of rules is applied to all words, mostly without avail. If the rules are too many or too complex, dictionary search could turn out to be the quicker way. On the other hand, every new stage added to the automatic analysis will make it slower, therefore the rules must find enough compound boundaries to pay off.
The implementation of the rules as a finite state automaton makes it almost impossible to order the rules. E.g. '+MÕ' would be a fine rule if it could be triggered only after the other rule 'NIM+Õ', thus eliminating most of the exceptions represented by the various forms of the word 'inimõigused' human rights. When ordered rules give better results, the speed will suffer.
The rules could be more complex. The generality of the rules would increase considerably if:
the rules could overlap;
several boundaries could be covered by one rule;
the beginning and end of a word would be marked and used in the rules;
- inhibiting rules could be introduced.

The rules proposed here cover approx. 45% of the compound boundaries while the addition of some rules without noticeable slowdown of the process would probably raise the efficiency to 60% of all compound boundaries.

Rule	Exceptions	Frequency
{A,E,I,O,U}+{Õ,Ä,Ö,Ü} (any vowel of the first set combined with any of the second set)	puänt	3007/4
I+J{O,U,Õ,Ä,Ü} any vowel except A		550
T+P		533
K+P		495
G+P		341
D+T	Brandt, Landtag, …	334
V+P		233
G+H		196
ÖÖ+A		191
M+T		154
D+P		127
M+V		126
K+F		99
V+M		41
G+T		38
Ö+Ü		29
M+H		28
Ö+K		28
G+Õ		16
Ö+Õ		7
Ä+A		4
Ö+F		4

US+J	soomusjad	534/1
S+H	show, ekshibitsionism, isheemia	603/358
P+K	knopka, papka, the -kond-suffix: piiskopkond, … the ki-enclitic: kappki, …	171/12
G+J	õlgjad	74/1
L+R	taalri, maalri, …	141/16
M+J	liimjas, piimjas, …	58/5
K+H	khaan, khmeerid	81/10
G+K	the ki-enclitic: lepingki, poegki, … the -kond-suffix: ringkond, aegkond, …	230/228
T+G	röntgen	6/23