Spelling and presentation of names

Spelling of names

Name preferences

In the KNAB foreign place names data the principal name is always the one considered to be the endonym or the local official name. This is given in the original spelling if possible, or romanized according to internationally accepted systems. For some non-Roman languages there are also provisionally KNAB's own systems in use. See an overview of romanization systems used in KNAB.

For features that are shared by two countries and having two different names usually both name forms are given as equal, the principal name being just mechanically the name used in the country whose ISO code is alphabetically first. E.g. for features shared between Spain (ES) and Portugal (PT) the Spanish name is given as the first form. This is done solely for technical reasons and does not imply any recognition of the degree of importance each name forms may possess.

For features shared by more than two countries or having more than two different names, and also for names of features beyond national sovereignty (seas, oceans) the conventional English form is given as the principal name, with French as a parallel name.

Romanized and original-script names

In principle, all name variants in KNAB should appear at least in Roman script, as this is used for sorting and queries. Russian names for features outside the C.I.S. may appear only in Cyrillic form, these can be queried using the Russian-language query form.

Non-Roman script forms presented here are actually regenerated based on transcription and transliteration schemes that record for KNAB the original-script forms. In principle only these non-Roman script forms are presented here that have been recorded de visu, i.e. directly from sources in the original script. (This is why it has not always been possible to give original-script forms to each of the name variant.) There are, however, cases when the romanization used in some sources is judged to be sufficient to represent the original-script form, and then also the non-Roman script forms are generated. If there are errors (which unfortunately can never be totally excluded) these can occur in several stages: errors in original sources (i.e. the romanization used there is inadequate); errors in transcription or transliteration procedures; errors in conversion modules.

Encoding of names data

Since July 19, 2004 all the names data in the Internet version of KNAB are presented fully compliant to the Unicode (or ISO 10646) standard, without any conventional sequences (formatting commands, etc.) used earlier. Names that were earlier given in Roman, Greek and Cyrillic scripts, are now additionally given in other non-Roman scripts (Arabic, Chinese, Tamil, etc.). Correct presentation of these name forms in your computer screens and applications depends on the use of the latest browsers (recommended are Firefox, Netscape 7.1 or Internet Explorer 6.0) and widest Unicode-compliant fonts.

Widest ranges of Unicode characters are given in fonts, such as Arial Unicode MS or Bitstream Cyberbit. For a detailed description of Unicode and suitable fonts please refer to Alan Wood's Unicode Resources and David McCreedy's Gallery of Unicode Fonts. You will also find there links to download fonts for missing scripts. But another problem is that although Unicode characters might be present in certain fonts (e.g. Arial Unicode MS), this will not guarantee the correct presentation of the names data, as additional OpenType Layout Tables are needed to ensure correct selection of glyphs (ligatures, etc.) and their sequences. With gradual updating these problems will be overcome. The following notes are based on observations with Windows 98 II, so any newer computers are likely to perform better.

Arial Unicode MS should correctly present names data in the following scripts: Latin (i.e. Roman, incl. Vietnamese), Arabic (incl. Pashto, Persian, Uighur, Urdu), Armenian, Chinese (simplified and traditional), Cyrillic (incl. characters for Turkic languages), Devanagari (Hindi, Marathi, Nepali, etc.), Georgian, Greek (incl, polytonic), Gujarati, Gurmukhi (Panjabi), Hebrew, Japanese (default), Kannada, Korean, Tamil and Thai. The font also contains characters for Bengali, Lao, Malayalam, Oria, Telugu and Tibetan but their presentation is not correct (no OpenType tables).

Bitstream Cyberbit contains characters for the following scripts: Latin (Roman), Arabic, Chinese (default), Cyrillic, Greek, Hebrew, Japanese, Thai.

For the following languages/scripts additional fonts are needed (see the links above): Bengali, Burmese (Myanmar), Ethiopian (Amharic and Tigrinya), Unified Canadian Aboriginal Syllabics (Inuktitut), Khmer, Lao, Malayalam, Oria, Sinhala, Telugu, Thaana (Maldivian) and Tibetan. For Burmese and Sinhala it is at present almost impossible to find a publicly available Unicode font, there are difficulties also with Telugu. A good selection of fonts is presented on a page of geonames.de by Werner Fröhlich.

If for example you have set Arial Unicode as your default font for Unicode pages, you might experience problems in correctly viewing names data for Lao, Tibetan, etc., even if you have installed a font suitable for viewing these scripts and select your options for the languages to use these particular fonts. This is because the browser application will take all characters from the default font for Unicode and only if these are missing, from other available fonts. There are two complicated ways to overcome presentation problems if you are seriously interested in correctly viewing these data : 1) for the time of viewing change your default Unicode font into a font that does not contain this particular range, e.g. Times New Roman (then the browser will follow your preferences in choosing fonts for languages); 2) copy the text into other applications, like Word document and change the font there. This can also be done automatically as text portions in various scripts often contain the tag ...... If you replace it with e.g. ...., then your application will automatically use the fonts prescribed. NB! Some scripts are read from right to left (Arabic, Hebrew, Thaana) and for these scripts the tag also contains indication of direction (e.g. ); if you delete that, the names are again distorted.

The following technical remarks could be useful to those wishing to view the names data correctly, or use these data in other applications. For different reasons the Unicode coding in some cases has been applied differently from Unicode recommendations.

Latin (Roman)

In order to economize in conversion procedures some of the precomposed characters might be presented with the sequence of the basic character and the combining diacritical mark (Unicode range U+0300 to U+036F). These sequences might be used for precomposed characters in ranges U+01CF..U+01E3, U+01EA..U+01EF, U+01F7..U+0217 and U+1E00..U+1E9B. However, all these instances should be very rare, and all significant precomposed characters (e.g. all Vietnamese characters) should be encoded with one code only.
If there are more than one combining diacritical mark (U+0300..U+036F) attached to one base character, then the usual order of encoding is to move from "inside" the character to the top and then to the bottom. Should there be, e.g., a character with combining stroke (inside the character), combining diaeresis, combining acute (above the previous sign), combining line below and combing ogonek below (below the previous sign), then the order of encoding will be in the order as mentioned here.
The apostrophe and reversed apostrophe are presented using Unicode code points U+2019 and U+2018, not U+02BC ega U+02BB (this applies to Arabic, Hebrew, and other names).
Unicode standard appears to lack a code for the character "N with ascender" which is used in the new Chechen and Tatar Roman orthographies. Nor can it be presented using code sequences, as there is no "combining ascender". For the time being "combining ascender" is encoded by the combining diacritical mark U+0321.

Arabic

In the case of Pashto, Persian and Urdu the code U+06CC representing dotless or dotted Y (depending on position) has not been used, as it appears that browsers do not support it adequately. Instead the codes for dotless or dotted Y are used (U+0650 and U+0649, respectively), depending on position. The dotless Y for Uighur that should appear dotless in all positions, does not seem to be supported presently in browser applications but if you use a font like Riwaj this problem can be fixed.
The Uighur Ä should be represented in Unicode by U+06D5, but this is currently not correctly supported either. This has been replaced by U+0647 (Arabic H), accompanied by U+200C (zero-width non-joiner) in the middle of a word.

Cyrillic

The sign for abruptivity in Caucasian languages should correctly be encoded using U+04C0; for technical reasons this has been substituted by the code for capital I (=Ukrainian I) U+0406.

Devanagari, Bengali, Gurmukhi (Panjabi), Kannada, Tamil

It seems that names in these scripts are best viewed using Internet Explorer if your computer has an updated file usp10.dll; Netscape does not seem to handle correctly ligatures nor the correct sequence of characters.

Chinese and Japanese

Certain Unicode code points for Chinese and Japanese logographic characters are the same if the characters share the same origin. The glyphs, however, might be somewhat different for both languages. Arial Unicode MS contains Japanese glyphs as a default, and Bitstream Cyberbit seems to contain Chinese glyphs as a default.

Myanmar

Currently there are no public fonts available that would correctly present names in Myanmar script. Therefore KNAB data here are perhaps ahead of time. Modification of the script is used for Mon and Shan languages but Unicode has not yet included special characters for these languages, so the original-script names cannot be encoded.