Letter database: languages, character sets, names etc

How to read the query results

decimal: ā UTF-8 (196, 129) ā	name: LATIN SMALL LETTER A WITH MACRON
	old name: ~~LATIN SMALL LETTER A MACRON~~
	Adobe glyph name: amacron
	mnemonic name(s): <a->
	category: Ll (Letter, Lowercase)
	combining: 0
	decomposition info: 0061 0304
	comment:
	found in charsets: 8859-10 (E0); 8859-13 (E2); 8859-4 (E0); CP1257 (E2); CP775 (83);
	found in languages: hawa [Hawaiian]; livo [Livonian]; lv [Latvian]; mars [Marshallese]; mi [Maori];
	used in romanization of: am_r [Amharic (ethiopic)]; ar_r [Arabic (perso-arabic)]; as_r [Assamese (assamese)]; bn_r [Bengali (bengali)]; fa_r [Persian (perso-arabic)]; gu_r [Gujarati]; hi_r [Hindi (devanagari)]; kn_r [Kannada]; ml_r [Malayalam]; or_r [Oriya]; pa_r [Punjabi]; ps_r [Pashto (perso-arabic)]; ta_r [Tamil (tamil)]; te_r [Telugu]; ur_r [Urdu (perso-arabic)]; zh_r [Chinese (sino-japanese)];
	uppercase: 0100

General comments

The query does not list the characters in the underlying script (Latin or Cyrillic). For some languages, when the information could be readily found, I provided a small note describing the absence of some basic characters in the alphabet or their general 'foreignness'.

Most languages also have at least some loanwords containing characters not listed in the query results. While these characters clearly do not belong under the 'required' category, one can argue about their 'importantness'. So far, only Norwegian has comments of this kind.

Glyph

Standard disclaimer applies -- the glyph presented here is by no means normative. U0101 is the Unicode value (= 101 hexadecimal).

Decimal

Decimal representation of the Unicode value that can be used e.g. in HTML 4.

UTF-8

UTF-8 representation of the Unicode value that can be used e.g. in HTML 4. You can see the result if you change the document encoding to UTF-8. Netscape users: View - Character set - Unicode (UTF-8). Font support is also needed so don't be surprised if an hollow box is displayed for many characters.

Name

Glyph names as currently defined in the Unicode standard (UnicodeData-Latest.txt).

Old name

Deprecated names as currently defined in the Unicode standard (UnicodeData-Latest.txt).

Adobe glyph name

Glyph names as currently defined in the Adobe Glyph List. This name is used in PostScript. Be sure to read Unicode and Glyph Names.

Mnemonic name

Mnemonic representation as currently defined in mnemonic,ds. Mainly used in POSIX environment to define mapping tables, collating rules etc. Enclosing <angle brackets> are not part of the name.

Character sets

ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/ Mapping tables for various character sets can be found in subdirectories. The bracketed hexadecimal number atfer the set corresponds to the character's code position in this set.

Category, combining, decomposition info, comment, upper/lowercase

Various categories as currently defined in the Unicode standard. Source UnicodeData-Latest.txt. Explanations about these fields can be found in ReadMe-Latest.txt.

Decomposition info is shown only if present. In this example, LATIN SMALL LETTER A WITH MACRON can be decomposed into 0061 (LATIN SMALL LETTER A) and 0304 (COMBINING MACRON). Note that some parts may be decomposed further and the full decomposition requires a recursive algorithm.

In addition to the Unicode comment field some characters may be provided with an additional "note"-field.

The upper case equivalent for the LATIN SMALL LETTER A WITH MACRON is 0100 (LATIN SMALL LETTER A WITH MACRON). A few exceptions to the common upper-lowercase behaviour of the Latin and Cyrillic scripts are

Ligatures are decomposed in the upper case (e.g. ligature ff > FF)
German ß is uppercased as SS
Uppercase is not precomposed (e.g. small n preceded with apostrophe > 'N)
Dotted i - dotless I pair is not used in Turkish and Azerbaijani where separate pairs of dotless (U0131 - U0049) and dotted (U0069 - U0130) i-s are used.

Not all fields defined by Unicode are shown and used by Letter Database.