This article first appeared in the Concatenator, Fall 1996.
(Please note: many diacritics found in this article cannot be stored in a web page.)
If you ask Big Bird or my three-year-old son, they'll say that the alphabet and its order are easy: First there's the letter A, then the letter B, then C, then . . . you probably know the rest. A discussion by those two on this topic would be very short and, for the most part, they have the subject well in hand. But be forewarned: For those of us who deal with large alphabetized lists and many languages, there are plenty of situations that can easily trick the unwary. I hope you'll find this brief overview of alphabetization issues of the Roman alphabet informative. Big Bird does not tell the whole story!
American alphabetization
The fact that the English word "cheap" has five letters but three constituent sounds exemplifies a central problem of most languages with the Roman alphabet. There is seldom a one-to-one correspondence between sounds and letters. The Roman alphabet was originally for Latin; the relationship between that language and its alphabet is complex enough. No wonder, then, that the adoption of that alphabet by other languages is often fraught with difficulties. In English, the "ch" sound needs two letters. So does the pair of sounds represented by "th;" the voiced "th", as in "bathe", and the unvoiced "th" sound, as in "bath". There are far more than 26 sounds necessary to speak English; therefore, there is not a one-to-one correspondence between sounds and letters. Of course, English doesn't even have consistency in this mapping process, but for now that is beside the point. Alphabetization, then, deals with the written products of this mapping process. The English words "ache" and "achieve" are in close alphabetic proximity because of the way that English links sounds to letters. In sum, the alphabetizing of English words is built on the shaky foundation of English spelling.
But, illogical spelling issues aside, the practice of alphabetizing words is a logical means of organizing them. The practice has a long and venerable history. It's hard for most of us to conceive of organizing written lists of words in any other way. At present there are two primary methods of alphabetizing, word-by-word and letter-by-letter. In the former, the phrase "musical works" would precede "musicals" because a space is counted as a character that precedes all letters. In letter-by-letter alphabetization, the order of those two would be reversed, because "s" precedes "w" in the alphabet. Word-by-word organization tends to be more preferred, but there is no intrinsic reason. The debate over the merits of the two systems continues unabated, as recent articles by Cousins and Brackney show.
So, let's just have a page of alphabetization rules and be done with it, right? North American libraries, in need of such a set of rules, usually rely on the manual published by the Library of Congress (see Rather and Biebel). It is meant to place all words written in Roman letters in a logical, consistent, unambiguous order. Far from a single page, the manual is well over 100 pages! Issues include primary name headings, classifying ampersands, and Indonesian plurals. Perusing this publication gives some idea of the difficulty of creating a consistent set of alphabetization rules.
There are other sets of American alphabetization rules as well as critical literature about them. Wellisch's critique of the American Library Association filing rules is perhaps the best at showing at a glance that alphabetization is a very thorny matter.
Alphabetization of Other Languages
It should hardly come as a surprise that most other languages using the Roman alphabet have both quirks and their own traditions of alphabetization. When using reference works in languages other than English, beware!
Diacritic marks
Most languages employ some sorts of diacritic marks. Their functions vary and I wouldn't hazard a guess as to all of their uses. They can denote spoken emphasis, such as the case of the Italian word "citta". They can serve to visually distinguish words otherwise spelled the same, such as the French words "a" and "a". They can have originated as a sign of contraction, such as the umlaut in German or the ague accent in French. Vietnamese has two independent sets of diacritic marks, one for emphasis and one for pronunciation.
Of interest here is when a diacritic mark creates a distinction within an alphabetization system. For instance, in Polish, "z" and "z" are treated differently in standard Polish alphabetization practice. In fact, there are seven such cases in Polish. In a Polish dictionary, all words beginning with "s" supersede all those beginning with an unaccented "s;" all "e"s supersede those not accented, and so on. Another set of examples come from the Scandinavian languages. A small set of accented letters in these languages are alphabetized at the end of their alphabet, including the Swedish "o," the Norwegian "o," and the Danish "a."
Ligatures
Ligatures are two letters that are considered as one. An example is the Dutch "ij:" "Rijksmuseum" in a Dutch encyclopedia is placed after "Rwanda" because the "ij" ligature is alphabetized by the Dutch between the letters "w" and "y." Another example is the Czech ligature "ch", which is alphabetized between "h" and "i." The Welsh language has eight ligatures. There is not a clear distinction between a ligature and a pair of letters representing one sound. The Spanish "ll" forms one sound, but Spanish alphabetization does not give it special treatment. "Llama" is found between "lib@'elula" and "lobo."
Non-Roman letters
Why stop at twenty-six letters? Icelandic has the letter "p" (thorn) and the Romanian language has four letters adopted from the Cyrillic alphabet. In this century all sorts of characters have been devised by linguists in their pathbreaking attempts to write languages that had existed previously only through oral tradition. The best set of examples may be the languages of native American peoples. Scholars invariably use the Roman alphabet when writing down these languages for the first time. A survey of the introductory materials to dictionaries for some of these languages shows the difficulty and complexity of writing words and ordering them. Thankfully, a list of the alphabet employed often appears, because there seems to be no standard procedures for alphabetizing created and derived characters. Deceptive characters abound. "C" is a different letter from "c" in a dictionary of the Plains Miwok (see Callaghan). In a recently published Navajo dictionary (see Young and Morgan), an added diacritic sometimes alters the alphabetization of a letter and sometimes doesn't. In Navajo the letter "'" comes after "kw" and before "l". A dictionary of the Nez Perce language (see Aoki, p.xii) clearly discusses the alphabet employed: