So far, I described how to make various programs understand Cyrillic text. Basically, each program required it's own method, very different from the others. Moreover, some programs had incomplete support of languages other than English. Not to mention their inability to interact using user's mother tongue instead of English.
The problems outlined above are very pressing, since software is rarely developed for home market only. Therefore, rewriting substantial parts of software each time the new international market is approached is very ineffective; and making each program implement it's own proprietary solution for handling different languages is not a great idea in a long term either.
Therefore, a need for standardization arises. And the standard shows up.
Everything related to the problems above is divided by two basic concepts: localization and internationalization. By localization we mean making programs able to handle different language conventions for different countries. Let me give an example. The way date is printed in the United States is MM/DD/YY. In Russia however, the most popular format is DD.MM.YY. Another issues include time representation, printing numbers and currency representation format. Apart from it, one of the most important aspect of localization is defining the appropriate character classes, that is, defining which characters in the character set are language units (letters) and how they are ordered. On the other hand, localization doesn't deal with fonts.
Internationalization (or i18n for brevity) is supposed to solve the problems related to the ability of the program interact with the user in his native language.
Both of the concepts above had to be implemented in a standard, giving programmers a consistent way of making the programs aware of national environments.
Althogh the standard hasn't been finished yet, many parts actually have; so they can be used without much of a problem.
I am going to outline the general scheme of making the programs use the features above in a standard way. Since this deserves a separate document, I'll just try to give a very basic description and pointers to more thorough sources.
One of the main concept of the localization is a locale. By locale is meant a set of conventions specific to a certain language in a certain country. It is usually wrong to say that locale is just country-specific. For example, in Canada two locales can be defined - Canada/English language and Canada/French language. Moreover, Canada/English is not equivalent to UK/English or US/English, just as Canada/French is not equivalent to France/French or Switzerland/French.
Each locale is a special database, defining at least the following rules:
In RedHat 4.1, which I am using there are actually two locale
databases: one for the C library (libc
) and one for the X
libraries. In the ideal case there should be only one locale database
for everything.
To change your default locale, it is usually enough to set the
LANG
environment variable. For example, in sh:
LANG=ru_RU export LANG
Sometimes, you may want to change only one aspect of the locale
without affecting the others. For example, you may decide (God knows
why) to stick with ru_RU
locale, but print numbers according to
the standard POSIX one. For such cases, there is a set of environment
variables, which you can you to configure specific parts for the
current locale. In the last exaple it would be:
LANG=ru_RU LC_NUMERIC=POSIX export LANG LC_NUMERIC
For the full description of those variables, see locale(7).
Now let's be more Linux-specific. Unfortunately, Linux libc
version 5.3.12, supplied with RedHat 4.1, doesn't have a russian
locale. In this case one must be downloaded from the Internet (I don't
know the exact address, however).
To check, locale for which languages you have, run 'locale
-a
'. It will list all locale databases, available to libc.
Fortunately, Linux community is rapidly moving to the new GNU libc
(glibc
version 2, which is much more POSIX-compliant and has a
proper russian locale. Next "stable" RedHat system will already use
glibc
.
As for the X
libraries, they have their own locale database. In
the version I am using (XFree86 3.3
), there already is a russian
locale database. I am not sure about the previous versions. In any
case, you may check it by looking into usr/lib/X11/locale/ (on
most systems). In my case, there already are subdirectories named
koi8-r
and even iso8859-5
.
With locale, program don't have to implement explicitly various character conversion and comparison rules, described above. Instead, they use special API which make use of the rules defined by locale. Also, it is not necessary for program to use the same locale for all rules - it is possible to handle different rules using different locales (although such technique should be strongly discouraged).
From the setlocale(3) manual page:
A program may be made portable to all locales by callingsetlocale(LC_ALL, "" )
after program initialization, by using the values returned from alocaleconv()
call for locale - dependent information and by usingstrcoll()
orstrxfrm()
to compare strings.
SunSoft, for example, defines 5 levels of program localization:
setlocale()
, it doesn't make any assumptions about the 8th bit of
each character, it users functions from ctype.h
and limits from
limits.h
, and it takes care about signed/unsigned
issues.
It is very important not to do any assumption about the character
set nature and ordering. The following programming practices must be
avoided:
if (c >= 'A' && c <= 'Z') { ...Instead, macros from the
ctype.h
header file are locale-aware and
should be used in all such occasions.
strcoll()
and strxfrm()
instead of strcmp()
for
strings, it uses time()
, localtime()
, and strftime()/ for
time services, and finally, it uses localeconv()
for a proper
numbers and currency representation.
gettext()
(Sun/POSIX standard), or catgets()
(X/Open
standard). For more information on that see section
i18n
.
char
type. Instead it uses wchar_t
, which defines entities
big enough to contain Unicode characters. ANSI C defines this data
type and an appropriate API.
For a more detaled explanation of locale, see, for example ( Voropay1 ) or ( SingleUnix ).
While localization describes, how to adapt a program to a foreign environment, internationalization (or i18n for brevity) details the ways to make program communicate with a non-English speaking user.
Before, that was done by developing some abstraction of the messages to output from the program's code. Now, such mechanism is (more or less) standardized. And, of course, there are free implementations of it!
The GNU project has finally adopted the way of making the
internationalized applications. Ulrich Drepper
(drepper@ipd.info.uni-karlsruhe.de
) developed a package
gettext
. This package is available at all GNU sites like
prep.ai.mit.edu. It
allows you to develop programs in the way that you can easily make
them support more languages. I don't intend to describe the
programming techniques, especially because the gettext
package is
delivered with excellent manual.
Request for collaboration: If you want to learn the gettext
package and to contribute to the GNU project simultaneously; or even
if you just want to contribute, then you can do it! GNU goes
international, so all the utilities are being made locale-aware. The
problem is to translate the messages from English to Russian (and
other languages if you'd like). Basically, what one has to do is to
get the special .po
file consisting of the English messages for a
certain utility and to append each message with it's equivalent in
Russian. Ultimately, this will make the system speak Russian if the
user wants it to! For more details and further directions contact
Ulrich Drepper (
drepper@ipd.info.uni-karlsruhe.de).