![]() |
Advanced
String Techniques in C++ - Part I:
Unicode by
(28 August 2000)
|
Introduction
![]()
|
These tutorials (there'll be two of them) will discuss how to
implement a few neat string handling techniques in your applications
and games. I've intentionally stayed clear of references to game
development in these tutorials because I feel that the techniques
presented herein win by being presented in a more general context.
After all, a large part of game development is the development of
the foundation driver technology, and there's nothing more
fundamental than string management, is there?
This first
tutorial discusses Unicode and localization techniques. What this
basically means is how to add support for character sets and
languages other than English in a very simple way. The reasons for
doing so should be quite obvious to everyone: At the very least,
making it possible for non-English players of your games to input
data in their native language (for instance, when using network chat
modes or perhaps when naming a character in an RPG) is a way of
showing them great
respect. |
A Farewell To Char
![]()
|
I happen to live in Sweden, and despite common belief we
neither have polar bears wandering around the towns nor have we
limited our diet to meatballs. What we do have, however, is an
extended Latin alphabet. For the sole purpose of limiting our
communication with the outside world (or so it feels), we've placed
little dots and circles above certain letters in our alphabet,
making them virtually unpronounceable to anyone not living in
Northern Europe. Such characters are a pain to store and transmit
electronically since the characters basically don't exist on
anything but Swedish computers.
The reason for this is quite
obvious: Characters in computer systems are normally stored as ASCII
codes, or a derivative thereof (such as the ANSI codes used by
Windows). The problem with these encoding methods is the limited
space available for characters; since ASCII uses eight bits to store
each character, a maximum of 2^8 or 256 characters can be defined.
This is quite enough for encoding the standard Latin alphabet, the
digits, punctuation and a few diacritics, but not the exotic
characters of non-English languages (just take a look at languages
such as Chinese or Japanese Kanji, which have thousands of letters
representing complete syllables, words and phrases).
The
solution is to increase the number of bits used to store each
character, and doing so in a standardized way to allow painless data
transfers between systems using different languages. Two such
standards exists and are in use today: Multibyte Character Sets
(MBCS) and Unicode.
The Mess of MBCS
To be
blunt, MBCS is the inferior of the two. The character set is based
on the ASCII-friendly char data type, but each character occupies
either one or two chars, effectively rendering all your favorite C
string functions useless. When parsing an MBCS string (yes, it has
to be parsed before it's used!), you must examine the bits of every
character you read to determine if the next character in the string
is a part of the current character or not, and if it is, how to
combine the two to form a human-readable character. While this
standard certainly provides you with more than 256 characters, it
requires you to use complex (and hence slow) string functions for
even the most trivial tasks.
Most modern Windows compilers
come with a set of string functions (all of which have the mbs
prefix) that operate on multi-byte character strings. They behave
like their regular C counterparts (e.g. mbslen() implements the same
functionality as strlen()). Windows even has a few functions for
parsing MBCS strings character-by-character, namely CharNext(),
CharPrev() and IsDBCSLeadByte(). Look'em up in the Win32 API
reference for more info.
Now that you know what MBCS is,
don't use it. If you do, you'll be sorry. Instead, read on and
discover the wonders of
Unicode!
Unicode
Unicode was invented by Apple
and Xerox in the late 80's, and is now maintained by an industry
consortium responsible for assigning new character codes
etc.
With Unicode, every character is encoded as a 16-bit (or
2-byte) quantity (unsigned short in C), thus making available as
many as 65 536 characters, more than enough for all significant
written languages in the world today.
C's good old string
functions won't work on Unicode strings either, but luckily there's
another set of C runtime library functions available for Unicode
strings, prefixed with wcs (for "wide character set", not to be
confused with the aforementioned mbs function set). You'll find all
your old workhorses here, such as wcslen(), which implements
behavior equivalent to strlen(). In addition, writing your own
functions to operate on Unicode strings is nowhere near as difficult
or frustrating as writing MBCS functions, since no parsing is
necessary. There's no hassle with any CharNext()-like functions,
traversing a string is once again as easy as blindly increasing a
pointer and looking for a terminating NULL (as is the case with
ASCII strings).
The Unicode consortium have defined code
points (a code point being the Unicode index of a specific
character) for a wide variety of languages, diacritics, special
symbols, dingbats, mathematical and scientific symbols etc. They've
also reserved quite a bit of room for you to store any custom
characters your application might use. And they've been foreseeing
enough to place the standard ASCII characters at code points 0-255,
making ASCII to Unicode translation and comparison a
breeze.
Where's the Catch?
However, Unicode do
have a few drawbacks (or rather design issues that you need to be
aware of). First, of course, is the fact that Unicode strings occupy
twice as much space as ASCII strings, since two bytes are used per
character instead of just one. This same fact leads to a few other
issues that need to be pointed out: You cannot treat Unicode strings
as arrays of bytes, as is perfectly legal with ASCII strings.
Instead, you need to treat them as arrays of characters. You must
also make sure you're not performing any arithmetic operations on
your strings under the assumption that characters occupy only one
byte each.
There are also portability issues to consider: Not
all operating systems and compilers have support for Unicode. An
operating system without Unicode support isn't that big a problem
actually - just make sure you're not using Unicode strings when
calling OS functions. On the other hand, your compiler and C runtime
library must have explicit support for Unicode in order for you to
use it, for reasons that will soon become obvious.
If you're
targeting the Windows platform, note that NT and Windows 2000 have
full Unicode support (in fact, all NT-API functions expect Unicode
strings, ASCII is supported by an internal conversion stage), but
Windows 95 and 98 have only very rudimentary support for Unicode.
We'll take care of that problem a little
later. |
Trying It At Home
![]()
|
So, how do we take Unicode from concept to reality? Assuming
you're on a platform and compiler that supports it, it's quite
simple. So let's pretend we're using Visual C++ on Windows 2000 for
a moment, shall we?
First, you need to inform the C runtime
library that you wish to use Unicode. That is done by placing the
following lines before any other C headers:
#define _UNICODE // Tell C we're using Unicode, notice the _
#include <tchar.h> // Include Unicode support functions
#include <stdlib.h>
#include <string.h>
#include ...
|
The
_UNICODE macro tells tchar.h, which is the Unicode header file
shipped with the compiler, to include the following definition of
the Unicode character type:
typedef unsigned short wchar_t;
|
...which
should be used instead of char for Unicode strings. Since we're
running on Windows, we also need to tell the Win32 API we're
interested in taking advantage of Unicode. This is done by placing
the following line before the inclusion of windows.h:
#define UNICODE // No underscore this time
#include <windows.h>
...
|
This
line causes Windows to redefine a few of its internal string data
types to be 16-bit quantities. It might be a good idea to stick
these definitions and inclusions in a common header file included by
all program modules, to avoid wreaking havoc if some module is
unintentionally not using Unicode.
Next in line is the
problem of literals. The following code would work perfectly well
with any C compiler:
char mystring[] = "flipcode"; // ASCII literal assignment
|
But
try the following:
wchar_t mystring[] = "flipcode"; // Unicode literal assignment
|
The
compiler will tell you it's an illegal assignment since you can't
assign a string to an array of 16-bit integers. But try the
following:
wchar_t mystring[] = _TEXT("flipcode");
|
I bet
it'll work perfectly. What's that _TEXT thing and what magic is
lurking beneath it? The answer is it's a macro defined in tchar.h.
Here it is written out:
For
those of you not very familiar with C's macro system, all this macro
does is it sticks a capital L in front of the string literal (the ##
is a macro concatenation command which just merges the L with the
parameter), thus making our initial source line look like this to
the compiler:
wchar_t mystring[] = L"flipcode";
|
The
magic L is what tells the compiler this is a Unicode literal and not
a char->unsigned short conversion. This is why you'll need a
Unicode-capable compiler to compile such programs. The same goes for
character literals. Following are character literal assignments with
both ASCII and Unicode character variables, respectively:
char mychar = 'A';
wchar_t mychar = _TEXT('A');
|
To
aid in porting applications that use Unicode to non-Unicode
platforms, tchar.h contains a few more features. If you don't define
_UNICODE prior to including tchar.h, _TEXT will be defined in the
following way:
...Which
means it virtually does nothing, thus falling back to standard ASCII
string functionality. In addition, using the special data type TCHAR
(also defined in tchar.h), data type independence can be achieved
since TCHAR is set up to be equal to a wchar_t when _UNICODE is
defined, and char when it isn't. This makes the following source
line work in both Unicode and non-Unicode environments:
TCHAR SomeString[] = _TEXT("flipcode");
|
As
another aid in porting your applications, tchar.h defines a set of
string manipulation macros (all having the _tcs prefix) that expand
to either the corresponding ASCII or Unicode functions, depending on
whether or not _UNICODE has been defined. As an example, _tcslen
expands to wcslen when used in Unicode applications, and strlen in
ASCII applications.
...And that's about all you need to know
to start using Unicode! But as always, operating systems tend to put
restrictions upon programmers, and Unicode is no
exception... |
Speaking Unicode To A Window
![]()
|
As I said earlier, Windows NT and 2000 are built with Unicode
in mind, whereas 95/98 aren't. Even though Microsoft did their best
(?) to hide this from the programmers, we must still be cautious
under some circumstances.
Internally, the Win32 API maintains
two versions of any function that operates on strings in any way,
one for Unicode and one for ASCII strings. Take for example the
CreateWindow() function, to which the first two arguments are
strings (window class and window title). It comes in two flavors,
both defined in winuser.h (One of Windows' internal
headers):
HWND CreateWindowA(LPCTSTR lpClassName, LPCTSTR lpWindowTitle, ...);
HWND CreateWindowW(LPCTSTR lpClassName, LPCTSTR lpWindowTitle, ...);
#ifdef UNICODE
#define CreateWindow CreateWindowW
#else
#define CreateWindow CreateWindowA
#endif
|
If
you've specified the UNICODE macro before including windows.h,
Windows automatically defines CreateWindow to call CreateWindowW
(the Unicode function), otherwise it is defined to call
CreateWindowA (The ASCII/ANSI version). The same applies to all
string processing Win32 API functions (The LPCTSTR data type, by the
way, is just Microsoft's way of saying "pointer to a constant string
in either Unicode or ASCII format, depending upon whether or not
UNICODE has been #defined"). In the same way, there are also Unicode
and ASCII versions of many structures.
Nothing stops us from
running a UNICODE-compiled application under Windows 95 or 98, but
it surely won't work correctly if we start passing Unicode strings
to the Win32 API functions, which requires strings to be in ASCII
format. There are two ways to get around this limitation:
Maintain one Unicode version and one non-Unicode version of
your app.
Convert any Unicode strings to ASCII format before calling a
Windows 95/98 Win32 API function. I prefer the
second choice, since working with multiple code bases or build
commands is a constant source of headache. In addition, it's much
more convenient for the end user to have but one executable that
runs on all platforms (in reality, I guess this is more or less
expected by today's users of Windows programs). Let's look at an
example of such a situation, again involving
CreateWindow.
Converting Between Unicode And
ASCII
We'll often need functions to convert from Unicode
to ASCII and vice versa. Such functions are easy to implement
yourself, but you could also use the ones included in the Win32
API:
// Convert an ASCII string to a Unicode String
char SomeAsciiStr[] = "Ascii!";
wchar_t SomeUnicodeStr[1024];
MultiByteToWideChar(CP_ACP, 0, SomeAsciiStr, -1, SomeUnicodeStr, 1024);
// Convert a Unicode string to an ASCII string
char SomeAsciiStr[1024];
wchar_t SomeUnicodeStr[] = L"Unicode!";
WideCharToMultiByte(CP_ACP, 0, SomeUnicodeStr, -1, SomeAsciiStr, 1024, NULL, NULL);
|
Using
The Back Door to Detect Unicode Support
Of course, we
need to determine if we're running on a Unicode-compatible version
of Windows, because if we are, there's naturally no need to convert
strings to ASCII before calling the API functions. For reasons
unknown, Win32API does not provide a function to determine whether
or not a particular Windows installation is capable of using
Unicode. However, it can be detected using the following little
function:
// Use a harmless Win32 API function to determine if Windows is currently capable of using
// Unicode. Since we're calling the Unicode version (W), the function will fail if called on
// a version of Windows that's not Unicode-compatible.
// It might be a good idea to determine this in the app's initialization phase, and store the
// result in a global boolean variable.
bool IsUnicodeOS()
{
OSVERSIONINFOW os;
memset(&os, 0, sizeof(OSVERSIONINFOW));
os.dwOSVersionInfoSize = sizeof(OSVERSIONINFOW);
return (GetVersionExW(&os) != 0);
}
|
Supporting
Two Worlds
Being armed with such a function, it's a
no-brainer to implement the calling of CreateWindow in a way that
works on both Unicode and non-Unicode versions of Windows:
HWND MyCreateWindow(const TCHAR *ClassName, const TCHAR *WindowTitle, ...)
{
#ifdef UNICODE // This is a Unicode program, must see if OS has Unicode
if (IsUnicodeOS() == false) // Win95/98, must build ASCII strings
{
char aClassName[1024], aWindowTitle[1024];
WideCharToMultiByte(CP_ACP, 0, ClassName, -1, aClassName, 1024, NULL, NULL);
WideCharToMultiByte(CP_ACP, 0, WindowTitle, -1, aWindowTitle, 1024, NULL, NULL);
CreateWindowA(aClassName, aWindowTitle, ...);
}
else
#endif
{
// If we get here, we're either running a Unicode version of the app on a Unicode version
// of Windows, or a non-UC version on non-UC Windows,
CreateWindow(ClassName, WindowTitle, ...); // Use the one defined by Windows
}
}
|
Prizes
and Penalties
One thing worth noting (particularly since
this is going to be used in the context of game development) is the
issue of performance.
If you're using Unicode, the Win32 API
functions in WinNT and Windows 2000 will execute faster since they
do not have to convert the strings to Unicode before doing the
actual work. This is however the case with ASCII strings - since the
functions use Unicode internally, all ASCII strings must be
converted to Unicode and that takes time. The situation is reversed
under Windows 95/98. |
Localization
![]()
|
The next thing you need to implement is the localization
functionality that actually makes use of all this Unicode stuff.
What this means is that when you're about to display a string of
text to the user, you first browse a database to see if that
particular string is available in some language selected by the
user.
There are many different ways to accomplish this; one
way is to simply use Windows' string table resources for
localization, thus defining a string table for each language you
wish to support (Windows resources are always stored in Unicode
format). This has two obvious drawbacks: Primarily, it makes your
application very hard to port, as non-Microsoft platforms (e.g.
Linux) have no support for such string table resources. Secondly, it
makes it hard to provide support for new languages after the product
has been shipped.
One great way of solving these problems is
to perform localization the same way it's done in Unreal. Here, you
store a file (for instance using regular INI syntax) containing all
the strings for a language, like
this:
english.str: OutOfMemoryError=Out of
memory! FileNotFoundError=File not
found!
swedish.str: OutOfMemoryError=Slut pa
minne! FileNotFoundError=Kan inte hitta filen!
Then create
a function to load such strings:
TCHAR *LoadLocalizedString(char *language, char *Key, char *Default);
|
By
replacing all explicit string references with calls to such a
function, you achieve full language independence. If a string isn't
localized for a particular language (meaning it cannot be found in
the language file), it might be good to default to English (hence
the Default argument in the function prototype above). Here's how
such a function can be used:
Old way: MessageBox(hWnd, "Out of memory!", NULL, MB_OK);
New way: MessageBox(hWnd, LoadLocalizedString("swedish", "OutOfMemoryError", "Out of memory!"), NULL, MB_OK);
|
The
string files must of course be written in Unicode format for
languages to take advantage of the extended character set; such
files can be written with for instance Microsoft Word and Notepad.
In fact, Unicode .txt files differs from ASCII .txt files in only
two ways:
Unicode .txt files always start with a 2-byte header, the
first byte being 0xFF and the second one 0xFE (for little-endian
files). Use these bytes to determine if the text following the
header is in Unicode format or not. If no Unicode header is found,
the two bytes is of course part of the actual text and are
therefore not a header.
And of course, the text in Unicode files is stored in Unicode
format, meaning there's two bytes per character for you to read.
|
Further Reading
![]()
|
Until Next Time...
![]()
|
If you're not fed up with strings yet, there's one more
tutorial to take care of that.
In the next tutorial, we'll
examine the use of string classes for encapsulating all this
ASCII/Unicode functionality, plus we'll add some extras to make C++
string management really earn the pluses.
Fredrik Andersson
(f01fan@efd.lth.se) Comment:
This address is only temporary, I'll soon have another mail
address... Lead Programmer, Herring
Interactive |
|
|