The Code Project
View our advertisers Advertise with us
All Topics, MFC / C++ >> STL >> General

Upgrading an STL-based application to use Unicode.
By Taka Muraoka

Problems that developers will face when upgrading an STL-based application to use Unicode and how to solve them. 
  C++ (VC7.1, VC7, VC6)
Windows (WinXP, Win2K, Win2003, Win95, Win98, WinME)
STL, Win32, VS
Dev
  Posted 17 Jul 2003
Articles by this author
64,151 views
Search:
Toolbox
VS.NET 2003 for $899
Ultimate Toolbox $499
Print version
Send to a friend

Sign in / Sign up
 Email
 Password
Remember me
Lost your Password?
 


48 members have rated this article. Result:
Popularity: 8.16. Rating: 4.86 out of 5.

Introduction

I recently upgraded a reasonably large program to use Unicode instead of single-byte characters. Apart from a few legacy modules, I had dutifully used the t- functions and wrapped all my strings literals and character constants in _T() macros, safe in the knowledge that when it came time to switch to Unicode, all I had to do was define UNICODE and _UNICODE and everything would Just Work (tm).

Man, was I ever wrong :((

So, I write this article as therapy for the past two weeks of work and in the hope that it will maybe save others some of the pain and misery I have endured. Sigh...

The basics

In theory, writing code that can be compiled using single- or double-byte characters is straight-forward. I was going to write a section on the basics but Chris Maunder has already done it. The techniques he describes are widely known so we'll just get right on to the meat of this article.

Wide file I/O

There are wide versions of the usual stream classes and it is easy to define t-style macros to manage them:

#ifdef _UNICODE
    #define tofstream wofstream 
    #define tstringstream wstringstream
    // etc...
#else 
    #define tofstream ofstream 
    #define tstringstream stringstream
    // etc...
#endif // _UNICODE

And you would use them like this:

tofstream testFile( "test.txt" ) ; 
testFile << _T("ABC") ;

Now, you would expect the above code to produce a 3-byte file when compiled using single-byte characters and a 6-byte file when using double-byte. Except you don't. You get a 3-byte file for both. WTH is going on?!

It turns out that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file. So in the example above, the wide string L"ABC" (which is 6 bytes long) gets converted to a narrow string (3 bytes) before it is written to the file. And if that wasn't bad enough, how this conversion is done is implementation-dependent.

I haven't been able to find a definitive explanation of why things were specified like this. My best guess is that a file, by definition, is considered to be a stream of (single-byte) characters and allowing stuff to be written 2-bytes at a time would break that abstraction. Right or wrong, this causes serious problems. For example, you can't write binary data to a wofstream because the class will try to narrow it first (usually failing miserably) before writing it out.

This was particularly problematic for me because I have a lot of functions that look like this:

void outputStuff( tostream& os )
{
    // output stuff to the stream
    os << ....
}

which would work fine (i.e. it streamed out wide characters) if you passed in a tstringstream object but gave weird results if you passed in a tofstream (because everything was getting narrowed).

Wide file I/O: the solution

Stepping through the STL in the debugger (what joy!) revealed that wofstream invokes a std::codecvt object to narrow the output data just before it is written out to the file. std::codecvt objects are responsible for converting strings from one character set to another and C++ requires that two be provided as standard: one that converts chars to chars (i.e. effectively does nothing) and one that converts wchar_ts to chars. This latter one was the one that was causing me so much grief.

The solution: write a new codecvt-derived class that converts wchar_ts to wchar_ts (i.e. do nothing) and attach it to the wofstream object. When the wofstream tried to convert the data it was writing out, it would invoke my new codecvt object that did nothing and the data would be written out unchanged.

A bit of poking around on Google Groups turned up some code written by P. J. Plauger (the author of the STL that ships with MSVC) but I had problems getting it to compile with Stlport 4.5.3. This is the version I finally hacked together:

#include <locale>

// nb: MSVC6+Stlport can't handle "std::"
// appearing in the NullCodecvtBase typedef.
using std::codecvt ; 
typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ;

class NullCodecvt
    : public NullCodecvtBase
{

public:
    typedef wchar_t _E ;
    typedef char _To ;
    typedef mbstate_t _St ;

    explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { }

protected:
    virtual result do_in( _St& _State ,
                   const _To* _F1 , const _To* _L1 , const _To*& _Mid1 ,
                   _E* F2 , _E* _L2 , _E*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_out( _St& _State ,
                   const _E* _F1 , const _E* _L1 , const _E*& _Mid1 ,
                   _To* F2, _E* _L2 , _To*& _Mid2
                   ) const
    {
        return noconv ;
    }
    virtual result do_unshift( _St& _State , 
            _To* _F2 , _To* _L2 , _To*& _Mid2 ) const
    {
        return noconv ;
     }
    virtual int do_length( _St& _State , const _To* _F1 , 
           const _To* _L1 , size_t _N2 ) const _THROW0()
    {
        return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ;
    }
    virtual bool do_always_noconv() const _THROW0()
    {
        return true ;
    }
    virtual int do_max_length() const _THROW0()
    {
        return 2 ;
    }
    virtual int do_encoding() const _THROW0()
    {
        return 2 ;
    }
} ;

You can see that the functions that are supposed to do the conversions actually do nothing and return noconv to indicate that.

The only thing left to do is instantiate one of these and connect it to the wofstream object. Using MSVC, you are supposed to use the (non-standard) _ADDFAC() macro to imbue objects with a locale, but it didn't want to work with my new NullCodecvt class so I ripped out the guts of the macro and wrote a new one that did:

#define IMBUE_NULL_CODECVT( outputFile ) \
{ \
    NullCodecvt* pNullCodecvt = new NullCodecvt ; \
    locale loc = locale::classic() ; \
    loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \
    (outputFile).imbue( loc ) ; \
}

So, the example code given above that didn't work properly can now be written like this:

tofstream testFile ;
IMBUE_NULL_CODECVT( testFile ) ;
testFile.open( "test.txt" , ios::out | ios::binary ) ; 
testFile << _T("ABC") ;

It is important that the file stream object be imbued with the new codecvt object before it is opened. The file must also be opened in binary mode. If it isn't, every time the file sees a wide character that has the value 10 in it's high or low byte, it will perform CR/LF translation which is definitely not what you want. If you really want a CR/LF sequence, you will have to insert it explicitly using "\r\n" instead of std::endl.

wchar_t problems

wchar_t is the type that is used for wide characters and is defined like this:

typedef unsigned short wchar_t ;

Unfortunately, because it is a typedef instead of a real C++ type, defining it like this has one serious flaw: you can't overload on it. Look at the following code:

TCHAR ch = _T('A') ;
tcout << ch << endl ;

Using narrow strings, this does what you would expect: print out the letter A. Using wide strings, it prints out 65. The compiler decides that you are streaming out an unsigned short and prints it out as a numeric value instead of a wide character. Aaargh!!! There is no solution for this other than going through your entire code base, looking for instances where you stream out individual characters and fix them. I wrote a little function to make it a little more obvious what was going on:

#ifdef _UNICODE
    // NOTE: Can't stream out wchar_t's - convert to a string first!
    inline std::wstring toStreamTchar( wchar_t ch ) 
            { return std::wstring(&ch,1) ; }
#else 
    // NOTE: It's safe to stream out narrow char's directly.
    inline char toStreamTchar( char ch ) { return ch ; }
#endif // _UNICODE    

TCHAR ch = _T('A') ;
tcout << toStreamTchar(ch) << endl ;

Wide exception classes

Most C++ programs will be using exceptions to handle error conditions. Unfortunately, std::exception is defined like this:

class std::exception
{
    // ...
    virtual const char *what() const throw() ;
} ;

and can only handle narrow error messages. I only ever throw exceptions that I have defined myself or std::runtime_error, so I wrote a wide version of std::runtime_error like this:

class wruntime_error
    : public std::runtime_error
{

public:                 // --- PUBLIC INTERFACE ---

// constructors:
                        wruntime_error( const std::wstring& errorMsg ) ;
// copy/assignment:
                        wruntime_error( const wruntime_error& rhs ) ;
    wruntime_error&     operator=( const wruntime_error& rhs ) ;
// destructor:
    virtual             ~wruntime_error() ;

// exception methods:
    const std::wstring& errorMsg() const ;

private:                // --- DATA MEMBERS ---

// data members:
    std::wstring        mErrorMsg ; ///< Exception error message.
    
} ;

#ifdef _UNICODE
    #define truntime_error wruntime_error
#else 
    #define truntime_error runtime_error
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wruntime_error::wruntime_error( const wstring& errorMsg )
    : runtime_error( toNarrowString(errorMsg) )
    , mErrorMsg(errorMsg)
{
    // NOTE: We give the runtime_error base the narrow version of the 
    //  error message. This is what will get shown if what() is called.
    //  The wruntime_error inserter or errorMsg() should be used to get 
    //  the wide version.
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::wruntime_error( const wruntime_error& rhs )
    : runtime_error( toNarrowString(rhs.errorMsg()) )
    , mErrorMsg(rhs.errorMsg())
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error&
wruntime_error::operator=( const wruntime_error& rhs )
{
    // copy the wruntime_error
    runtime_error::operator=( rhs ) ; 
    mErrorMsg = rhs.mErrorMsg ; 

    return *this ; 
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

wruntime_error::~wruntime_error()
{
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; }

(toNarrowString() is a little helper function that converts a wide string to a narrow string and is given below). wruntime_error simply keeps a copy of the wide error message itself and gives a narrow version to the base std::exception in case somebody calls what(). Exception classes that I define myself, I modified to look like this:

class MyExceptionClass : public std::truntime_error
{
public:
    MyExceptionClass( const std::tstring& errorMsg ) : 
                            std::truntime_error(errorMsg) { } 
} ;

The final problem was that I had lots and lots of code that looked like this:

try
{
    // do something...
}
catch( exception& xcptn )
{
    tstringstream buf ;
    buf << _T("An error has occurred: ") << xcptn ; 
    AfxMessageBox( buf.str().c_str() ) ;
}

where I had defined an inserter for std::exception like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    // NOTE: toTstring() converts a string to a tstring - defined below
    os << toTstring( xcptn.what() ) ;

    return os ;
}

The problem is that my inserter called what() which only returns the narrow version of the error message. But if the error message contains foreign characters, I'd like to see them in the error dialog! So I rewrote the inserter to look like this:

tostream&
operator<<( tostream& os , const exception& xcptn )
{
    // insert the exception
    if ( const wruntime_error* p = 
            dynamic_cast<const wruntime_error*>(&xcptn) )
        os << p->errorMsg() ; 
    else 
        os << toTstring( xcptn.what() ) ;

    return os ;
}

Now it detects if it has been given a wide exception class and if so, streams out the wide error message. Otherwise it falls back to using the standard (narrow) error message. Even though I might exclusively use truntime_error-derived classes in my app, this latter case is still important since the STL or other third-party libraries might throw a std::exception-derived error.

Other miscellaneous problems

  • Q100639: If you are writing an MFC app using Unicode, you need to specify wWinMainCRTStartup as your entry point (in the Link page of your Project Options).
  • Many Windows functions accept a buffer to return their results in. The buffer size is usually specified in characters, not bytes. So while the following code will work fine when compiled using single-byte characters:
    // get our EXE name 
    TCHAR buf[ _MAX_PATH+1 ] ; 
    GetModuleFileName( NULL , buf , sizeof(buf) ) ;

    it is wrong for double-byte characters. The call to GetModuleFileName() needs to be written like this:

    GetModuleFileName( NULL , buf , sizeof(buf)/sizeof(TCHAR) ) ;
  • If you are processing a file byte-by-byte, you need to test for WEOF, not EOF.
  • HttpSendRequest() accepts a string that specifies additional headers to attach to an HTTP request before it is sent. ANSI builds accept a string length of -1 to mean that the header string is NULL-terminated. Unicode builds require the string length to be explicitly provided. Don't ask me why.

Miscellaneous useful stuff

Finally, some little helper functions that you might find useful if you are doing this kind of work.

extern std::wstring toWideString( const char* pStr , int len=-1 ) ; 
inline std::wstring toWideString( const std::string& str )
{
    return toWideString(str.c_str(),str.length()) ;
}
inline std::wstring toWideString( const wchar_t* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::wstring(pStr,len) ;
}
inline std::wstring toWideString( const std::wstring& str )
{
    return str ;
}
extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; 
inline std::string toNarrowString( const std::wstring& str )
{
    return toNarrowString(str.c_str(),str.length()) ;
}
inline std::string toNarrowString( const char* pStr , int len=-1 )
{
    return (len < 0) ? pStr : std::string(pStr,len) ;
}
inline std::string toNarrowString( const std::string& str )
{
    return str ;
}

#ifdef _UNICODE
    inline TCHAR toTchar( char ch )
    {
        return (wchar_t)ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return ch ;
    }
    inline std::tstring toTstring( const std::string& s )
    {
        return toWideString(s) ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return toWideString(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return (len < 0) ? p : std::wstring(p,len) ;
    }
#else 
    inline TCHAR toTchar( char ch )
    {
        return ch ;
    }
    inline TCHAR toTchar( wchar_t ch )
    {
        return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ;
    } 
    inline std::tstring toTstring( const std::string& s )
    {
        return s ;
    }
    inline std::tstring toTstring( const char* p , int len=-1 )
    {
        return (len < 0) ? p : std::string(p,len) ;
    }
    inline std::tstring toTstring( const std::wstring& s )
    {
        return toNarrowString(s) ;
    }
    inline std::tstring toTstring( const wchar_t* p , int len=-1 )
    {
        return toNarrowString(p,len) ;
    }
#endif // _UNICODE

/* -------------------------------------------------------------------- */

wstring 
toWideString( const char* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many wide characters we are going to get 
    int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return L"" ;

    // convert the narrow string to a wide string 
    // nb: slightly naughty to write directly into the string like this
    wstring buf ;
    buf.resize( nChars ) ; 
    MultiByteToWideChar( CP_ACP , 0 , pStr , len , 
        const_cast<wchar_t*>(buf.c_str()) , nChars ) ; 

    return buf ;
}

/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  */

string 
toNarrowString( const wchar_t* pStr , int len )
{
    ASSERT_PTR( pStr ) ; 
    ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; 

    // figure out how many narrow characters we are going to get 
    int nChars = WideCharToMultiByte( CP_ACP , 0 , 
             pStr , len , NULL , 0 , NULL , NULL ) ; 
    if ( len == -1 )
        -- nChars ; 
    if ( nChars == 0 )
        return "" ;

    // convert the wide string to a narrow string
    // nb: slightly naughty to write directly into the string like this
    string buf ;
    buf.resize( nChars ) ;
    WideCharToMultiByte( CP_ACP , 0 , pStr , len , 
          const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; 

    return buf ; 
}

Taka Muraoka


Click here to view Taka Muraoka's online profile.


Other popular articles:

[Top] Sign in to vote for this article:     PoorExcellent  

Note: You must Sign in to post to this message board.
FAQ  Noise tolerance    Search comments  
  View    Per page  
  Msgs 1 to 25 of 66 (Total: 66) (Refresh) First Prev Next     
Subject  Author  Date 
  VS 2005 Updates   starcraft01  14:48 5 Jan '06 
  how to convert CString to WCHAR *   Balasom  8:17 16 Aug '05 
  UTF-8 to Unicode   OlSchol  0:45 19 Jul '05 
  Setting up the IMBUE Macro   OlSchol  1:51 14 Jul '05 
  Re: Setting up the IMBUE Macro   Taka Muraoka  4:42 14 Jul '05 
  Re: Setting up the IMBUE Macro   OlSchol  19:24 14 Jul '05 
  Re: Setting up the IMBUE Macro   Taka Muraoka  21:32 14 Jul '05 
  Re: Setting up the IMBUE Macro   OlSchol  22:04 14 Jul '05 
  Re: Setting up the IMBUE Macro   Taka Muraoka  22:08 14 Jul '05 
  Re: Setting up the IMBUE Macro   OlSchol  22:29 14 Jul '05 
  Re: Setting up the IMBUE Macro   OlSchol  22:53 14 Jul '05 
  UNICODE, codecvt and STLPort solution   thomasG  4:29 14 Oct '04 
  Probably nice but where's the STL?   Andrew Phillips  4:28 14 Apr '04 
  Re: Probably nice but where's the STL?   George L. Jackson  8:33 15 Apr '04 
  Re: Probably nice but where's the STL?   Andrew Phillips  21:11 16 Apr '04 
  Re: Probably nice but where's the STL?   Marcello  13:34 31 Mar '05 
  Perhaps safer alternative to writing directly to wstring buffer   Jazee  15:44 9 Apr '04 
  Re: Perhaps safer alternative to writing directly to wstring buffer   aimsoft2  5:06 10 Aug '05 
  Attempt to make it more portable   Rob Staveley  18:25 21 Nov '03 
  Re: Attempt to make it more portable   OlSchol  2:30 14 Jul '05 
  Re: Attempt to make it more portable   Rob Staveley  2:48 13 Aug '05 
  Re: Attempt to make it more portable Unconfirmed/Anonymous posting  Anonymous  22:51 14 Aug '05 
  i had a problem with unicode too   zcpro  4:09 23 Oct '03 
  Re: i had a problem with unicode too   zcpro  4:13 23 Oct '03 
  a paragraph of msdn   Edwin Geng  10:35 1 Sep '03 
Last Visit: 16:20 Tuesday 7th February, 2006 First Prev Next     

General comment    News / Info    Question    Answer    Joke / Game    Admin message


All Topics, MFC / C++ >> STL >> General
Updated: 17 Jul 2003
Article content copyright Taka Muraoka, 2003
everything else Copyright © CodeProject, 1999-2006.
Web10 | Advertise on The Code Project | Privacy

The Ultimate ToolboxMSDN Communities | ASP AllianceDeveloper FusionDevelopersdexDevGuruProgrammers HeavenPlanet Source CodeTek-Tips Forums