|
|
![]() |
| ||||||||||||||||||||||||||||||||||||||||
IntroductionI recently upgraded a reasonably large program to use Unicode instead
of single-byte characters. Apart from a few legacy modules, I had
dutifully used the t- functions and wrapped all my strings literals and
character constants in Man, was I ever wrong :(( So, I write this article as therapy for the past two weeks of work and in the hope that it will maybe save others some of the pain and misery I have endured. Sigh... The basicsIn theory, writing code that can be compiled using single- or double-byte characters is straight-forward. I was going to write a section on the basics but Chris Maunder has already done it. The techniques he describes are widely known so we'll just get right on to the meat of this article. Wide file I/OThere are wide versions of the usual stream classes and it is easy to define t-style macros to manage them: #ifdef _UNICODE #define tofstream wofstream #define tstringstream wstringstream // etc... #else #define tofstream ofstream #define tstringstream stringstream // etc... #endif // _UNICODE And you would use them like this: tofstream testFile( "test.txt" ) ; testFile << _T("ABC") ; Now, you would expect the above code to produce a 3-byte file when compiled using single-byte characters and a 6-byte file when using double-byte. Except you don't. You get a 3-byte file for both. WTH is going on?! It turns out that the C++ standard dictates that wide-streams are required to convert double-byte characters to single-byte when writing to a file. So in the example above, the wide string L"ABC" (which is 6 bytes long) gets converted to a narrow string (3 bytes) before it is written to the file. And if that wasn't bad enough, how this conversion is done is implementation-dependent. I haven't been able to find a definitive explanation of why things were
specified like this. My best guess is that a file, by definition, is
considered to be a stream of (single-byte) characters and allowing stuff
to be written 2-bytes at a time would break that abstraction. Right or
wrong, this causes serious problems. For example, you can't write binary
data to a This was particularly problematic for me because I have a lot of functions that look like this: void outputStuff( tostream& os ) { // output stuff to the stream os << .... } which would work fine (i.e. it streamed out wide characters) if you
passed in a Wide file I/O: the solutionStepping through the STL in the debugger (what joy!) revealed that
The solution: write a new A bit of poking around on Google Groups turned up some code written by P. J. Plauger (the author of the STL that ships with MSVC) but I had problems getting it to compile with Stlport 4.5.3. This is the version I finally hacked together: #include <locale> // nb: MSVC6+Stlport can't handle "std::" // appearing in the NullCodecvtBase typedef. using std::codecvt ; typedef codecvt < wchar_t , char , mbstate_t > NullCodecvtBase ; class NullCodecvt : public NullCodecvtBase { public: typedef wchar_t _E ; typedef char _To ; typedef mbstate_t _St ; explicit NullCodecvt( size_t _R=0 ) : NullCodecvtBase(_R) { } protected: virtual result do_in( _St& _State , const _To* _F1 , const _To* _L1 , const _To*& _Mid1 , _E* F2 , _E* _L2 , _E*& _Mid2 ) const { return noconv ; } virtual result do_out( _St& _State , const _E* _F1 , const _E* _L1 , const _E*& _Mid1 , _To* F2, _E* _L2 , _To*& _Mid2 ) const { return noconv ; } virtual result do_unshift( _St& _State , _To* _F2 , _To* _L2 , _To*& _Mid2 ) const { return noconv ; } virtual int do_length( _St& _State , const _To* _F1 , const _To* _L1 , size_t _N2 ) const _THROW0() { return (_N2 < (size_t)(_L1 - _F1)) ? _N2 : _L1 - _F1 ; } virtual bool do_always_noconv() const _THROW0() { return true ; } virtual int do_max_length() const _THROW0() { return 2 ; } virtual int do_encoding() const _THROW0() { return 2 ; } } ; You can see that the functions that are supposed to do the conversions
actually do nothing and return The only thing left to do is instantiate one of these and connect it to
the #define IMBUE_NULL_CODECVT( outputFile ) \ { \ NullCodecvt* pNullCodecvt = new NullCodecvt ; \ locale loc = locale::classic() ; \ loc._Addfac( pNullCodecvt , NullCodecvt::id, NullCodecvt::_Getcat() ) ; \ (outputFile).imbue( loc ) ; \ } So, the example code given above that didn't work properly can now be written like this: tofstream testFile ; IMBUE_NULL_CODECVT( testFile ) ; testFile.open( "test.txt" , ios::out | ios::binary ) ; testFile << _T("ABC") ; It is important that the file stream object be imbued with the new
wchar_t problems
typedef unsigned short wchar_t ; Unfortunately, because it is a TCHAR ch = _T('A') ; tcout << ch << endl ; Using narrow strings, this does what you would expect: print out the letter A. Using wide strings, it prints out 65. The compiler decides that you are streaming out an unsigned short and prints it out as a numeric value instead of a wide character. Aaargh!!! There is no solution for this other than going through your entire code base, looking for instances where you stream out individual characters and fix them. I wrote a little function to make it a little more obvious what was going on: #ifdef _UNICODE // NOTE: Can't stream out wchar_t's - convert to a string first! inline std::wstring toStreamTchar( wchar_t ch ) { return std::wstring(&ch,1) ; } #else // NOTE: It's safe to stream out narrow char's directly. inline char toStreamTchar( char ch ) { return ch ; } #endif // _UNICODE TCHAR ch = _T('A') ; tcout << toStreamTchar(ch) << endl ; Wide exception classesMost C++ programs will be using exceptions to handle error conditions.
Unfortunately, class std::exception { // ... virtual const char *what() const throw() ; } ; and can only handle narrow error messages. I only ever throw exceptions
that I have defined myself or class wruntime_error : public std::runtime_error { public: // --- PUBLIC INTERFACE --- // constructors: wruntime_error( const std::wstring& errorMsg ) ; // copy/assignment: wruntime_error( const wruntime_error& rhs ) ; wruntime_error& operator=( const wruntime_error& rhs ) ; // destructor: virtual ~wruntime_error() ; // exception methods: const std::wstring& errorMsg() const ; private: // --- DATA MEMBERS --- // data members: std::wstring mErrorMsg ; ///< Exception error message. } ; #ifdef _UNICODE #define truntime_error wruntime_error #else #define truntime_error runtime_error #endif // _UNICODE /* -------------------------------------------------------------------- */ wruntime_error::wruntime_error( const wstring& errorMsg ) : runtime_error( toNarrowString(errorMsg) ) , mErrorMsg(errorMsg) { // NOTE: We give the runtime_error base the narrow version of the // error message. This is what will get shown if what() is called. // The wruntime_error inserter or errorMsg() should be used to get // the wide version. } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error::wruntime_error( const wruntime_error& rhs ) : runtime_error( toNarrowString(rhs.errorMsg()) ) , mErrorMsg(rhs.errorMsg()) { } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error& wruntime_error::operator=( const wruntime_error& rhs ) { // copy the wruntime_error runtime_error::operator=( rhs ) ; mErrorMsg = rhs.mErrorMsg ; return *this ; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ wruntime_error::~wruntime_error() { } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ const wstring& wruntime_error::errorMsg() const { return mErrorMsg ; } ( class MyExceptionClass : public std::truntime_error { public: MyExceptionClass( const std::tstring& errorMsg ) : std::truntime_error(errorMsg) { } } ; The final problem was that I had lots and lots of code that looked like this: try { // do something... } catch( exception& xcptn ) { tstringstream buf ; buf << _T("An error has occurred: ") << xcptn ; AfxMessageBox( buf.str().c_str() ) ; } where I had defined an inserter for tostream& operator<<( tostream& os , const exception& xcptn ) { // insert the exception // NOTE: toTstring() converts a string to a tstring - defined below os << toTstring( xcptn.what() ) ; return os ; } The problem is that my inserter called tostream& operator<<( tostream& os , const exception& xcptn ) { // insert the exception if ( const wruntime_error* p = dynamic_cast<const wruntime_error*>(&xcptn) ) os << p->errorMsg() ; else os << toTstring( xcptn.what() ) ; return os ; } Now it detects if it has been given a wide exception class and if so,
streams out the wide error message. Otherwise it falls back to using the
standard (narrow) error message. Even though I might exclusively use
Other miscellaneous problems
Miscellaneous useful stuffFinally, some little helper functions that you might find useful if you are doing this kind of work. extern std::wstring toWideString( const char* pStr , int len=-1 ) ; inline std::wstring toWideString( const std::string& str ) { return toWideString(str.c_str(),str.length()) ; } inline std::wstring toWideString( const wchar_t* pStr , int len=-1 ) { return (len < 0) ? pStr : std::wstring(pStr,len) ; } inline std::wstring toWideString( const std::wstring& str ) { return str ; } extern std::string toNarrowString( const wchar_t* pStr , int len=-1 ) ; inline std::string toNarrowString( const std::wstring& str ) { return toNarrowString(str.c_str(),str.length()) ; } inline std::string toNarrowString( const char* pStr , int len=-1 ) { return (len < 0) ? pStr : std::string(pStr,len) ; } inline std::string toNarrowString( const std::string& str ) { return str ; } #ifdef _UNICODE inline TCHAR toTchar( char ch ) { return (wchar_t)ch ; } inline TCHAR toTchar( wchar_t ch ) { return ch ; } inline std::tstring toTstring( const std::string& s ) { return toWideString(s) ; } inline std::tstring toTstring( const char* p , int len=-1 ) { return toWideString(p,len) ; } inline std::tstring toTstring( const std::wstring& s ) { return s ; } inline std::tstring toTstring( const wchar_t* p , int len=-1 ) { return (len < 0) ? p : std::wstring(p,len) ; } #else inline TCHAR toTchar( char ch ) { return ch ; } inline TCHAR toTchar( wchar_t ch ) { return (ch >= 0 && ch <= 0xFF) ? (char)ch : '?' ; } inline std::tstring toTstring( const std::string& s ) { return s ; } inline std::tstring toTstring( const char* p , int len=-1 ) { return (len < 0) ? p : std::string(p,len) ; } inline std::tstring toTstring( const std::wstring& s ) { return toNarrowString(s) ; } inline std::tstring toTstring( const wchar_t* p , int len=-1 ) { return toNarrowString(p,len) ; } #endif // _UNICODE /* -------------------------------------------------------------------- */ wstring toWideString( const char* pStr , int len ) { ASSERT_PTR( pStr ) ; ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; // figure out how many wide characters we are going to get int nChars = MultiByteToWideChar( CP_ACP , 0 , pStr , len , NULL , 0 ) ; if ( len == -1 ) -- nChars ; if ( nChars == 0 ) return L"" ; // convert the narrow string to a wide string // nb: slightly naughty to write directly into the string like this wstring buf ; buf.resize( nChars ) ; MultiByteToWideChar( CP_ACP , 0 , pStr , len , const_cast<wchar_t*>(buf.c_str()) , nChars ) ; return buf ; } /* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */ string toNarrowString( const wchar_t* pStr , int len ) { ASSERT_PTR( pStr ) ; ASSERT( len >= 0 || len == -1 , _T("Invalid string length: ") << len ) ; // figure out how many narrow characters we are going to get int nChars = WideCharToMultiByte( CP_ACP , 0 , pStr , len , NULL , 0 , NULL , NULL ) ; if ( len == -1 ) -- nChars ; if ( nChars == 0 ) return "" ; // convert the wide string to a narrow string // nb: slightly naughty to write directly into the string like this string buf ; buf.resize( nChars ) ; WideCharToMultiByte( CP_ACP , 0 , pStr , len , const_cast<char*>(buf.c_str()) , nChars , NULL , NULL ) ; return buf ; } Taka Muraoka
|
|
General comment
News / Info
Question
Answer
Joke / Game
Admin message
All Topics, MFC / C++ >> STL >> General
Updated: 17 Jul 2003 |
Article content
copyright Taka Muraoka, 2003 everything else Copyright © CodeProject, 1999-2006. Web10 | Advertise on The Code Project | Privacy |
![]() |
The Ultimate Toolbox • MSDN Communities | ASP Alliance • Developer Fusion • Developersdex • DevGuru • Programmers Heaven • Planet Source Code • Tek-Tips Forums • |