Khaled Sayed Blog: Working with Characters and Strings in Windows

This post is the summarize of Ch2 "Working with Characters and Strings" of Windows via C/C++, Fifth Edition Jeffrey Richter (Wintellect) Christophe Nasarre book.

Introduction & Some Definitions:

UTF : Unicode Transformation Format.
UTF-16 : encodes each character as 2 bytes (or 16 bits).

// Characters in C\C++

char : data type to represent an 8-bit ANSI character.
wchar_t : a built-in data type, which represents a 16-bit Unicode character.

// Characters in Windows Programming

CHAR : An 8-bit character
WCHAR : A 16-bit character

// Pointer to 8-bit character(s)

PCHAR : Pointer to CHAR (CHAR *)
PSTR : Pointer to CHAR (CHAR *)
PCSTR : Pointer to Constant CHAR (CONST CHAR *)

// Pointer to 16-bit character(s)

PWCHAR : Pointer to WCHAR (WCHAR *)
PWSTR : Pointer to WCHAR (WCHAR *)
PCWSTR : Pointer to Constant WCHAR(CONST WCHAR *)

// in Generic Case
// #ifdef UNICODE

TCHAR = WCHAR (Wide Character)
PTCHAR = PWCHAR (Pointer to Wide Character)
PTSTR = PWSTR (Pointer to Wide Character)
PCTSTR = PCWSTR (Pointer to Constant Wide Character)

// else

TCHAR = CHAR (Character)
PTCHAR = PCHAR (Pointer to Character)
PTSTR = PSTR (Pointer to Character)
PCTSTR = PCSTR (Pointer to Constant Character)

// How to Use TCHAR (Generic case):
// #define __TEXT(quote) L##quote

TCHAR c = TEXT('A');
TCHAR szBuffer[100] = TEXT("A String");

_tcslen : wcslen (Unicode) = strlen (ANSII)

Unicode and ANSI Functions in Windows and C Run-Time Library:

If you call any Windows function passing it an ANSI string (a string of 1-byte characters), the function first converts the string to Unicode and then passes the Unicode string to the operating system. If you are expecting ANSI strings back from a function, the system converts the Unicode string to an ANSI string before returning to your application

In c Run-Time: unlike Windows, the ANSI functions do the work; they do not translate the strings to Unicode and then call the Unicode version of the functions internally. And, of course, the Unicode versions do the work themselves too; they do not internally call the ANSI versions.

Secure String Functions in the C Run-Time Library:

Any function that modifies a string exposes a potential danger: if the destination string buffer is not large enough to contain the resulting string, memory corruption occurs. Here is an example:
// The following puts 4 characters in a
// 3-character buffer, resulting in memory corruption
WCHAR szBuffer[3] = L"";
wcscpy(szBuffer, L"abc"); // The terminating 0 is a character too!

The problem with the strcpy and wcscpy functions (and most other string manipulation functions) is that they do not accept an argument specifying the maximum size of the buffer, and therefore, the function doesn't know that it is corrupting memory. Because the function doesn't know that it is corrupting memory, it can't report an error back to your code, and therefore, you have no way of knowing that memory was corrupted. And, of course, it would be best if the function just failed without corrupting any memory at all.

Each existing function, like _tcscpy or _tcscat, has a corresponding new function that starts with the same name that ends with the _s (for secure) suffix.

All of the secure (_s) functions validate their arguments as the first thing they do. Checks are performed to make sure that pointers are not NULL, that integers are within a valid range, that enumeration values are valid, and that buffers are large enough to hold the resulting data. If any of these checks fail, the functions set the thread-local C run-time variable errno and the function returns an errno_t value to indicate success or failure.

However, these functions don't actually return; instead, in a debug build, they display a user-unfriendly assertion dialog box. Then your application is terminated. The release builds directly auto-terminate.

Why we should Use Unicode:

Easy for you to localize application.
Distribute a single binary (.exe or DLL) file that supports all languages.
Improves the efficiency of application because the code performs faster and uses less memory.

Windows internally does everything with Unicode characters and strings, so when you pass an ANSI character or string, Windows must allocate memory and convert the ANSI character or string to its Unicode equivalent.
Application can easily call all non-deprecated Windows functions, as some Windows functions offer versions that operate only on Unicode characters and strings.
Easily integrates with COM (which requires the use of Unicode characters and strings).
Easily integrates with the .NET Framework (which also requires the use of Unicode characters and strings).

Recommendation of how Working with Characters and Strings:

Start thinking of text strings as arrays of characters, not as arrays of chars or arrays of bytes.

Use generic data types (such as TCHAR/PTSTR) for text characters and strings.
Use explicit data types (such as BYTE and PBYTE) for bytes, byte pointers, and data buffers.
Use the TEXT or _T macro for literal characters and strings, but avoid mixing both for the sake of consistency and for better readability.

Perform global replaces. (For example, replace PSTR with PTSTR).

Modify string arithmetic problems. For example, functions usually expect you to pass a buffer's size in characters, not bytes. This means you should pass _countof(szBuffer) instead of sizeof(szBuffer). Also, if you need to allocate a block of memory for a string and you have the number of characters in the string, remember that you allocate memory in bytes. This means that you must call malloc(nCharacters * sizeof(TCHAR)) and not call malloc(nCharacters). Of all the guidelines I've just listed, this is the most difficult one to remember, and the compiler offers no warnings or errors if you make a mistake. This is a good opportunity to define your own macros, such as the following:

#define chmalloc(nCharacters) (TCHAR*)malloc(nCharacters * sizeof(TCHAR))

Avoid printf family functions, especially by using %s and %S field types to convert ANSI to Unicode strings and vice versa. Use MultiByteToWideChar and WideCharToMultiByte instead, as shown in "Translating Strings Between Unicode and ANSI" below.

Always specify both UNICODE and _UNICODE symbols or neither of them.

In terms of string manipulation functions, here are the basic guidelines that you should follow:

Always work with safe string manipulation functions such as those suffixed with _s or prefixed with StringCch. Use the latter for explicit truncation handling, but prefer the former otherwise.

Don't use the unsafe C run-time library string manipulation functions. In a more general way, don't use or implement any buffer manipulation routine that would not take the size of the destination buffer as a parameter. The C run-time library provides a replacement for buffer manipulation functions such as memcpy_s, memmove_s, wmemcpy_s, or wmemmove_s. All these methods are available when the __STDC_WANT_SECURE_LIB__ symbol is defined, which is the case by default in CrtDefs.h. So don't undefine __STDC_WANT_SECURE_LIB__.

Don't use Kernel32 methods for string manipulation such as lstrcat and lstrcpy.

There are two kinds of strings that we compare in our code. Programmatic strings are file names, paths, XML elements and attributes, and registry keys/values. For these, use CompareStringOrdinal, as it is very fast and does not take the user's locale into account. This is good because these strings remain the same no matter where your application is running in the world. User strings are typically strings that appear in the user interface. For these, call CompareString(Ex), as it takes the locale into account when comparing strings.

You don't have a choice: as a professional developer, you can't write code based on unsafe buffer manipulation functions. And this is the reason why all the code in this book relies on these safer functions from the C run-time library.

Khaled Sayed Blog

Tuesday, March 18, 2014

Working with Characters and Strings in Windows

No comments:

Post a Comment