This document provides a brief background on Unicode, its development, and how it is accommodated by Unicode and non-Unicode DataDirect Connect® for ODBC drivers.
Most people know that Unicode is a standard encoding that can be used to support multi-lingual character sets. Unfortunately, understanding Unicode is not as simple as its name would indicate. Software developers have used a number of character encodings, from ASCII to Unicode, to solve the many problems that arise when developing software applications that can be used worldwide.
Most legacy computing environments have used ASCII character encoding developed by the ANSI standards body to store and manipulate character strings inside software applications. ASCII encoding was convenient for programmers because each ASCII character could be stored as a byte. The initial version of ASCII used only 7 of the 8 bits available in a byte, which meant that software applications could use only 128 different characters. This version of ASCII could not account for European characters, and was completely inadequate for Asian characters. Using the eighth bit to extend the total range of characters to 256 added support for most European characters. Today, ASCII refers to either the 7-bit or 8-bit encoding of characters.
As the need increased for applications with additional international support, ANSI again increased the functionality of ASCII by developing an extension to accommodate multi-lingual software. The extension, known as the Double-Byte Character Set or DBCS, allowed existing applications to function without change, but provided for the use of additional characters, including complex Asian characters. With DBCS, characters map to either one byte (such as American ASCII characters) or two bytes (for example, Asian characters). The DBCS environment also introduced the concept of an operating system code page that identified how characters would be encoded into byte sequences in a particular computing environment. DBCS encoding provides a cross-platform mechanism for building multi-lingual applications; however, using variable-width codes is not ideal.
DataDirect Connect for ODBC UNIX drivers are capable of using double-byte character sets. The drivers normally use the character set defined by the default locale "C" unless explicitly pointed to another character set. The default locale "C" corresponds to the 7-bit ASCII character set in which only characters from ISO 8859-1 are valid. Use the following procedure to set the locale to a different character set:
1. Add the following line at the very beginning of applications that use double-byte character sets:
setlocale (LC_ALLs, "");
2. This is a standard UNIX function. It selects the character set indicated by the environment variable LANG as the one to be used by X/Open compliant character handling functions. If this line is not present, or if LANG is either not set or is set to NULL, the default locale "C" is used.
Set the LANG environment variable to the appropriate character set. The UNIX command locale -a can be used to display all supported character sets on your system. For more information, see the man pages for "locale" and "setlocale."
Many developers felt that there was a better way to solve the problem than using double-byte character sets. A group of leading software companies joined forces to form the Unicode Consortium. Together, they produced a new solution to building worldwide applications-Unicode. Unicode was originally designed as a fixed-width, uniform two-byte designation that could represent all modern scripts without the use of code pages. The Unicode Consortium has continued to evaluate new characters, and the current number of supported characters is over 95,000.
Although it seemed to be the perfect solution to building multi-lingual applications, Unicode started off with a significant drawback-it would have to be retrofitted into existing computing environments. To use the new paradigm, all applications would have to change. This was clearly unacceptable, and several standards-based transliterations were designed to convert two-byte fixed Unicode values into more appropriate character encodings, including, among others, UTF-8, UCS-2, and UTF 16.
UTF-8 is a standard method for transforming Unicode values into byte sequences that maintain transparency for all ASCII codes. UTF-8 is endorsed by the Unicode Consortium as a standard mechanism for transforming Unicode values and is popular for use with HTML, XML, and similar protocols. UTF-8 is, however, currently used primarily on AIX, HP-UX, Solaris, and Linux.
UCS-2 encoding is a fixed two-byte encoding sequence and is a method for transforming Unicode values into byte sequences for Microsoft Windows platforms. It is the standard for Windows 95, Windows 98, Windows Me, and Windows NT.
UTF-16 is a superset of UCS-2, with the addition of some special characters in surrogate pairs. UTF-16 is the standard encoding for Windows 2000, Windows XP, and Windows Server 2003.
Unicode Support in Databases
Recently, database vendors have begun to support Unicode data types natively in their systems. With Unicode support, one database can hold multiple languages. For example, a large multinational corporation could store expense data in the local languages for the Japanese, U.S., English, German, and French offices in one database.
Not surprisingly, the implementation of Unicode data types varies from vendor to vendor. For example, the Microsoft SQL Server 2000 implementation of Unicode provides data in UTF-16 format, while Oracle provides Unicode data types in UTF-8 and UTF-16 formats . A consistent implementation of Unicode not only depends on the operating system, but also on the database itself.
Unicode Support in ODBC
Prior to the ODBC 3.5 standard, all ODBC access to function calls and string data types was through ANSI encoding (either ASCII or DBCS). Applications and drivers were both ANSI-based.
The ODBC 3.5 standard specified that the ODBC Driver Manager (on both Windows and UNIX) be capable of mapping both Unicode function calls and string data types to ANSI encoding as transparently as possible. This meant that ODBC 3.5-compliant Unicode applications could use Unicode function calls and string data types with ANSI drivers because the Driver Manager could convert them to ANSI. Because of character limitations in ANSI, however, not all conversions are possible.
The ODBC Driver Manager version 3.5 or later, therefore, supports the following configurations:
A Unicode application can work with an ANSI driver because the Driver Manager provides limited Unicode-to-ANSI mapping. The Driver Manager makes it possible for a pre-3.5 ANSI driver to work with a Unicode application. What distinguishes a Unicode from a non-Unicode driver is the Unicode driver's capacity to interpret Unicode function calls without the intervention of the Driver Manager, as described in the following section.
The way in which a driver handles function calls from a Unicode application determines whether it is called a "Unicode driver."
Function Calls
Instead of the standard ANSI SQL function calls, such as SQLConnect, Unicode applications employ "W" (wide) function calls, such as SQLConnectW. If the driver is a true Unicode driver, it can understand the "W" function calls and the Driver Manager can pass them through to the driver without conversion to ANSI.
If the driver is a non-Unicode driver, it cannot understand the W function calls, and the Driver Manager must convert them to ANSI calls before sending them to the driver. The Driver Manager determines the ANSI encoding system to which it must convert by referring to a code page. On Windows, this reference is to the Active Code Page. On UNIX, it is to the IANAAppCodePage connection string attribute, part of the odbc.ini file.
The following examples illustrate these conversion streams for DataDirect Connect for ODBC drivers. The Driver Manager on UNIX prior to DataDirect Connect for ODBC Edition 5.0 assumes Unicode applications and Unicode drivers that use the same encoding (UTF-8). For DataDirect Connect for ODBC Edition 5.0 on UNIX, the Driver Manager determines the type of Unicode encoding of both the application and the driver, and performs conversions when the application and driver each use different types of encoding. This determination is made by checking SQL_ATTR_APP_UNICODE_TYPE and SQL_ATTR_DRIVER_UNICODE_TYPE, two ODBC Environment Attributes. The Driver Manager and Unicode Encoding on UNIX describes in detail how this is done.
An operation involving a Unicode application and a non-Unicode driver incurs more overhead because function conversion is involved.
Windows
UNIX: DataDirect Connect for ODBC Editions prior to 5.0
UNIX: DataDirect Connect for ODBC Edition 5.0
Unicode Application with Unicode Driver
An operation involving a Unicode application and a Unicode driver that use the same Unicode encoding is more efficient because no function conversion is involved. If the application and the driver each use different types of encoding, there is some conversion overhead. See The Driver Manager and Unicode Encoding on UNIX for details.
Windows
UNIX: DataDirect Connect for ODBC Editions prior to 5.0
UNIX: DataDirect Connect for ODBC Edition 5.0
Data
ODBC C data types are used to indicate the type of C buffers that store data in the application. This is in contrast to SQL data types, which are mapped to native database types to store data in a database (data store). ANSI applications bind to the C data type SQL_C_CHAR and expect to receive information bound in the same way. Similarly, most Unicode applications bind to the C data type SQL_C_WCHAR (wide data type) and expect to receive information bound in the same way. Any ODBC 3.5-compliant Unicode driver must be capable of supporting SQL_C_CHAR and SQL_C_WCHAR so that it can return data of one type to both ANSI and Unicode applications.
When the driver communicates with the database, it must use ODBC SQL data types, such as SQL_CHAR and SQL_WCHAR, that map to native database types. In the case of ANSI data and an ANSI database, the driver receives data bound to SQL_C_CHAR and passes it to the database as SQL_CHAR. The same is true of SQL_C_WCHAR and SQL_WCHAR in the case of Unicode data and a Unicode database.
When data from the application and the data stored in the database differ in format, for example, ANSI application data and Unicode database data, then conversions must be performed. The driver cannot receive SQL_C_CHAR data and pass it to a Unicode database that expects to receive a SQL_WCHAR data type. The driver or the Driver Manager must, therefore, be capable of converting SQL_C_CHAR to SQL_WCHAR, and vice versa.
The simplest cases of data communication are when the application, the driver, and the database are all of the same type and encoding, ANSI to ANSI to ANSI or Unicode to Unicode to Unicode. There is no data conversion involved in these instances.
When there is a difference in types of data, it must be converted from one type to another at the driver or Driver Manager level, which involves additional overhead. The type of driver determines whether these conversions are performed by the driver or the Driver Manager.
With Unicode data and an ANSI database, the Driver Manager performs the conversion.
In the case of ANSI databases, which character set to use is determined from the database code page. Flat-file databases often use the code page of the client, while RDBMSs may manage their own code pages.
The following sections discuss two basic types of data conversion in DataDirect Connect for ODBC drivers and the Driver Manager. The Driver Manager and Unicode Encoding on UNIX describes how the Driver Manager determines the type of Unicode encoding of the application and driver. How an individual driver exchanges different types of data with a particular database at the database level is beyond the scope of this discussion.
Unicode Driver
The Unicode driver, not the Driver Manager, must convert SQL_C_CHAR (ANSI) data to SQL_WCHAR (Unicode) data, and vice versa, as well as SQL_C_WCHAR (Unicode) data to SQL_CHAR (ANSI) data, and vice versa. The driver must use client code page information (Active Code Page on Windows, IANAAppCodePage attribute on UNIX) to determine which ANSI codepage to use for the conversions
ANSI Driver
The Driver Manager, not the ANSI driver, must convert SQL_C_WCHAR (Unicode) data to SQL_CHAR (ANSI) data, and vice versa (see Unicode Support in ODBC for a detailed discussion). This is necessary because ANSI drivers do not support any Unicode ODBC types.
The Driver Manager must use client code page information (Active Code Page on Windows, IANAAppCodePage attribute on UNIX) to determine which ANSI codepage to use for the conversions.
Unicode ODBC drivers on UNIX can be written with either UTF-8 or UTF-16 encoding. This would normally mean that a UTF-8 application could not work with a UTF-16 driver, and, conversely, that a UTF-16 application could not work with a UTF-8 driver. To accomplish the goal of being able to use a single UTF-8 or UTF-16 application with either a UTF-8 or UTF-16 driver, the Driver Manager must be able to determine with which type of encoding the application and driver are written and, if necessary, convert them accordingly.
To make this determination, the Driver Manager supports two ODBC Environment Attributes: SQL_ATTR_APP_UNICODE_TYPE and SQL_ATTR_DRIVER_UNICODE_TYPE, each with possible values of SQL_DD_CP_UTF8 and SQL_DD_CP_UTF16. The default value is SQL_DD_CP_UTF8.
There are several steps the Driver Manager must undertake before actually connecting to the driver to achieve this goal.
Determine if the driver supports SQL_ATTR_WCHAR_TYPE: SQLSetConnectAttr (SQL_ATTR_WCHAR_TYPE, X) is called in the driver by the Driver Manager, where X is either SQL_DD_CP_UTF8 or SQL_DD_CP_UTF16, depending on the value of the SQL_ATTR_APP_UNICODE_TYPE environment setting. If the driver returns any error on this call to SQLSetConnectAttr, the Driver Manager assumes that the driver does not support this connection attribute.
In the case of an error, the Driver Manager converts all data bound as SQL_C_WCHAR to the application Unicode type as specified by SQL_ATTR_APP_UNICODE_TYPE. The Driver Manager also converts all bound parameter data from the application Unicode type to the driver Unicode type specified by SQL_ATTR_DRIVER_UNICODE_TYPE.
Based on the information it has gathered prior to connection, the Driver Manager either does not have to convert function calls, or it converts to either UTF-8 or UTF-16 all string arguments to calls to the ODBC "W" functions before calling the driver.
UTF-16 Applications on UNIX
Because the DataDirect Driver Manager allows applications to use either UTF-8 or UTF-16 Unicode encoding, this means that applications written in UTF-16 for Windows platforms can now also be used on UNIX platforms.
The Driver Manager assumes a default of UTF-8 applications; therefore, two things must occur for it to determine that the application is UTF-16:
Although Unicode was developed to expand the number of available characters and ultimately to simplify data access in a world-wide setting, these goals have not been fully realized. The character set has been expanded, but data access still involves a number of conversions. This is because Unicode must be able to work with existing ANSI applications and because database vendors make data available in a number of different Unicode encoding formats, including UCS-2, UTF-16, and UTF-8.
ODBC drivers and the ODBC Driver Manager are the components responsible for processing function call and data encoding conversions. Developers of these components must code them to be able to recognize the type of function call and the various Unicode encoding schemes, and to make the appropriate conversions. The drivers and Driver Manager must make these conversions; Unicode data in a database can be accessed only by W function calls, and ANSI data can only be accessed by standard, non-W functions calls.
Application developers, on the other hand, need only consider whether a Unicode or ANSI application is most appropriate for a particular circumstance and code their function calls appropriately-W function calls, such as SQLConnectW, for Unicode, or standard function calls, such as SQLConnect, for ANSI. They can also code an application to switch dynamically between Unicode and ANSI calls.
As Unicode applications and data become more prevalent, and more agreements are reached concerning encoding and implementation of Unicode, data access will become more efficient as the need for function call and data conversion is reduced.