9 Globalization and Unicode Support

This chapter describes OCCI support for multibyte and Unicode charactersets.

This chapter contains these topics:

Overview of Globalization and Unicode Support
Specifying Charactersets
Data Types for Globalization and Unicode Support
Objects and OTT Support

Overview of Globalization and Unicode Support

OCCI now enables application development in all Oracle supported multibyte and Unicode charactersets. The UTF16 encoding of Unicode is fully supported. Application programs can specify their charactersets when the OCCI Environment is created. OCCI interfaces that take character string arguments (such as SQL statements, usernames, error messages, object names, and so on) have been extended to handle data in any characterset. Character data from relational tables or objects can be in any characterset. OCCI can be used to develop multi-lingual, global and Unicode applications.

Specifying Charactersets

OCCI applications must specify the client characterset and client national characterset when initializing the OCCI Environment. The client characterset specifies the characterset for all SQL statements, object/user names, error messages, and data of all CHAR data type (CHAR, VARCHAR2, LONG) columns/attributes. The client national characterset specifies the characterset for data of all NCHAR data type (NCHAR, NVARCHAR2) columns/attributes.

A new createEnvironment() interface that takes the client characterset and client national characterset is now provided. This allows OCCI applications to set characterset information dynamically, independent of the NLS_LANG and NLS_CHAR initialization parameter.

Note that if an application specifies OCCIUTF16 as the client characterset (first argument), then the application should use only the UTF16 interfaces of OCCI. These interfaces take UString argument types.

The charactersets in the OCCI Environment are client-side only. They indicate the charactersets the OCCI application uses to interact with Oracle. The database characterset and database national characterset are specified when the database is created. Oracle converts all data from the client characterset/national characterset to the database characterset/national characterset before the server processes the data.

Example 9-1 How to Use Globalization and Unicode Support

Environment *env = Environment:createEnvironment("JA16SJIS","UTF8");

This statement creates an OCCI Environment with JA16SJIS as the client characterset and UTF8 as the client national characterset.

Any valid Oracle characterset name (except AL16UTF16) can be passed to createEnvironment(). An OCCI specific string OCCIUTF16 (in uppercase) can be passed to specify UTF16 as the characterset.

Environment *env = Environment::createEnvironment("OCCIUTF16","OCCIUTF16");
Environment *env = Environment::createEnvironment("US7ASCII", "OCCIUTF16");

Data Types for Globalization and Unicode Support

The data types used for supporting globalization and use of unicode include UString Data Type, Multibyte and UTF16 data, and CLOB and NCLOB Data Types.

UString Data Type

UString is a data type that enables applications and the OCCI library to pass and receive Unicode data in UTF-16 encoding. UString is templated from the C++ STL basic_string with Oracle's utext data type.

typedef basic_string<utext> UString;

Oracle's utext data type is a 2 byte short data type and represents Unicode characters in the UTF-16 encoding. A Unicode character's codepoint can be represented in 1 utext or 2 utexts (2 or 4 bytes). Characters from European and most Asian scripts are represented in a single utext. Supplementary characters defined in the Unicode 3.1 standard are represented with 2 utext elements.

In Microsoft Windows platforms, UString is equivalent to the C++ standard wstring data type. This is because the wchar_t data type is type defined to a 2 byte short in these platforms, which is same as Oracle's utext, allowing applications to use a wstring type variable where a UString would be normally required. Consequently, applications can also pass wide-character string literals, created by prefixing the literal with the letter 'L', to OCCI Unicode APIs.

Example 9-2 Using wstring Data Type

//bind Unicode data using wstring data type
//binding the Euro symbol, UTF16 codepoint 0x20AC
wchar_t eurochars[] = {0x20AC,0x00};
wstring eurostr(eurochars);
stmt->setUString(1,eurostr);
 
//Call the Unicode version of createConnection by
//passing widechar literals
Connection *conn = Connection(L"HR",L"password",L"");

OCCI applications should use the UString data type for data in UTF16 characterset

Multibyte and UTF16 data

For data in multibyte charactersets like JA16SJIS and UTF8, applications should use the C++ string type. The existing OCCI APIs that take string arguments can handle data in any multibyte characterset. Due to the use of string type, OCCI supports only byte length semantics for multibyte characterset strings.

Example 9-3 Binding UTF8 Data Using the string Data Type

//bind UTF8 data
//binding the Euro symbol, UTF8 codepoint : 0xE282AC
char eurochars[] = {0xE2,0x82,0xAC,0x00};
string eurostr(eurochars)
stmt->setString(1,eurostr);//use the string interface

For Unicode data in the UTF16 characterset, the OCCI specific data type: UString and the OCCI UTF16 interfaces must be used.

Example 9-4 Binding UTF16 Data Using the UString Data Type

//bind Unicode data using UString data type
//binding the Euro symbol, UTF16 codepoint 0x20AC
utext eurochars[] = {0x20AC,0x00};
UString eurostr(eurochars);
stmt->setUString(1,eurostr);//use the UString interface

CLOB and NCLOB Data Types

Oracle provides the CLOB and NCLOB data types for storing and processing large amounts of character data. CLOBs represent data in the database characterset and NCLOBs represent data in the database national characterset. CLOBs and NCLOBs can be used as column types in relational tables and as attributes in object types.

The OCCI Clob class is used to work with both CLOB and NCLOB data types. If the database type is NCLOB, then the Clob set CharSetForm() method should be called with OCCI_SQLCS_NCHAR before reading/writing from the LOB.

The OCCI Clob class has support for multibyte and UTF16 charactersets. By default, the Clob interfaces assume the data is encoded in the client-side characterset (for both CLOBs and NCLOBs). To specify a different characterset or to specify the client-side national characterset for a NCLOB, call the setCharSetId() or setCharSetIdUString() methods with the appropriate characterset. The OCCI specific string 'OCCIUTF16' can be passed to indicate UTF16 as the characterset.

Example 9-5 Using CLOB and NCLOB Data Types

//client characterset - ZHT16BIG5, national characterset - UTF16
Environment *env = Environment::createEnvironment("ZHT16BIG5","OCCIUTF16");...
Clob nclobvar;
//for NCLOBs, must call setCharSetForm method. 
nclobvar.setCharSetForm(OCCI_SQLCS_NCHAR);...
//if reading/writing data in UTF16 for this NCLOB, still must 
//explicitly call setCharSetId
nclobvar.setCharSetId("OCCIUTF16")

To read or write data in multibyte charactersets, use the existing read and write interfaces that take a char buffer. New overloaded interfaces that take utext buffers for UTF16 data have been added to the Clob Class as read(), write() and writeChunk() methods. The arguments and return values for these methods are either bytes or characters, depending on the characterset of the LOB.

Objects and OTT Support

Multibyte and UTF16 charactersets are supported for handling character data in object attributes. All CHAR data type (CHAR or VARCHAR2) attributes hold data in the client-side characterset, while all NCHAR data type (NCHAR or NVARCHAR2) attributes hold data in the client-side national characterset. A member variable of UString data type represents an attribute in UTF16 characterset.