-
Notifications
You must be signed in to change notification settings - Fork 219
Unicode and UnicodeString class
The library uses strings for various reasons, among which are:
- File names
- URL links
- Text for writing text
- Logging and tracing
In each of these cases (and any others i fail to mention) the input is either an std string
object or a plain char
pointer. Any of these are to be encoded in UTF8.
This is rather native for Unix or Mac OSX, however some others, including Windows require wide character usage and UTF16 encoding. Well, the wide char for mac and unix is 4 bytes, so i can't use it, due to the difference - for this is had to go with the UTF8 method.
Two points about that:
-
Windows still requires the non-ansi "wfopen" to open files with a wide character. i got this ifdefed in my file stream IO (actually in a separate file here referenced by both ) on condition of
WIN32
pre-processor symbol existence. Note that in all the rest i'm using "fopen" for file opening. This bares meaning on the type of file names you can provide. for Mac OSX it means you should pass POSIX paths. If you want something else...use a custom stream implementation. See in Custom input and output on how to do that. -
To help you with conversions, in case you want such help, and for my own usages, there's the UnicodeString, where i'm implementing all encoding conversions that i deemed important. Note that internally
UnicodeString
holdsunsigned longs
for each character, making it essentially UCS4/UTF32 encoded, or simply put - the unicode values themselves.
The rest of the discussion will relate to UnicodeString methods:
Two methods relate to UTF8, and you can use them for conversions sake:
EStatusCode FromUTF8(const string& inString);
EStatusCodeAndString ToUTF8() const;
The FromUTF8
gets an std string
object encoded in UTF8 and builds the string internal representation with the matching unicode values.
The ToUTF8
returns a UTF8 encoded string paired with a status. Check the status, and only if it's OK then the string is valid. Not that it is supposed to fail if you got proper Unicode values there.
UTF16 is very interesting for Windows, where a 2 byte wchar_t or simply unsigned shorts, both encoded using UTF16 or UCS2, sort of encoding.
You can use the unicode class to convert to and from unsigned shorts encoded as UTF16 (No BOM! byte ordering is implied by OS):
EStatusCode FromUTF16UShort(const unsigned short* inShorts, unsigned long inLength);
EStatusCodeAndUShortList ToUTF16UShort() const;
FromUTF16UShort
converts a UTF16 encoded unsigned short input to the internal unicode representation. This can be your way out if you are using wstring on Windows (or any system that uses 2 bytes for wchar_t). Just do the relevant casting and pass to this method.
ToUTF16UShort
will return a list of short values encoded as UTF16, which can be used to initiate a matching 2 byte wstring, or for whatever other usage you have in mind. Here too there's a status code, in case you played with the unicode values (and put some single surrogate values or something of that naughty sort).
Other methods are for actual encoding to/from UTF16. They remain in single bytes, but encoded as UTF16. Here, of course, there's importance to mentioning the byte order, or using the right method:
// convert from UTF16 string, requires BOM
EStatusCode FromUTF16(const string& inString);
EStatusCode FromUTF16(const unsigned char* inString, unsigned long inLength);
// convert from UTF16BE, do not include BOM
EStatusCode FromUTF16BE(const string& inString);
EStatusCode FromUTF16BE(const unsigned char* inString, unsigned long inLength);
// convert from UTF16LE do not include BOM
EStatusCode FromUTF16LE(const string& inString);
EStatusCode FromUTF16LE(const unsigned char* inString, unsigned long inLength);
// convert to UTF16 BE
EStatusCodeAndString ToUTF16BE(bool inPrependWithBom) const;
// convert to UTF16 LE
EStatusCodeAndString ToUTF16LE(bool inPrependWithBom) const;
The first 6 "From" methods deal with byte input encoded in UTF16. Note that there's always an option to use either std strings
or unsigned chars
. Whatever gets you kickin'.
The first pair - FromUTF16
- gets an input with a BOM, so it can determine the byte order and encode accordingly.
The other two pairs - FromUTF16BE
and 'FromUTF16LE` - allow you to get input from UTF16 where you know the byte order. In this case no BOM is required (in fact...make sure not to pass one).
The last two pairs - ToUTF16BE
and ToUTF16LE
- can be used in case you are looking for encoding to a byte string of UTF16. You can order them to either place a BOM or not.
There are not really any UCS4/UTF32 handling, so there's not direct support of Mac OSX 4 byte wchar_t
. HOWEVER, note that the internal representing is actually 4 byte, so you can access it directly and just set the values from the wchar_t
string directly:
const ULongList& GetUnicodeList() const;
ULongList& GetUnicodeList();
The GetUnicodeList
methods just give you direct access to the internal representation of unicode, and can be considered as the UCS4/UTF32 encoding, for all relevant matters.
- First Steps In Creating a PDF file
- Creating PDF Pages
- Images Support
- Text Support
- Adding Content to PDF Pages
- Links
- Unicode and UnicodeString class
- PDF Embedding
- Custom input and output
- Using Form XObjects
- Forward Referencing
- JPG Images Support
- TIFF Images Support
- PNG Images support