Andrew

April 14, 2007

UTF-8 Conversion Tricks

Filed under: cplusplus,Optimization,Programming — floodyberry @ 3:04 am
Tags: , , ,

UTF-8 is a wonderfully simple encoding format with some very nice properties, but the juggling required to convert to UTF-16, and UTF-32 can be a little tricky and fairly easy to do poorly. This is further compounded by the various error conditions you must keep an eye out for, such as overlong encodings, reserved ranges, surrogate markers, incomplete sequences, and so on.

These are a couple tricks you can employ to hopefully keep the conversion fast and robust.

Tail Length Lookup

Our first trick is to use a lookup table for the initial byte. This allows you to both a) tell whether the byte is valid (80 to bf and fe to ff are invalid leading bytes, as well as f5 to fd if you don’t want to handle 5 and 6 byte sequences) and b) determine the number of trailing bytes in the expected sequence. We will also need the length of the sequence to quickly ensure there are enough bytes in left in the input as well as for other upcoming tricks, so this actually results in multiple wins.

If you want to cut down on the table size, you could use 128 values and take (c<<1), or 64 values and take ((c-0×80)<<1), although you’ll need an extra check for 80-bf with 64 values.

const UTF32 Replacement = ( 0xfffd );

const unsigned char UTF8TailLengths[256] = {
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
	2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
	3,3,3,3,3,3,3,3,4,4,4,4,5,5,0,0
};
UTF32 utf8_to_utf32( UTF8 *&s, const UTF8 *end ) {
	UTF32 c = ( *s++ );
	if ( c < 0x80 )
		return ( c );
	unsigned int tail = ( UTF8TailLengths[ c ] );
	if ( !tail ) || ( s + tail > end )
		return ( Replacement );

Overlong Encodings and Magic Subtraction

Once we know the length of the expected sequence, we can attempt to decode it.The basic decoding loop is something like:

c &= ( 0x3f >> tail );

unsigned int i;
for (  i = 0; i < tail; ++i ) {
	if ( ( s[i] & 0xc0 ) != 0x80 )
		break;

	c = ( ( c << 6 ) + ( s[i] & 0x3f ) );
}

s += i;
if ( i != tail )
	return ( Replacement );

At the end of decoding, we will still be faced with a problem: How do you tell if it was an overlong encoding? To keep the mapping of UTF-8 to UTF-32 one to one, we are required to reject any encoding that uses more bytes than it requires. Markus Kuhn’s utf8_check.c has a jungle of conditionals to detect the specific lead and tail byte encodings that would indicate an overlong encoding, but this is not something we want to do in our innerloop!

This is where our Overlong Encoding and Magic Subtraction lookup comes in. Since we know the length of the tail, we can create a lookup of the minimum value a sequence with tail bytes needs to be.

Magic Subtraction is a side bonus to knowing the length of the tail. With Magic Subtraction, we can skip masking off the lead byte as well as eliminating the &0x3f mask in the innerloop! Magic Subtraction works by accumulating the value of the masked off bits into a single value, and subtracting that value at the end. Because we’re making sure each byte is well formed, we can be sure that the masked off bits will add up to a constant value. I got this trick from ConvertUTF.c by Mark E. Davis.

If you want to double check the magic subtraction values, you can calculate them yourself like so: Find the constant value for the lead byte of each sequence, then for each byte in the sequence, shift the value over 6 bits and add 80.

  • 1 tail byte: (c0<<6)+80
  • 2 tail bytes: (((e0<<6)+80)<<6)+80
  • 3 tail bytes: (((((f0<<6)+80)<<6)+80)<<6)+80
  • etc.
struct UTF8Lookup {
	UTF32 mOverlongMinimum, mMagicSubtraction;
} const UTF8Lookups[ 6 ] = {
	{ 0x00000000, 0x00000000 },
	{ 0x00000080, 0x00003080 },
	{ 0x00000800, 0x000E2080 },
	{ 0x00010000, 0x03C82080 },
	{ 0x00200000, 0xFA082080 },
	{ 0x04000000, 0x82082080 },
};
unsigned int i;
for ( i = 0; i < tail; ++i ) {
	if ( ( s[i] & 0xc0 ) != 0x80 )
		break;

	c = ( c << 6 ) + s[i];
}

s += i;
if ( i != tail )
	return ( Replacement );

const UTF8Lookup &lookup = UTF8Lookups[ tail ];
c -= ( lookup.mMagicSubtraction );
if ( c < lookup.mOverlongMinimum )
	return ( Replacement );

Tail Byte Error Bits

You may have noticed that we are checking every single tail byte to see if it is well formed ( *s & 0xc0 != 0×80 ). Even if we used a switch on tail to unroll our loop, we would still need to have all the conditionals. The Tail Byte Error Bits trick is what I came up with to remove the checks.

If we have a lookup table that has 1 for invalid tail bytes and 0 for valid bytes, we can accumulate these values in a mask variable which will be non-zero at the end of our decoding loop if any of the tail bytes were invalid. Further more, if we accumulate them so that ( mask = ( mask << 1 ) | UTFInvalidTailBytes[ s[i] ] ), we can also tell which was the first invalid tail byte by looking at which bits in mask are set. This allows us to back the source pointer up to the last valid byte.

Feb. 27, 2008: Re-looking at this, I realized you can shrink the table to just 4 values since only the top 2 bits are being checked. This leaves us with ( mask = ( mask << 1 ) | UTFInvalidTailBits[ s[i]>>6 ] ).

const unsigned char UTF8InvalidTailBits[4] = {
	1,1,0,1,
};

const unsigned int UTF8InvalidOffset[32] = {
	0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4,
	5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,
};
unsigned int mask = ( 0 );
for ( unsigned int i = 0; i < tail; ++i ) {
	c = ( c << 6 ) + s[i];
	mask = ( mask << 1 ) | UTF8InvalidTailBits[ s[i]>>6 ];
}

s += tail;
if ( mask ) {
	s -= UTF8InvalidOffset[ mask ];
	return ( Replacement );
}

const UTF8Lookup &lookup = UTF8Lookups[ tail ];
c -= ( lookup.mMagicSubtraction );
if ( c < lookup.mOverlongMinimum )
	return ( Replacement );

Finishing Up

If we made it this far, we can be sure that our UTF-8 sequence is valid. Well, almost valid. There are still certain UTF-32 values that, even when properly encoded in UTF-8, are still illegal, and must be checked for if you want a robust converter.

The offending values:

  • d800-dfff: UTF-16 uses d800-dfff to encode it’s surrogate pairs, i.e. values that don’t fit in to 16 bits. This means UTF-8/UTF-32 are not allowed to encode these values.
  • fdd0-fdef: This range was added to make your life more difficult.
  • xfffe-xffff: x ranges from 0 to 10 (hex), so the values to check for are fffe-ffff, 1fffe-1ffff, etc.
  • > 10ffff: 10ffff is the highest Unicode codepoint.

The first two ranges can be checked using a subtraction and a compare instead of two compares, a trick I got from uClibc’s wchar.c. The third range can be checked with a simple and.. and compare. The last range (10ffff) is a simple compare.

bool InRange( UTF32 c, UTF32 lo, UTF32 hi ) { return ( (UTF32 )( c - lo ) < ( hi - lo + 1 ) ); }

bool IsSurrogate( UTF32 c ) { return ( InRange( c, 0xd800, 0xdfff ) ); }
bool IsNoncharacter( UTF32 c ) { return ( InRange( c, 0xfdd0, 0xfdef ) ); }
bool IsReserved( UTF32 c ) { return ( ( c & 0xfffe ) == 0xfffe ); }
bool IsOutOfRange( UTF32 c ) { return ( c > 0x10ffff ); }

This can be further simplified if you want to make use of a rather excessive lookup table. If you initialize a 32k entry lookup table like so, you can check any value under 10000 with a simple lookup.

for ( UTF32 c = 0; c < 0x10000; ++c )
	BMPInvalid[ c >> 1 ] = ( IsSurrogate( c ) || IsNonCharacter( c ) || IsReserved( c ) );

bool IsInvalidBMP( UTF32 c ) { return ( BMPInvalid[ c >> 1 ] ); }

Cheating

Ok, all we have to do is paste this all together and we’ll have a lightning fast UTF-8 to UTF-32 converter, right? Well.. consider the following example: You have a 1000k UTF-8 file which is 90% ASCII that you want converted to UTF-32. Would you rather convert it with a function that takes 30 cycles for an ASCII character and 50 cycles for a multibyte character, or a function that takes 10 cycles for an ASCII character and 90 cycles for a multibyte character?

If you have an input agnostic converter and/or unlucky compiler optimizations, your conversion function can end up looking something like:

mov esi, [ start ]
mov edi, [ dest ]
cmp esi, [ end ]
jge finished
loop:
	movzx eax, byte ptr [ esi ]
	inc esi
	cmp eax, 80
	jle ascii

	...130 bytes of code to handle multi-byte encodings...

ascii:
	mov [ edi ], eax
	add edi, 4
	cmp esi, [ end ]
	jl loop
finished:

Yuck! Instead of a nice tight loop for the easy ASCII case, this will most likely crap all over the cache and slow you down. If you’re especially unlucky when trusting your compiler to handle templates and functions that should obviously be inlined, you’ll even end up with a call or two per character. Doing a little optimization and tweaking can result in code like:

while ( start < end ) {
	while ( ( *start < 0x80 ) && ( start < end ) ) {
		*to++ = *start++;
	}
	if ( start < end ) {
	...
	}
}
mov esi, [ start ]
mov edi, [ dest ]
cmp esi, [ end ]
jge finished
loop:
	movzx eax, byte ptr [ esi ]
	inc esi
	cmp eax, 80
	jge notascii
	mov [ edi ], eax
	add edi, 4
	cmp esi, [ end ]
	jl loop
	jmp finished

notascii:

	...130 bytes of code to handle multi-byte encodings...

	mov [ edi ], eax
	add edi, 4
	cmp esi, [ end ]
	jl loop
finished:

This function should cut through ASCII oriented UTF-8 like butter, even if the multi-byte handling is a little slower than a more optimized converter. This code re-working may have little to no gain if your input is highly varied, but if you have a good idea what you’ll be facing, it may be worth it to tweak your functions to the data.

5 and 6 byte sequences

The original UTF-8 specification allowed for 5 and 6 byte sequences (up to 31 bits of data), however, only up to 4 byte sequences are valid under RFC 3629. So what do you do with 5 and 6 byte sequences? You can interpret the entire sequence and dump it as a single invalid character, or dump an invalid character for every byte in the sequence. Since the lead byte for 5 and 6 byte sequences (f5-fd) will never appear in any 4 byte or shorter sequence, interpreting the sequences as a single (invalid) character appears to make the most sense:

  • If you are not actually processing UTF-8 text, or your input is corrupted, it won’t matter how you interpret them as any interpretation will produce garbage
  • If you are processing valid UTF-8 text, they can only appear due to an intentional 5 to 6 byte sequence. While illegal, it still represents a single character, not 5 to 6 invalid characters.

Fin

dreamprojections’ wonderful Syntax Highlighter was a contributing factor to the length of this post.

About these ads

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: