Vladimir's Tech Blog

Ruby: String vs. UTF-8

December 16, 2008

While debugging and fixing the hessian library I was reminded one more time about Ruby’s string support weakness.

… string encoded in UTF-8. Strings are encoded in chunks. ‘S’ represents the final chunk and ‘s’ represents any initial chunk. Each chunk has a 16-bit length value.

The length is the number of characters, which may be different than the number of bytes.

It is very easy to implement in most programming languages on most platforms. Why it is so hard and error prone in Ruby?

The originial implementation did not work:

[ 'S', val.length ].pack('an') << val.unpack('C*').pack('U*')

I’ve fixed this with the following:

length = val.unpack('U*').length
[ 'S', length ].pack('an') << val

It works but is not absolutely reliable (assumes the input is utf-8 encoded, but can not check this assumption) and is ugly. If the string is longer than 64K, then the implementation would be even more complex. There is no easy string slicing possibility in Ruby. And converting a long string to an array where each element is an object representing one letter will eat all your RAM!

Now how it should work:

('S', len(val), val.encode) # python

And for slicing use

val[ : :65000] # slice from the beginning to the end with step 65000

Can not we learn from Python? Has everybody to do his own mistakes? After 1.8 screw it up again in 1.9 and repair in Ruby 3000?