« September 2008 | Main | January 2009 »

December 16, 2008

Ruby: String vs. UTF-8

While debugging and fixing the hessian library I was reminded one more time about Ruby’s string support weakness.

The specification for hessian states:

… string encoded in UTF-8. Strings are encoded in chunks. ‘S’ represents the final chunk and ‘s’ represents any initial chunk. Each chunk has a 16-bit length value.

The length is the number of characters, which may be different than the number of bytes.

It is very easy to implement in most programming languages on most platforms. Why it is so hard and error prone in Ruby?

The originial implementation did not work:

[ 'S', val.length ].pack('an') << val.unpack('C*').pack('U*')

I’ve fixed this with the following:

length = val.unpack('U*').length
[ 'S', length ].pack('an') << val

It works but is not absolutely reliable (assumes the input is utf-8 encoded, but can not check this assumption) and is ugly. If the string is longer than 64K, then the implementation would be even more complex. There is no easy string slicing possibility in Ruby. And converting a long string to an array where each element is an object representing one letter will eat all your RAM!

Now how it should work:

('S', len(val), val.encode) # python

And for slicing use

val[ : :65000] # slice from the beginning to the end with step 65000

Can not we learn from Python? Has everybody to do his own mistakes? After 1.8 screw it up again in 1.9 and repair in Ruby 3000?

Posted by VladimirDobriakov at 12:23 AM

December 6, 2008

Python 3.0 released

Some are talking about a big break, but some of the mentioned things are simply a polishing for the implementation of known python principles.

Strings

Python has had a perfect way to represent chunks of written human language for ages. In Python 2.6, 2.5, 2.4 the class was called 'Unicode'. The class for representing a byte sequence was called 'String'. Syntax for (unicode) string literals was also not very intuitive: u'hello'.

In Python 3 names were changed and are perfect now:

There should be only one, obvious way to do things

The obsolete ways, marked as such some point-versions ago, are removed in the current version. So if you strive to memorize the syntax of a programming language and the standard library, you've just gained some spare space in your brain and can have a look at the details.

Posted by VladimirDobriakov at 10:04 AM