Friday, August 26, 2005

Python note: unicode

When you get some error output like this:

'ascii' encoding can not encode ...

The first thing to check is python unicode object.

Python chooses a separate type of object to support unicode, in order to keep string compatibility. Thus python has two kinds of strings: str and unicode. Str objects are the same as the standard C string - an array of chars. It is used in most function calls.

Each char in a computer system can represent 128 different values. For languages like English, the alphabet is below 30. Therefore we can find a mapping between each letter and each char value. Such a mapping is called encoding. ASCII is the most common encoding to map char values into letters.

For languages with many more than 128 letters, such as Chinese and Japanese, many chars need to be combined to represent one character. A problem arises. Because different languages have different interpretations of char values, the same string can be mapped into different letters / characters by different encodings. For example, when viewing one webpage, you can switch your browser to different encodings, and the page will be displayed differently (of course there is only one encoding that is 'correct') Unicode is proposed to solve the encoding clash, and it includes all possible characters / letters in languages. Interestingly, there are also many different UTF encoding versions, include utf-8, utf-16, etc.

Unicode objects in Python are actually strings encoded in utf-8. It can be seen as the abstract representation of the real character / letters, which can be encoded into different computer strings by different encodings. In other words, if strings are viewed as the outside form, Unicode can be viewed as the inside meaning.

Unicode objects can be changed to str object by the method 'encode'. It will translate the meaning to raw strings with certain encodings.

On the contrary, raw strings can be changed to unicode, using method 'decode'. When you know the 'correct' encoding of a raw string, you can tell it to the system and make it an unicode object.

There are methods to help you determine the os encoding. They are sys.getdefaultencoding() and sys.getfilesystemencoding(). Which are self explanatory.

Some methods in python work with str while other work with Unicode. You have no difficulty with those taking both types, but you need to be careful when calling a method that take only str or Unicode params. Also, the return type of a method us often neglected. For example, file.readline() would return a string. If a file is a unicode file, it's still a string encoded in 'utf-8'.

When a unicode object is passed to a method taking string params, or vice versa, the system will try to switch beween them automatically. However because we did not specify encodings beforehand, it will use ascii by default. When the real encoding can't be interpreted by the ascii char set, the exception at the beginning of this article will occur. The steps to take to fix the problem might be: first check the type of the string, using type() method, then try to convert it to the correct type by using encode() or decode, specifying the encoding.

No comments: