Exploring Python and the Unicode System

Get The Most Affordable Hosting in the World!

Starting at just $1.87/month, Vercaa offers unbeatable pricing for world-class web hosting services.

Fast, reliable, and secure hosting to power your website without breaking the bank. Plus, enjoy a free CDN for faster loading times worldwide!

Get Started Now!

Software applications often require to display messages output in a variety in different languages such as in English, French, Japanese, Hebrew, or Hindi. Pythonâ€™s string type uses the Unicode Standard for representing characters. It makes the program possible to work with all these different possible characters.

A character is the smallest possible component of a text. â€˜Aâ€™, â€˜Bâ€™, â€˜Câ€™, etc., are all different characters. So are â€˜Ãˆâ€™ and â€˜Ãâ€™.

According to The Unicode standard, characters are represented by code points. A code point value is an integer in the range 0 to 0x10FFFF.

A sequence of code points is represented in memory as a set of code units, mapped to 8-bit bytes. The rules for translating a Unicode string into a sequence of bytes are called a character encoding.

Three types of encodings are present, UTF-8, UTF-16 and UTF-32. UTF stands for Unicode Transformation Format.

Python 3.0 onwards has built-in support for Unicode. The str type contains Unicode characters, hence any string created using single, double or the triple-quoted string syntax is stored as Unicode. The default encoding for Python source code is UTF-8.

Hence, string may contain literal representation of a Unicode character (Â¾) or its Unicode value (\u00BE).

var = "Â¾" print (var) var = "\u00BE" print (var)

This above code will produce the following output −

'Â¾'
Â¾

In the following example, a string â€˜10â€™ is stored using the Unicode values of 1 and 0 which are \u0031 and u0030 respectively.

var = "\u0031\u0030" print (var)

It will produce the following output −

Strings display the text in a human-readable format, and bytes store the characters as binary data. Encoding converts data from a character string to a series of bytes. Decoding translates the bytes back to human-readable characters and symbols. It is important not

to confuse these two methods. encode is a string method, while decode is a method of the Python byte object.

In the following example, we have a string variable that consists of ASCII characters. ASCII is a subset of Unicode character set. The encode() method is used to convert it into a bytes object.

string = "Hello" tobytes = string.encode('utf-8') print (tobytes) string = tobytes.decode('utf-8') print (string)

The decode() method converts byte object back to the str object. The encodeing method used is utf-8.

b'Hello'
Hello

In the following example, the Rupee symbol (â‚¹) is stored in the variable using its Unicode value. We convert the string to bytes and back to str.

string = "\u20B9" print (string) tobytes = string.encode('utf-8') print (tobytes) string = tobytes.decode('utf-8') print (string)

When you execute the above code, it will produce the following output −

â‚¹
b'\xe2\x82\xb9'
â‚¹





The End! should you have any inquiries, we encourage you to reach out to the Vercaa Support Center without hesitation.

Exploring Python and the Unicode System

Get The Most Affordable Hosting in the World!

Legnépszerűbb cikkek

Jelszó létrehozása