ENLASO

enlaso.com

Social Networks

Page Tools

What's New

Request a Project Estimate

Estimate RequestDocumentation, software, and website translation and localization services into over 100 languages. Request an Estimate...

Contact Us:

ENLASO is a Charter-level contributor to the GALA Standards Initiative.

A Brief Overview of Character Encodings

By Brooks Kline

The ability to recognize, understand, and migrate among different character encodings is an essential skill in localization engineering. As this knowledge comprises the technical basis of any localization project, it is important for localization professionals to have a good grasp of the subject. This article provides a basic introduction to character encoding.

What is character encoding, and how does it function?

All digital data exists as binary 'code'. There are only two binary digits: 0 and 1, so any character representations must be built from these two digits. In order to work with character data on a computer, characters must be associated with binary numeric values. This creates a map—generally called the character encoding—that contains a collection of code points (numeric values) with character values assigned to them. A piece of software that is programmed to understand a particular character encoding is able to process the binary values that represent character information using this mapping.

There are three principal facets of an encoding, or character map:

  • Character sets are abstract collections of characters, such as the Latin alphabet or the Hiragana characters in Japanese
  • Coded character sets (CCS), aka character encodings, where the characters from the character are set mapped to code points
  • Character encoding schemes (CES), define the method by which the virtual map (the CCS) is translated into computer-usable data

It is useful to note the distinction between a character—which is an abstract representation of a unit of written language—and a glyph—which is a graphical representation of a character (such as the letters you see on this page).

A little more about bits and bytes

In order to understand how character encodings function, it is helpful to understand basic binary math. A bit is a single unit of digital data, and its value can be either 0 or 1. A byte is 8 bits, which is the amount of data early computer engineers thought practical for storing the value of a character. Binary numbers have a different structure from the decimal numbers with which most of us are familiar. With decimal numbers each position in a sequence of digits represents a power of ten, increasing from right to left. In binary notation each position represents a power of two. For example, the following table shows counting in binary using three bits (or three positions):

Binary Value Decimal Value

001

1

010

2

011

3

100

4

101

5

Just as with decimal counting, when a position has reached its maximum value—in the case of binary that value is 1—the position to the left is incremented by 1. Notice that each position, or 'place', represents a value times a power of 2: the first position is multiplied by 20, or 1; the second position to the left is multiplied by 21, or 2; the third by 22 (4); and so on. That means that adding a single bit to a chunk of data doubles the possible values that data can contain. That is, a single bit of data can contain two possible values; two bits can contain four possible values; three bits eight values; and so on.

Binary values also lend themselves to being expressed in hexadecimal (base 16) notation. In hexadecimal notation there are 16 digits: 0-9, and then a-f. The following table shows counting in hexadecimal and binary (using four bits):

Hex Value Binary Value Decimal Value

8

1000

8

9

1001

9

a

1010

10

b

1011

11

c

1100

12

It's common to see character values—which are typically 8 or 16 bits—written as two or four digit hexadecimal numbers. Hexadecimal values are commonly indicated with an 'x' or '0x' preceding the number. For example, in ASCII (see Unicode and its precursors, below) the code point for the Question Mark character, '?' is 63 (decimal), or 0x3F (hexadecimal).

Why are things the way they are?

Sorting out the various encodings we come across can seem complex. Much of this complexity is the result of the process of computer system evolution. This process involved adapting and expanding older, more limited computer systems to ever-improving technology that provided greater resources. Originally, there was very little overhead in these systems for character data. A multitude of different 8-bit encodings were created—partially out of necessity and partially due to organizational divisions. As computer resources increased, a universal encoding was developed that overcomes the limitations of and lack of standardization in 8-bit encodings: Unicode. However, 8-bit encodings are still prevalent today, due to the resistance to adopt Unicode and/or the lack of understanding of the benefits of migrating.

One primary drawback to 8-bit encodings is that each only support a limited number of character sets, or even just a single character set. There is also the problem of competing encodings, where different encodings can be used to encode a given character set. For example, a French document could be encoded using Windows 1252, ISO-8859-1, or Macintosh (Western European)–among others. Each of these 8-bit encodings supports the characters of the French document, but each one encodes the data differently, and each is incompatible with the others. This lack of standardization can result in confusion and problems—such as corruption or mojibake—when data is processed with the wrong encoding.

Some 8-bit encodings (mainly the ones used for Asian character sets) use a system of combining two 8-bit values to accommodate the large number of characters they contain. These encodings are often referred to as double-byte, or multi-byte, encodings. Some examples are: Shift-JIS (Japanese), GB2312 (Simplified Chinese), and Big5 (Traditional Chinese).

Unicode and its precursors

Unicode is a universal character encoding comprised of a single, unified character map containing the characters for all modern languages, as well as some archaic character sets used in academia. Unicode was built upon previous standards, and is therefore said to be backwards compatible with them. These earlier standards include: ASCII, which was a very early 7-bit encoding (which means it contains 128 code points); and ISO-8859-1, also known as Latin1, which is an 8-bit encoding that contains the ASCII range as well as an additional 128 extended Latin characters, formatting characters, and control characters. The official Unicode designation is ISO/IEC 10646.

Almost every character in Unicode's coded character set (CCS) has a 16-bit code point value. These are usually written as a four digit hexadecimal number. For example, the value 0x0628 is the Arabic number eight, "٨". Because Unicode supports all modern language characters, any number of different character sets can be used in a single document (which was impossible with 8-bit encodings). Also, there is no confusion with documents containing character sets that have more than one possible encoding.

Another important practical aspect of Unicode is its character encoding schemes (CES). Although the Unicode code points are (mostly) 16-bit values, Unicode data are not always written in this form. The CES is a method of translating CCS values to computer-usable byte sequences, taking into account machine architecture and data storage methodologies. One facet of the CES process is translating code points into code units. A code unit is basically a building block for storing character data. It has a defined width—in Unicode it is typically 8 or 16 bits. A CES uses a character encoding form (CEF) that defines the way in which code points are translated into code units. There are fixed-width CEFs, which use the same number of code units per character, and variable-width CEFs, which use a variable number of code units per character. Another aspect of a CES is byte-order. When a CES uses a fixed-width, 16-bit CEF, each 16-bit code point can be written with either the first byte first, known as big-endian; or the second byte first, known as little-endian. The byte-order is usually declared at the beginning of a Unicode document with the byte order mark, or BOM.

Here are the two most common Unicode CES's:

  • UTF-8, which uses a variable-width CEF of one to four 8-bit code units to encode each character. ASCII characters are encoded using single 8-bit code units, resulting in significant size reductions for English and Western European data
  • UTF-16LE, which uses a fixed-width CEF of 16-bits, encoded as little-endian. This is often the 'default' Unicode CES used by many Windows applications, such as Microsoft Word

What are the implications for people working in localization?

One common encoding-related issue is corruption (in some cases called mojibake). Corruption is usually caused by a file being processed with the wrong encoding. Often this leads to some or all of the code point values being misinterpreted and/or lost in the process. The best solution for corruption is prevention, which means always being sure of the proper encoding(s) when processing files—otherwise you'll have to spend time later trying to fix the data in damaged files. Mojibake, which is a Japanese term, refers to data displayed with the incorrect encoding. One common example of this can be seen with web-based documents that fail to specify (or incorrectly declare) the proper encoding.

Another, more general, issue is working with old or outdated processes. A lot of unnecessary work can result from having to accommodate outdated encoding specifications for projects. Also, we are often asked to use complex methodologies that evolved in prior times to accommodate these outdated encodings. The best solution is ultimately prevention: always try to promote and use Unicode for projects.

ENLASO's Localization Solutions

For more information on how ENLASO can assist you with all of your localization needs, please contact us at Contact@enlaso.com, call 303.516.0857 x127, or complete the quote request form.

 

Copyright © 2012 ENLASO Corporation. All Rights Reserved. ENLASO is a trademark of ENLASO Corporation. Rhonix is a trademark of ENLASO Corporation. Privacy statement.