UTF-8 Encoding
UTF-8 is a character encoding format to encode text in any language to bytes. It is a variable sized encoding for unicode characters. UTF-8 uses one to four bytes to represent a Unicode character.
Why UTF-8?
UTF-8 has become the most popular character encoding standard. 60% of the websites on the internet use UTF-8 and if you include ASCII which is a subset of UTF-8, that number goes up to 80%. The main reason for using UTF-8 is to display text in languages other than English. It also supports emojis which have become very popular in text messages.
How does UTF-8 work?
UTF-8 works by representing characters in binary numbers. Each unicode character is represented by one to four bytes. The high order bits in UTF-8 tells us how many bytes were used to encode a character.
1 Byte UTF-8 Encoding
ASCII characters which range from 0 to 127 are represented using a single byte. All ASCII characters can be represented using 7 bits. This frees up the first bit of the byte to store 0
which tells us that the character was encoded into a single byte. For example, the capital letter A
has a code point of 65. It can be represented using the binary number 1000001
.
Character | A | |||||||
Binary | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
Bit Utility | Indicator | Bit 1 | Bit 2 | Bit 3 | Bit 4 | Bit 5 | Bit 6 | Bit 7 |
Hex | 4 | 1 |
2 Byte UTF-8 Encoding
2 byte encodings are for characters ranging from 128 to 2047. The first 3 bits of the first byte are always set to 110
. The first 2 bits of the second byte is always set to 10
. That leaves 11 bits available for the actual character being encoded. Given below is a breakdown of how 2 byte UTF-8 encoding happens for the unicode character ™ which represents the word Trademark.
Character | ™ | ||||||||||||||||
Binary | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | |
Hex | C | 2 | 9 | 9 |
Similarly, characters from 2,048 to 65,535 take up 3 bytes. While all other characters from 65,536 to 1,112,064 take 4 bytes.
Character | Numeric Code | Hex Code | Binary Code | Bytes Used | Indicators |
---|---|---|---|---|---|
A | 65 | 41 | 0100 0001 | 1 | 0 |
™ | 49817 | C299 | 1100 0010 1001 1001 | 2 | 110 10 |
ओ | 14,722,195 | E0A493 | 1110 0000 1010 0100 1001 0011 | 3 | 1110 10 10 |
Advantages of UTF-8
- Works with null terminated string functions
- Widely used such as in HTML, JSON & XML
- Any Unicode character can be encoded without having to choose a code page
- Simple bit operations can be used to perform UTF-8 encoding. Hence, it is faster.
- Does not depend on the Endianness of the computer
- Smaller in size compared to UTF-16 when dealing with only latin characters
Disadvantages of UTF-8
- Larger in size for text in languages that need 3 or 4 bytes to be represented
- Characters in Japanese, Chinese and Korean languages require 3 bytes in UTF-8 compared to 2 in UTF-16
- Takes 2x the space to encode Cyrillic and Greek text compared to their dedicated encoding formats
- Takes 3x the space to encode Hindi and Thai text compared to their dedicated encoding formats
It all comes down to the indicator bits that tells us how many bytes were used to encode a single character. Because, there are only two characters in Binary, these prefixes tend to grow in size.
- 0
- 110
- 1110
- 11110
UTF-8 Tools
Given below is a list of all our tools that deal with UTF-8 Encoding.
Encode any text in UTF-8