Jun 23, 2019

UTF-8 Encoding

UTF-8 is a character encoding format to encode text in any language to bytes. It is a variable sized encoding for unicode characters. UTF-8 uses one to four bytes to represent a Unicode character.

Why UTF-8?

UTF-8 has become the most popular character encoding standard. 60% of the websites on the internet use UTF-8 and if you include ASCII which is a subset of UTF-8, that number goes up to 80%. The main reason for using UTF-8 is to display text in languages other than English. It also supports emojis which have become very popular in text messages.

How does UTF-8 work?

UTF-8 works by representing characters in binary numbers. Each unicode character is represented by one to four bytes. The high order bits in UTF-8 tells us how many bytes were used to encode a character.

1 Byte UTF-8 Encoding

ASCII characters which range from 0 to 127 are represented using a single byte. All ASCII characters can be represented using 7 bits. This frees up the first bit of the byte to store 0 which tells us that the character was encoded into a single byte. For example, the capital letter A has a code point of 65. It can be represented using the binary number 1000001.

Character	A
Binary	`0`	`1`	`0`	`0`	`0`	`0`	`0`	`1`
Bit Utility	Indicator	Bit 1	Bit 2	Bit 3	Bit 4	Bit 5	Bit 6	Bit 7
Hex	4				1

2 Byte UTF-8 Encoding

2 byte encodings are for characters ranging from 128 to 2047. The first 3 bits of the first byte are always set to 110. The first 2 bits of the second byte is always set to 10. That leaves 11 bits available for the actual character being encoded. Given below is a breakdown of how 2 byte UTF-8 encoding happens for the unicode character ™ which represents the word Trademark.

Character

™

Binary

1

0

1

0

1

0

1

0

1

Hex

Similarly, characters from 2,048 to 65,535 take up 3 bytes. While all other characters from 65,536 to 1,112,064 take 4 bytes.

Character	Numeric Code	Hex Code	Binary Code	Bytes Used	Indicators
A	65	41	0100 0001	1	0
™	49817	C299	1100 0010 1001 1001	2	110 10
ओ	14,722,195	E0A493	1110 0000 1010 0100 1001 0011	3	1110 10 10

Advantages of UTF-8

Works with null terminated string functions
Widely used such as in HTML, JSON & XML
Any Unicode character can be encoded without having to choose a code page
Simple bit operations can be used to perform UTF-8 encoding. Hence, it is faster.
Does not depend on the Endianness of the computer
Smaller in size compared to UTF-16 when dealing with only latin characters

Disadvantages of UTF-8

Larger in size for text in languages that need 3 or 4 bytes to be represented
Characters in Japanese, Chinese and Korean languages require 3 bytes in UTF-8 compared to 2 in UTF-16
Takes 2x the space to encode Cyrillic and Greek text compared to their dedicated encoding formats
Takes 3x the space to encode Hindi and Thai text compared to their dedicated encoding formats

It all comes down to the indicator bits that tells us how many bytes were used to encode a single character. Because, there are only two characters in Binary, these prefixes tend to grow in size.

0
110
1110
11110

UTF-8 Tools

Given below is a list of all our tools that deal with UTF-8 Encoding.

Remove Ads

UTF-8 Encoder

Encode any text in u t f-8

View Tool

UTF-8 Encoder
Encode any text in u t f-8