Have you ever thought about, how to design a multi language Website using UTF-8?

fordem · May 31, 2025

Multi language design serves several purposes, being inclusion and reachability the most important ones. In this small tutorial i'm going explain what i learned from Chatgpt about utf-8 validation that allows us to receive multibyte utf-8 characters and once verified they are valid, we consider the input valid. And than other validation methods can be employed to search for more intrinsic errors related to your business rules.

From here on everything came from chatgpt i'm not a lier and i don't like to get credit from something i didn't do. So here we go.

Let's go step by step and rigorously understand how UTF-8 byte sequences work, and why the hex constants (like 0xC0, 0xE0, etc.) are used in UTF-8 validation.

🧠 What is UTF-8?

UTF-8 is a variable-length encoding for Unicode characters using 1 to 4 bytes per character.

Each byte starts with a specific bit pattern that tells you what kind of byte it is:

Byte Type	Format (binary)	Example (hex)	Meaning
1-byte (ASCII)	`0xxxxxxx`	`0x00`–`0x7F`	Standard ASCII (U+0000 to U+007F)
2-byte lead	`110xxxxx`	`0xC0`–`0xDF`	Start of 2-byte sequence
3-byte lead	`1110xxxx`	`0xE0`–`0xEF`	Start of 3-byte sequence
4-byte lead	`11110xxx`	`0xF0`–`0xF7`	Start of 4-byte sequence
Continuation	`10xxxxxx`	`0x80`–`0xBF`	Must follow a multi-byte lead

⚠️ Every byte after the lead must match 10xxxxxx, i.e., it must be in range 0x80–0xBF.

📦 How the Validator Works

Let’s walk through what each part of the validator does:

🔹 Case 1: 1-byte (ASCII)

if (bytes[i] <= 0x7F)

If the byte is 0x00–0x7F, it's a valid ASCII char — move to next byte.

🔹 Case 2: 2-byte sequence

else if ((bytes[i] & 0xE0) == 0xC0)

⚙️ Breakdown:

bytes[i] & 0xE0 masks the first 3 bits: 11100000
We're testing for 110xxxxx — the lead byte of a 2-byte character
0xC0 = 11000000
So if bytes[i] & 0xE0 == 0xC0, we know it's a 2-byte lead

✅ Then check:

(bytes[i + 1] & 0xC0) == 0x80

This ensures that the next byte is a continuation byte (10xxxxxx), which must be true for valid UTF-8.

🔹 Case 3: 3-byte sequence

else if ((bytes[i] & 0xF0) == 0xE0)

⚙️ Breakdown:

Mask top 4 bits: 11110000
0xE0 = 11100000 → matching 1110xxxx, the lead for a 3-byte sequence

We check that the next 2 bytes are valid continuation bytes:

(bytes[i+1] & 0xC0) == 0x80 &&
(bytes[i+2] & 0xC0) == 0x80

🔹 Case 4: 4-byte sequence

else if ((bytes[i] & 0xF8) == 0xF0)

⚙️ Breakdown:

Mask top 5 bits: 11111000
0xF0 = 11110000 → matches 11110xxx, the lead for a 4-byte UTF-8 char

Check next 3 continuation bytes:

(bytes[i+1] & 0xC0) == 0x80 &&
(bytes[i+2] & 0xC0) == 0x80 &&
(bytes[i+3] & 0xC0) == 0x80

❌ Default: Invalid Byte

else {
    return false;
}

If it doesn’t match any valid UTF-8 lead or continuation format → the sequence is invalid.

🧪 Example: How It Validates `"日本"` (`U+65E5 U+672C`)

Encoded in UTF-8: E6 97 A5 E6 9C AC

E6 → 11100110 = 3-byte lead
97 → 10010111 = continuation
A5 → 10100101 = continuation
✅ Pass

Then:

E6 → 3-byte lead
9C → continuation
AC → continuation
✅ Pass

✅ Final Takeaway

The hex masks like 0xE0, 0xF0, 0xC0, etc., are used to extract the bit patterns that define UTF-8 byte types. The bitwise & tests identify which type of byte it is by focusing on the leading bits that encode structure.

Let me know if you want a version of the validator that logs which byte failed or shows the character boundaries.

News

Programming

Have you ever thought about, how to design a multi language Website using UTF-8?

Recommended Posts

fordem

🧠 What is UTF-8?

📦 How the Validator Works

🔹 Case 1: 1-byte (ASCII)

🔹 Case 2: 2-byte sequence

⚙️ Breakdown:

✅ Then check:

🔹 Case 3: 3-byte sequence

⚙️ Breakdown:

🔹 Case 4: 4-byte sequence

⚙️ Breakdown:

❌ Default: Invalid Byte

🧪 Example: How It Validates `"日本"` (`U+65E5 U+672C`)

✅ Final Takeaway

Link to comment

Share on other sites

Auctions

DISCORD BOT

About Us

Club

News

Programming

Have you ever thought about, how to design a multi language Website using UTF-8?

Recommended Posts

fordem

🧠 What is UTF-8?

📦 How the Validator Works

🔹 Case 1: 1-byte (ASCII)

🔹 Case 2: 2-byte sequence

⚙️ Breakdown:

✅ Then check:

🔹 Case 3: 3-byte sequence

⚙️ Breakdown:

🔹 Case 4: 4-byte sequence

⚙️ Breakdown:

❌ Default: Invalid Byte

🧪 Example: How It Validates "日本" (U+65E5 U+672C)

✅ Final Takeaway

Link to comment

Share on other sites

Auctions

DISCORD BOT

About Us

Club

🧪 Example: How It Validates `"日本"` (`U+65E5 U+672C`)