Hello @mc !
Why does
System.Text.Encoding.Unicode.GetBytes("x")return only 2 bytes?
In .NET, Encoding.Unicode means UTF-16 little-endian encoding. UTF-16 represents most common characters (including "x") as a single 16-bit code unit. "x" has the Unicode code point U+0078. In UTF-16 LE, that'll becomes:
0x78 0x00 → two bytes
That’s why you only see 2 bytes. If you tried a character outside the Basic Multilingual Plane (BMP) (e.g., an emoji like "😀" U+1F600), UTF-16 would use a surrogate pair → 4 bytes. So, the number of bytes depends on the character’s code point and the encoding scheme. Refer: refer: https://free.blessedness.top/en-us/dotnet/api/system.text.encoding.unicode?view=net-9.0
Why are Chinese, Japanese, and Korean characters all in
U+4E00–U+9FFF?
That block is called CJK Unified Ideographs in Unicode. Unicode designers noticed that Chinese, Japanese, and Korean share many Han characters (called Hanzi in Chinese, Kanji in Japanese, Hanja in Korean). Instead of duplicating the same character three times, Unicode unified them into one code point. Refer: https://www.unicodepedia.com/groups/cjk-unified-ideographs.
- For example:
- The character 木 (U+6728) means “tree” in Chinese, Japanese, and Korean.
- It’s encoded once, but each language may pronounce it differently.
- The reason it's unify is to save space and ensure compatibility across systems. But this also led to the famous “Han unification” controversy, since some characters look slightly different in each language. Unicode handles this with font rendering — the same code point can look different depending on the locale/font.
I hope these clarify to your questions! Let me know if you have questions! If you agree with my suggestion, kindly mark this as final answer to your question!