why can not get the right bytes of str ?

mc 6,056 Reputation points
2025-10-22T12:49:57.5133333+00:00

System.Text.Encoding.Unicode.GetBytes("x") will only return two bytes why?

and why the chinese and japanese and korea is in 4E00-9FFF there is 3 languages right?

Developer technologies | .NET | .NET MAUI
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Tony Dinh (WICLOUD CORPORATION) 3,585 Reputation points Microsoft External Staff
    2025-10-23T03:34:51.2+00:00

    Hello @mc !

    Why does System.Text.Encoding.Unicode.GetBytes("x") return only 2 bytes?

    In .NET, Encoding.Unicode means UTF-16 little-endian encoding. UTF-16 represents most common characters (including "x") as a single 16-bit code unit. "x" has the Unicode code point U+0078. In UTF-16 LE, that'll becomes:

    0x78 0x00 → two bytes
    

    That’s why you only see 2 bytes. If you tried a character outside the Basic Multilingual Plane (BMP) (e.g., an emoji like "😀" U+1F600), UTF-16 would use a surrogate pair → 4 bytes. So, the number of bytes depends on the character’s code point and the encoding scheme. Refer: refer: https://free.blessedness.top/en-us/dotnet/api/system.text.encoding.unicode?view=net-9.0

    Why are Chinese, Japanese, and Korean characters all in U+4E00–U+9FFF?

    That block is called CJK Unified Ideographs in Unicode. Unicode designers noticed that Chinese, Japanese, and Korean share many Han characters (called Hanzi in Chinese, Kanji in Japanese, Hanja in Korean). Instead of duplicating the same character three times, Unicode unified them into one code point. Refer: https://www.unicodepedia.com/groups/cjk-unified-ideographs.

    • For example:
      • The character (U+6728) means “tree” in Chinese, Japanese, and Korean.
      • It’s encoded once, but each language may pronounce it differently.
    • The reason it's unify is to save space and ensure compatibility across systems. But this also led to the famous “Han unification” controversy, since some characters look slightly different in each language. Unicode handles this with font rendering — the same code point can look different depending on the locale/font.

    I hope these clarify to your questions! Let me know if you have questions! If you agree with my suggestion, kindly mark this as final answer to your question!

    1 person found this answer helpful.

  2. Starry Night 600 Reputation points
    2025-10-23T02:17:00.9333333+00:00

    why the chinese and japanese and korea is in 4E00-9FFF there is 3 languages right?

    Do you mean the CJK Unified Ideographs? Just as introduced on this website,CJK Unified Ideographs is a block of the Unicode standard that contains the most commonly used characters for writing Chinese, Japanese, and Korean. It is the largest block of the Unicode standard in terms of the number of characters it contains, with over 20,000 characters. This block includes all the characters needed to write the majority of Chinese, Japanese, and Korean text, including both simplified and traditional characters. It is often used in conjunction with other blocks, such as CJK Unified Ideographs Extension A and B, to fully support these languages in text processing and display applications.

    By the way, may I ask, what functionality do you want to achieve?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.