why can not get the right bytes of str ?

Question

why can not get the right bytes of str ?

mc 6,056

System.Text.Encoding.Unicode.GetBytes("x") will only return two bytes why?

and why the chinese and japanese and korea is in 4E00-9FFF there is 3 languages right?

2 answers

Your answer

Answer 1

Tony Dinh (WICLOUD CORPORATION) 3,585 Microsoft External Staff

Hello @mc !

Why does System.Text.Encoding.Unicode.GetBytes("x") return only 2 bytes?

In .NET, Encoding.Unicode means UTF-16 little-endian encoding. UTF-16 represents most common characters (including "x") as a single 16-bit code unit. "x" has the Unicode code point U+0078. In UTF-16 LE, that'll becomes:

0x78 0x00 → two bytes

That’s why you only see 2 bytes. If you tried a character outside the Basic Multilingual Plane (BMP) (e.g., an emoji like "😀" U+1F600), UTF-16 would use a surrogate pair → 4 bytes. So, the number of bytes depends on the character’s code point and the encoding scheme. Refer: refer: https://free.blessedness.top/en-us/dotnet/api/system.text.encoding.unicode?view=net-9.0

Why are Chinese, Japanese, and Korean characters all in U+4E00–U+9FFF?

That block is called CJK Unified Ideographs in Unicode. Unicode designers noticed that Chinese, Japanese, and Korean share many Han characters (called Hanzi in Chinese, Kanji in Japanese, Hanja in Korean). Instead of duplicating the same character three times, Unicode unified them into one code point. Refer: https://www.unicodepedia.com/groups/cjk-unified-ideographs.

For example:
- The character 木 (U+6728) means “tree” in Chinese, Japanese, and Korean.
- It’s encoded once, but each language may pronounce it differently.
The reason it's unify is to save space and ensure compatibility across systems. But this also led to the famous “Han unification” controversy, since some characters look slightly different in each language. Unicode handles this with font rendering — the same code point can look different depending on the locale/font.

I hope these clarify to your questions! Let me know if you have questions! If you agree with my suggestion, kindly mark this as final answer to your question!

mc 6,056 Reputation points

2025-10-24T01:43:18.8733333+00:00

there maybe different in chinese and korea and japanese how to get that the code is korea ? chinese ? japanese according the 4e00 - 9fcb
mc 6,056 Reputation points

2025-10-24T01:47:13.83+00:00

and some code (korea) can not find it in 4e00 - 9fcb
Tony Dinh (WICLOUD CORPORATION) 3,585 Reputation points Microsoft External Staff

2025-10-24T02:48:31.93+00:00
Hello @mc !

You cannot tell from the Unicode code point alone whether a CJK Unified Ideograph (U+4E00–U+9FFF) is being used as Chinese, Japanese, or Korean. The block is shared because of Han unification. The actual language depends on context (surrounding text, locale, or font). Also, not all Korean characters are in that block—modern Korean is written with Hangul (U+AC00–U+D7AF), while Hanja (Chinese-derived characters) are the ones that appear in the CJK ranges.

Unicode does not store “this is Chinese” or “this is Japanese.” It only stores the abstract character.

Language detection must be done by:

Surrounding text (if the sentence has Hiragana, it’s Japanese; if it has Hangul, it’s Korean).

Locale or font (operating system or app chooses a font that renders the glyph in the style of the target language).

Statistical language models (software like NLP libraries can guess based on word frequency and context).

Why some Korean characters are not in U+4E00–U+9FFF

Modern Korean writing uses Hangul syllables (U+AC00–U+D7AF). These are unique to Korean and not part of the CJK Unified Ideographs block. Hanja (Chinese characters in Korean) are included in the CJK blocks, but only the subset historically used in Korea.

So, if you’re looking for a Korean word written in Hangul, you won’t find it in U+4E00–U+9FFF because that block is only for Han characters.

Hope this helps!

Answer 2

why the chinese and japanese and korea is in 4E00-9FFF there is 3 languages right?

Do you mean the CJK Unified Ideographs? Just as introduced on this website,CJK Unified Ideographs is a block of the Unicode standard that contains the most commonly used characters for writing Chinese, Japanese, and Korean. It is the largest block of the Unicode standard in terms of the number of characters it contains, with over 20,000 characters. This block includes all the characters needed to write the majority of Chinese, Japanese, and Korean text, including both simplified and traditional characters. It is often used in conjunction with other blocks, such as CJK Unified Ideographs Extension A and B, to fully support these languages in text processing and display applications.

By the way, may I ask, what functionality do you want to achieve?

Share via

why can not get the right bytes of str ?

2 answers

Your answer