.----. |C>_ | __|____|__ | ______--| `-/.::::.\-'a `--------'
What is Unicode?Previous Hangul code was Completion type. The completion type is a method of allocating 2 bytes of code and assigning one character to one hexadecimal number. In the complete-type code, only 2,355 characters most frequently used in daily life are assigned as codes. Therefore, there were many problems to select as a standard. Since the Windows 95 operating system only supported the complete version, programs running on Windows 95 were able to use only 2,355 Hangul characters. After all, letters like " In fact, 99% of the characters used today are 2,355 characters. However, the number of characters that can be expressed in Hangul is 11,172 characters. Therefore, the completed form can only display about 1/5 of the entire Hangul." could not be written properly, so it was also a limitation of Windows 95. For this reason, the word processor '아래 한글' has been favored by many people because of the advantage of being able to express all Hangul. Unicode has been changed to use most Korean characters. Therefore, if you have Windows 98 installed, you can freely enter Hangul in a program such as Notepad or Word. In fact, the 'lower Hangul' word processor uses the 3-byte code system internally, so it has the advantage of being able to handle old Korean or extended Chinese characters more freely than Unicode.
Unicode is the name of an international code convention proposed to represent languages around the world in a unified way. It is a standard code that expresses all the characters in the world by extending the ASCII code. In English or Latin, there is no problem with the 256-character limit that can be expressed in 8 bits. However, there are limitations in expressing various characters such as Korea, Japan, China, and Arab. In English-speaking countries, twice as much space is required as ASCII code, so it is a waste in general communication, but it is an advantage because Unicode can handle characters from all countries. The encoding method of '가' through 'ㅎ' in the order of '가나다라' in 11,172 Hangul characters in a continuous space is Unicode 2.0, which was adopted by the Unicode Technical Committee(UTC). Microsoft has made a lot of efforts internally to support all Korean Hangul combinations (KSC5601-1992). However, serious compatibility problems with existing programs and data, which supported only the most complete versions, have always been a problem.
Limitations of Extended Completion CodeBefore you know the limitations of the extension-complete code, you need to know about the reason. Only then can we understand the limitations and problems easily. The most important reason for establishing the extended completion type is to process 8,822 modern language characters that could not be processed in the KS completion code, and the establishment principle is to maintain compatibility with programs and data that support only the existing completion type. The reason why the extended completion type was criticized was the code assignment that ignored the Korean order for compatibility. In other words, the existing KS completed 2,350 characters are arranged in the order of 'Canada', but the rest of the letters are placed on it (front), so the order of Hangul in the entire code is tangled. This has a serious effect when sorting or searching.
Assignment of Hangul code from UnicodeSince Unicode has a very important meaning, let's look at the background of the appearance of Unicode and examine the characteristics of Hangul code in Unicode and the characteristics of Hangul code in Unicode. Unicode is not an international standardization organization. It was established in 1989 as a consortium in order to establish a code system for efficiently processing multilingual languages, mainly in the computer-related industry. At the end of 1991, Unicode 1.0 was released. Since Korean code included only KSC5601-1987 complete code, it was not welcomed in Korea. Moreover, in Unicode 1.1, combinatorial code was only partially accepted. So even in Unicode 1.1, Hangul code was not even available at all.
An important reason why Korean codes were not assigned as a combination type at the time was that Korean combination codes were repulsed because they required too many areas compared to other countries. Until then, Unicode 2.0 was not used worldwide, so in 1995, Unicode 2.0 was established. In Unicode 2.0, Hangul was assigned to two areas. The first is the arrangement of 11,172 modern Hangul characters in the order of the completion type code, and the 11,172 modern Hangul characters are called completion type codes because they are assigned based on the completed syllables like the completion type. Unlike the KSC5601 completion code, it maintains a certain combination rule. The second is a combination type code in which onset, nucleus, and coda are assigned in units of elements, and is designated as an N-byte format that uses a number of elements rather than a fixed length like a 2-byte commercial combination type or KSC5601-1992 standard combination type. Nowadays, Unicode 2.1 has been released as the final version, and as soon as Unicode 2.0 is released in Korea, Unicode 2.0 will be adopted as the national standard code under the name of KSC5700.
Code ConversionCode conversion is largely a conversion of three states. As mentioned above, there are several types of codes, but since the only codes that are actually used are KS complete type, combination type(commercial, standard) and Unicode, I will introduce each conversion process. In this article, the combination type is a commercial combination type, and the KS completion type is an extended completion type used after Windows 95.Unicode introduces 1-byte ASCII characters that are compatible with other codes and 11,172 fully-converted modern languages. Instead of explaining based on the basic principle, please refer to the source code below to learn the specific conversion process.
KS completion form <-> UnicodeUnicode converts a compatible part of a 1-byte character and a Uni completion part. Unicode, inc. provides information necessary for conversion through the Internet and books.
┌---------------------------┐ │ 가 │ 0xb0a1 <---> 0xac00 │ │ 각 │ 0xb0a2 <---> 0xac01 │ │ 衣 │ 0x8141 <---> 0xac02 │ │ 衤 │ 0x8142 <---> 0xac03 │ │ 간 │ 0x8142 <---> 0xac04 │ │ 펺 │ 0xc64f <---> 0xd7a0 │ │ 펻 │ 0xc650 <---> 0xd7a1 │ │ 펼 │ 0xc651 <---> 0xd7a2 │ │ 펽 │ 0xc652 <---> 0xd7a3 │ └---------------------------┘
Combination form <-> UnicodeUnicode only handles 1-byte ASCII characters and Uni completion code. Fundamentally, Uni-completion code uses morpheme information different from KS-completion code. So onset-nucleus-coda element information can be easily obtained like a combination code, and the combination process and changing process are very simple. However, it should be noted that incomplete syllable characters used in combinations cannot be converted, and after onset-nucleus-coda information is obtained, it must be synthesized according to each constructing principle.
Finding the grapheme in the completion code: If you divide the rest except the most significant bit(MSB) by 5 bits in turn, it becomes onset, nucleus, and coda, respectively.
kssmcode(Combination characters) = MSB(Most Significant Bit) + ChoJaso(Onset) + JungJaso(Nucleus) + JongJaso(Coda)
= (0x8000 | (ChoJaso << 10) | (JungJaso << 5) | (JongJaso))
ChoJaso = ((BYTE)((((HGCODE)(kssmcode)) >> 10) & 0x1f))
JungJaso = ((BYTE)((((HGCODE)(kssmcode)) >> 5) & 0x1f))
JongJaso = ((BYTE)(((HGCODE)(kssmcode)) & 0x1f))
To find a grapheme in a uni-complete form:
Uni-complete character: UncCode
BaseVal(Starting Position of Uni-complete) = 0xac00
ChoJamoNum(Number of Onset) = 19Characters
JungJamoNum(Number of Nucleus) = 21Characters
JongJamoNum(Number of Coda) = 28Characters(Add a Coda Filled State)
JungJongNum(Composite Number of Nucleus and Coda) = JungJamoNum * JungJamoNum
UncCode = ((((ChoJaso * ChoJamoNum) + JungJaso) * JungJamoNum) + JongJaso + BaseVal)
UncInx = UncCode - BaseVal
ChoJaso = (UncInx / JungJongNum))
JungJaso = (UncInx % JungJongNum) / JongJamoNum
JongJaso = (UncInx % JongJamoNum)
🔼 Go to top