Hacker Timesnew | past | comments | ask | show | jobs | submitlogin

Great explanation. The only part that tripped me up was in determining the number of octets to represent the codepoint. From the post:

>From the previous diagram the value 0x1F602 falls in the range for a 4 octets header (between 0x10000 and 0x10FFFF)

Using the diagram in the post would be a crutch to rely on. It seems easier to remember the maximum number of "data" bits that each octet layout can support (7, 11, 16, 21). Then by knowing that 0x1F602 maps to 11111011000000010, which is 17 bits, you know it must fit into the 4-octet layout, which can hold 21 bits.



As the continuation bytes always bear the payload in the low 6 bits, Connor Lane Smith suggests writing them out in octal[1]. Though that 3 octets of UTF-8 precisely cover the BMP is also quite convenient and easy to remember (but perhaps don’t use that like MySQL did[2]?..).

[1] http://www.lubutu.com/soso/write-out-unicode-in-octal

[2] https://mathiasbynens.be/notes/mysql-utf8mb4




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: