High value Unicodes need special treatment

While working on a utf-8 byte list -> Unicode converter I seem to have discovered an issue with large Unicodes

All Unicodes up to 0xffff (65535) work fine
unicodeUTF8 script pic (1)
and convert both ways

but e.g. 0x10000 (65536) doesn't

And this seems to be due to JavaScript coding up big values as 16bit "surrogate pairs"

with a bit more searching

I came up with this to allow for it

and created a reporter
unicodeUTF8 script pic (6)

All scripts

That was an interesting article, although I boggled at "the English language letter (the meaning)" in the second paragraph of section 1. Letters aren't meanings. Letters don't even have meanings. Words have meanings, often. (Not always, because, for example, of metasyntactic variables, e.g., "foo.")

I'm surprised, in these days when memory is basically free, that they don't just use UTF-32 and forget about planes altogether. Although that wouldn't eliminate the need to handle combining forms when testing for equality. (And not just combining forms; glyphs with MATHEMATICAL in their names are often visually indistinguishable from the corresponding character without the MATHEMATICAL. And there's been a smiley face Screenshot 2025-03-09 at 11.02.23 AM (U+263A U+FE0E) in Unicode from the beginning, long before they had emoji as a category, because it was in IBM's version of 8-bit ASCII. (Or maybe I mean Microsoft's version. Anyway, it was in DOS.))

Really there are several different kinds of equality testing. The article talks about the two ways to encode é in Unicode, but for some purposes, such as dictionary sorting, "café" and "cafe" should be considered equal, so that "caféx" sorts before "cafey." (Okay, here's a realistic example: "café" comes before "cafeteria" in the dictionary.) That's an issue that transcends the details of Unicode. The German "ß" is, for some purposes, equal to "ss" even though they don't look the same or have the same string length. (I think .normalize handles that one, but I wouldn't swear to it.) Similarly, there's the ongoing debate about whether lower case letters and upper case letters should be considered equal. (Hint: yes. Screenshot 2025-03-09 at 11.02.23 AM)

And then there are the Delphic Unicode Consortium pronouncements. For example, they're very insistent that font styles such as italic and boldface aren't separate glyphs, except in MATHEMATICAL-land, where they are. And Elvish and Klingon characters aren't characters at all, but they have a non-Consortium semi-official position in the private use area. But emoji are characters. :~( There's a SUPERSCRIPT LATIN SMALL LETTER N (ⁿ) and digits, but not other letters, not even superscript k, which is almost as common in math as superscript n. And ligatures (ff etc.), another font variation, are Unicode characters.

There's a LATIN SMALL LETTER CHI (ꭓ, U+AB53) and a GREEK SMALL LETTER CHI (χ, U+03C7). The Latin one is way far away from all the other Latin letters, even the weird ones such as ƻ (LATIN LETTER TWO WITH STROKE, U+01BB, the only digit with a stroke). Who uses chi in Latin or Latin-derived languages?

tl;dr: "looks the same" and "linguistically the same" are intersecting sets, but neither is a subset of the other.

P.S.: By the way,


:~(

This is actually a pretty useful block! Too bad it (somehow) still doesn't support all characters...

...but it's still cool to see how easily you fixed the problem with high-value unicodes. I'm definitely going to be using this soon!

Oh dear :frowning:
I don't think things like this (which I've found are called Variant Selectors) are going to be easily possible to deal with efficiently so I'm out of the Unicode game :slight_smile:


It's properly decoded as two codepoint.

  1. untitled script pic (23), White Smiling Face , 0x263a;
  2. Followed by a VS15, 0xFE0E=>65038, non-printable codepoint that, used for Emoji, should select variant monochrome text, but for White Smiling Face does nothing.

There is no other way to express those 2 codepoint as they are already Unicode'd :wink:

BTW:
The 0x1F642 Slightly Smiling Face is another codepoint.
untitled script pic (25)
Combined with the VS15
untitled script pic (27)


There are commented parts of the Process.prototype.reportUnicode@threads.js to properly handle codepoints above 0xFFFF.
But JS handles "large" codepoint only in selected functions. Finding the text length, splitting by a letter, for large text, may break Snap/JS. So maybe an extended Unicode library should be created.


This problem usually arises when someone is trying to "compress" a binary data to Unicode text for sharing/transmitting. MQTT or Web Services libraries have built-in support for binary data. For the other cases, some encoding should be used - say Base64 or QuotedPrintable.

Oh. So in other words it's MacOS's fault for giving me an unnecessary extra codepoint when I clicked on the smiley. :~( Thanks for the explanation!

Sorry for replying to myself, but I've just figured out why MacOS does that. It's to compensate for the other even more annoying way they mess up Unicode, namely to replace certain geometric glyphs (such as :arrow_forward:︎ iirc) with colored keycap-like emoji. They use the monochrome text variant to prevent themself doing that to the smiley.